Lectures On Ergodic Theory
Lectures On Ergodic Theory
Special thanks to Snir Ben-Ovadia, Keith Burns, Yair Daon, Dimitris Gatzouras,
Yair Hartman, Qiujie Qiao, Abhishek Khetan, Ian Melbourne, Tom Meyerovitch,
Ofer Shwartz and Andreas Strömbergsson for indicating typos and mistakes in ear-
lier versions of this set of notes.
If you find additional errors please let me know! O.S.
Contents
2 Ergodic Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.1 The Mean Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2 The Pointwise Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 The non-ergodic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.3.1 Conditional expectations and the limit in the ergodic theorem 40
2.3.2 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.3.3 The ergodic decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.4 An Ergodic Theorem for Zd -actions . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5 The Subadditive Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6 The Multiplicative Ergodic Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.6.1 Preparations from Multilinear Algebra . . . . . . . . . . . . . . . . . . 52
vii
viii Contents
3 Spectral Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.1 The spectral approach to ergodic theory . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 Weak mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.1 Definition and characterization . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.2.2 Spectral measures and weak mixing . . . . . . . . . . . . . . . . . . . . . 92
3.3 The Koopman operator of a Bernoulli scheme . . . . . . . . . . . . . . . . . . . 95
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.1 Information content and entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.2 Properties of the entropy of a partition . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.1 The entropy of α ∨ β . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.2.2 Convexity properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2.3 Information and independence . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3 The Metric Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.1 Definition and meaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.2 The Shannon–McMillan–Breiman Theorem . . . . . . . . . . . . . . 109
4.3.3 Sinai’s Generator theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.1 Bernoulli schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.2 Irrational rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4.3 Markov measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4.4 Expanding Markov Maps of the Interval . . . . . . . . . . . . . . . . . 115
4.5 Abramov’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6 Topological Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
4.6.1 The Adler–Konheim–McAndrew definition . . . . . . . . . . . . . . 118
4.6.2 Bowen’s definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.6.3 The variational principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.7 Ruelle’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.7.1 Preliminaries on singular values . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7.2 Proof of Ruelle’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Contents ix
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Chapter 1
Basic definitions and constructions
Dynamical systems and ergodic theory. Ergodic theory is a part of the theory of
dynamical systems. At its simplest form, a dynamical system is a function T defined
on a set X. The iterates of the map are defined by induction T 0 := id, T n := T ◦T n−1 ,
and the aim of the theory is to describe the behavior of T n (x) as n → ∞.
More generally one may consider the action of a semi-group of transformations,
namely a family of maps Tg : X → X (g ∈ G) satisfying Tg1 ◦ Tg2 = Tg1 g2 . In the
particular case G = R+ or G = R we have a family of maps Tt such that Tt ◦Ts = Tt+s ,
and we speak of a semi-flow or a flow.
The original motivation was classical mechanics. There X is the set of all pos-
sible states of given dynamical system (sometimes called configuration space) or
phase space), and T : X → X is the law of motion which prescribes that if the sys-
tem is at state x now, then it will evolve to state T (x) after one unit of time. The orbit
{T n (x)}n∈Z is simply a record of the time evolution of the system, and the under-
standing the behavior of T n (x) as n → ∞ is the same as understanding the behavior
of the system at the far future. Flows Tt arise when one insists on studying contin-
uous, rather than discrete time. More complicated group actions, e.g. Zd –actions,
arise in material science. There x ∈ X codes the configuration of a d–dimensional
lattice (e.g. a crystal), and {Tv : v ∈ Zd } are the symmetries of the lattice.
The theory of dynamical systems splits into subfields which differ by the struc-
ture which one imposes on X and T :
1. Differentiable dynamics deals with actions by differentiable maps on smooth
manifolds;
2. Topological dynamics deals with actions of continuous maps on topological
spaces, usually compact metric spaces;
3. Ergodic theory deals with measure preserving actions of measurable maps on a
measure space, usually assumed to be finite.
1
2 1 Basic definitions, examples, and constructions
It may seem strange to assume so little on X and T . The discovery that such
meagre assumptions yield non trivial information is due to Poincaré, who should be
considered the progenitor of the field.
Poincaré’s Recurrence Theorem and the birth of ergodic theory. Imagine a box
filled with gas, made of N identical molecules. Classical mechanics says that if
we know the positions qi = (q1i , q2i , q3i ) and momenta pi = (p1i , p2i , p3i ) of the i-th
molecule for all i = 1, . . . , N, then we can determine the positions and momenta of
each particle at time t by solving Hamilton’s equations
denote the map which gives solution of (1.1) with initial condition (q(0), p(0)). If
H is sufficiently regular, then (1.1) had a unique solution for all t and every ini-
tial condition. The uniqueness of the solution implies that Tt is a flow. The law of
conservation of energy implies that x ∈ X ⇒ Tt (x) ∈ X for all t.
Question: Suppose the system starts at a certain state (q(0), p(0)), will it eventually
return to a state close to (q(0), p(0))?
For general H, the question seems intractable because (1.1) is strongly coupled
system of an enormous number of equations (N ∼ 1024 ). Poincaré’s startling discov-
ery is that the question is trivial, if viewed from the right perspective. To understand
his solution, we need to recall a classical fact, known as Liouville’s theorem: The
Lebesgue measure m on X satisfies m(Tt E) = m(E) for all t and all measurable
E ⊂ X (problem 1.1).
Here is Poincaré’s solution. Define T := T1 , and observe that T n = Tn . Fix ε > 0
and consider the set W of all states x = (q, p) such that d(x, T n (x)) > ε for all n ≥ 1
(here d is the Euclidean distance). Divide W into finitely many disjoint pieces Wi of
diameter less than ε.
For each fixed i, the sets T −n (Wi ) (n ≥ 1) are pairwise disjoint: Otherwise
T (Wi ) ∩ T −(n+k) (Wi ) ̸= ∅, so Wi ∩ T −k (Wi ) ̸= ∅, and there exists x ∈ Wi ∩
−n
Since {T −nWi }n≥1 are pairwise disjoint, m(X) ≥ ∑k≥1 m(T −kWi ). But T −k (Wi )
all have the same measure (Liouville theorem), and m(X) < ∞, so we must have
m(Wi ) = 0. Summing over i we get that m(W ) = 0. In summary, a.e. x has the
property that d(T n (x), x) < ε for some n ≥ 1. Considering the countable collection
ε = 1/n, we obtain the following result:
Poincaré’s Recurrence Theorem: For almost every x = (q(0), p(0)), if the system
is at state x at time zero, then it will return arbitrarily close to this state infinitely
many times in the arbitrarily far future.
Poincaré’s Recurrence Theorem is a tour de force, because it turns a problem
which looks intractable to a triviality by simply looking at it from a different angle.
The only thing the solution requires is the existence of a finite measure on X such
that m(T −1 E) = m(E) for all measurable sets E. This startling realization raises
the following mathematical question: What other dynamical information can one
extract from the existence of a measure m such that m = m ◦ T −1 ? Of particular
interest was the justification of the following “assumption” made by Boltzmann in
his work on statistical mechanics:
The Ergodic Hypothesis: For certain invariant measures µ, many functions f : X →
R, and many states x = (q, p), the time average lim T1 0T f (Tt (x))dt exists, and
R
T →∞
1 R
equals the space average µ(X) f dµ.
(This is not Boltzmann’s original formulation.) The ergodic hypothesis is a quanti-
tative version of Poincaré’s recurrence theorem: If f is the indicator of the ε–ball
around a state x, then the time average of f is the frequency of times when Tt (x) is
ε–away from x, and the ergodic hypothesis is a statement on its value.
The proof of Poincaré’s Recurrence Theorem suggests the study of the following
setup.
Definition 1.1. A measure space is a triplet (X, B, µ) where
1. X is a set, sometime called the space.
2. B is a σ –algebra: a collection of subsets of X which contains the empty set,
and which is closed under complements, countable unions and countable inter-
sections. The elements of B are called measurable sets.
3. µ : B → [0, ∞], called the measure,
U
is a σ –additive function: if E1 , E2 , . . . ∈ B
are pairwise disjoint, then µ( i Ei ) = ∑i µ(Ei ).
If µ(X) = 1 then we say that µ is a probability measure and (X, B, µ) is a proba-
bility space.
In order to avoid measure theoretic pathologies, we will always assume that
(X, B, µ) is the completion (see problem 1.2) of a standard measure space: a mea-
4 1 Basic definitions, examples, and constructions
sure space (X, B ′ , µ ′ ), where X is a complete and separable metric space and B ′ is
its Borel σ –algebra.
It can be shown that such spaces are Lebesgue spaces: They are isomorphic
to the union of a compact interval equipped with the Lebesgue’s σ –algebra and
Lebesgue’s measure, and a finite or countable or empty collection of points with
positive measure. See the appendix for details.
Much of the power and usefulness of ergodic theory is due to the following prob-
abilistic interpretation of the abstract set up discussed above. Suppose (X, B, µ, T )
is a ppt.
1. We imagine X to be a sample space: the collection of all possible outcomes ω of
a random experiment.
2. We interpret B as the collection of all measurable events: all sets E ⊂ X such
that we have enough information to answer the question “is ω ∈ E?.”
3. We use µ to define the probability law: Pr[ω ∈ E] := µ(E);
1.4 Ergodicity and mixing 5
The invariance of µ guarantees that such stochastic processes are always station-
ary: Pr[Xi1 +m ∈ Ei1 +m , . . . , Xik ∈ Eik +m ] = Pr[Xi1 ∈ Ei1 , . . . , Xik ∈ Eik +m ] for all m.
The point is that we can ask what are the properties of the stochastic processes
{ f ◦ T n }n≥1 arising out of the ppt (X, B, µ, T ), and bring in tools and intuition from
probability theory to the study of dynamical systems.
Note that we have found a way of studying stochastic phenomena in a context
which is, a priori, completely deterministic (if we know the state of the system at
time zero is x, then we know with full certainty that its state at time n is T n (x)). The
modern treatment of the question “how come a deterministic system can behave
randomly” is based on this idea.
(This is because µ(Ai △Ai ) = ∥1Ai − 1A j ∥1 where 1Ai is the indicator function of Ai ,
equal to one on Ai and to zero outside Ai ). The triangle inequality implies that
∞ ∞ k−1
µ(E0 △E) ≤ ∑ µ(E△T −k E) ≤ ∑ ∑ µ(T −i E△T −(i+1) E)
k=1 k=1 i=0
∞
= ∑ kµ(E△T −1 E) (∵ µ ◦ T −1 = µ).
k=1
Since µ(E△T −1 E) = 0, µ(E0 △E) = 0 and we have shown that (1) implies (2).
Next assume (2). and let f be a measurable function s.t. f ◦ T = f almost every-
where. For every t, [ f > t]△T −1 [ f > t] ⊂ [ f ̸= f ◦ T ], so
Just like ergodicity, strong mixing can be defined in terms of functions. Before
we state the condition, we recall a relevant notion from statistics. The correlation
coefficient of non-constant f , g ∈ L2 (µ) is defined to be
R R R
f g dµ − f dµ · gdµ
ρ( f , g) := R R .
∥ f − f dµ∥2 ∥g − gdµ∥2
It follows that lim sup |Cov( f , g ◦ T n )| ≤ 4ε(∥ f ∥2 + ∥g∥2 + ε). Since ε is arbitrary,
the limsup, whence the limit itself, is equal to zero. ⊔
⊓
8 1 Basic definitions, examples, and constructions
1.5 Examples
Let T := [0, 1) equipped with the Lebesgue measure m, and define for α ∈ [0, 1)
Rα : T → T by Rα (x) = x + α mod 1. Rα is called a circle rotation, because the
map π(x) = exp[2πix] is an isomorphism between Rα and the rotation by the angle
2πα on the unit circle S1 .
Proposition 1.4.
1. Rα is measure preserving for every α;
2. Rα is ergodic iff α ̸∈ Q;
3. Rα is never strongly mixing.
Proof. A direct calculation shows that the Lebesgue measure m satisfies m(R−1 α I) =
m(I) for all intervals I ⊂ [0, 1). Thus the collection M := {E ∈ B : m(R−1 α E) =
m(E)} contains the algebra of finite disjoint unions of intervals. It is easy to check
M is a monotone class, so by the monotone class theorem (see appendix) M con-
tains all Borel sets. It clearly contains all null sets. Therefore it contains all Lebesgue
measurable sets. Thus M = B and (1) is proved.
We prove (2). Suppose first that α = p/q for p, q ∈ N. Then Rqα = id. Fix some x ∈
[0, 1), and pick ε so small that the ε–neighborhoods of x +kα for k = 0, . . . , q −1 are
disjoint. The union of these neighborhoods is an invariant set of positive measure,
and if ε is sufficiently small then it is not equal to T. Thus Rα is not ergodic.
Next assume that α ̸∈ Q. Suppose E is an invariant set, and set f = 1E . Expand
f to a Fourier series:
Equating coefficients, we see that fb(n) = fb(n) exp[2πinα]. Thus either fb(n) = 0
or exp[2πinα] = 1. Since α ̸∈ Q, fb(n) = 0 for all n ̸= 0. We obtain that f = fb(0)
a.e., whence 1E = m(E) almost everywhere. This can only happen if m(E) = 0 or
m(E) = 1, proving the ergodicity of m.
To show that m is not mixing, we consider the function f (x) = exp[2πix]. This
function satisfies f ◦ Rα = λ f with λ = exp[2πiα] (such a function is called an
eigenfunction). For every α there is a sequence nk → ∞ s.t. nk α mod 1 → 0 (Dirich-
n
let theorem), thus ∥ f ◦ Rαk − f ∥2 = |λ nk − 1| −−−→ 0. It follows that F := Re( f ) =
k→∞
n n
F ◦ Rαk Fdm −−−→ F 2 dm ̸=
R R
cos(2πx) satisfies ∥F ◦ Rαk − F∥2 −−−→ 0, whence
k→∞ k→∞
( F)2 , and m is not mixing.
R
⊔
⊓
1.5 Examples 9
Again, we work with T := [0, 1] equipped with the Lebesgue measure m, and define
T : T → T by T (x) = 2x mod 1. T is called the angle doubling map, because the
map π(x) := exp[2πix] is an isomorphism between T and the map eiθ 7→ e2iθ on S1 .
Proposition 1.5. The angle doubling map is probability preserving, and strong mix-
ing, whence ergodic.
Thus Cov( f , g ◦ T k ) −−−→ 0 for all indicators of cylinders. Every L2 –function can be
k→∞
approximated in L2 by a finite linear combination of indicators of cylinders (prove!).
One can therefore proceed as in the proof of proposition 1.3 to show that Cov( f , g ◦
T k ) −−−→ 0 for all L2 functions. ⊔
⊓
k→∞
Let S be a finite set, called the alphabet, and let X := SN be the set of all one–sided
infinite sequences of elements of S. Impose the following metric on X:
N−|b j |
j j
∑ µ[a, β , c] = µ[a, β ] ∑ pc = µ[a, β j ] ≡ µ[b j ].
|c|=N−|b j | c
to the collection of all possible words w of length N − |a| (otherwise the right hand
side misses some sequences). Thus
N−|a|
j
∑ µ[b ] = ∑ µ[a, w] = µ[a] ∑ pc = µ[a],
j |w|=N−|a| c
and [0, 1) (prove that µ(X ′ ) = 1), and it is clear that π ◦ σ = T ◦ π. Since the image
of a cylinder of length n is a dyadic interval of length 2−n , π preserves the measures
of cylinders. The collection of measurable sets which are mapped by π to sets of the
same measure is a σ –algebra. Since this σ –algebra contains all the cylinders and all
the null sets, it contains all measurable sets. ⊔
⊓
The proof is the same as in the case of the angle doubling map. Alternatively, it
follows from the mixing of the angle doubling map and the fact that the two are
isomorphic.
We saw that the angle doubling map is isomorphic to a dynamical system acting
as the left shift on a space of sequences (a Bernoulli scheme). Such representations
appear frequently in the theory of dynamical systems, but more often than not, the
space of sequences is slightly more complicated than the set of all sequences.
12 1 Basic definitions, examples, and constructions
Let S be a finite set, and A = (ti j )S×S a matrix of zeroes and ones without columns
or rows made entirely of zeroes.
Definition 1.8. The subshift of finite type (SFT) with alphabet S and transition ma-
trix A is the set ΣA+ = {x = (x0 , x1 , . . .) ∈ SN : txi xi+1 = 1 for all i}, together with the
metric d(x, y) := 2− min{k:xk ̸=yk } and the action σ (x0 , x1 , x2 , . . .) = (x1 , x2 , . . .).
This is a compact metric space, and the left shift map σ : ΣA+ → ΣA+ is continuous.
We think of ΣA+ as of the space of all infinite paths on a directed graph with vertices
S and edge a → b connecting a, b ∈ S such that tab = 1.
Let ΣA+ be a SFT with set of states S, |S| < ∞, and transition matrix A = (Aab )S×S .
• A stochastic matrix is a matrix P = (pab )a,b∈S with non-negative entries, such
that ∑b pab = 1 for all a, i.e. P1 = 1. The matrix is called compatible with A, if
Aab = 0 ⇒ pab = 0.
• A probability vector is a vector p = ⟨pa : a ∈ S⟩ of non-negative entries, s.t.
∑ pa = 1
• A stationary probability vector is a probability vector p = ⟨pa : a ∈ S⟩ s.t. pP = p:
∑a pa pab = pb .
Given a probability vector p and a stochastic matrix P compatible with A, one can
define a Markov measure µ on ΣA+ (or ΣA ) by
Canceling the identical terms on both sides gives ∑a pa pab0 = pb0 . Thus µ is shift
invariant iff p is P–stationary.
We now show that every stochastic matrix has a stationary probability vector.
Consider the right action of P on the simplex ∆ of probability vectors in RN :
Thus every stochastic matrix determines (at least one) shift invariant measure.
Such measures are called Markov measures We ask when is this measure ergodic,
and when is it mixing.
There are obvious obstructions to ergodicity and mixing. To state them concisely,
we introduce some terminology. Suppose P = (pab )S×S is a stochastic matrix. We
n
say that a connects to b in n steps, and write a → − b, if there is a path of length n + 1
(a, ξ1 , . . . , ξn−1 , b) ∈ S n+1 s.t. paξ1 pξ1 ξ2 · . . . · pξn−1 b > 0 (see problem 1.5).
Definition 1.9. A stochastic matrix P = (pab )a,b∈S is called irreducible, if for every
n
a, b ∈ S there exists an n s.t. a →
− b.
n
Lemma 1.1. If A is irreducible, then p := gcd{n : a →
− a} is independent of a.
(gcd =greatest common divisor).
n n n
Proof. Let pa := gcd{n : a →
− a}, pb := gcd{n : b →
− b}, and Λb := {n : b →
− b}. Then
Λb +Λb ⊂ Λb , and therefore Λb −Λb is a subgroup of Z. Necessarily Λb −Λb = pb Z.
α β α n β
By irreducibility, there are α, β s.t. a −
→b−→ a. So for all n ∈ Λb , a −
→b→− b− → a,
whence pa | gcd(α + β + Λb ). This clearly implies that pa divides every number in
Λb − Λb = pb Z. So pa |pb . Similarly one shows that pb |pa , whence pa = pb . ⊔
⊓
is a shift invariant set which contains [a], but which is disjoint from [b]. So E is an
invariant set with such that 0 < pa ≤ µ(E) ≤ 1 − pb < 1. So µ is not ergodic.
If P is irreducible, but not aperiodic, then any Markov measure on ΣA+ is non-
mixing, because the function fR:= 1[a] satisfies f f ◦ T n ≡ 0 for all n not divisible
by the period. This means that f f ◦ T n dµ is equal to zero on a subsequence, and
therefore cannot converge to µ[a]2 .
We claim that these are the only possible obstructions to ergodicity and mixing.
The proof is based on the following fundamental fact, whose proof will be given at
the next section.
14 1 Basic definitions, examples, and constructions
Corollary 1.1. A shift invariant Markov measure on a SFT ΣA+ is ergodic iff its
stochastic matrix is irreducible, and mixing iff its stochastic matrix is irreducible
and aperiodic.
Proof. Let µ be a Markov measure with stochastic matrix P and stationary proba-
bility vector p. For all cylinders [a] = [a0 , . . . , an−1 ] and [b] = [b0 , . . . , bm−1 ],
ξ ∈Wk−n
(k−n)
µ[b] pa b
= µ[a] · ∑ pan−1 ξ1 · · · pξk−n b0 · = µ[a]µ[b] · n−1 0 .
ξ ∈Wk−n
pb0 pb0
N N
µ(E) = µ(E ∩ σ −k E) = ∑ µ(Ai ∩ σ −k E) ± ε = ∑ µ(Ai ∩ σ −k A j ) ± 2ε.
i=1 i, j=1
Remark 1. The ergodic theorem for Markov chains can be visualized as follows.
Imagine that we distribute mass on the states of S according to a probability dis-
tribution q = (qa )a∈S . Now shift mass from one state to another using the rule that
a pab –fraction of the mass at a is moved to b. The new mass distribution is qP
(check). After n steps, the mass distribution is qPn . The previous theorem says that,
in the aperiodic case, the mass distribution converges to the stationary distribution
—- the equilibrium state. It can be shown that the rate of convergence is exponential
(problem 1.7).
Remark 2: The ergodic theorem for Markov chains has an important generaliza-
tion to all matrices with non-negative entries, see problem 1.6.
Suppose first that P is an irreducible and aperiodic stochastic matrix. This implies
that there is some power m such that all the entries of Pm are strictly positive.2
Let N := |S| denote the number of states, and consider the set of all probability
vectors ∆ := {(x1 , . . . , xN ) : xi ≥ 0, ∑ xi = 1}. Since P is stochastic, the map T (x) =
xP maps ∆ continuously into itself. By the Brouwer Fixed Point Theorem, there is
a probability vector p s.t. pP = p (this is the stationary probability vector).
The irreducibility of P implies that all the coordinates of p are strictly positive.
Indeed, ∑ pk = 1 so at least one coordinate pi is positive. For any other coordinate
p j there is by irreducibility a path (i, ξ1 , . . . , ξn−1 , j) such that
(n)
So p j = (pPn ) j = ∑k pk pk j ≥ pi piξ1 pξ1 ξ2 · · · pξn−2 ξn−1 pξn−1 j > 0.
So p lies in the interior of ∆ , and the set C := ∆ − p is a compact convex neigh-
borhood of the origin such that T (C) ⊂ C , T m (C) ⊂ int[C]. (We mean the relative
interior in the (N − 1)-dimensional body C, not the ambient space RN .)
Now consider L := span(C) (an N − 1–dimensional space). This is an invariant
space for T , whence for Pt (the transpose of P). We claim that all the eigenvalues of
Pt |L have absolute value less than 1:
1. Eigenvalues of modulus larger than one are impossible, because P is stochastic,
so ∥vP∥1 ≤ ∥v∥1 , so the spectral radius of Pt cannot be more than 1.
2. Roots of unity are impossible, because in this case for some k, Pkm has a real
eigenvector v with eigenvalue one. There is no loss of generality in assuming
that v ∈ ∂C, otherwise multiply v by a suitable scalar. But Pkm cannot have fixed
points on ∂C, because Pkm (C) ⊂ int[C]
3. Eigenvalues eiθ with θ ̸∈ 2πQ are impossible, because if eiθ is an eigenvalue then
there are two real eigenvectors u, v ∈ ∂C such that the action of P on span{u, v}
2 n
Begin by proving that if A is irreducible and aperiodic, then for every a there is an Na s.t. a →
− a
n
for all n > Na . Use this to show that for all a, b there is an Nab s.t. a →
− b for all n > Nab . Take
m = max{Nab }.
16 1 Basic definitions, examples, and constructions
cos θ sin θ
is conjugate to , an irrational rotation. This means ∃kn → ∞ s.t.
− sin θ cos θ
uPmkn → u ∈ ∂C. But this cannot be the case because Pm [C] ⊂ int[C], and by
compactness, this cannot intersect ∂C.
In summary the spectral radius of Pt |L is less than one.
But RN = L ⊕ span{p}. If we decompose a general vector v = q + t p with q ∈
L, then the above implies that vPn = t p + O(∥Pn |L ∥)∥q∥ −−−→ t p. It follows that
n→∞
(n)
pab −−−→ pb for all a, b.
n→∞
This is almost the ergodic theorem for irreducible aperiodic Markov chains, the
only thing which remains is to show that P has a unique stationary vector. Suppose
(n)
q is another probability vector s.t. qP = q. We can write pab → pb in matrix form
as follows:
Pn −−−→ Q, where Q = (qab )S×S and qab = pb .
n→∞
Sk are pairwise disjoint, because if b ∈ Sk1 ∩ Sk2 , then ∃αi = ki mod p and ∃β
αi β
s.t. v −→ b−→ v for i = 1, 2. By the definition of the period, p|αi + β for i = 1, 2,
whence k1 = α1 = −β = α2 = k2 mod p.
It is also clear that every path of length ℓ which starts at Sk , ends at Sk′ where
k′ = k + ℓ mod p. In particular, every path of length p which starts at Sk ends at Sk .
(p)
This means that if pab > 0, then a, b belong to the same Sk .
p
It follows that P is conjugate, via a coordinate permutation, to a block matrix
(p)
with blocks (pab )Sk ×Sk . Each of the blocks is stochastic, irreducible, and aperiodic.
Let π (k) denote the stationary probability vectors of the blocks.
(ℓp) (k)
By the first part of the proof, pab −−−→ πb for all a, b in the same Sk , and
ℓ→∞
(n)
pab = 0 for n ̸= 0 mod p. More generally, if a ∈ Sk1 and b ∈ Sk2 , then
(ℓp+k2 −k1 ) (k −k1 +p) ((ℓ−1)p)
lim p = lim ∑ paξ2 pξ b
ℓ→∞ ab ℓ→∞
ξ ∈Sk2
(k −k1 +p) (k2 )
= ∑ paξ2 πb (by the above)
ξ ∈Sk2
(k2 ) (k −k1 +p) (k2 ) (k −k1 +p)
= πb ∑ paξ2 = πb . (∵ paξ2 = 0 when ξ ̸∈ Sk2 )
ξ ∈S
(ℓp+α)
Similarly, pab = 0 when α ̸= k2 − k1 mod p. Thus
1.5 Examples 17
1 n (k) 1 (k)
lim
n→∞ n
∑ pab = p πb for the unique k s.t. Sk ∋ b
k=0
p (k)
The limiting vector π is a probability vector, because ∑k=1 ∑b∈Sk 1p πb = 1.
We claim that π is the unique stationary probability vector of P. The limit the-
(n) n−1 k
orem for pab can be written in the form n1 ∑k=0 P → Q where Q = (qab )S×S and
qab = πb . As before this implies that πP = π and that any probability vector q such
that qP = q, we also have qQ = q, whence q = p. ⊔
⊓
The hyperbolic plane is the surface H := {z ∈ C : Im(z) > 0} equipped with the
Riemannian metric ds = |dz|/Im(z), which gives it constant curvature (−1).
It is known that the orientation preserving isometries (i.e. distance preserving
maps) are the Möbius transformations which preserve H. They form the group
az + b
Möb(H) = z 7→ : a, b, c, d ∈ R, ad − bc = 1
cz + d
ab
≃ : a, b, c, d ∈ R, ad − bc = 1 {±id} =: PSL(2, R).
cd
(We quotient by {±id} to identify ac db , −a −b
−c −d which represent the same Möbius
transformation.)
The geodesics (i.e. length minimizing curves) on the hyperbolic plane are vertical
half–lines, or circle arcs which meet ∂ H at right angles. Here is why: Suppose ω ∈
T M is a unit tangent vector which points directly up, then it is not difficult to see
that the geodesic at direction ω is a vertical line. For general unit tangent vectors
ω, find an element ϕ ∈ Möb(H) which rotates them so that dϕ(ω) points up. The
geodesic in direction ω is the ϕ–preimage of the geodesic in direction dϕ(ω) (a
vertical half–line). Since Möbius transformations map lines to lines or circles in a
conformal way, the geodesic of ω is a circle meeting ∂ H at right angles.
The geodesic flow of H is the flow gt on the unit tangent bundle of H,
It can be shown that in this coordinate system the geodesic flow takes the form
18 1 Basic definitions, examples, and constructions
et/2 0
ab ab
gt = .
cd cd 0 e−t/2
10
To verify this, it is enough to calculate the geodesic flow on ω0 ≃ .
01
Next we describe the Riemannian volume measure on T 1 H (up to normaliza-
tion). Such a measure must be invariant under the action of all isometries. In our
coordinate system, the isometry ϕ(z) = (az + b)/(cz + d) acts by
x y ab x y
ϕ = .
zw cd zw
Since PSL(2, R) is a locally compact topological group, there is only one Borel
measure on PSL(2, R) (up to normalization), which is left invariant by all left trans-
lations on the group: the Haar measure of PSL(2, R). Thus the Riemannian volume
measure is a left Haar measure of PSL(2, R), and this determines it up to normal-
ization.
It is a convenient feature of PSL(2, R) that its left Haar measure is also invari-
ant under right translations. It follows that the geodesic flow preserves the volume
measure on T 1 H. But this measure is infinite, and it is not ergodic (prove!).
To obtain ergodic flows, we need to pass to compact quotients of H. These are
called hyperbolic surfaces.
A hyperbolic surface is a Riemannian surface M such that every point in M has a
neighborhood V which is isometric to an open subset of H. Recall that a Riemannian
surface is called complete, if every geodesic can be extended indefinitely in both
directions.
−s L2
Step 1. If f ∈ L2 , then f ◦ hte −−−→ f .
s→∞
In this section we discuss several standard methods for creating new measure pre-
serving transformations from old ones. These constructions appear quite frequently
in applications.
20 1 Basic definitions, examples, and constructions
Products
Recall that the product of two measure spaces (Xi , Bi , µi ) (i = 1, 2) is the measure
space (X1 × X2 , B1 ⊗ B2 , µ1 × µ2 ) where B1 × B2 is the smallest σ –algebra which
contains all sets of the form B1 × B2 where Bi ∈ Bi , and µ1 × µ2 is the unique
measure such that (µ1 × µ2 )(B1 × B2 ) = µ1 (B1 )µ2 (B2 ).
This construction captures the idea of independence from probability theory: if
(Xi , Bi , µi ) are the probability models of two random experiments, and these experi-
ments are “independent”, then (X1 × X2 , B1 ⊗ B2 , µ1 × µ2 ) is the probability model
of the pair of the experiments, because for every E1 ∈ B1 , E2 ∈ B2 ,
and F1 ∩F2 = E1 ×E2 ; so (µ1 × µ2 )(F1 ∩F2 ) = (µ1 × µ2 )(F1 )(µ1 × µ2 )(F2 ), showing
that the events F1 , F2 are independent.
Proposition 1.9. The product of two ergodic mpt is not always ergodic. The product
of two mixing mpt is always mixing.
= µ (A1 ∩ T −n A2 ) ∩ (B1 ∩ T −n B2 )
1.6.1 Skew-products
We start with an example. Let µ be the ( 12 , 12 )–Bernoulli measure on the two shift
Σ2+ := {0, 1}N . Let f : Σ2+ → Z be the function f (x0 , x1 , . . .) = (−1)x0 . Consider the
transformation
where σ : Σ2+ → Σ2+ is the left shift. This system preserves the (infinite) measure
µ × mZ where mZ is the counting measure on Z. The n-th iterate is
What we see in the second coordinate is the symmetric random walk on Z, started at
k, because (1) the steps Xi take the values ±1, and (2) {Xi } are independent because
of the choice of µ. We say that the second coordinate is a “random walk on Z driven
by the noise process (Σ2+ , B, µ, σ ).”
Here is a variation on this example. Suppose T0 , T1 are two measure preserving
transformations of the same measure space (Y, C , ν). Consider the transformation
(X ×Y, B ⊗ C , µ × ν, T f ), where
The n–th iterate takes the form T fn (x, y) = (σ n x, Txn−1 · · · Tx0 y). The second coordi-
nate looks like the random concatenation of elements of {T0 , T1 }. We say that T f is
a “random dynamical system driven by the noise process (X, B, µ, T ).”
These examples suggest the following abstract constructions.
Suppose (X, B, µ, T ) is a mpt, and G is a locally compact Polish3 topological
group, equipped with a left invariant Haar measure mG . Suppose f : X → G is mea-
surable.
Definition 1.13. The skew–product with cocycle f over the basis (X, B, µ, T ) is the
mpt (X × G, B ⊗ B(G), µ × mG , T f ), where T f : X × G → X × G is the transforma-
tion T f (x, g) = (T x, g · f (x)).
(Check, using Fubini’s theorem, that this is a mpt.) The n–th iterate T fn (x, g) =
(T n−1 x, g · f (x) · f (T x) · · · f (T n−1 x)), is a “random walk on G driven by the noise
process (X, B, µ, T ).”
Now imagine that the group G is acting in a measure preserving way on some
space (Y, C , ν). This means that there are measurable maps Tg : Y → Y such that
ν ◦ Tg−1 = ν, Tg1 g2 = Tg1 Tg2 , and (g, y) 7→ Tg (y) is a measurable from X × G to Y .
Definition 1.14. The random dynamical system on (Y, C , ν) with action {Tg : g ∈
G}, cocycle f : X → G, and noise process (X, B, µ, T ), is the system (X × Y, B ⊗
C , µ × ν, T f ) given by T f (x, y) = (T x, T f (x) y).
3 “Polish”=has a topology which makes it a complete separable metric space.
22 1 Basic definitions, examples, and constructions
(Check using Fubini’s theorem that this is measure preserving.) Here the n-th iterate
is T fn (x, y) = (T n x, T f (T n x) · · · T f (T x) T f (x) y).
It is obvious that if a skew–product (or a random dynamical system) is ergodic or
mixing, then its base is ergodic or mixing. The converse is not always true. The
ergodic properties of a skew–product depend in a subtle way on the interaction
between the base and the cocycle.
Here are two important obstructions to ergodicity and mixing for skew–products.
In what follows G is a polish group and G b is its group of characters,
b := {γ : G → S1 : γ is a continuous homomorphism}.
G
ing. So periodicity⇒non-mixing. ⊔
⊓
1.6.2 Factors
An invertible mpt is a mpt (X, B, µ, T ) such that for some invariant set X ′ ⊂ X of full
measure, T : X ′ → X ′ is invertible, and T, T −1 : X ′ → X ′ are measurable. Invertible
mpt are more convenient to handle than general mpt because they have the following
properties: If (X, B, µ) is a complete measure space, then the forward image of a
measurable set is measurable, and for any countable collection of measurable sets
T T
Ai , T ( Ai ) = T (Ai ) up to a set of measure zero. This is not true in general for
non-invertible maps. Luckily, every “reasonable” non-invertible mpt is the factor of
an invertible mpt. Here is the construction.
4 Such a σ –algebra exists: take the intersection of all sub-σ –algebras which make f ◦ T n all mea-
surable, and note that this intersection is not empty because it contains B.
24 1 Basic definitions, examples, and constructions
Proof (the proof can be omitted on first reading). Let S denote the collection of all
sets of the form [E−n , . . . , E0 ] := {x ∈ Xe : x−i ∈ E−i (i = 0, . . . , n)} where n ≥ 0 and
E−n , . . . , E−1 , E0 ∈ B. We call the elements of S cylinders.
It is easy to see that S is a semi-algebra. Our plan is to define µ e on S and then
apply Carathéodory’s extension theorem. To do this we first observe the following
important identity:
n
T −(n−i) E−i }.
\
[E−n , . . . , E0 ] = {x ∈ Xe : x−n ∈ (1.3)
i=0
The inclusion ⊆ is because for every x ∈ [E−n , . . . , E0 ], x−n ≡ T −(n−i) x−i ∈ T −(n−i) E−i
by the definition of X; e and ⊇ is because if x−n ∈ T −(n−i) E−i then x−i ≡ T (n−i) (x−n ) ∈
E−i for i = 0, . . . , n. Motivated by this identity we define µ e : S → [0, ∞] by
!
n
T −(n−i) E−i .
\
e [E−n , . . . , E0 ] := µ
µ
i=0
e (X)
Since µ e = µ(X) = 1, µ e is σ -finite on X.
e We will check that µ
e is σ -additive on
S and deduce the lemma from Carathéodory’s extension theorem. We begin with
simpler finite statements.
S TEP 1. Suppose C1 , . . . ,Cα are pairwise disjoint cylinders and D1 , . . . , Dβ are pair-
Uβ
wise disjoint cylinders. If αi=1 Ci = i=1 Di , then ∑ µ
U
e (Ci ) = ∑ µ
e (Di ).
Proof. Notice that [E−n , . . . , E0 ] ≡ [X, . . . , X, E−n , . . . , E0 ] and (by the T -invariance
e [E−n , . . . , E0 ] = µ
of µ) µ e [X, . . . , X, E−n , . . . , E0 ], no matter how many X’s we add to
the left. So there is no loss of generality in assuming that Ci , Di all have the same
(i) (i) (i) (i)
length: Ci = [C−n , . . . ,C0 ], D j = [D−n , . . . , D0 ].
By (1.3), since Ci are pairwise disjoint, −(n−k)C(i) are pairwise disjoint.
Tn
k=0 T
−k
(i)
Similarly, nk=0 T −(n−k) D−k are pairwise disjoint. Since αi=1 Ci = i=1 Di , the iden-
T U Uβ
(i) (i)
tity (1.3) also implies that αi=1 nk=0 T −(n−k)C−k = j=1 nk=0 T −(n−k) D−k . So
U T Uβ T
1.6 Basic constructions 25
! !
α α n β n α
(i) ( j)
T −(n−k)C−k −(n−k)
\ \
∑ µe (Ci ) = ∑ µ = ∑µ T D−k = ∑ µe (D j ),
i=1 i=1 k=0 j=1 k=0 j=1
n
[ ni
n ]
] ni
]
Ci = Cik where Cik ⊂ Ci .
i=1 i=1 k=1 k=1
ε
µ(F) ≥ µ
e (C) − ε and µ(Uk ) ≤ µ
e (Ck ) + .
2k
U∞ S∞
By (1.3), [F, X, . . . , X ] ⊂ C = k=1 Ck ⊂ k=1 [Uk , X, . . . , X ], and with respect to
| {z } | {z }
n nk
the product topology on X, e [F, X, . . . , X] is compact and [Uk , X, . . . , X] are open. So
there is a finite N s.t. [F, X, . . . , X ] ⊂ Nk=1 [Uk , X, . . . , X ]. By step 2
S
| {z } | {z }
n nk
N
e [F, X, . . . , X ] ≤ ∑ µ
µ e [Uk , X, . . . , X ]
| {z } k=1 | {z }
n nk
Proof. It is clear that the natural extension is an invertible ppt. Let π : Xe → X denote
the map π(x) = x0 , then π is measurable and π ◦ Te = T ◦ π. Since T X = X, every
point has a pre-image, and so π is onto. Finally, for every E ∈ B,
e (π −1 (E)) = µ
µ x ∈ Xe : x0 ∈ E}) = µ(E)
e ({e
by construction. So (X, e B,
e µe , Te) is an invertible extension of (X, B, µ, T ).
Suppose (Y, C , ν, S) is another invertible extension, and let πY : Y → X be the
factor map (defined a.e. on Y ). We show that (Y, C , ν, S) extends (X, e B,
e µe , Te).
Let (Ye , Ce, νe, S)
e be the natural extension of (Y, C , ν, S). It is isomorphic to
(Y, C , ν, S), with the isomorphism given by ϑ (y) = (yk )k∈Z , yk := Sk (y). Thus it
is enough to show that (X, e B,e µ e , Te) is a factor of (Ye , Ce, νe, Te). Here is the factor
map: θ : (yk )k∈Z 7→ (πY (yk ))k∈Z .
If Te is ergodic, then T is ergodic, because every T -invariant set E lifts to a Te-
invariant set Ee := π −1 (E). The ergodicity of Te implies that µ e (E)
e = 0 or 1, whence
µ(E) = µ −1
e (π (E)) = µ e (E) = 0 or 1.
e
To see the converse (T is ergodic ⇒ Te is ergodic) we make use of the following
observation:
C LAIM : Let B en := {e x ∈ Xe : x−n ∈ E} : E ∈ B , then
1. B
en are σ -algebras
2. B
e1 ⊂ Be2 ⊂ · · · and S B
n≥0 n generate B
e e
1.6 Basic constructions 27
3. Te−1 (Ben ) ⊂ B e B
en and (X, en , µ
e , Te) is a ppt
e B
4. if T is ergodic then (X, en , µ
e , Te) is ergodic.
Proof. We leave the first three items as exercises to the reader. To see the last item
suppose T is ergodic and Ee ∈ B en is Te-invariant. By the definition of B en , Ee = {e
x:
xe−n ∈ E} with E ∈ B, and it is not difficult to see that E must be T -invariant. Since
T is ergodic, µ(E) = 0 or 1. So µ e (E)
e = µ(E) = 0 or 1. So (X, e B en , µ
e , Te) is ergodic.
We can now prove the ergodicity of (X, e B,
e µe , Te) as follows. Suppose fe is abso-
lutely integrable and T -invariant, and let
e
fen := E( fe|B
en )
(readers who are not familiar with conditional expectations can find their definition
in section 2.3.1).
We claim that fen ◦ Te = fen . This is because for every bounded B
en –measurable test
function ϕ,
e
Z Z Z Z
(1) (2)
[ϕe · fen ◦ Te−1 ]d µ
e= (ϕe · fen ◦ Te−1 ) ◦ Ted µ
e= ϕe ◦ Te · fen d µ
e= ϕe ◦ Te · fed µ
e
Z Z Z
(3) (4) (5)
= (ϕe ◦ Te · fe◦ Te)d µ
e= e=
ϕe fed µ e.
ϕe fen d µ
So µ e ∩ Te−k B)
e (A e = µ(A ∩ T −k B) −−−→ µ(A)µ(B) ≡ µ
e (A)
eµ e (B).
e
k→∞
28 1 Basic definitions, examples, and constructions
with the minimum of the empty set being understood as infinity. Note that ϕA < ∞
a.e. on A, hence A0 := {x ∈ A : ϕA (x) < ∞} is equal to A up to a set of measure zero.
Definition 1.18. The induced transformation on A is (A0 , B(A), µA , TA ), where
A0 := {x ∈ A : ϕA (x) < ∞}, B(A) := {E ∩ A0 : E ∈ B}, µA is the measure µA (E) :=
µ(E|A) = µ(E ∩ A)/µ(A), and TA : A0 → A0 is TA (x) = T ϕA (x) (x).
Theorem 1.7. Suppose (X, B, µ, T ) is a ppt, and A ∈ B has positive finite measure.
1. µA ◦ TA−1 = µA ;
2. if T is ergodic, then TA is ergodic (but the mixing of T ̸⇒ the mixing of TA );
R ϕA −1
f ◦ T k dµ for every f ∈
R
3. Kac Formula: If µ is ergodic, then f dµ = A ∑k=0
1
R
L (X). In particular A ϕA dµA = 1/µ(A).
Proof. Given E ⊂ A measurable, µ(E) = µ(T −1 E ∩ A) + µ(T −1 E ∩ Ac ) =
| {z }
µ(TA−1 E∩[ϕA =1])
Passing to the limit as N → ∞, we see that µ(E) ≥ µA (TA−1 E). Working with A \
E, and using the assumption that µ(X) < ∞, we get that µ(A) − µ(E) ≥ µ(A) −
µ(TA−1 E) whence µ(E) = µ(TA−1 E). Since µA is proportional to µ on B(A), we get
µA = µA ◦ TA−1 .
We assume that T is ergodic, and prove that TA is ergodic. The set
This makes sense a.e. in X, because rA < ∞ almost everywhere. This function is T –
invariant, because either x, T x ∈ A and then F(T x) = f (T x) = f (TA x) = f (x) = F(x)
or one of x, T x is outside A and then F(T x) = f (T rA (T x) T x) = f (T rA (x) x) = F(x).
Since T is ergodic, F is constant a.e. on X, and therefore f = F|A is constant a.e. on
A. Thus the ergodicity of T implies the ergodicity of TA .
Here is an example showing that the mixing of T does not imply the mixing of
TA . Let Σ + be a SFT with states {a, 1, 2, b} and allowed transitions
a → 1; 1 → 1, b; b → 2; 2 → a.
The second term is bounded by ∥ f ∥∞ µ{x : T j (x) ̸∈ A for all k ≤ N}. This bound
tends to zero, because µ{x : T j (x) ̸∈ A for all k} = 0 because T is ergodic and re-
current (fill in the details). This proves the Kac formula for all L∞ functions.
Every non-negative L1 –function is the increasing limit of L∞ functions. By the
monotone convergence theorem, the Kac formula must hold for all non-negative
L1 –function. Every L1 –function is the difference of two non-negative L1 –functions
( f = f · 1[ f >0] − | f | · 1[ f <0] ). It follows that the Kac formula holds for all f ∈ L1 . ⊔
⊓
30 1 Basic definitions, examples, and constructions
Proposition 1.11. A Kakutani skyscraper over an ergodic base is ergodic, but there
are non-mixing skyscrapers over mixing bases.
Definition 1.20. The suspension semi-flow with base (X, B, µ, T ) and height func-
tion r is the semi-flow (Xr , B(Xr ), ν, Ts ), where
1. Xr := {(x,t) ∈ X × R : 0 ≤ t < r(x)};
2. B(Xr ) = {E ∈ B(X) ⊗ B(R) : E ⊆ Xr };
R R R r(x) R
3. ν is the measure such that Xr f dν = X 0 f (x,t)dtdµ(x)/ X rdµ;
n−1 n−1
4. Ts (x,t) = (T n x,t + s − ∑ r(T k x)), where n is s.t. 0 ≤ t + s − ∑ r(T k x) < r(T n x).
k=0 k=0
Problems
1.4. Fill in the details in the proof above that the Markov measure corresponding to
a stationary probability vector and a stochastic matrix exists, and is shift invariant
measure.
1.5. Suppose ΣA+ is a SFT with stochastic matrix P. Let A = (tab )S×S denote the
matrix of zeroes and ones where tab = 1 if pab > 0 and tab = 0 otherwise. Write
(n) (n)
An = (tab ). Prove that tab is the number of paths of length n starting at a and ending
n (n)
at b. In particular: a →
− b ⇔ tab > 0.
The standard references for measure theory are [7] and [4], and the standard refer-
ences for ergodic theory of probability preserving transformations are [6] and [8].
For ergodic theory on infinite measure spaces, see [1]. Our proof of the Perron-
Frobenius theorem is taken from [3]. Kac’s formula has very simple proof when T
is invertible. The proof we use (taken from [5]) works for non-invertible transfor-
mations, and extends to the conservative infinite measure setting. The ergodicity of
the geodesic flow was first proved by E. Hopf by other means. The short proof we
gave is due to Gelfand & Fomin and is reproduced in [2].
References
1. Aaronson, J.: Introduction to infinite ergodic theory, Mathematical Surveys and Monographs
50, AMS, 1997. xii+284pp.
2. Bekka, M.B. and Mayer, M.: Ergodic theory and topological dynamics of group actions on
homogeneous spaces. London Mathematical Society LNM 269, 2000. x+200pp.
3. Brin, M. and Stuck, G.: Introduction to dynamical systems. Cambridge University Press,
Cambridge, 2002. xii+240 pp.
4. Halmos, P. R.: Measure Theory. D. Van Nostrand Company, Inc., New York, N. Y., 1950.
xi+304 pp.
5. Krengel, U.: Ergodic theorems. de Gruyter Studies in Mathematics 6 1985. viii+357pp.
6. Petersen, K.: Ergodic theory. Corrected reprint of the 1983 original. Cambridge Studies in
Advanced Mathematics, 2. Cambridge University Press, Cambridge, 1989. xii+329 pp.
7. Royden, H. L.: Real analysis. Third edition. Macmillan Publishing Company, New York,
1988. xx+444 pp.
8. Walters, P.: An introduction to ergodic theory. Graduate Texts in Mathematics, 79. Springer-
Verlag, New York-Berlin, 1982. ix+250 pp.
Chapter 2
Ergodic Theorems
1 N−1 1
∑ f ◦Tk
N k=0
=
N
∥g ◦ T n − g∥2 ≤ 2∥g∥2 /N −−−→ 0.
N→∞
2
35
36 2 Ergodic theorems
∥ f − f ◦ T ∥22 = ⟨ f − f ◦ T, f − f ◦ T ⟩ = ∥ f ∥2 − 2⟨ f , f ◦ T ⟩ + ∥ f ◦ T ∥22
= 2∥ f ∥22 − 2⟨ f , f − ( f − f ◦ T )⟩ = 2∥ f ∥22 − 2∥ f ∥22 = 0 =⇒ f = f ◦ T a.e.
⊥
So C ⊆ {invariant functions}. Conversely, if f is invariant then for every g ∈ L2
⟨ f , g − g ◦ T ⟩ = ⟨ f , g⟩ − ⟨ f , g ◦ T ⟩ = ⟨ f , g⟩ − ⟨ f ◦ T, g ◦ T ⟩ = ⟨ f , g⟩ − ⟨ f , g⟩ = 0,
L2 = C ⊕ {invariant functions}.
We saw above that the MET holds for all elements of C with zero limit, and
holds for all invariant functions f with limit f . Therefore the MET holds for all L2 –
functions, and the limit f is the orthogonal projection of f on the space of invariant
functions. R
In particular f is invariant. If T is ergodic, then f is constant and f = f dµ
1 N
almost everywhere. Also, since N ∑n=1 f ◦ T n → f in L2 , then
1 N 1 N
Z Z
f dµ = ∑ ⟨1, f ◦ T n ⟩ = ⟨1, ∑ f ◦ T n ⟩ → ⟨1, f ⟩ = f dµ,
N n=1 N n=1
R R R
so f dµ = f dµ whence f = f dµ almost everywhere. ⊔
⊓
Remark 1. The proof shows that the limit f is the projection of f on the space of
invariant functions.
Remark 2. The proof only uses the fact that U f = f ◦T is an isometry of L2 . In fact
it works for all linear operators U : H → H on separable Hilbert spaces s.t. ∥U∥ ≤ 1,
see problem 2.1.
L2
Remark 3. If fn −−−→ f , then ⟨ fn , g⟩ −−−→ ⟨ f , g⟩ for all g ∈ L2 . Specializing to the
n→∞ n→∞
case fn = 1n ∑n−1 k
k=0 1B ◦ T , g = 1A we obtain the following corollary of the MET:
1 n−1
∑ µ(A ∩ T −k B) −n→∞
n k=0
−−→ µ(A)µ(B).
So ergodicity is mixing “on the average.” We will return to this point when we
discuss the definition of weak mixing in the next chapter.
2.2 The Pointwise Ergodic Theorem 37
1 n−1
An (x) := ∑ f (T k x) , A(x) := lim
n k=0
sup An (x) , A(x) := lim inf An (x).
n→∞ n→∞
A(x), A(x) take values in [0, ∞], are measurable, and are T -invariant, as can be easily
1
checked by taking the limits on both sides of An (x) = n−1n An−1 (T x) + O( n ).
R R
S TEP 1. f dµ ≥ Adµ.
Proof. Fix ε, L, M > 0, and set AL (x) := A(x) ∧ L = min{A(x), L} (an invariant func-
tion). The following function is well-defined and finite everywhere:
• If k is red, then T k x ∈ [τL > M], so ∑ 1[τL >M] (T k x) ≥number of red k’s
red k’s
• The average of f on each blue segment of length τL (T k x) is larger than AL (T k x)−
ε = AL (x) − ε. So for each blue segment
N−1
∑ ( f + AL 1[τL >M] )(T k x) ≥ #(blues) · (AL (x) − ε) + #(reds) · AL (x)
k=0
≥ #(blues and reds) · (AL (x) − ε) ≥ (N − M)(AL (x) − ε).
ML
Z Z
f dµ ≥ AL dµ − − ε.
[τL ≤M] N
is well-defined and finite for a.e. x. We now repeat the coloring argument of step 1,
with θ replacing τ and A replacing AM . Let fM (x) := f (x) ∧ M, then as before
N−1
∑ fM (T k x)1[θ ≤M] (T k x) ≤ ∑ f (T k x) + ∑ 0+ ∑ M
k=0 k blue k red no color
≤ #(blues)(A(x) + ε) + M ≤ N(A(x) + ε) + M 2 .
2
R R
Integrating and dividing by N, we obtain 1[θ ≤M] f ∧M ≤ A+ε +O(1/N). Passing
to the
R
limits
R
N → ∞ and then M → ∞ (byR the monotone
R
convergence theorem), we
get f ≤ A + ε. Since ε was arbitrary, f ≤ A.
S TEP 3. limn→∞ n1 ∑n−1 k
k=0 f (T x) exists almost everywhere
R
Together, steps 1 and 2 imply that (A − A)dµ ≤ 0. But A ≥ A, so necessarily A = A
µ–almost everywhere (if µ[A > A] > 0, the integral would be strictly positive). So
1 n−1 1 n−1
lim sup ∑ f (T k x) = lim inf ∑ f (T k x) a.e.,
n→∞ n k=0 n→∞ n
k=0
1 n−1
Z
≤ ∥ f − ϕ∥1 + o(1) + lim
N→∞
∑ |(ϕ − f ) ◦ T k |dµ, by Fatou’s Lemma
N k=0
≤ 2ε + o(1).
The almost sure limit in the pointwise ergodic theorem is clear when the map is
ergodic: N1 ∑N−1 k
R
k=0 f ◦ T −−−→ f dµ. In this section we ask what is the limit in the
N→∞
non-ergodic case.
If f belongs to L2 , the limit is the projection of f on the space of invariant func-
tions, because of the Mean Ergodic Theorem and the fact that every sequence of
functions which converges in L2 has a subsequence which converges almost every-
where to the same limit.1 But if f ∈ L1 we cannot speak of projections. The right
notion in this case is that of the conditional expectation.
L2
1 Proof: Suppose fn −−−→ f . Pick a subsequence nk s.t. ∥ fnk − f ∥2 < 2−k . Then ∑k≥1 ∥ fnk − f ∥2 <
n→∞
∞. This means that ∥ ∑ | fnk − f |∥2 < ∞, whence ∑( fnk − f ) converges absolutely almost surely. It
follows that fnk − f → 0 a.e.
40 2 Ergodic theorems
Suppose g is not F –measurable. What is the ‘best guess’ for g(x) given the infor-
mation F ?
Had g been in L2 , then the “closest” F –measurable function (in the L2 –sense) is
the projection of g on L2 (X, F , µ). The defining property of the projection Pg of g
is ⟨Pg, h⟩ = ⟨g, h⟩ for all h ∈ L2 (X, F , µ). The following definition mimics this case
when g is not necessarily in L2 :
Definition 2.1. The conditional expectation of f ∈ L1 (X, B, µ) given F is the
unique L1 (X, F , µ)–element E( f |F ) which is
1. E( f |F ) is F –measurable;
2. ∀ϕ ∈ L∞ F –measurable, ϕE( f |F )dµ = ϕ f dµ.
R R
1 N−1
lim
N→∞ N
∑ f ◦ T k = E( f |Inv(T )) a.e. and in L1 ,
k=0
Definition 2.2. The measures µx are called the conditional probabilities of F . Note
that they are only determined almost everywhere.
Proof. By the isomorphism theorem for standard spaces, there is no loss of general-
ity in assuming that X is compact. Indeed, we may take X to be a compact interval.
Recall that for compact metric spaces X, the space of continuous functions C(X)
with the maximum norm is separable.2
n=0 in C(X) s.t. f 0 ≡ 1. Let AQ be the algebra
Fix a countable dense set { fn }∞
generated by these functions over Q. It is still countable.
Choose for every g ∈ AQ an F –measurable version E(g|F ) of E(g|F ) (recall
that E(g|F ) is an L1 –function, namely not a function at all but an equivalence class
of functions). Consider the following collection of conditions:
1. ∀α, β ∈ Q, g1,2 ∈ AQ , E(αg1 + β g2 |F )(x) = αE(g1 |F )(x) + β E(g2 |F )(x)
2. ∀g ∈ AQ , min g ≤ E(g|F )(x) ≤ max g
2 Proof: Compact metric space are separable because they have finite covers by balls of radius
1/n. Let {xn } be a countable dense set of points, then ϕn (·) := dist(xn , ·) is a countable family
of continuous functions, which separates points in X. The algebra which is generated over Q by
{ϕn } ∪ {1X } is countable. By the Stone-Weierstrass Theorem, it is dense in C(X).
42 2 Ergodic theorems
Z Z Z
µx (E)dµ(x) ≤ lim hεn dµx dµ ≤
ε→0+
Z Z Z Z
≤ lim hεn dµx dµ = lim hεn dµ ≤ lim hεn dµ ≤ µ(Un ) −−−→ 0.
ε→0+ ε→0+ ε→0+ n→∞
Then
2.3 The non-ergodic case 43
∞
E( f |F ) = ∑ E(gn |F ), because E(·|F ) is a bounded operator on L1
n=1
∞ Z
= ∑ gn dµx a.e., because gn ∈ C(X)
n=1 X
Z ∞
= ∑ gn dµx a.e., (justification below)
X n=1
Z
= f dµx a.e.
X
R
Here is the justification: ∑ |gn |dµx < ∞, because the integral of this expression, by
the monotone convergence theorem is less than ∑ ∥gn ∥1 < ∞. ⊔
⊓
1 N−1
Z
fn dµx = Eµ ( fn |Inv(T ))(x) = lim
N→∞ N
∑ fn (T k x) for all n.
k=0
1 N−1
Z
fn ◦ T dµx = lim
N→∞ N
∑ fn (T k+1 x) a.e. (by the PET)
k=0
1 N−1
Z
= lim
N→∞ N
∑ fn (T k x) = Eµ ( fn |Inv(T ))(x) = fn dµx .
k=0
44 2 Ergodic theorems
f dµx for all µx –integrable functions. This means that µx ◦ T −1 = µx for all x ∈ Ω ′ .
R
1 N−1
Z
0 = lim ∑ fn ◦ T k − fn dµx (∵ L1 -convergence in the PET)
N→∞ N k=0 X
L1 (µ)
1 N−1
Z Z Z
= lim
N→∞ X
∑ fn ◦ T k −
N k=0 X
fn dµx dµ(x) (∵ µ =
X
µx dµ)
L1 (µx )
1 N−1
Z Z
≥ lim inf
X N→∞
∑ fn ◦ T k −
N k=0 X
fn dµx dµ(x) (∵ Fatou’s Lemma)
L1 (µx )
1
∑n−1 k
R
It follows that lim inf
n→∞ n k=0 f n ◦ T − X f n dµx L1 (µx ) = 0 µ–a.e. Let
1 n−1
Z
Ω := x ∈ Ω ′′ : lim inf ∑ f n ◦ T k
− fn dµx = 0 for all n, ,
n→∞ n k=0 X
L1 (µx )
1 n−1
Z
lim inf ∑ f ◦Tk − f dµx = 0 for all f ∈ L1 (µx ).
n→∞ n k=0 X
L1 (µx )
where Ir is a sequence of subsets of Zd+ which “tends to Zd+ ”, and |Ir | =cardinality
of Ir . Such statements are not true for any choice of {Ir } (even when d = 1). Here
we prove the following pointwise ergodic theorem for increasing boxes : Boxes are
sets of the form
A sequence of boxes {Ir }r≥1 is said to be increasing if SIr ⊂ Ir+1 for all r. An in-
creasing sequence of boxes is said to tend to Zd+ if Zd+ = r≥1 Ir .
1
∑ f ◦ T n −r→∞
|Ir | n∈I
−−→ E( f |Inv(T1 ) ∩ · · · ∩ Inv(Td )) almost surely.
r
For convergence along more general sequences of boxes, see problem 2.8.
Proof. Fix f ∈ L1 . Almost sure convergence is obvious in the following two cases:
1. Invariant functions: If f ◦ Ti = f for i = 1, . . . , d, then |I1r | SIr f = f for all r, so the
limit exists and is equal to f .
2. Coboundries: Suppose f = g − g ◦ Ti for some g ∈ L∞ and some i,
1 1
SI f = SI gi − SIr +ei gi , where e1 , . . . ed is the standard basis of Rd
|Ir | r |Ir | r
1 |Ir △(Ir + ei )|
= S gi − S(Ir +ei )\Ir gi ≤ ∥gi ∥∞ .
|Ir | Ir \(Ir +ei ) |Ir |
Now |Ir △(Ir + ei )|/|Ir | −−−→ 0, because the lengths ℓ1 (r), . . . , ℓd (r) of the sides
r→∞
of the box Ir tend to infinity, and so
|Ir △(Ir + ei )| 2
= −−−→ 0.
|Ir | ℓi (r) r→∞
2d ∥ϕ∥1
1
µ sup SIr ϕ > t ≤ . (2.2)
r |Ir | t
1
ϕ remain small in L1 norm. It follows that |Ir | SIr f converge in L1 . The limit must
agree with the pointwise limit f (see the footnote on page 39).
Integrate |I1r | SI1 f against a bounded invariant function g and pass to the limit. By
L1 –covergence, f g = hg. It follows that h = E( f | di=1 Inv(T
R R T
R i
)). In particular,
if the Zd+ –action generated by T1 , . . . , Td is ergodic, then h = f . This finishes the
proof, assuming the maximal inequality.
Proof of the maximal inequality. Let ϕ ∈ L1 be a non-negative function, fix N and
α > 0. We will estimate the measure of EN (α) := {ω ∈ Ω : max1≤k≤N |I1 | SIk ϕ > α}.
k
Pick a large box I, and let A(ω) := {n ∈ I : T n (ω) ∈ EN (α)}. For every n ∈ A(ω)
there is a 1 ≤ k(n) ≤ N such that (SIk(n) +n ϕ)(ω) > α|Ik(n) |.
Imagine that we were able to find a disjoint subcollection {n + Ik(n) : n ∈ A′ (ω)}
which is “large” in the sense that there is some global constant K s.t.
1
∑′ |n + Ik(n) | ≥ |A(ω)|. (2.3)
n∈A (ω)
K
The sets n+Ik(n) (n ∈ A′ (ω)) are included in the box J ⊃ I obtained by increasing
the sides of I by 2 max{diam(I1 ), . . . , diam(IN )}. This, the non-negativity of ϕ, and
the invariance of µ implies that
1 1
Z Z
∥ϕ∥1 ≥ (SJ ϕ)(ω) dµ ≥ ∑ (SIk(n) +n ϕ)(ω) dµ
|J| |J| n∈A′ (ω)
1
Z
≥ ∑ α|Ik(n) + n| dµ
|J| n∈A′ (ω)
Z Z
α α α|I|
≥
K|J|
|A(ω)|dµ =
K|J| ∑ 1EN (α) (T n ω)dµ = K|J| µ[EN (α)].
n∈I
·········
(1) Let M1 be a maximal disjoint collection of sets of the form I1 + n where k(n) = 1
and such that all elements of M1 are disjoint from (MN ∪ · · · ∪M2 ).
S
48 2 Ergodic theorems
• In the second case there are u ∈ Ik(n) , v ∈ Ik(m) s.t. n + u = m + v, and again we get
n ∈ m + Ik(m) − Ik(n) ⊆ m + Ik(m) − Ik(m) (since k(m) ≥ k(n) and Ir is increasing).
For a d–dimensional box I, |I − I| = 2d |I|. Since A(ω) ⊆ (n + Ik(n) − Ik(n) ),
S
n∈A′ (ω)
|A(ω)| ≤ ∑n∈A′ (ω) 2d |Ik(n) |, and we get (2.3) with K = 2d . ⊔
⊓
describes the position of a random walk on G, which starts at the identity, and whose
steps have the distribution Pr[step = gi ] = pi . What can be said on the behavior of
this random walk?
In the special case G = Zd or G = Rd , fn (x) = f (x) + f (T x) + · · · + f (T n−1 x),
3 1 R
and the ergodic theorem says that n fn (x) has an almost sure limit, equal to f dµ =
∑ pi gi . So: the random walk has speed ∥ ∑ pi gi ∥, and direction ∑ pi gi /∥ ∑ pi gi ∥.
(Note that if G = Zd , the direction need not lie in G.)
Example 2 (The derivative cocycle) Suppose T : V → V is a diffeomorphism act-
ing on an open set V ⊂ Rd . The derivative of T at x ∈ V is a linear transformation
(dT )(x) on Rd , v 7→ [(dT )(x)]v. By the chain rule,
(dT n )(x) = (dT )(T n−1 x) ◦ (dT )(T n−2 x) ◦ · · · ◦ (dT )(x).
What is the “speed” of this random walk? Does it have an asymptotic “direction”?
The problem of describing the “direction” of random walk on a group is deep, and
remain somewhat mysterious to this day, even in the case of groups of matrices. We
postpone it for the moment, and focus on the conceptually easier task of defining
the “speed.” Suppose G is a group of d × d matrices with real-entries. Then G can
be viewed to be group of linear operators on Rd , and we can endow A ∈ G with the
operator norm ∥A∥ := max{∥Av∥2 /∥v∥2 : 0 ̸= v ∈ Rd }. Notice that ∥AB∥ ≤ ∥A∥∥B∥.
We will measure the speed of fn (x) := f (x) f (T x) · · · f (T n−1 x) by analyzing
g(n+m) (x) = log ∥ fn+m (x)∥ = log ∥ fn (x) fm (T n x)∥ ≤ log(∥ fn (x)∥ · ∥ fm (T n x)∥)
≤ log ∥ fn (x)∥ + log ∥ fm (T n x)∥ = g(n) (x) + g(m) (T n x).
Proof. We begin by observing that it is enough to treat the case when g(n) are all
non-positive. This is because h(n) := g(n) − (g(1) + g(1) ◦ T + · · · + g(1) ◦ T n−1 ) are
non–positive, sub-additive, and differ from g(n) by the ergodic sums of g(1) whose
asymptotic behavior we know by Birkhoff’s ergodic theorem.
Assume then that g(n) are all non-negative. Define G(x) := lim inf g(n) (x)/n (the
n→∞
limit may be equal to −∞). We claim that G ◦ T = G almost surely.
Starting from the subadditivity inequality g(n+1) ≤ g(n) ◦ T + g(1) , we see that
G ≤ G ◦ T . Suppose there were a set of positive measure E where G ◦ T > G + ε.
Then for every x ∈ E, G(T n x) ≥ G(T n−1 x) ≥ · · · ≥ G(T x) > G(x) + ε. But this is
impossible, because by Poincaré’s Recurrence Theorem, for a.e. x there is some
n > 0 such that G(T n x) = ∞ or |G(T n x) − G(x)| < ε (prove!). This contradiction
shows that G = G ◦ T almost surely. Henceforth we work the set of full measure
\
X0 := [G ◦ T n = G].
n≥1
• “bad”, if it’s not good: g(ℓ) (T k x)/ℓ > GM (x) + ε for all ℓ = 1, . . . N.
Color the integers 0, . . . , n − 1 inductively as follows, starting from k = 1. Let k be
the smallest non-colored integer,
(a) If k ≤ n − N and k is “bad”, color it red;
(b) If k ≤ n − N and k is “good”, find the smallest 1 ≤ ℓ ≤ N s.t. g(ℓ) (T k x)/ℓ ≤
GM (T k x) + ε and color the segment [k, k + ℓ) blue;
(c) If k > n − N, color k white.
Repeat this procedure until all integers 0, . . . , n − 1 are colored.
The “blue” part can be decomposed into segments [τi , τi + ℓi ), with ℓi s.t.
g(ℓi ) (T τi x)/ℓi ≤ GM (x) + ε. Let b denote the number of these segments.
The “red” part has size ≤ ∑nk=1 1B(N,M,ε) (T k x), where
Let r denote the size of the red part. The “white” part has size w ≤ N.
By the sub-additivity condition
g(n) (x)
lim sup ≤ (GM (x) + ε)(1 − E(1B(N,M,ε) |Inv)) almost surely.
n→∞ n
g(n) (x)
lim sup ≤ GM (x) + ε almost surely.
n→∞ n
Since ε was arbitrary, lim sup g(n) /n ≤ GM almost surely, which proves almost sure
n→∞
convergence by the discussion above. ⊔
⊓
4 Suppose 0 ≤ fn ≤ 1 and fn ↓ 0. The conditional expectation is monotone, so E( fn |F ) is decreas-
ing at almost every point. Let ϕ be its almost sure limit, then 0 ≤ ϕ ≤ 1 a.s., and by the BCT,
E(ϕ) = E(lim E( fn |F )) = lim E(E( fn |F )) = lim E( fn ) = E(lim fn ) = 0, whence ϕ = 0 almost
everywhere.
2.5 The Subadditive Ergodic Theorem 51
Proposition 2.3. Suppose m is ergodic, and g(n) ∈ L 1 for all n, then the limit in
R (n)
Kingman’s ergodic theorem is the constant inf[(1/n) g dm] ∈ [−∞, ∞).
Proof. Let G := lim g(n) /n. Subadditivity implies that G ◦ T ≤ G. Recurrence im-
plies that G◦T = T . Ergodicity implies that G = c a.e., for some constant c = c(g) ≥
−∞. We claim that c ≤ inf[(1/n) g(n) dm]. This is because
R
!
1 (kn) 1 g(n) g(n) n g(n) n(k−1)
c = lim g ≤ lim + ◦T +··· ◦T
k→∞ kn k→∞ k n n n
1
Z
= g(n) dm (Birkhoff’s ergodic theorem),
n
To prove the other inequality we first note (as in the proof of Kingman’s sub-
additive theorem) that it is enough to treat the case when g(n) are all non-positive.
Otherwise work with h(n) := g(n) − (g(1) + · · · + g(1) ◦ T n−1 ). Since g(1) ∈ L1 ,
1 (1)
Z
(g + · · · + g(1) ◦ T n−1 ) −−−→ g(1) dm pointwise and in L1 .
n n→∞
(n)
Thus c(g) = lim g n = c(h)+ g(1) = inf(1/n)[ h(n) + Sn g(1) ] = inf[(1/n) g(n) ]dm.
R R R R
(n)
Suppose then that g(n) are all non-positive. Fix N, and set gN := max{g(n) , −nN}.
This is, again, subadditive because
(n+m)
gN = max{g(n+m) , −(n + m)N} ≤ max{g(n) + g(m) ◦ T n , −(n + m)N}
(n) (m) (n) (m)
≤ max{gN + gN ◦ T n , −(n + m)N} ≡ gN + gN ◦ T n .
(n)
By Kingman’s theorem, gN /n converges pointwise to a constant c(gN ). By defini-
(n)
tion, −N ≤ gN /n ≤ 0, so by the bounded convergence theorem,
1 1 1
Z Z Z
(n) (n)
c(gN ) = lim gN dm ≥ inf gN dm ≥ inf g(n) dm. (2.4)
n→∞ n n n
Case 1: c(g) = −∞. In this case g(n) /n → −∞, and for every N there exists N(x) s.t.
(n)
n > N(x) ⇒ gN (x) = −N. Thus c(gN ) = −N, and (2.4) implies inf[(1/n) g(n) dm] =
R
−∞ = c(g).
Case 2: c(g) is finite. Take N > |c(g)| + 1, then for a.e. x, if n is large enough, then
(n)
g(n) /n > c(g) − ε > −N, whence gN = g(n) . Thus c(g) = c(gN ) ≥ inf n1 g(n) dm
R
The following immediate consequence will be used in the proof of the Oseledets
theorem for invertible cocycles:
Remark: Suppose (X, B, m, T ) is invertible, and let g(n) be a subadditive cocycle
s.t. g(1) ∈ L1 . Then for a.e. x, lim g(n) ◦ T −n /n exists and equals lim g(n) /n.
n→∞ n→∞
g(n) ◦ T −n 1 1
Z Z
lim = inf g(n) ◦ T −n dmy = inf g(n) dmy my a.e.
n→∞ n n n
g(n)
= lim my a.e.
n→∞ n
Thus the set where the statement of the remark fails has zero measure with respect
to all the ergodic components of m, and this means that the statement is satisfied on
a set of full m–measure. □
Multilinear forms. Let V = Rn equipped with the Euclidean inner product ⟨v, w⟩ =
∑ vk wk . A linear functional on V is a linear map ω : V → R. The set of linear func-
tionals is denoted by V ∗ . Any v ∈ V determines v∗ ∈ V ∗ via v∗ = ⟨v, ·⟩. Any linear
function is of this form.
A k–multilinear function is a function T : V k → R such that for all i and
v1 , . . . , vi−1 , vi+1 , . . . , vk ∈ V , T (v1 , . . . , vi−1 , · , vi+1 , . . . , vk ) is a linear functional.
The set of all k–multilinear functions on V is denoted by T k (V ). The tensor
product of ω ∈ T k (V ) and η ∈ T ℓ (V ) is ω ⊗ η ∈ T k+ℓ (V ) given by
ω(v1 , . . . , vi , . . . , v j , . . . vn ) = −ω(v1 , . . . , v j , . . . , vi , . . . vn ).
(to see the equivalence, expand ω(v1 , . . . , vi + v j , . . . , v j + vi , . . . vn )). The set of all
k–alternating forms is denoted by Ω k (V ).
Any multilinear form ω gives rise to an alternating form Alt(ω) via
1
Alt(ω) := sgn(σ )σ · ω,
k! σ∑
∈S k
Alt[(Alt(ω1 ⊗ ω2 ) − ω1 ⊗ ω2 ) ⊗ ω3 ] = 0,
(k + l)!
ω ∧ η := Alt(ω ⊗ η).
k!ℓ!
The wedge product is bilinear, and the previous lemma shows that it is associative.
It is almost anti commutative: If ω ∈ Ω k (V ), η ∈ Ω ℓ (V ), then
ω ∧ η = (−1)kℓ η ∧ ω.
and we conclude that Alt(ξ ) = 0. If, on the other hand, i1 , . . . , ik are all different,
then it is easy to see using lemma 2.1 that Alt(e∗i1 ⊗ e∗i2 ⊗ · · · ⊗ e∗ik ) = k!1 e∗i1 ∧ · · · ∧ e∗ik .
Thus ω = k!1 ∑ ai1 ,...,ik e∗i1 ∧ · · · ∧ e∗ik , and we have proved that the set of forms in
the statement spans Ω k (V ).
To see that this set is independent, we show that it is orthonormal. Suppose
{i1 , . . . , ik } ̸= { j1 , . . . , jk }, then the sets {σ · e∗i1 ⊗ · · · ⊗ e∗ik }, {σ · e∗j1 ⊗ · · · ⊗ e∗jk }
are disjoint, so Alt(e∗i1 ⊗ · · · ⊗ e∗ik ) ⊥ Alt(e∗j1 ⊗ · · · ⊗ e∗jk ). This proves orthogo-
nality. Orthonormality is because ∥e∗i1 ∧ · · · ∧ e∗ik ∥2 = k!∥Alt(e∗i1 ∧ · · · ∧ e∗ik )∥2 =
k! 1
k! ∑σ ∈Sk sgn(σ )σ · (e∗i1 ⊗ · · · ⊗ e∗ik ) = ∑σ ∈Sk sgn(σ )2 ( √1k! )2 = 1. (This explains
2
why we chose to define ∥e∗i1 ⊗ · · · ⊗ e∗ik ∥2 := √1 )
k!
⊔
⊓
Corollary 2.2. e∗1 ∧ · · · ∧ e∗n is the determinant. This is the reason for the peculiar
normalization in the definition of ∧.
Proof. Write for I = (i1 , . . . , ik ) such that 1 ≤ i1 < · · · < ik ≤ n, e∗I := e∗i1 ∧ · · · ∧ e∗ik .
Represent ω := ∑ αI e∗I , η := ∑ βJ e∗J , then
2 2
∥ω ∧ η∥2 = ∑ αI βJ e∗I ∧ e∗J = ∑ ±αI βJ e∗I∪J = ∑ αI2 βJ2 ≤ ∥ω∥2 ∥η∥2 .
I,J I∩J=∅ I∩J=∅
Take two multi indices I, J. If I = J, then the inner product matrix is the identity
matrix. If I ̸= J, then ∃α ∈ I \ J and then the α–row and column of the inner product
matrix will be zero. Thus the formula holds for any pair e∗I , e∗J . Since part (b) of
the lemma holds for all basis vectors, it holds for all vectors. Part (c) immediately
follows.
Next we prove part (d). Represent vi = ∑ αi j u j , then
!
v∗1 ∧ · · · ∧ v∗k = const. Alt(v∗1 ⊗ · · · ⊗ v∗k ) = const. Alt ∑ α1 j u∗j ⊗ · · · ⊗ ∑ αk j u∗j
j j
The terms where j1 , . . . , jk are not all different are annihilated by Alt. The terms
where j1 , . . . , jk are all different are mapped by Alt to a form which proportional to
u∗1 ∧ · · · ∧ u∗k . Thus the result of the sum is proportional to u∗1 ∧ · · · ∧ u∗k . ⊔
⊓
Thus ∥A∧k ω∥2 = ∑I ωI2 ∏i∈I λi2 ≤ ∥ω∥2 ∏ki=1 λi2 . It follows that ∥A∧k ∥ ≤ λ1 · · · λk .
To see that the inequality is in fact an equality, consider the case ω = v∗I where
I = {1, . . . , k}: ∥A∧k ω∥ = ⟨v∗I , v∗I ⟩ = (λ1 · · · λk )2 = (λ1 · · · λk )2 ∥ω∥2 . ⊔
⊓
Exterior products and angles between vector spaces The angle between vector
spaces V,W ⊂ Rd is
Proof. If V ∩W ̸= {0} then both sides are zero, so suppose V ∩W = {0}, and pick
an orthonormal basis e1 , . . . , en+k for V ⊕ W . Let w ∈ W, v ∈ V be unit vectors s.t.
∡(V,W ) = ∡(v, w), and write v = ∑ vi ei , w = ∑ w j e j , then
2.6 The Multiplicative Ergodic Theorem 57
2 2
∥v∗ ∧ w∗ ∥2 = ∑ vi w j e∗i ∧ e∗j = ∑ (vi w j − v j wi )e∗i ∧ e∗j = ∑ (vi w j − v j wi )2
i, j i< j i< j
1
= ∑(vi w j − v j wi )2 (the terms where i = j vanish)
2 i, j
" 2 #
1 2 2 2 2
1 2 2
= ∑ vi w j + v j wi − 2vi wi · v j w j = 2 vi ∑ w j − 2 ∑ vi wi
2 i, j 2 ∑i j i
exists a.e., and lim n1 ln ∥An (x)Λ (x)−n ∥ = lim n1 ln ∥(An (x)Λ (x)−n )−1 ∥ = 0 a.s.
n→∞ n→∞
p
Proof. The matrix Bn (x) := An (x)t An (x) is symmetric, therefore it can be orthog-
s (x)
onally diagonalized. Let ∃λn1 (x) < · · · < λn n (x) be its different eigenvalues, and
s (x)
λ 1 (x) λ n (x)
Rd = Wn n (x) ⊕ · · · ⊕ Wn n (x) the orthogonal decomposition ot Rd into the
corresponding eigenspaces. The proof has the following structure:
p
Part 1: Let tn1 (x) ≤ · · · ≤ tnd (x) be a list of the eigenvalues of Bn (x) := An (x)t An (x)
with multiplicities, then for a.e. x, there is a limit ti (x) = lim [tni (x)]1/n , i =
n→∞
1, . . . , d.
58 2 Ergodic theorems
Part 2: Let λ1 (x) < · · · < λs(x) (x) be a list of the different values of {ti (x)}di=1 .
Divide {tni (x)}di=1 into s(x) subsets of values {tni (x) : i ∈ Inj }, (1 ≤ j ≤ s(x)) in
such a way that tni (x)1/n → λ j (x) for all i ∈ Inj . Let
We show that the spaces Unj (x) converge as n → ∞ to some limiting spaces U j (x)
(in the sense that the orthogonal projections on Unj (x) converge to the orthogonal
projection on U j (x)).
Part 3: The theorem holds with Λ (x) : Rd → Rd given by v 7→ λi (x)v on U i (x).
Part 1 is proved by applying the sub additive ergodic theorem for a cleverly chosen
sub-additive cocyle (“Raghunathan’s trick”). Parts 2 and 3 are (non-trivial) linear
algebra.
(n)
Part 1: Set gi (x) := ∑dj=d−i+1 lntnj (x). This quantity is finite, because Atn An is
invertible, so none of its eigenvalues vanish.
(n)
The sequence gi is subadditive! This is because the theory of exterior products
(n)
says that exp gi = product of the i largest e.v.’s of An (x)t An (x) = ∥An (x)∧i ∥, so
p
(n+m)
exp gi (x) = ∥A∧i n ∧i ∧i n ∧i ∧i
n+m (x)∥ = ∥Am (T x) An (x) ∥ ≤ ∥Am (T x) ∥∥An (x) ∥
(m) (n)
= exp[gi (T n x) + gi (x)],
1
| lntni (x)| ≤ max{| ln ∥Atn An ∥|, | ln ∥(Atn An )−1 ∥|}
2
≤ max{| ln ∥An (x)∥|, | ln ∥An (x)−1 ∥|}
n−1
≤ ∑ | ln ∥A(T k x)∥ + | ln ∥A(T k x)−1 ∥|
k=0
n−1
(n)
∴ |gi (x)| ≤ i ∑ | ln ∥A(T k x)∥ + | ln ∥A(T k x)−1 ∥| . (2.5)
k=0
So ∥g(n) ∥1 ≤ n( ln ∥A∥ 1
+ ln ∥A−1 ∥ 1 ) < ∞.
(n)
Thus Kingman’s ergodic theorem says that lim n1 gi (x) exists almost surely, and
belongs to [−∞, ∞). In fact the limit is finite almost everywhere, because (2.5) and
5Proof: Let v be an eigenvector of λ with norm one, then |λ | = ∥Bv∥ ≤ ∥B∥ and 1 = ∥B−1 Bv∥ ≤
∥B−1 ∥∥Bv∥ = ∥B−1 ∥|λ |.
2.6 The Multiplicative Ergodic Theorem 59
Thus [tni (x)]1/n −−−→ ti (x) almost surely, for some ti (x) ∈ R.
n→∞
Part 2: Fix x s.t. [tni (x)]1/n −−−→ ti (x) for all 1 ≤ i ≤ d. Henceforth we work with
n→∞
this x only, and write for simplicity An = An (x), ti = ti (x) etc.
Let s = s(x) be the number of the different ti . List the different values of these
quantities an increasing order: λ1 < λ2 < · · · < λs . Set χ j := log λ j . Fix 0 < δ <
1 i 1/n → λ , the following sets
2 min{χ j+1 − χ j }. Since for all i there is a j s.t. (tn ) j
eventually stabilize and are independent of n:
• Unj := ∑i∈I j [eigenspace of tni (x)] (this sum is not necessarily direct);
• Vnr := ⊕ j≤rUnj
• Venr := ⊕ j≥rUnj
The linear spaces Un1 , . . . ,Uns are orthogonal,
p since they are eigenspaces of different
eigenvalues for a symmetric matrix ( Atn An ). We show that they converge as n → ∞,
in the sense that their orthogonal projections converge.
The proof is based on the following technical lemma. Denote the projection of a
vector v on a subspace W by v|W , and write χi := log λi .
Technical lemma: For every δ > 0 there exists constants K1 , . . . , Ks > 1 and N s.t.
for all n > N, t = 1, . . . , s, k ∈ N, and u ∈ Vnr ,
r+t
u|Ven+k ≤ Kt ∥u∥ exp(−n(χr+t − χr − δt))
We give the proof later. First we show how it can be used to finish parts 2 and 3.
We show that Vnr converge as n → ∞. Since the projection on Uni is the projection
on Vni minus the projection on Vni−1 , it will then follow that the projections of Uni
converge.
Fix N large. We need it to be so large that
1. I j are independent of n for all n > N;
2. The technical lemma works for n > N with δ as above.
There will be other requirements below.
60 2 Ergodic theorems
Fix an orthonormal basis (v1n , . . . , vdnr ) for Vnr (dr = dim(Vnr ) = ∑ j≤r |I j |). Write
∥vin − win+1 ∥ ≤ C1 θ n .
because {vin } is an orthonormal system. It follows that for all n large enough, win+1
are linearly independent. A quick way to see this is to note that
j
∥(w1n+1 )∗ ∧ · · · ∧ (wdn+1
r
)∗ ∥2 = det(⟨win+1 , wn+1 ⟩) (lemma 2.2)
= det(I + O(θ n )) ̸= 0, provided n is large enough,
because |⟨wi , w j ⟩| = |⟨wi − vin , w j ⟩ + ⟨vin , w j − vnj ⟩ + ⟨vin , vnj ⟩| ≤ 2C1 θ n . Call the term
in the brackets K, and assume n is so large that Kθ n < 1/2, then |∥ui ∥ − 1| ≤ ∥ui −
wi ∥ ≤ Kθ n , whence
r → V r.
Thus Vn+k
Part 3: We saw that proj Uni (x) −−−→ proj U i (x) for some linear spaces Ui (x). Set
n→∞
Λ (x) ∈ GL(Rd ) to be the matrix representing
s(x)
Λ (x) = ∑ eχ j (x) proj Ui (x) .
j=1
Since Ui (x) are limits of Uni , they are orthogonal, and they sum up to Rd . It follows
that Λ is invertible, symmetric, and positive. p
Choose an orthogonal basis {v1n (x), . . . , vdn (x)} of Bn (x) := An (x)t An (x) so that
Bn vin = tni vin for all i, and let Wni := span{vin }. Then for all v ∈ Rd ,
d
(Atn An )1/2n v = ( Atn An )1/n v = ∑ tni (x)1/n proj Wni (v)
p
i=1
s
= ∑ ∑ tni (x)1/n proj Wni (v)
j=1 i∈I j
s
= ∑ eχ j (x) ∑ proj Wni (v) + o(∥v∥),
j=1 i∈I j
1
lim log ∥An v∥ = χr := log λr uniformly on the unit ball in Ur . (2.7)
n→∞ n
To see that this is enough, note that Λ v = ∑sr=1 eχr (v|Ur ); for all δ > 0, if n is large
enough, then for every v,
62 2 Ergodic theorems
s s
∥AnΛ −n v∥ ≤ ∑ e−nχr ∥An (v|Ur )∥ = ∑ e−nχr en(χr +δ ) ∥v∥ ≤ senδ ∥v∥ (v ∈ Rd )
r=1 r=1
∥AnΛ −n v∥ = e−nχr ∥An v∥ = e±nδ ∥v∥ (v ∈ Ur )
r−t s−r+t+1 #
To see this note that Vn+k = (Ven+k ) and Unr ⊂ (Vns−r+1 )# , and apply the techni-
cal lemma to the cocycle generated by A# .
We prove (2.7). Fix δ > 0 and N large (we see how large later), and assume
n > N. Suppose v ∈ Ur and ∥v∥ = 1. Write v = lim vn+k with vn+k := v|Un+k k ∈ Ur .
n+k
Note that ∥vn+k ∥ ≤ 1. We decompose vn+k as follows
s−r
vn+k = vn+k |Vnr−1 + (vn+k |Unr ) + ∑ vn+k |Unr+t ,
t=1
and estimate the size of the image of each of the summands under An .
First summand:
Note that the cancellation of χr+t — this is the essence of the technical lemma. We
get: ∥An (vn+k |Unt )∥ = O(exp[n(χr + o(1))]). Summing over t = 1, . . . , s − r, we get
that third summand is O(exp[n(χr + o(1))]).
Putting these estimates together, we get that
∥An vn+k ∥ ≤ const. exp[n(χr + o(1))] uniformly in k, and on the unit ball in Ur .
“Uniformity” means that the o(1) can be made independent of v and k. It allows us
to pass to the limit as k → ∞ and obtain
Thus ∥An vn+k ∥ ≥ [1 + o(1)] exp[n(χr + o(1))] uniformly in v, k. Passing to the limit
as k → ∞, we get ∥An v∥ ≥ const. exp[n(χr + o(1))] uniformly on the unit ball in Ur .
These estimates imply (2.7).
64 2 Ergodic theorems
Proof of the technical lemma: We are asked to estimate the norm of the projection
r+t
of a vector in Vnr on Vn+k . We so this in three steps:
r+t
1. Vnr → Vn+1 , all t > 0;
r r+1
2. Vn → Vn+k , all k > 0;
r+t
3. Vnr → Vn+k , all t, k > 0.
Step 1. The technical lemma for k = 1: Fix δ > 0, then for all n large enough and
r′ ∥ ≤ ∥u∥ exp(−n(χ ′ − χ − δ )).
for all r′ > r, if u ∈ Vnr , then ∥u|Vn+1 r r
Proof. Fix ε, and choose N = N(ε) so large that tni = e±nε ti for all n > N, i = 1, . . . , d.
For every t = 1, . . . , s, if u ∈ Vnr , then
q
∥An+1 u∥ = ⟨Atn+1 An+1 u, u⟩
q
r+t r+t r+t−1 r+t−1
= ⟨Atn+1 An+1 (u|Ven+1 ), (u|Ven+1 )⟩ + ⟨Atn+1 An+1 (u|Vn+1 ), (u|Vn+1 )⟩
r+t−1 e r+t
(because Vn+1 , Vn+1 are orthogonal, Atn+1 An+1 –invariant,
and Rd = Vn+1
r+t−1 r+t
⊕ Ven+1 )
q
r+t 2 r+t−1 2
= ∥An+1 (u|Ven+1 )∥ + ∥An+1 (u|Vn+1 )∥
r+t
≥ ∥An+1 (u|Ven+1 )∥ = e(χr+t ±ε)(n+1) ∥u|Ven+1
r+t
∥.
1 1 n 1 n−1
log ∥A(T n x)∥ = ∑ log ∥A(T k x)∥ − ∑ log ∥A(T k x)∥ −−−→ 0 a.e.
n n k=0 n k=0 n→∞
r+1
∥u|Ven+k ∥ ≤ K1 ∥u∥ exp[−n(χr+1 − χr − δ )].
2.6 The Multiplicative Ergodic Theorem 65
r+t
• Estimate of ∥C∥: ∥C∥ ≤ ∥u|Ven+k−1 ∥. By the induction hypothesis on k,
! ! !
t−1 k−2 k−2
∥C∥ ≤ ∥u∥ ∑ Kt ′ ∑ e−δ0 j ∑ exp[−(n + j)(χr+t − χr − tδ )] .
t ′ =1 j=0 j=0
It is not difficult to see that when we add these bounds for ∥C∥, ∥B∥ and ∥A∥, the
result is smaller than the RHS of (2.9) for k. This completes the proof by induction
of (2.9). As explained above, step 3 follows by induction. ⊔
⊓
Corollary 2.3. Let χ1 (x) < · · · < χs(x) (x) denote the logarithms of the (different)
eigenvalues
L
of Λ (x). Let Uχi be the eigenspace of Λ (x) corresponding to exp χi . Set
Vχ := χ ′ ≤χ Uχ ′ .
1. χ(x, v) := lim n1 log ∥An (x)v∥ exists a.s, and is invariant.
n→∞
2. χ(x, v) = χi on Vχi \Vχi−1
3. If ∥A−1 ∥, ∥A∥ ∈ L∞ , then 1n log | det An (x)| = ∑ ki χi , where ki = dimUχi .
{χi (x)} are called the Lyapunov exponents of x. {Vχi } is called the Lyapunov fil-
tration of x. Property (2) implies that {Vχ } is A–invariant: A(x)Vχ (x) = Vχ (T x).
Property (3) is sometimes called regularity.
Remark: Vχi \ Vχi−1 is A–invariant, but if A(x) is not orthogonal, then UχiL doesn’t
need to be A–invariant. When T is invertible, there is a way of writing Vχi = j≤i H j
so that A(x)H j (x) = H j (T x) and χ(x, ·) = χ j on H j (x), see the next section.
1
2. lim log ∥An (x)v∥ = ±χi (x) on the unit sphere in H i (x).
n→±∞ |n|
1 i n j n
3. n log sin ∡(H (T x), H (T x)) − −−→ 0.
n→∞
Proof. Fix x, and let Let tn1 ≤ · · · ≤ tnd and t 1n ≤ · · · ≤ t dn be the eigenvalues of
(Atn An )1/2 and (At−n A−n )1/2 . Let ti := lim(tni )1/n , t i = lim(t in )1/n . These limits ex-
ists almost surely, and {logti }, {logt i } are lists of the Lyapunov exponents of An
and A−n , repeated with multiplicity. The proof of the Oseledets theorem shows that
d
1
lim log ∥A∧i
∑logt i ≡ n→∞ −n ∥
k=d−i+1 n
1
= lim log | det A−n |·∥((A−n )−1 )∧(d−i) ∥ (write using e.v.’s)
n→∞ n
1
∧(d−i)
≡ lim log |(det An ◦ T −n )|−1 ·∥An ◦ T −n ∥
n→∞ n
1
∧(d−i)
≡ lim log | det An |−1 ·∥An ∥ (remark after Kingman’s Theorem)
n→∞ n
d d d−i
= ∑ logtk − ∑ logtk = − ∑ logtk .
k=d−i+1 k=1 k=1
1
V i (x) := {v : lim log ∥An (x)v∥ ≤ χi (x)}.
n→∞ n
1 2 s
Let V (x) ⊃V (x) ⊃ · · · ⊃V (x) be the following decreasing filtration, given by
i 1
V (x) := {v : lim log ∥A−n (x)v∥ ≤ −χi (x)}.
n→∞ n
i i
These filtrations are invariant: A(x)V i (x) = V i (T x), A(x)V (x) = V (T x).
i
Set H i (x) := V i (x) ∩V (x). We must have A(x)H i (x) = H i (T x).
We claim that R = H i (x) almost surely. It is enough to show that for a.e. x,
d L
i+1
R = V i (x) ⊕V (x), because
d
1 1 2 2
Rd ≡ V = V ∩ [V 1 ⊕V ] (V 1 ⊕V = Rd )
1 2 2 1 2
= H 1 ⊕ [V ∩V ] = H 1 ⊕V (V ⊇ V )
2 3 3
= H 1 ⊕ [V ∩ (V 2 ⊕V )] (V 2 ⊕V = Rd )
3
= H 1 ⊕ H 2 ⊕V = · · · = H 1 ⊕ · · · ⊕ H s .
68 2 Ergodic theorems
i+1
Since the spectra of Λ , Λ agree with matching multiplicities, dimV i +dimV =
i+1
d. Thus it is enough to show that E := {x : V i (x) ∩V (x) ̸= {0}} has zero measure
for all i.
Assume otherwise, then by the Poincaré recurrence theorem, for almost every
x ∈ E there is a sequence nk → ∞ for which T nk (x) ∈ E. By the Oseledets theorem,
for every δ > 0, there is Nδ (x) such that for all n > Nδ (x),
i+1
∥An (x)u∥ ≤ ∥u∥ exp[n(χi + δ )] for all u ∈ V i ∩V , (2.10)
i+1
∥A−n (x)u∥ ≤ ∥u∥ exp[−n(χi+1 − δ )] for all u ∈ V i ∩V . (2.11)
i+1
If nk > Nδ (x), then Ank (x)u ∈ V i (T nk x) ∩V (T nk x) and T nk (x) ∈ E, so
whence |χi+1 − χi | < 2δ . But δ was arbitrary, and could be chosen to be much
smaller than the gaps between the Lyapunov exponents. With this choice, we get a
contradiction which shows that m(E) = 0.
Thus Rd = H i (x). Evidently, V i = V i ⊇ j≤i H j and V i ∩ j>i H j ⊆ V i ∩
L L L
i+1 i
= {0}, so V i = j. j. It follows that H i ⊂
L L
V j≤i H In the same way V = j≥i H
i i+1
(V i \V i−1 ) ∩ (V \V ). Thus lim 1 log ∥An v∥ = ±χi on the unit sphere in H i .
n→±∞ |n|
Next we study the angle between H i (x) and H e i (x) := L j̸=i H j (x). Pick a basis
(vi1 , . . . , vimi ) for H i (x). Pick a basis (wi1 , . . . , wimi ) for He i (x). Since An (x) is invertible,
i
Ak (x) maps (v1 , . . . , vmi ) onto a basis of H (T x), and (wi1 , . . . , wimi ) onto a basis of
i i k
We view A∧p ∗ ∗
n as an invertible matrix acting on span{ei1 ∧ · · · ∧ ei p : i1 < · · · < i p } via
∗ ∗
(An (x)ei1 ) ∧ · · · ∧ (An (x)ei p ) . It is clear
∧p
Λ p (x) := lim ((A∧p ∗ ∧p 1/2n
n ) (An )) = lim (A∗n An )1/2n = Λ (x)∧p ,
n→∞ n→∞
thus the eigenspaces of Λ p (x) are the tensor products of the eigenspaces of Λ (x).
This determines the Lyapunov filtration of An (x)∧p , and implies – by Oseledets
theorem – that if v j ∈ Vχk( j) \Vχk( j)−1 , and v1 , . . . , vk are linearly independent, then
2.7 The Karlsson-Margulis ergodic theorem 69
p
1
lim log ∥An (x)∧p ω∥ = ∑ χk( j) , for ω := v1 ∧ · · · ∧ v p .
n→∞ n
j=1
Suppose (X, d) is a “nice” metric space, and Isom(X) is the group of isometries of
X. Let (Ω , F , µ, T ) be Ra ppt and f : Ω → Isom(X) be a measurable map such that
for some (any) x0 ∈ X, X d( f (ω) · x, x0 )dµ(ω) < ∞.
Consider the “random walk” xn (ω) := f (T n−1 ω) ◦ · · · ◦ f (ω) · x0 . It is not
difficult to see that g(n) (ω) := d(x0 , xn (ω)) is a sub-additive cocycle. The sub-
additive ergodic theorem, implies the existence of almost sure asymptotic speed
s(ω) = lim 1n d(x0 , xn (ω)). The Karlsson-Margulis theorem provides (under addi-
n→∞
tional assumptions on (X, d)), the existence of an asymptotic velocity: A geodesic
ray γω (t) ⊂ X which starts at x0 s.t. d(xn (ω), γω (s(ω)n)) = o(n) as n → ∞.
Example 1 (Birkhoff Ergodic Theorem): Take an ergodic ppt (Ω , F , µ, T ), X =
Rd and f (ω) · v := v + f (ω) where f (ω) := ( f1 (ω), . . . , fd (ω)) and fi : Ω → R are
absolutely integrable with non-zero integral. Then
1 n−1 1 n−1
f1 (T i ω), . . . , ∑ fd (T i (ω)
xn (ω) = x0 + n ∑
n i=0 n i=0
R R
so s(ω) = ∥v∥2 a.e. and γω (t) = tv/∥v∥ a.e., where v := ( f1 dµ, . . . , fd dµ).
Example 2 (Multiplicative Ergodic Theorem): See section 2.7.3 below.
4. A geodesic ray is a curve γ : [0, ∞) → X s.t. d(γ(t), γ(t ′ )) = |t − t ′ | for all t,t ′ .
5. A metric space is called a geodesic space, if every A, B ∈ X are connected by a
geodesic segment.
6. A geodesic metric space is called complete if every geodesic segment can be
extended to a geodesic ray in the positive direction.
Suppose X is a non–compact proper geodesic metric space. We are interested in
describing the different ways of “escaping to infinity” in X.
The idea is to compactify X by adding to it a “boundary” so that tending to
infinity in X in a certain “direction” corresponds to tending to a point in the boundary
of X in X.
b
Fix once and for all a reference point x0 ∈ X (the “origin”), and define for each
x ∈ X the function
Dx (z) := d(z, x) − d(x0 , x).
Dx (·) is has Lipschitz constant 1, and Dx (x0 ) = 0. It follows that {Dx (·) : x ∈ X} is
equicontinuous and uniformly bounded on compact subsets of X.
By the Arzela–Ascoli theorem, every sequence {Dxn }n≥1 has a subsequence
{Dxnk }k≥1 which converges pointwise (in fact uniformly on compacts). Let
Xb := { lim Dxn (·) : {xn }n≥1 ⊂ X, {Dxn }n≥1 converges uniformly on compacts}.
n→∞
Proof. Xb is compact, because it is the closure of {Dx (·) : x ∈ X}, which is precom-
pact by Arzela–Ascoli.
The map ı : x 7→ Dx (·) is one-to-one because x can be read of Dx (·) as the unique
point where that function attains its minimum. The map ı is continuous, because if
d(xn , x) → 0, then
To see that ı−1 is continuous, we first note that since Xb is metrizable, it is enough
to show that if Dxn → Dx uniformly on compacts, then xn → x in X. Suppose Dxn →
Dx uniformly on compacts, and fix ε > 0. Suppose by way of contradiction that
∃nk ↑ ∞ s.t. d(xnk , x) ≥ ε. Construct ynk on the geodesic segment connecting x to xnk
s.t. d(x, ynk ) = ε/2. We have
Dxnk (ynk ) = d(ynk , xnk ) − d(xnk , x0 ) = d(x, xnk ) − d(x, ynk ) − d(xnk , x0 )
ε
= Dxnk (x) − .
2
Since d(ynk , x) = ε/2 and X is proper, ynk lie in a compact subset of X. W.l.o.g.
ynk −−−→ y ∈ X. Passing to the limit we see that Dx (y) = Dx (x) − ε2 < Dx (x). But
k→∞
this is absurd, since Dx attains its minimum at x. It follows that xn → x. ⊔
⊓
Terminology: Xb is called the horofunction compactification of X, and ∂ X :=
Xb \ ı(X) is called the horofunction boundary of X. Elements of ∂ X are called horo-
functions.
The horofunction compactification has a very nice geometric interpretation in
case (X, d) has “non–positive curvature”, a notion we now proceed to make precise.
Suppose (X, d) is a geodesic space, then any three points A, B,C ∈ X determine
a geodesic triangle △ABC obtained by connecting A, B,C by geodesic segments.
A euclidean triangle △ABC ⊂ R2 is called a (euclidean) comparison triangle for
△ABC if it has the same lengths:
d(A, B) = dR2 (A, B), d(B,C) = dR2 (B,C), d(C, A) = dR2 (C, A).
A point x ∈ AB is called a comparison point for x ∈ AB, if d(x, A) = dR2 (x, A) and
d(x, B) = dR2 (x, B).
Definition 2.3. A geodesic metric space (X, d) is called a CAT(0) space if for any
geodesic triangle △ABC in X and points x ∈ AC, y ∈ BC, if △A BC is a euclidean
comparison triangle for △ABC, and x ∈ AC and y ∈ BC are comparison points to
x ∈ AB and y ∈ BC, then d(x, y) ≤ dR2 (x, y). (See figure 2.1.)
Proof.
72 2 Ergodic theorems
Proof. Fix ε > 0 and N so large that if k > N, then d(xk , x0 ) > t and |Dxk (z)−D(z)| <
ε for all z s.t. d(z, x0 ) ≤ t. Let yk := γk (t), the point on the geodesic segment x0 xk at
distance t from x0 . We show that {yk }k≥1 is a Cauchy sequence.
Fix m, n > N, and construct the geodesic triangle △xm yn x0 . Let △xm yn x0 be its
euclidean comparison triangle. Let ym ∈ [xm , x0 ] be the comparison point to ym on
the geodesic segment from xm to x0 . By the CAT(0) property,
d(ym , yn ) ≤ |ym yn |.
Fig. 2.2
|yn ym | 1 2t
= = .
ym z cos θ |yn ym |
p
It follows that |yn ym | ≤ 2t|ym z|.
74 2 Ergodic theorems
This shows that {γn (t)}n≥1 = {yn }n≥1 is a Cauchy sequence. Moreover, the Cauchy
criterion holds uniformly on compact subsets of t ∈ [0, ∞).
The limit γ(t) = lim γn (t) must be a geodesic ray emanating from x0 (exercise).
Step 2. D(z) = Bγ (z; x0 ).
(
γ(tn ) n is odd
Proof. Let tn := d(x0 , xn ), and define ξn := Then Dξ2n (z) → D(z)
xn n is even.
and Dξ2n−1 (z) → Bγ (z; x0 ).
We use the fact that the geodesic segments x0 ξn converge to γ(t) to show that
|Dξn − Dξn+1 | → 0 uniformly on compacts. It will follow that D(z) = Bγ (z; x0 ).
Fix ε > 0 small and r > ρ > 0 large. Let ηk denote the point on the segment ξk x0
at distance r from x0 , then
The last summand tends to zero because ηk → γ(r). We show that the other two
summands are small for all z s.t. d(z, x0 ) ≤ ρ.
Let △ξ k x0 z be a euclidean comparison triangle for △ξk x0 z, and let η k ∈ ξk x0
be a comparison point to ηk ∈ ξk x0 . Let z′ be the projection of z in the line passing
through ξ k x0 (figure 2.3).
By the CAT(0) property, d(ηk , z) ≤ d(η k , z), and so
Fig. 2.3
Similarly, one shows that d(ξk+1 , ηk+1 )+d(ηk+1 , z)−d(ξk+1 , z) ≤ 2ρ 2 /r. Choosing
r > 2ρ 2 /ε sufficiently large, and k large enough so that d(xk , x0 ) > r we see that the
two remaining terms are less than ε, with the result that |Dξk+1 (z) − Dξk (z)| < 3ε for
all z s.t. d(z, x0 ) < ρ.
It follows that Bγ (z; x0 ) = lim Dξ2n−1 (z) = lim Dξ2n (z) = D(z).
Part 3. If Bγ1 ( · ; x0 ) = Bγ2 ( · ; x0 ), then γ1 = γ2 .
(
γ1 (tn ) n is odd
Proof. Fix tn ↑ ∞, and set xn := . The sequences Dx2n (·), Dx2n−1 (·)
γ2 (tn ) n is even
have the same limit, Bγ1 ( · ; x0 ) = Bγ2 ( · ; x0 ), therefore lim Dxn exists. Step 1 in Part
2 shows that the geodesic segments x0 xn must converge uniformly to a geodesic ray
γ(t). But these geodesic segments lie on γ1 for n odd and on γ2 for n even; it follows
that γ1 = γ2 . ⊔
⊓
Proposition 2.6. Let (X, d) be a proper geodesic space with the CAT(0) property,
and suppose xn ∈ X tend to infinity at speed s, i.e. n1 d(x0 , xn ) −−−→ s. If
n→∞
1
D(xn ) −−−→ −s for D ∈ ∂ X,
n n→∞
Fix n and t large, and consider the geodesic triangle △xn x0 γ(st) and the point
γ(sn) on the segment from x0 to γ(st). Let △xn x0 γ(st) be the euclidean comparison
triangle, and let γ(sn) be the comparison point to γ(st). By the CAT(0) property,
Let wt be the point on the segment connecting γ(st) to x0 at the same distance from
γ(st) as xn . Drop a height xn z to that segment (figure 2.4).
Fig. 2.4
|xn z|2 = |xn x0 |2 − |x0 z|2 = [sn + o(n)]2 − [sn + o(n)]2 = o(n2 ),
Throughout this section, (X, d) is a metric space which is proper, geodesic, geodesi-
cally complete, and which has the CAT(0) property. We fix once and for all some
point x0 ∈ X (“the origin”).
A map ϕ : X → X is called an isometry, if it is invertible, and d(ϕ(x), ϕ(y)) =
d(x, y) for all x, y ∈ X. The collection of isometries is a group, which we will denote
by Isom(X).
Suppose (Ω , B, µ, T ) is a ppt and f : Ω → Isom(X) is measurable, in the sense
that (ω, x) 7→ f (ω)(x) is a measurable map Ω × X → X. We study the behavior of
fn (ω)x0 as n → ∞, where
The subadditive theorem implies that { fn (ω)x0 }n≥1 has “asymptotic speed”:
1
s(ω) := lim d(x0 , fn (ω)x0 ).
n→∞ n
We are interested in the existence of an “asymptotic direction.”
Theorem 2.14 (Karlsson–Margulis). Suppose (Ω , B, µ, T ) is Ra ppt on a standard
probability space, and f : Ω → Isom(X) is a measurable map. If Ω d(x0 , f (ω)x0 )dµ
is finite, then for a.e. ω ∈ Ω there exists a geodesic ray γω (t) emanating from x0 s.t.
1
d( f (ω) f (T ω) · · · f (T n−1 ω)x0 , γω (ns(ω)) −−−→ 0.
n n→∞
The trick is to write the expression D( fn (ω)x0 ) as an ergodic sum for some other
dynamical system, and then apply the pointwise ergodic theorem.
We first extend the action of an isometry ϕ on X to an action on X. b Recall that
every D ∈ Xb equals lim Dxn (·) where Dxn (·) = d(xn , ·) − d(xn , x0 ). Define
n→∞
Notice that the second coordinate of the iterate Sk (ω, D) = (T k (ω), fk (ω)−1 · D) is
the horofunction D( fk (ω)z) − D( fk (ω)x0 ). Define F : Ω × Xb → R by
F(ω, D) := D( f (ω)x0 ),
b0 on Ω × Xb such that
Technical step: To construct a probability measure µ
1. µ
b0 is S–invariant, S–ergodic, and µb0 (Ω × ∂ X) = 1;
2. µ b0 (E × X)
b0 projects to µ in the sense that µ b = µ(E) for all E ⊂ Ω measurable;
3. F ∈ L1 (µ
R
b0 ) and Fd µ b0 = −s.
We give the details later. First we explain how to use such µ
b0 to prove the theo-
rem. By the pointwise ergodic theorem,
1
Z
U := (ω, D) ∈ Ω × ∂ X : D( fn (ω)x0 ) −−−→ Fd µ0 = −s b
n n→∞
1 n−1 1
Z Z Z
(−F)d νb = − ∑ (F ◦ Sk )d νb = − n
n k=0
D( fn (ω) · x0 )d νb(ω, D)
1
Z
≤ d(x0 , fn (ω) · x0 )dµ, by (2.13) and the assumption that νb projects to µ.
n
! !
As this holds for all n, (−F)d νb ≤ inf 1n d(x0 , fn (ω) · x0 )dµ = s. Equality = is due
R R
We have − n ∑n−1 1R k b 1 1
R R
k=0 F ◦S d ηn = − n D fn (ω)x0 ( f n (ω)x0 )dµ = n d(x0 , f n (ω)x0 )dµ.
So − 1n ∑n−1 k b
R
k=0 F ◦ S d ηn ≥ s. It is easy to see that ηn projects to µ.
b
But ηbn is not S–invariant. Let µ bn := 1n ∑n−1
k=0 η
b n ◦ S−k . This measure still projects
n−1
bn = − n1 ∑k=0 F ◦ Sk d η bn ◦ S−1 − µ
R R
to µ (check!); − Fd µ bn ≥ s; and |µ b | ≤ 2/n (total
variation norm).
We may think of µ bn as of positive bounded linear functional on L1 (Ω , X) b with
norm 1. By the Banach-Alaoglu Theorem, there is a subsequence µnk which con-
verges weak-star in L1 (Ω , X)b ∗ to some positive µ b ∈ L1 (Ω , X)b ∗ . This functional re-
∗
stricts to a functional in C(Ω × X) , and therefore can be viewed as a positive finite
b
measure on Ω × X. b Abusing notation, we call this measure µ b.
• µ
b is a probability measure, because µ b (1) = lim µ bnk (1) = 1
b projects to µ, because for every E ∈ F , gE (ω)(D) := 1E (ω) belongs to
• µ
L 1 (Ω ,C(X)),
b whence µ b (E) = µb (gE ) = lim µbnk (ϕE ) = lim µ
bnk (E) = µ(E).
b is S-invariant, because if g ∈ C(Ω × X),
• µ b then g, g ◦ S ∈ L 1 (Ω ,C(X)),b whence
|µ
b (g) − µ
b (g ◦ S)| = lim |µ
bnk (g) − µ
bnk (g ◦ S)| = 0.
b (−F) = s: The inequality ≤ is because of the S-invariance of µ
• µ b . The inequality
≥ is because F ∈ L 1 (Ω ,C(X)), b whence µ b (−F) = lim µ bnk (−F) ≥ s.
(These arguments require weak∗ convergence in L1 (Ω ,C(X)) b ∗ , not just weak∗ con-
vergence of measures, because in general gE , F, g ◦ S ̸∈ C(Ω × X).)
b
We found an S-invariant probability measure µ b which projects to µ and so that
b (−F) = s. But µ
µ b is not necessarily ergodic.
R
b = Ω ×∂ X µ
Let µ b be its ergodic decomposition. Then: (a) almost every
b(ω,D) d µ
ergodic component is ergodic and invariant; (b) Almost every ergodic component
2.7 The Karlsson-Margulis ergodic theorem 81
projects to µ (prove using the extremality µ); (c) Almost every ergodic component
gives (−F) integral s, because µb(ω,D) (−F) ≤ s for a.e. (ω, D) by S-invariance, and
the inequality cannot be strict on a set with positive measure as this would imply
b (−F) < s.
that µ
SoR
a.e. ergodic component µb0 of µ
b is ergodic, invariant, projects to µ, and satis-
fied Fd µ b0 = −s. It remains to check that µ
b0 is carried by Ω × ∂ X. This is because
b -a.e. (ω, D)
by the ergodic theorem, for µ
1 1 n−1
D( fn (ω)x0 ) ≡ − ∑ (F ◦ Sk )(ω, D) −−−→ −s < 0,
n n k=0 n→∞
We explain how to obtain the multiplicative ergodic theorem as a special case of the
Karlsson-Margulis ergodic theorem. We begin with some notation and terminology.
• log := ln, log+ t = max{logt, 0} p
• Vectors in Rd are denoted by v, w etc. ⟨v, w⟩ = ∑ vi wi and ∥v∥ = ⟨v, v⟩.
• The space of d × d matrices with real entries will be denote by Md (R).
• diag(λ1 , . . . , λd ) := (ai j ) ∈ Md (R) where aii := λi and ai j = 0 for i ̸= j
• I :=identity matrix= diag(1, . . . , 1)
• For a matrix A = (ai j ) ∈ Md (R), Tr(A) := ∑ aii (the trace), and Tr1/2 (A) :=
p
Tr(A). At := (a ji ) (the transpose). ∥A∥ := sup∥v∥≤1 ∥Av∥.
∞
exp(A) := I + ∑ Ak /k!
k=1
Proof. Suppose A ∈ GL(d, R), P ∈ Posd (R). Clearly APAt is symmetric. It is pos-
itive definite, because for every v ̸= 0, ⟨APAt v, v⟩ = ⟨PAt v, At v⟩ > 0 by the positive
definiteness of P and the invertibility of At . This proves (1). (2) is trivial. (3) is
because A · P = Q for A := Q1/2 (P−1 )1/2 . Here we use the elementary fact that
P ∈ Posd (R) ⇒ P is invertible and P−1 ∈ Posd (R). ⊔
⊓
For details and proof, see Bridson & Haefliger: Metric spaces of non-positive cur-
vature, Springer 1999, chapter II.10.
q
Lemma 2.3. For every A ∈ GL(d, R), d(I, A · I) = 2 ∑di=1 (log λi )2 , where λi are
√
singular values of A (the eigenvalues of AAt ).
Proof. The idea is to find the geodesic from I to A · I. Such geodesics take the form
γ(τ) = exp(τS) for S ∈ Symd (R) s.t. Tr(S2 ) = 1. To find S we write A · I = AAt =
exp(log AAt ). This leads naturally to the solution
It is well-known that ∥A∥ = λmax , ∥A−1 ∥ = 1/λmin . 6 The corollary follows from
the lemma 2.3 and the identity max{(log λi )2 } = 2 max{| log λmax |, | log λmin |}. ⊔
⊓
Lemma 2.4. Suppose Pn , P ∈ Posd (R). If d(Pn , P) −−−→ 0 where d is the metric in
n→∞
Theorem 2.15, then ∥Pn − P∥ −−−→ 0.
n→∞
Let γn (t) denote the geodesic ray from I to P−1/2 Pn P−1/2 , then γn (t) = exp(tSn ) for
some symmetric matrix Sn such that Tr[Sn2 ] = 1. This gives us the identity
Since ∑ λi (n)2 = 1, the middle term tends to I in norm, whence by the orthogonality
of On , ∥P−1/2 Pn P−1/2 − I∥ → 0. It follows that ∥Pn − P∥ → 0. ⊔
⊓
Proof. The idea is to apply the Karlsson-Margulis ergodic theorem to the space
X := Posd (R), the origin x0 := I, and the isometric action A(ω) · P := A(ω)PA(ω)t .
First we need to check the integrability condition:
R
Step 1. d(x0 , A(ω) · x0 )dµ < ∞.
This follows from by corollary 2.5.
Step 2. For a.e. ω the following limit exists: s(ω) := lim 1n d(I, An (ω) · I).
n→∞
Proof. It is easy to see using the triangle inequality and the fact that GL(d, R) acts
isometrically on Posd (R) that g(n) (ω) := d(I, An (ω) · I) is a sub-additive cocycle.
The step follows from the sub-additive ergodic theorem.
Step 3. Construction of Λ (ω) s.t. 1n log ∥An (ω)−1Λ (ω)n ∥ −−−→ 0 a.e.
n→∞
6This can be easily deduced from the singular decomposition: A = O1 diag(λ1 , . . . , λd )Ot2 where
O1 , O2 ∈ Od (R).
84 2 Ergodic theorems
The Karlsson-Margulis ergodic theorem provides, for a.e. ω, a geodesic ray γω (t)
emanating from I such that
By theorem 2.15, γω (t) = exp(tS(ω)), for S(ω) ∈ Symd (R) s.t. Tr[S2 ] = 1. Let
1
Λ (ω) := exp s(ω)S(ω)
2
1 1
A′n := (An Atn )1/2n = exp[ log An Atn ] = γ1 ( )
2n 2ncn
1 1
dR2 (A′n ,Λ ) = dR2 (An · I,Λ 2n ) ≡ d(An · I, γω (ns)) = o(1).
2n 2n
Problems
2.3. Use the pointwise ergodic theorem to show that any two different ergodic in-
variant probability measures for the same transformation are mutually singular.
2.5. Prove that the Bernoulli ( 21 , 12 )–measure is the invariant probability measure for
the adding machine (Problem 1.10), by showing that all cylinders of length n must
have the same measure as [0n ]. Deduce from the previous problem that the adic
machine is ergodic.
2.7. Prove:
1. f :7→ E( f |F ) is linear, and a contraction in the L1 –metric
2. f ≥ 0 ⇒ E( f |F ) ≥ 0 a.e.
3. if ϕ is convex, then E(ϕ ◦ f |F ) ≤ ϕ(E( f |F ))
4. if h is F –measurable, then E(h f |F ) = hE( f |F )
5. If F1 ⊂ F2 , then E[E( f |F2 )|F1 ] = E( f |F1 )
Prove this theorem, using the following steps (W. Parry). It is enough to consider
non-negative f ∈ L1 .
L1
1. Prove that E( f |Fn ) −−−→ E( f |F ) using the following observations:
n→∞
S 1 (X, F
a. The convergence holds for all elements of n≥1 L n , µ);
b. n≥1 L1 (X, Fn , µ) is dense in L1 (X, F , µ).
S
2. Set Ea := {x : max1≤n≤N E( f |Fn )(x) > a}. Show that µ(Ea ) ≤ 1a f dµ. (Hint:
R
3. Prove that E( f |Fn ) −−−→ E( f |F ) a.e. for every non-negative f ∈ L1 , using the
n→∞
following steps. Fix f ∈ L1 . For every ε > 0, choose n0 and g ∈ L1 (X, Fn0 , µ)
such that ∥E( f |F ) − g∥1 < ε.
a. Show that |E( f |Fn ) − E( f |F )| ≤ E(| f − g| |Fn )| + |E( f |F ) − g| for all n ≥
n0 . Deduce that
√ 1√
µ lim sup |E( f |Fn ) − E( f |F )| > ε ≤ µ sup |E(| f − g||Fn )| > ε
n→∞ n 2
1√
+ µ |E( f |F ) − g| > ε
2
√
b. Show that µ lim supn→∞ |E( f |Fn ) − E( f |F )| > ε −−−→ 0. (Hint: Prove
ε→0+
first that for every L1 function F, µ[|F| > a] ≤ 1a ∥F∥1 .)
c. Finish the proof.
2.7 The Karlsson-Margulis ergodic theorem 87
Prove this theorem using the following steps (R. Zweimüller). Fix a set A ∈ B s.t.
0 < µ(A) < ∞, and let (A, BA , TA , µA ) denote the induced system on A (problem
1.14). For every function F, set
Sn F := F + F ◦ T + · · · + F ◦ T n−1
SnA F := F + F ◦ TA + · · · + F ◦ TAn−1
1. Read problem 1.14, and show that a.e. x has an orbit which enters A infinitely
many times. Let 0 < τ1 (x) < τ2 (x) < · · · be the times when T τ (x) ∈ A.
2. Suppose f ≥ 0. Prove that for every n ∈ (τk−1 (x), τk (x)] and a.e. x ∈ A,
A f )(x)
(Sk−1 (Sn f )(x) (SkA f )(x)
≤ ≤ .
(SkA 1A )(x) A 1 )(x)
(Sn 1A )(x) (Sk−1 A
1
3. Verify that SAj 1A = j a.e. on A, and show that (Sn f )(x)/(Sn 1A )(x) −−−→
R
µ(A) f dµ
n→∞
a.e. on A.
4. Finish the proof.
For a comprehensive reference to ergodic theorems, see [7]. The mean ergodic the-
orem was proved by von Neumann, and the pointwise ergodic theorem was proved
by Birkhoff. By now there are many proofs of Birkhoff’s theorem; the one we use
is taken from [6], where it is attributed to Kamae — who found it using ideas from
nonstandard analysis. The subadditive ergodic theorem was first proved by King-
man. The proof we give is due to Steele [11]. The proof of Tempelman’s pointwise
ergodic theorem for Zd is taken from [7]. For ergodic theorems for the actions of
other amenable groups see [8]. The multiplicative ergodic theorem is due to Os-
eledets. The “linear algebra” proof is due to Raghunathan and Ruelle, and is taken
from [10]. The geometric approach to the multiplicative ergodic theorem is due to
Kaimanovich [3] who used it to generalize that theorem to homogeneous spaces
other than Posd (R). The ergodic theorem for isometric actions on CAT(0) spaces
is due to Karlsson & Margulis [5]. We proof we give is due to Karlsson & Ledrap-
pier [4]. The Martingale convergence theorem (problem 2.9) is due to Doob. The
proof sketched in problem 2.9 is taken from [2]. The proof of Hopf’s ratio ergodic
theorem sketched in problem 2.10 is due to R. Zweimüller [12].
88 2 Ergodic theorems
References
A basic problem in ergodic theory is to determine whether two ppt are measure
theoretically isomorphic. This is done by studying invariants: properties, quantities,
or objects which are equal for any two isomorphic systems. The idea is that if two
ppt have different invariants, then they cannot be isomorphic. Ergodicity and mixing
are examples of invariants for measure theoretic isomorphism.
An effective method for inventing invariants is to look for a weaker equivalence
relation, which is better understood. Any invariant for the weaker equivalence rela-
tion is automatically an invariant for measure theoretic isomorphism. The spectral
approach to ergodic theory is an example of this method.
The idea is to associate to the ppt (X, B, µ, T ) the operator UT : L2 (X, B, µ) →
L2 (X, B, µ), UT f = f ◦ T . This is an isometry of L2 (i.e. ∥UT f ∥2 = ∥ f ∥2 and
⟨UT f ,UT g⟩ = ⟨ f , g⟩).
It is useful
R
here to think of L2 as a Hilbert space over C, with inner product
⟨ f , g⟩ := f gdµ. The need to consider complex numbers arises from the need to
consider complex eigenvalues, see below.
Definition 3.1. Two ppt (X, B, µ, T ), (Y, C , ν, S) are called spectrally isomorphic,
if their associated L2 –isometries UT and US are unitarily equivalent: There exists a
linear operator W : L2 (X, B, µ) → L2 (Y, C , ν) s.t.
1. W is invertible;
2. ⟨W f ,W g⟩ = ⟨ f , g⟩ for all f , g ∈ L2 (X, B, µ);
3. WUT = USW .
It is easy to see that any two measure theoretically isomorphic ppt are spectrally iso-
morphic, but we will see later that there are Bernoulli schemes which are spectrally
isomorphic but not measure theoretically isomorphic.
89
90 3 Spectral theory
Proof. Suppose (X, B, µ, T ) is a ppt, and let UT be as above. The trick is to phrase
ergodicity and mixing in terms of UT .
Ergodicity is equivalent to the statement “all invariant L2 –functions are con-
stant”, which is the same as saying that dim{ f : UT f = f } = 1. Obviously, this
is a spectral invariant.
Mixing is equivalent to the following statement:
To see that (3.1) is a spectral invariant, we first note that (3.1) implies:
1. dim{g : UT g = g} = 1. One way to see this is to note that mixing implies ergod-
icity. Here is a more direct proof: If UT g = g, then N1 ∑N−1 n
n=0 UT g = g. By (3.1),
1 N−1 n w
N ∑n=0 UT g −−−→ ⟨g, 1⟩1. Necessarily g = ⟨g, 1⟩1 = const.
N→∞
2. For unitary equivalences W , W 1 = c with |c| = 1. Proof: W 1 is an eigenfunction
with eigenvalue one, so W 1 = const by 1. Since ∥W 1∥1 = ∥1∥2 = 1, |c| = 1.
Suppose now that W : L2 (X, F , µ) → L2 (Y, F , ν) is a unitary equivalence between
(X, F , µ, T ) and (Y, G , ν, S). Suppose T satisfies (3.1). For every F, G ∈ L2 (Y ) we
can write F = W f , G = W g with f , g ∈ L2 (Y ), whence
H(T ) is a countable subgroup of the unit circle (problem 3.1). Evidently H(T ) is a
spectral invariant of T .
It is easy to see using Fourier expansions that for the irrational rotation Rα ,
H(Rα ) = {eikα : k ∈ Z} (problem 3.2), thus irrational rotations by different angles
are non-isomorphic.
Here are other related invariants:
Definition 3.4. Given a ppt (X, B, µ, T ), let Vd := span{eigenfunctions}. We say
that (X, B, µ, T ) has
1. discrete spectrum (sometime called pure point spectrum), if Vd = L2 ,
2. continuous spectrum, if Vd = {constants} (i.e. is smallest possible),
3. mixed spectrum, if Vd ̸= L2 , {constants}.
3.2 Weak mixing 91
Any irrational rotation has discrete spectrum (problem 3.2). Any mixing transfor-
mation has continuous spectrum, because a non-constant eigenfunction f ◦ T = λ f
satisfies Z
⟨ f , f ◦ T nk ⟩ −−−→ ∥ f ∥22 ̸= | f |2
n→∞
We saw that if a transformation is mixing, then it does not have non-constant eigen-
functions. But the absence of non-constant eigenfunctions is not equivalent to mix-
ing (see problems 3.8–3.10 for an example). Here we study the dynamical signifi-
cance of this property. First we give it a name.
Theorem 3.2. The following are equivalent for a ppt (X, B, µ, T ) on a Lebesgue
space:
1. weak mixing;
2. for all E, F ∈ B, 1
N ∑N−1
n=0 |µ(E ∩ T
−n F) − µ(E)µ(F)| −−−→ 0;
N→∞
3. for every E, F ∈ B, ∃N ⊂ N of density zero (i.e. |N ∩ [1, N]|/N −−−→ 0) s.t.
N→∞
µ(E ∩ T −n F) −−−−−→ µ(E)µ(F);
N ̸∋n→∞
4. T × T is ergodic.
Proof. We prove (2) ⇒ (3) ⇒ (4) ⇒ (1). The remaining implication (1) ⇒ (2)
requires additional preparation, and will be shown later.
(2) ⇒ (3) is a general fact from calculus (Koopman–von Neumann Lemma): If
an is a bounded sequence of non-negative numbers, then N1 ∑Nn=1 an → 0 iff there is
a set of zero density N ⊂ N s.t. an −−−−−→ 0 (Problem 3.3).
N ̸∋n→∞
92 3 Spectral theory
−n
whence 1
N ∑N−1
k=0 m[(E1 × F1 ) ∩ S (E2 × F2 )] −−−→ m(E1 × F1 )m(E2 × F2 ). In sum-
N→∞
mary, 1
N∑N−1 −n
k=0 m[A ∩ S B] − −−→ m(A)m(B) for all A, B ∈ S .
N→∞
Since S generates B ⊗ B, every A, B ∈ B ⊗ B can be approximated up to
measure ε by finite disjoint unions of elements of S . A standard approximation
argument shows that N1 ∑N−1 −n
k=0 m[A ∩ S B] − −−→ m(A)m(B) for all A, B ∈ B ⊗ B.
N→∞
and this implies that T × T is ergodic.
Proof that (4)⇒ (1): Suppose T were not weak mixing, then T has an non-
constant eigenfunction f with eigenvalue λ . The eigenvalue λ has absolute value
equal to one, because |λ |∥ f ∥2 = ∥| f | ◦ T ∥2 = ∥ f ∥2 . Thus
H f := span{UTn f : n ∈ Z}.
Theorem 3.3 (Herglotz). A sequence {rn }n∈Z is the sequence of Fourier coeffi-
cients of a positive Borel measure on S1 iff r−n = rn and {rn } is positive definite:
N
∑ rn−m am an ≥ 0 for all sequences {an } and N. This measure is unique.
n,m=−N
It is easy to check that rn = ⟨UTn f , f ⟩ is positive definite (to see this expand
N
⟨Σn=−N anUTn f , Σm=−N
N amUTm f ⟩ noting that ⟨UTn f ,UTm f ⟩ = ⟨UTn−m f , f ⟩).
Definition 3.6. Suppose (X, B, µ, T ) is an invertible ppt, and f ∈ L2 \ {0}. The
spectral measure of f is the unique measure ν f on S1 s.t. ⟨ f ◦ T n , f ⟩ = S1 zn dν f
R
for n ∈ Z.
Proposition 3.2. Suppose (X, B, µ, T ) is an invertible ppt, f ∈ L2 \ {0}, and
H f := span{UTn f : n ∈ Z}. Then UT : H f → H f is unitarily equivalent to the op-
erator g(z) 7→ zg(z) on L2 (S1 , B(S1 ), ν f ).
Proof. By the definition of the spectral measure,
2 * +
N N N N Z
n n m
∑ an z = ∑ an z , ∑ am z = ∑ an am zn−m dν f (z)
n=−N n=−N m=−N n,m=−N S1
L2 (ν f)
2
N N N
= ∑ an am ⟨UTn−m f , f⟩ = ∑ an am ⟨UTn f ,UTm f ⟩ = ∑ anUTn f
n,m=−N n,m=−N n=−N L2 (µ)
N
In particular, if Σn=−N anUTn f = 0 in L2 (µ), then Σn=−N
N an zn = 0 in L2 (ν f ). It follows
that W : UT f 7→ z extends to a linear map from span{UTn f : n ∈ Z} to L2 (ν f ).
n n
The limit g must satisfy ⟨UT g, h⟩ = ⟨λ g, h⟩ for all h ∈ L2 (check!), therefore it must
be an eigenfunction with eigenvalue λ .
94 3 Spectral theory
Next we claim that g is not constant, and obtain a contradiction to weak mixing:
1 Nk −1 −n n 1 Nk −1
Z
⟨g, f ⟩ = lim ∑ λ ⟨UT f , f ⟩ = lim ∑ λ −n zn dν f (z)
k→∞ Nk n=0 k→∞ Nk n=0
1 Nk −1
Z
= ν f {λ } + lim ∑ λ −n zn dν f (z)
k→∞ Nk n=0 S1 \{λ }
1 1 − λ −Nk zNk
Z
= ν f {λ } + lim −1 z
dν f (z).
k→∞ S1 \{λ } Nk 1 − λ
The limit is equal to zero, because the integrand tends to zero and is uniformly
bounded (by one). Thus ⟨g, f ⟩ = ν f {λ } ̸= 0, whence g ̸= const. ⊔
⊓
1 N−1
Z Z
= ∑ zn wn dν f (z)dν f (w)
N k=0 S1 S1
!
N−1
1
Z Z
n n
= ∑zw dν f (z)dν f (w)
S1 S1 N k=0
The integrand tends to zero and is bounded outside ∆ := {(z, w) : z = w}. If we can
N−1
1 2
∑ | f · f ◦ T n dµ| −−−→ 0.
R
show that (ν f × ν f )(∆ ) = 0, then it will follow that N
k=0 N→∞
This is indeed the case: T is weak mixing,
R
so by the previous proposition ν f is
non-atomic, whence (ν f × ν f )(∆ ) = S1 ν f {w}dν f (w) = 0 by Fubini-Tonelli.
It remains to note that by the Koopman - von Neumann theorem, for every
bounded non-negative sequence an , N1 ∑Nk=1 a2n → 0 iff N1 ∑Nk=1 an → 0, because both
conditions are equivalent to saying that an converges to zero outside a set of indices
of density zero. ⊔
⊓
We can now complete the proof of the theorem in the previous section:
Proposition 3.4. If T is weak mixing (possibly non-invertible) ppt, then for all real-
valued f , g ∈ L2 ,
3.3 The Koopman operator of a Bernoulli scheme 95
1 N−1
Z Z Z
n
∑ g · f ◦ T dµ − f dµ gdµ −−−→ 0. (3.2)
N n=0 N→∞
Proof. Suppose
R
first that T is invertible. It is enough to consider f ∈ L2 real valued
such that f dµ = 0.
Set S( f ) := span{UTk f : k ≥ 0} ∪ {1}. Then L2 = S( f ) ⊕ S( f )⊥ , and
1. Every g ∈ S( f ) satisfies (3.2). To see this note that if g1 , . . . , gm satisfy (3.2)
then so does g = ∑ αk gk for any αk ∈ R. Therefore it is enough to check (3.2)
for g := UTRk f and g = const. Constant functions satisfy (3.2) because for such
functions, g f ◦ T n dµ = 0 for g for all n ≥ 0. Functions of the form g = UTk f
satisfy (3.2) because for such functions for all n > k,
Z Z Z
g f ◦ T n dµ = f ◦ T k f ◦ T n dµ = f · f ◦ T n−k dµ −−−−−→ 0
N ̸∋n→∞
a sequence of independent identically distributed random variables has probability zero or one is
called “Kolmogorov’s zero–one law.”
3.3 The Koopman operator of a Bernoulli scheme 97
It follows that B is independent of every element of T −|B| A (the set of B’s like
thatTis a sigma-algebra). Thus every cylinder B is independent of every element
of
T n≥1
T −n A . Thus every element of B is independent of every element of
n≥1 T
−n A (another sigma-algebra argument).
This means that every E ∈ n≥1 T −n A is independent of itself. Thus µ(E) =
T
L2 (X, B, µ) = {constants} ⊕
M
UTn (W ) (orthogonal sum).
n∈Z
{1} ∪ {UTn fλ : λ ∈ Λ }
Corollary 3.1. All systems with countable Lebesgue spectrum, whence all invertible
Bernoulli schemes, are spectrally isomorphic.
We see that all Bernoulli schemes are spectrally isomorphic. But it is not true that
all Bernoulli schemes are measure theoretically isomorphic. To prove this one needs
new (non-spectral) invariants. Enter the measure theoretic entropy, a new invariant
which we discuss in the next chapter.
Problems
3.1. Suppose (X, B, µ, T ) is an ergodic ppt on a Lebesgue space, and let H(T ) be
its group of eigenvalues.
1. show that if f is an eigenfunction, then | f | = const. a.e., and that if λ , µ ∈ H(T ),
then so do 1, λ µ, λ /µ.
2. Show that eigenfunctions of different eigenvalue are orthogonal. Deduce that
H(T ) is a countable subgroup of the unit circle.
3.2. Prove that the irrational rotation Rα has discrete spectrum, and calculate H(Rα ).
3.4. Here is a sketch of an alternative proof of proposition 3.4, which avoids natural
extensions (B. Parry). Fill in the details.
1. Set H := L2 , V := n and W := H ⊖UT H := {g ∈ H, g ⊥ UT H}.
T
n≥0 UT (H),
2. UT : V → V has a bounded inverse (hint: use the fact from Banach space the-
ory that any bounded linear operator between mapping one Banach space onto
another Banach space which is one-to-one, has a bounded inverse).
3. (3.2) holds for any f , g ∈ V .
4. if g ∈ UTk W for some k, then (3.2) holds for all f ∈ L2 .
5. if g ∈ V , but f ∈ UTk W for some k, then (3.2) holds for f , g.
6. (3.2) holds for all f , g ∈ L2 .
3.5. Show that every invertible ppt with countable Lebesgue spectrum is mixing,
whence ergodic.
3.6. Suppose (X, B, µ, T ) has countable Lebesgue spectrum. Show that { f ∈ L2 :
R
f = 0} is spanned by functions f whose spectral measures ν f are equal to the
Lebesgue measure on S1 .
3.7. Show that any two ppt with countable Lebesgue spectrum are spectrally iso-
morphic.
3.8. Cutting and Stacking and Chacon’s Example
This is an example of a ppt which is weak mixing but not mixing. The example is
a certain map of the unit interval, which preserves Lebesgue’s measure. It is con-
structed using the method of “cutting and stacking” which we now explain.
Let A0 = [1, 32 ) and R0 := [ 23 , 1] (thought of as reservoir).
Step 1: Divide A0 into three equal subintervals of length 92 . Cut a subinterval B0
of length 29 from the left end of the reservoir.
• Stack the three thirds of A0 one on top of the other, starting from the left and
moving to the right.
• Stick B0 between the second and third interval.
• Define a partial map f1 by moving points vertically in the stack. The map is
defined everywhere except on R \ B0 and the top floor of the stack. It can be
viewed as a partially defined map of the unit interval.
Update the reservoir: R1 := R \ B0 . Let A1 be the base of the new stack (equal to
the rightmost third of A0 ).
Step 2: Cut the stack vertically into three equal stacks. The base of each of these
thirds has length 13 × 29 . Cut an interval B1 of length 13 × 29 from the left side of
the reservoir R1 .
100 3 Spectral theory
• Stack the three stacks one on top of the other, starting from the left and moving
to the right.
• Stick B1 between the second stack and the third stack.
• Define a partial map f2 by moving points vertically in the stack. This map is
defined everywhere except the union of the top floor floor and R1 \ B1 .
Update the reservoir: R2 := R1 \ B1 . Let A2 be the base of the new stack (equal to
the rightmost third of A1 ).
Step 3: Cut the stack vertically into three equal stacks. The base of each of these
thirds has length 312 × 29 . Cut an interval B2 of length 312 × 92 from the left side of
the reservoir R2 .
• Stack the three stacks one on top of the other, starting from the left and moving
to the right.
• Stick B2 between the second stack and the third stack.
• Define a partial map f3 by moving points vertically in the stack. This map is
defined everywhere except the union of the top floor floor and R2 \ B2 .
Update the reservoir: R3 := R2 \ B2 . Let A3 be the base of the new stack (equal to
the rightmost third of A2 )
A0 R0 B0
l1
B0 A1 R1
B1
l2
B0
B1
A1 R1
A2 R2
step 2 (cutting) step 2 (stacking)
the unit interval. Using this identification, we may view fn as partially defined maps
of the unit interval.
1. Show that fn is measure preserving where it is defined (the measure is Lebesgue’s
measure). Calculate the Lebesgue measure of the domain of fn .
2. Show that fn+1 extends fn (i.e. the maps agree on the intersection of their do-
mains). Deduce that the common extension of fn defines an invertible probability
preserving map of the open unit interval. This is Chacon’s example. Denote it by
(I, B, m, T ).
3. Let ℓn denote the height of the stack at step n. Show that the sets {T i (An ) : i =
0, . . . , ℓn , n ≥ 1} generate the Borel σ –algebra of the unit interval.
3.9. (Continuation) Prove that Chacon’s example is weak mixing using the follow-
ing steps. Suppose f is an eigenfunction with eigenvalue λ .
1. We first show that if f is constant on An for some n, then f is constant every-
where. (An is the base of the stack at step n.)
a. Let ℓk denote the height of the stack at step k. Show that An+1 ⊂ An , and
T ℓn (An+1 ) ⊂ An . Deduce that λ ℓn = 1.
b. Prove that λ ℓn+1 = 1. Find a recursive formula for ℓn . Deduce that λ = 1.
c. The previous steps show that f is an invariant function. Show that any invari-
ant function which constant on An is constant almost everywhere.
2. We now consider the case of a general L2 – eigenfunction.
a. Show, using Lusin’s theorem, that there exists an n such that f is nearly con-
stant on most of An . (Hint: part 3 of the previous question).
b. Modify the argument done above to show that any L2 –eigenfunction is con-
stant almost everywhere.
3.10. (Continuation) Prove that Chacon’s example is not mixing, using the following
steps.
1. Inspect the image of the top floor of the stack at step n, and show that for every n
and 0 ≤ k ≤ ℓn−1 , m(T k An ∩ T k+ℓn An ) ≥ 31 m(T k An ).
2. Use problem 3.8 part 3 and an approximation argument to show that for every
Borel set E and ε > 0, m(E ∩ T ℓn E) ≥ 31 m(E) − ε for all n. Deduce that T cannot
be mixing.
Notes to chapter 3
The spectral approach to ergodic theory is due to von Neumann. For a thorough
modern introduction to the theory, see Nadkarni’s book [1]. Our exposition follows
in parts the books by Parry [2] and Petersen [1]. A proof of the discrete spectrum
theorem mentioned in the text can be found in Walters’ book [5]. A proof of Her-
glotz’s theorem is given in [2].
102 3 Spectral theory
References
We saw at the end of the last chapter that every two Bernoulli schemes are spec-
trally isomorphic (because they have countable Lebesgue spectrum). The question
whether any two Bernoulli schemes are measure theoretically isomorphic was a ma-
jor open problem until it was answered negatively by Kolmogorov and Sinai, who
introduced for this purpose a new invariant: metric entropy. Later, Ornstein proved
that this invariant is complete within the class of Bernoulli schemes: Two Bernoulli
schemes are measure theoretically isomorphic iff their metric entropies are equal.
Proposition 4.1. The only functions ϕ : [0, 1] → R+ such that I(A) = ϕ[µ(A)] sat-
isfies the above axioms for all probability spaces (X, B, µ) are c lnt with c < 0.
103
104 4 Entropy
µ(A ∩ B)
Hµ (α|β ) = − ∑ µ(B) ∑ µ(A|B) log µ(A|B), where µ(A|B) = .
B∈β A∈α µ(B)
We state and prove a formula which says that the information content of α and
β is the information content of α plus the information content of β given the knowl-
edge α.
Theorem 4.1 (The Basic Identity). Suppose α, β are measurable countable parti-
tions, and assume Hµ (α), Hµ (β ) < ∞, then
1. Iµ (α ∨ β |F ) = Iµ (α|F ) + Iµ (β |F ∨ σ (α));
2. Hµ (α ∨ β ) = Hµ (α) + Hµ (β |α).
F)
Claim: µ(B|F ∨ σ (α)) = ∑A∈α 1A µ(B∩A|
µ(A|F ) :
µ(B ∩ A|F )
Iµ (β |F ∨ σ (α)) = − ∑ 1B log ∑ 1A
B∈β A∈α µ(A|F )
µ(B ∩ A|F )
=− ∑ ∑ 1A∩B log
B∈β A∈α
µ(A|F )
=− ∑ ∑ 1A∩B log µ(B ∩ A|F ) + ∑ ∑ 1A∩B log µ(A|F )
B∈β A∈α A∈α B∈β
= Iµ (α ∨ β |F ) − Iµ (α|F ).
Lemma 4.1. Let ϕ(t) := −t logt, then for every probability vector (p1 , . . . , pn ) and
x1 , . . . , xn ∈ [0, 1] ϕ(p1 x1 + · · · + pn xn ) ≥ p1 ϕ(x1 ) + · · · + pn ϕ(xn ), with equality iff
all the xi with i s.t. pi ̸= 0 are equal.
Proof. This is because ϕ(·) is strictly concave. Let m := ∑ pi xi . If m = 0 then the
lemma is obvious, so suppose m > 0. It is an exercise in calculus to see that ϕ(t) ≤
ϕ(m) + ϕ ′ (m)(t − m) for t ∈ [0, 1], with equality iff t = m. In the particular case
m = ∑ pi xi and t = xi we get
∑ ∑ µ(B)ϕ[µ(A|B)] = ∑ ϕ[µ(A)].
A∈α B∈β A∈α
It can be shown that the supremum is attained by finite measurable partitions (prob-
lem 4.9).
Proposition 4.4. The limit which defines hµ (T, α) exists.
Proof. Write αn := n−1
W −i
i=0 T α. Then an := Hµ (αn ) is subadditive, because an+m :=
Hµ (αn+m ) ≤ Hµ (αn ) + Hµ (T −n αm ) = an + am .
We claim that any sequence of numbers {an }n≥1 which satisfies an+m ≤ an + am
converges to a limit (possibly equal to minus infinity), and that this limit is inf[an /n].
Fix n. Then for every m, m = kn + r, 0 ≤ r ≤ n − 1, so
am ≤ kan + ar .
am kan + ar an ar
≤ ≤ + ,
m kn + r n m
whence lim sup(am /m) ≤ an /n. Since this is true for all n, lim sup am /m ≤ inf an /n.
But it is obvious that lim inf am /m ≥ inf an /n, so the limsup and liminf are equal,
and their common value is inf an /n.
We remark that in our case the limit is not minus infinity, because Hµ ( n−1
W −i
i=0 T α)
are all non-negative. ⊔
⊓
108 4 Entropy
Hµ (αn ) is the average information content in the first n–digits of the α–itinerary.
Dividing by n gives the average “information per unit time.” Thus the entropy mea-
sure the maximal rate of information production the system is capable of generating.
It is also possible to think of entropy as a measure of unpredictability. Let’s think
of T as moving backward in time. Then α1∞ := σ ( ∞
S −n α) contains the informa-
n=1 T
tion on the past of the itinerary. Given the future, how unpredictable is the present,
on average? This is measured by Hµ (α|α1∞ ).
−i α
S∞
Theorem 4.2. If Hµ (α) < ∞ , then hµ (T, α) = Hµ (α|α1∞ ), where α1∞ = σ i=1 T .
1 n
hµ (T, α) = lim
n→∞ n
∑ Hµ (α|α1k )
k=1
But the claim that the integral of the limit is equal to the limit of the integrals requires
justification.
If |α| < ∞, then we can bypass the problem by writing
Z
Hµ (α|α1k ) = ∑ ϕ[µ(A|α1k )]dµ, with ϕ(t) = −t logt,
A∈α
and noting that this function is bounded (by |α| max ϕ). This allows us to apply the
bounded convergence theorem, and deduce that Hµ (α|α1k ) −−−→ Hµ (α|α1∞ ).
k→∞
If |α| = ∞ (but Hµ (α) < ∞) then we need to be more clever, and appeal to the
following lemma (proved below):
Lemma 4.2 (Chung–Neveu). Suppose α is a countable measurable partition with
finite entropy, then the function f ∗ := supn≥1 Iµ (α|α1n ) is absolutely integrable.
The result now follows from the dominated convergence theorem. ⊔
⊓
4.3 The Metric Entropy 109
Here is the proofUof the Chung Neveu Lemma. Fix A ∈ α, then we may decom-
pose A ∩ [ f ∗ > t] = m≥1 A ∩ Bm (t; A), where
Bm (t; A) := {x ∈ X : m is the minimal natural number s.t. − log2 µ(A|α1m ) > t}.
We have
Summing over m we see that µ(A ∩ [ f ∗ > t]) ≤ 2−t . Of course we also have µ(A ∩
[ f ∗ > t]) ≤ µ(A). Thus µ(A ∩ [ f ∗ > t]) ≤ min{µ(A), 2−t }. R
We now use the following fact from measure theory: If g ≥ 0, then gdµ =
1
R∞
0 µ[g > t]dt:
Z Z ∞ Z ∞
f ∗ dµ = µ(A ∩ [ f ∗ > t])dt ≤ min{µ(A), 2−t }dt
A 0 0
2−t
Z − log µ(A) Z ∞ ∞
2
−t
≤ µ(A)dt + 2 dt = −µ(A) log2 µ(A) −
0 − log2 µ(A) ln 2 − log2 µ(A)
= −µ(A) log2 µ(A) + µ(A)/ ln 2.
R ∗
Summing over A ∈ α we get that f dµ ≤ Hµ (α) + (ln 2)−1 < ∞. ⊔
⊓
1
− log µ(αn (x)) −−−→ hµ (T, α) a.e.
n n→∞
Proof. Let α0n := ni=0 T −i α. Recall the basic identity Iµ (α0n−1 ) ≡ Iµ (α1n−1 ∨ α) =
W
1 1 n
lim Iµ (α0n ) ≡ lim ∑ Iµ (α|α1k ) ◦ T n−k
n→∞ n n→∞ n
k=1
? 1 n n−k 1 n−1
= lim
n→∞ n
∑ Iµ (α|α ∞
1 ) ◦ T ≡ lim
n→∞ n
∑ Iµ (α|α1∞ ) ◦ T k
k=1 k=0
Z
= Iµ (α|α1∞ )dµ (Ergodic Theorem)
= Hµ (α|α1∞ ) = hµ (T, α).
1 n
Z
lim sup
n→∞
∑ | fk − f∞ | ◦ T n−k dµ = 0.
n k=1
(4.1)
(This implies that the limsup is zero almost everywhere.) Set Fn := supk>n | fk − f∞ |.
Then Fn → 0 almost everywhere. We claim that Fn → 0 in L1 . This is because of
the dominated convergence theorem and the fact that Fn ≤ 2 f ∗ := 2 supm fm ∈ L1
(Chung–Neveu Lemma). Fix some large N, then
1 n−1
Z
lim sup
n→∞
∑ | fn−k − f∞ | ◦ T k dµ =
n k=0
1 n−N−1 1 n−1
Z Z
= lim sup
n→∞
∑ | fn−k − f∞ | ◦ T k dµ +
n k=0
lim sup
n→∞
∑ | fn−k − f∞ | ◦ T k dµ
n k=n−N
!
1 n−N−1 1 N−1 ∗
Z Z
≤ lim
n→∞ n
∑ FN ◦ T k dµ + lim sup
n→∞
∑ 2 f ◦ T k ◦ T n−N dµ
n k=0
k=0
4.3 The Metric Entropy 111
!
1 N−1 ∗
Z Z Z
k
= FN dµ + lim sup ∑ 2 f ◦ T dµ = FN dµ.
n→∞ n k=0
Proof. We treat the inverible case only, and leave the non-invertible case as an ex-
ercise. Fix a finite measurable partition β ; Must show that hµ (T, β ) ≤ hµ (T, α).
Step 1. hµ (T, β ) ≤ hµ (T, α) + Hµ (β |α)
1 1 1
Hµ (β0n−1 ) ≤ Hµ (β0n−1 ∨ α0n−1 ) = Hµ (α0n−1 ) + Hµ (β0n−1 |α0n−1 )
n " n n#
n−1
1
≤ Hµ (α0n−1 ) + ∑ Hµ (T −k β |α0n−1 ) ∵ Hµ (ξ ∨ η|G ) ≤ Hµ (ξ |G ) + Hµ (η|G )
n k=0
" #
n−1
1
≤ Hµ (α0n−1 ) + ∑ Hµ (T −k β |T −k α)
n k=0
1
= Hµ (α0n−1 ) + Hµ (β |α).
n
Now pass to the limit.
112 4 Entropy
N )
Step 2. For every N, hµ (T, β ) ≤ hµ (T, α) + Hµ (β |α−N
N instead of α, and check that h (T, α N ) =
Repeat the previous step with α−N µ −N
hµ (T, α).
N ) −−−→ H (β |B) = 0.
Step 3. Hµ (β |α−N µ
N→∞
Z Z
N N N
Hµ (β |α−N )= Iµ (β |α−N )dµ = − ∑ 1B log µ(B|α−N )dµ
B∈β
Z
N N
=− ∑ µ(B|α−N ) log µ(B|α−N )dµ
B∈β
Z Z
N
= ∑ ϕ[log µ(B|α−N )]dµ −−−→ ∑ ϕ[log µ(B|B)]dµ = 0,
N→∞
B∈β B∈β
Proposition 4.5. Let T be a Borel map on a compact metric space (X, d), and sup-
pose αn is a sequence of finite Borel partitions such that diam(αn ) −−−→ 0. Then for
n→∞
every T -invariant Borel probability measure µ, hµ (T ) = limn→∞ hµ (T, αn ).
Proof. We begin, as in the proof of Sinai’s generator theorem, with the inequality
Proof. Fix δ , to be determined later. Write β = {B1 , . . . , Bm } and find compact sets
Ki and open sets Ui such that Ki ⊂ Bi ⊂ Ui and ∑ µ(Ui \ Ki ) < δ . Find δ ′ > 0 so
small that dist (Ki , K j ) > δ ′ for all i ̸= j, and Nδ ′ (Ki ) ⊂ Ui .
Fix N so that n > N ⇒ diam(αn ) < δ ′ /2. By choice of δ ′ , every A ∈ αn can
intersect at most one Ki . This gives rise to a partition γ = {C1 , . . . ,Cm+1 } where
[ m
[
Ci := {A ∈ αn : A ∩ Ki ̸= ∅} (i = 1, . . . , m) , Cm+1 := X \ Ci .
i=1
We now use Claim 2 to choose δ so that ∑m+1 i=1 µ(Bi △Ci ) < 2δ ⇒ Hµ (β |γ) < ε.
Since γ ≥ αn , Hµ (β |αn ) < ε for all n > N.
Step 3. Proof of the proposition: Pick nk → ∞ such that hµ (T, αnk ) → lim inf hµ (T, αn ).
For every finite partition β , hµ (T, β ) ≤ hµ (T, αnk ) + Hµ (β |αnk ).
By step 1, Hµ (β |αnk ) −−−→ 0, so hµ (T, β ) ≤ lim infn→∞ hµ (T, αn ). Passing to the
k→∞
supremum over all finite partitions β , we find that
4.4 Examples
Proposition 4.6. The entropy of the Bernoulli shift with probability vector p is
− ∑ pi log pi . Thus the ( 31 , 31 , 13 )–Bernoulli scheme and the ( 12 , 21 )–Bernoulli scheme
are not isomorphic.
114 4 Entropy
Proof. The partition into 1-cylinders α = {[i]} is a strong generator, and Hµ (α0n−1 ) =
− ∑x0 ,...,xn−1 px0 · · · pxn−1 log px0 + · · · log pxn−1 = −n ∑ pi log pi . ⊔
⊓
Proposition 4.7. The irrational rotation has entropy zero w.r.t. the Haar measure.
We now claim that α := {A0 , A1 } (the two halves of the circle) is a strong generator.
It is enough to show that for every ε, n≥1 α0n−1 contains open covers of the circle
S
by open arcs of diameter < ε (because this forces α0∞ to contain all open sets).
It is enough to manufacture one arc of diameter less than ε, because the transla-
tions of this arc by kα will eventually cover the circle.
But such an arc necessarily exits: Choose some n s.t. nα mod 1 ∈ (0, ε). Then
A1 \ T −n A1 = (A1 \ [A1 − nα] is an arc of diameter less than ε.
Proposition 4.8. Suppose µ is a shift invariant Markov measure with transition ma-
trix P = (pi j )i, j∈S and probability vector (pi )i∈S . Then hµ (σ ) = − ∑ pi pi j log pi j .
n−1
=−∑ ∑ pξ0 pξ0 ξ1 · · · pξ j−1 ξ j · pξ j+1 ξ j+2 · · · pξn−1 ξn pξ j ξ j+1 log pξ j ξ j+1
j=0 ξ0 ,...,ξn ∈S
Proof. One checks that the elements of α0n−1 are all intervals of length O(λ −n ).
Therefore α is a strong generator, whence
Z
hµ (T ) = hµ (T, α) = Hµ (α|α1∞ ) = Iµ (α|α1∞ )dµ.
We calculate Iµ (α|α1∞ ).
First note that α1∞ = T −1 (α0∞ ) = T −1 B, thus Iµ (α|α1∞ ) = − ∑ 1A log µ(A|T −1 B).
A∈α
We need to calculate E(·|T −1 B). For this purpose, introduce the operator Tb :
dµ
L → L1 given by (Tb f )(x) = ∑Ty=x dµ◦T
1 (y) f (y).
116 4 Entropy
Proof. (Scheller) We prove the theorem in the case when T is invertible. The non-
invertible case is handled by passing to the natural extension.
The idea is to show, for as many partitions α as possible, that hµ (T, α) =
µ(A)hµA (TA , α ∩ A), where α ∩ A := {E ∩ A : E ∈ α}. As it turns out, this is the case
for all partitions s.t. (a) Hµ (α) < ∞; (b) Ac ∈ α; and (c) ∀n, TA [ϕA = n] ∈ σ (α).
Here, as always, ϕA (x) := 1A (x) inf{n ≥ 1 : T n x ∈ A} (the first return time).
To see that there are such partitions, we let
(the coarsest possible) and show that Hµ (ξA ) < ∞. A routine calculation shows that
Hµ (ξA ) = Hµ ({A, Ac }) + µ(A)HµA (TA ηA ) ≤ 1 + HµA (ηA ). It is thus enough to show
that − ∑ pn log2 pn < ∞, where pn := µA [ϕA = n]. This is because ∑ npn = 1/µ(A)
4.5 Abramov’s Formula 117
(Kac formula) and the following fact from calculus: probability vectors with finite
expectations have finite entropy.2
Assume now that α is a partition which satisfies (a)–(c) above. We will use
throughout the following fact:
hµ (T, α) = Hµ (α|α1∞ )
Z Z
= ∑ 1B log µ(B|α1∞ )dµ + 1Ac log µ(Ac |α1∞ )dµ
B∈A∩α
Z
= ∑ 1B log µ(B|α1∞ )dµ, because Ac ∈ (ξA )∞ ∞
1 ⊆ α1
B∈A∩α
Z
= µ(A) ∑ 1B log µA (B|α1∞ )dµA ,
A B∈A∩α
This implies that hµ (T, α) = µ(A)hµA (TA , A ∩ α). Passing to the supremum over all
α which contain Ac as an atom, we obtain
this list, and so B = N−1 TA−k (A jk ∩ [ϕA = jk+1 − jk ]) ∩ TA−N [ϕA > n − jN ]. Since
T
W∞ k=1
−1
ηA ≤ α ∩ A, B ∈ i=1 TA (α ∩ A).
Proof of ⊇: TA−1 (α ∩ A) ≤ A ∩ −i α,
W∞
i=1 T because if B ∈ α ∩ A, then
∞ ∞
TA−1 B = T −n (B ∩ TA [ϕA = n]) ∈ T −n α (∵ TA ηA ≤ ξA ≤ A ∩ α).
[ _
n=1 n=1
The limit exists because of the subadditivity of N(·): an := log N(U0n−1 ) satisfies
am+n ≤ am + an , so lim an /n exists.
Proof. Eventually everything boils down to the following inequality, which can be
checked using Lagrange multipliers: For every probability vector (p1 , . . . , pk ),
k
− ∑ pi log2 pi ≤ log k, (4.4)
i=1
Let B0 := X \ kj=1 B j be the remainder (of measure less than kε), and define β =
S
{B0 ; B1 , . . . , Bk }.
Step 1 in the proof of Sinai’s theorem says that hµ (T, α) ≤ hµ (T, β ) + Hµ (α|β ).
We claim that Hµ (α|β ) can be made uniformly bounded by a suitable choice of
ε = ε(α):
120 4 Entropy
B0 ∪ B j = B0 ∪ (A j \ B0 ) (∵ A j ∩ B0 = A j \ B j )
! !
[ [
= B0 ∪ A j = B0 ∪ X \ Ai = B0 ∪ X \ Bi .
i̸= j i̸= j
Definition 4.7.
1. sn (ε, T ) := max{#(F) : F is (n, ε)–separated}.
2. s(ε, T ) := lim sup n1 log sn (ε, T )
n→∞
3. htop (T ) := lim s(ε, T ).
ε→0+
Proof. Suppose U is an open cover all of whose elements have diameters less than
ε. We claim that N(U0n−1 ) ≥ sn (ε, T ) for all n. To see this suppose F is an (n, ε)–
separated set of maximal cardinality. Each x ∈ F is contained in some Ux ∈ U0n−1 .
Since the d–diameter of every element of U is less than δ , the dn –diameter of every
element of U0n−1 is less than δ . Thus the assignment x 7→ Ux is one-to-one, whence
N(U0n−1 ) ≥ sn (ε, T ).
N(U0n−1 ) ≤ sn (ε).
122 4 Entropy
We just proved that for every open cover U with Lebesgue number at least ε,
htop (T, U ) ≤ s(ε). It follows that
We now pass to the limit ε → 0+ . The left hand side tends to the supremum over all
open covers, which is equal to htop (T ). We obtain htop (T ) ≤ lim s(ε). ⊔
⊓
ε→0+
Corollary 4.1. Suppose T is an isometry, then all its invariant probability measures
have entropy zero.
Proof. If T is an isometry, then dn = d for all n, therefore s(ε, T ) = 0 for all ε >
0, so htop (T ) = 0. The theorem says that htop (T ) = 0. The corollary follows from
Goodwyn’s theorem. ⊔
⊓
The following theorem was first proved under additional assumptions by Dinaburg,
and then in the general case by Goodman. The proof below is due to Misiurewicz.
Proof. We have already seen that the topological entropy is an upper bound for the
metric entropies. We just need to show that this is the least upper bound.
Fix ε, and let Fn be a sequence of (n, ε)–separated sets of maximal cardinality
(so #Fn = sn (ε, T )). Let
1
νn := ∑ δx ,
#Fn x∈F n
where δx denotes the Dirac measure at x (i.e. δx (E) = 1E (x)). These measure are not
invariant, so we set
1 n−1
µn := ∑ νn ◦ T −k .
n k=0
Any weak star limit of µn will be T –invariant (check).
w∗ 1
Fix some sequence nk → ∞ s.t. µnk −−−→ µ and s.t. nk log snk (ε, T ) −−−→ s(ε, T ).
k→∞ n→∞
We show that the entropy of µ is at least s(ε, T ). Since s(ε, T ) −−−→ htop (T ), this
ε→0+
will prove the theorem.
Let α = {A1 , . . . , AN } be a measurable partition of X s.t. (1) diam(Ai ) < ε; and
(2) µ(∂ Ai ) = 0. (Such a partition can be generated from a cover of X by balls of
radius less than ε/2 and boundary of zero measure.) It is easy to see that the dn –
diameter of α0n−1 is also less than ε. It is an exercise to see that every element of
α0n−1 has boundary with measure µ equal to zero.
4.7 Ruelle’s inequality 123
We calculate Hνn (α0n−1 ). Since Fn is (n, ε)–separated and every atom of α has
dn –diameter less than ε, α0n−1 has #Fn elements whose νn measure is 1/#Fn , and the
other elements of α0n−1 have measure zero. Thus
We now “play” with Hνn (α0n−1 ) with the aim of bounding it by something in-
q−1
volving a sum of the form ∑n−1
i=0 Hνn ◦T −i (α0 ). Fix q, and j ∈ {0, . . . , q − 1}, then
[n/q]−1
log2 sn (ε, T ) = Hνn (α0n−1 ) ≤ Hνn (α0j−1 ∨ T −(qi+ j) α0q−1 ∨ αq([n/q]−1)+
_
n−1
j+1 )
i=0
[n/q]−1
≤ ∑ Hνn ◦T −(qi+ j) (α0q−1 ) + 2q log2 #α.
i=0
1 n−1
q log2 sn (ε, T ) ≤ n · ∑ Hνn ◦T −k (α0q−1 ) + 2q log2 #α
n k=0
≤ nHµn (α0q−1 ) + 2q log2 #α,
1 1 2
log2 snk (ε, T ) ≤ Hµnk (α0q−1 ) + log2 #α, (4.6)
nk q nk
where nk → ∞ is the subsequence chosen above.
w∗
Since every A ∈ α0n−1 satisfies µ(∂ A) = 0, µnk (A) −−→ µ(A) for all A ∈ α0n−1 .3
k→ ∞
It follows that Hµnk (α0q−1 ) −−−→ Hµ (α0q−1 ). Passing to the limit k → ∞ in (4.6),
k→∞
we have s(ε, T ) ≤ q1 Hµ (α) −−−→ hµ (T, α) ≤ hµ (T ). Thus hµ (T ) ≥ s(ε, T ). Since
q→∞
s(ε, T ) −−−→ htop (T ) the theorem is proved. ⊔
⊓
ε→0+
Proposition 4.9. Let eχ1 ≥ · · · ≥ eχd denote the singular values of an invertible ma-
trix A listed with multiplicity. The ellipsoid {Av : v ∈ Rd , ∥v∥2 ≤ 1} can be inscribed
in an orthogonal box with sides 2eχ1 , . . . , 2eχd .
√
Proof. The matrix At A is symmetric and positive definite, therefore it can be di-
agonalized orthogonally. Let u1 , . . . , ud be an orthonormal basis of eigenvectors so
that Aui = eχi ui . Using this basis we can represent
( ) ( )
n o d d
d 2
Av : v ∈ R , ∥v∥2 ≤ 1 ≡ ∑ αi Aui : ∑ αi ≤ 1 ⊂ ∑ αi Aui : |αi | ≤ 1 .
i=1 i=1
The last set is an orthogonal box as requested, because Aui are orthogonal vectors
with lengths eχi : ⟨Aui , Au j ⟩ = ⟨At Aui , u j ⟩ = e2χi ⟨ui , u j ⟩ = e2χi δi j .
i i
∥Av∥2 = ⟨At Av, v⟩ = ∑ e2χ j β j2 ≥ e2χi ∑ β j2 = e2χi .
j=1 j=1
The lower bound is achieved for v = ui , so min ∥Av∥ = eχi , proving that
v∈V0 ,∥v∥=1
max min ∥Av∥ ≥ min ∥Av∥ = exp χi . ⊔
⊓
V ⊂Rd ,dimV =i v∈V,∥v∥=1 v∈V0 ,∥v∥=1
such that diam(αk ) −−−→ 0. So hµ (T, αk ) −−−→ hµ (T, α). Note that it is still the case
k→∞ k→∞
that B(yi , ε/k) ⊂ Ai ⊂ B(yi , 2ε/2k).
Step 2. hµ ( f , αk ) ≤ ∑A∈αk µ(A)( 1n log2 Kn (A, k)), where
j=1
for all y ∈ A, where si (DFyn ) denote the singular values of the derivative matrix of
f n at y, expressed in coordinates.
We first explain the method on the heuristic level, and then develop the rigorous
implementation. Suppose n is fixed.
Observe that f n (A) has diameter at most Lip( f n )diamαk ≤ Lip( f n )2ε/k, and
therefore f n (A) and all the A′ which intersect it are included in a ball B of ra-
dius (Lip( f n ) + 1)ε/k). If ε is very small, this ball lies inside a single coordinate
chart ψ1 (Rd ). Similarly, A lies inside a ball of radius diamαk ≤ 2ε/k, which for ε
sufficiently small lies inside a coordinate chart ψ2 (Rd ). This allows us to express
f n | f −n (B) in coordinates: F n := ψ2−1 ◦ f n ◦ ψ1 : Rd → Rd .
If ε is sufficiently small, F n is very close to its linearization. Suppose for the
purpose of the explanation that F n is precisely equal to its linearization on f −n (B).
From now on we abuse notation and identify f n with its derivative matrix DF n in
coordinates.
Since A is contained in a ball of radius ε/k, DF n (A) is contained in a box with
orthogonal sides whose lengths are 2(ε/k)×the singular values
n n
eχ1 (DF ) ≥ · · · ≥ eχd (DF ) .
k ∏(eχi (DF ) + 1)
i=1
By construction, A′i contain pairwise disjoint balls of radius ε/2k, and volume
n
const. (ε/k)d . Thus their number is at most const. ∏χi (DF n )≥0 (eχi (DF ) + 1), where
the constant comes from the volume of the unit ball in Rd .
We now explain how to implement this idea when f n is not linear. The details are
tedious, but are all routine.
We begin by recalling some facts from differential topology. Recall that M is a
compact smooth Riemannian manifold of dimension d. Let Tx M denote the tangent
space at x, and let T M denote the tangent bundle. Since M is a Riemannian manifold,
4.7 Ruelle’s inequality 127
every Tx M comes equipped with an inner product ⟨·, ·⟩x and a norm ∥ · ∥x . We denote
tangent vectors in Tx M by ⃗v, and vectors in Rd by v.
Let expx : Tx M → M denote the exponential map, defined by expx (⃗0) := x and
Here we identified the linear maps ϑx : Tx M → Rd with their derivatives at ⃗0, and
(D f n )x : Tx M → T f n (x) M is the derivative of f n at x.
Claim. For every ε > 0 there is a κ < κ3 such that for all x ∈ M and ∥t∥ < κ
d(Fxn (t), (DFxn )0t) < ε∥t∥.
Proof. Write G = Fxn , fix t, and let gt (s) = G(st) ∈ Rd . Since gt (0) = Fxn (0) = 0,
Z 1
∥G(t) − DG0t∥ = ∥gt (1) − gt (0) − gt′ (0)∥ ≤ gt′ (s)ds − gt′ (0)
0
Z 1
≤ ∥gt′ (s) − gt′ (0)∥ds ≤ ∥t∥ sup ∥(DFxn )s1 − (DFxn )s2 ∥
0 ∥s1 −s2 ∥≤∥t∥
Since f is C1 and ϑ f n (x) , ϑx are isometries, the supremum can be made smaller than
ε by choosing κ sufficiently small.
128 4 Entropy
Claim. Let si (D fxn ) := i-th singular value of ϑ f n (x) D fxn ϑx−1 , then there exists κ4 such
that |si (D fyn ) − si (D fzn )| < 1 for all d(y, z) < κ4 .
Proof. By the minimax theorem the fact that ϑx , ϑ f n (x) are isometries,
The C1 –smoothness of f implies that (x,⃗v) 7→ ∥(D f n )x⃗v∥ f n (x) is uniformly continu-
ous. From this it is routine to deduce that si (D fxn ) is uniformly continuous.
We can now begin the estimation of Kn (A, k). Recall that A is contained in some
ball B(yi , ε/k) of radius ε/k. Therefore
Assume k is so large that (Lip( f n ) + 1)ε/k < κ3 . Using the bound Lip(exp−1
x )≤
2, we can write A′ and A in coordinates as follows:
A = (expyi ◦ϑy−1
i
)(B), where B ⊂ B(0, 2ε/k) ⊂ Rd
A′ = (exp f n (yi ) ◦ϑ f−1 ′ ′ ′
n (y ) )(B ), where diam(B ) ≤ 2diam(A ) ≤ 4ε/k.
i
where ρ := 4ε/k.
Claim. Choose k so large that 10ε/k < 1 and ε < 1/2. Then the Lebesgue volume
of Nρ (Fyni (B)) is less than 100d (ε/k)d ∏di=1 (1 + si (ϑ f−1 n
n (y ) D f yi ϑyi )).
i
balls are disjoint, because the A′ are disjoint. We find that Kn (A, k) is bounded by
the maximal number of pairwise disjoint balls of radius ε/4k which can be fit inside
a set of volume 100d (ε/k)d ∏di=1 (1 + si (ϑ f−1 n
n (y ) D f yi ϑyi )). Thus
i
d
Kn (A, k) ≤ C ∏[1 + si (D fyni )],
i=1
1
|si (D fyn ) − si (D fyni )| < for all y ∈ A ⊂ B(yi , ε/k).
2
Passing to the limit k → ∞ gives us that
d
1 1
Z
hµ ( f ) ≤ ∑ log2 [1 + si (D fyn )]dµ(y) + logC,
i=1 n n
d
1
Z
whence hµ ( f ) ≤ lim log2 [1 + si (D fyn )]dµ(y).
∑
n→∞
i=1 n
We analyze the last limit. Let A(x) = ϑ f (x) D fx ϑx−1 , then
and therefore si (D fyn ) = i-th singular value of A( f n−1 (x)) · · · A( f (x))A(x). By the
multiplicative ergodic theorem, si (x)1/n −−−→ eχi (x) , where χ1 (x) ≥ · · · ≥ χd (x) are
n→∞
the Lyapunov exponents of A(x). Necessarily,
(
1 χi (x)/ ln 2 χi (x) > 0
log2 [1 + si (D fyn )] −−−→
n n→∞ 0 χi (x) ≤ 0,
Since 1n log2 [1+si (D fyn )] is uniformly bounded (by (ln 2)−1 supy ∥D fy ∥), we have
by the bounded convergence theorem that
d
1 1
Z Z
hµ ( f ) ≤ ∑ lim log2 [1 + si (D fyn )]dµ(y) = ∑ χi (x)dµ,
i=1
n→∞ n ln 2 i:χi (x)>0
Problems
µ(A∩B)
4.1. Prove: Hµ (α|β ) = − ∑ µ(B) ∑ µ(A|B) log µ(A|B), where µ(A|B) = µ(B) .
B∈β A∈α
induces a metric on P.
4.8. Let (X, B, µ, T ) be a ppt. Show that |hµ (T, α) − hµ (T, β )| ≤ Hµ (α|β ) +
Hµ (β |α).
4.9. Use the previous problem to show that
4.10. Suppose α = {A1 , . . . , An } is a finite measurable partition. Show that for every
ε, there exists δ = δ (ε, n) such that if β = {B1 , . . . , Bn } is measurable partition s.t.
µ(Ai △Bi ) < δ , then ρ(α, β ) < ε.
References 131
4.12. Show that the entropy of the product of two ppt is the sum of their two en-
tropies.
4.13. Show that htop (T n ) = nhtop (T ).
Notes to chapter 4
References
1. Petersen, K.: Ergodic theory. Corrected reprint of the 1983 original. Cambridge Studies in
Advanced Mathematics, 2. Cambridge University Press, Cambridge, 1989. xii+329 pp.
2. Katok, A. and Hasselblat, B.: Introduction to the modern theory of dynamical systems, with a
supplement by A. Katok and L. Mendoza, Encyclopedia of Mathematics and its Applications
54, Cambridge University Press, 1995, xviii+802.
3. Min Qian, Jian-Sheng Xie, Shu Zhu: Smooth ergodic theory for endomorphisms, LNM 1978
(2009)
132 4 Entropy
Every compact metric space is polish. But a polish space need not be compact, or
even locally compact. For example,
NN := {x = (x1 , x2 , x3 , . . .) : xk ∈ N}
Proof. Since X is separable, it contains a countable dense set {xn }n≥1 . Define
U := {B(xn , r) : n ∈ N, r ∈ Q}.
This is a countable collection, and we claim that it satisfies (1). Take some open set
U. For every x ∈ U there are
1. R > 0 such that B(x, R) ⊂ U (because U is open);
2. xn ∈ B(x, R/2) (because {xn } is dense); and
133
134 A The isomorphism theorem for standard measure spaces
The separability of L p for 1 ≤ p < ∞ follows from the above and the obvious fact
that the collection {∑Ni=1 αi 1Ei : N ∈ N, αi ∈ Q, Ei ∈ E } is dense in L p (prove!). The
other direction is left to the reader. ⊔
⊓
The following statement will be used in the proof of the isomorphism theorem.
Lemma A.1. Suppose (X, B, µ) is a standard probability space and E is a measur-
able set of positive measure, then there is a point x ∈ X such that µ[E ∩ B(x, r)] ̸= 0
for all r > 0.
A.3 Atoms
Proposition A.2. For standard spaces (X, B, µ), every atom is of the form {x}∪null
set for some x s.t. µ{x} ̸= 0.
Proof. Since every measurable set is a finite or countable disjoint union of sets of
diameter less than r (prove!), it is enough to treat sets E such that diam(E) < r.
Standard spaces are regular, so we can find a closed set F1 ⊂ E such that µ(E \
F1 ) < 12 . If µ(E \ F1 ) = 0, then stop. Otherwise apply the argument to E \ F1 to
find a closed set F2 ⊂ E \ F1 of positive measure such that µ[E \ (F1 ∪ F2 )] < 212 .
Continuing in this manner we obtain pairwise disjoint closed sets Fi ⊂ E such that
µ(E \ ni=1 Fi ) < 2−n for all n, or until we get to an n such that µ(E \ ni=1 Fn ) = 0.
S U
If the procedure did not stop at any stage, then the lemma follows with N :=
E \ i≥1 Fi .
S
We show what to do in case the procedure stops after n steps. Set F = Fn , the last
closed set. The idea is to split F into countably many disjoint closed sets, plus a set
of measure zero.
Find an x such that µ[F ∩ B(x, r)] ̸= 0 for all r > 0 (previous lemma). Since X is
non-atomic, µ{x} = 0. Since B(x, r) ↓ {x}, µ[F ∩ B(x, r)] −−−→ 0. Choose rn ↓ 0 for
n→∞
which µ[F ∩ B(x, rn )] is strictly decreasing. Define
This is an infinite sequence of closed pairwise disjoint sets of positive measure in-
side F. By the construction of F they are disjoint from F1 , . . . , Fn−1 and they are
contained in E.
Now consider E ′ := E \( n−1 i=0 Fi ∪ i Ci ). Applying the argument in the first para-
S S
′
graph to E , we write it as a finite or countable disjoint union of closed sets plus a
null set. Adding these sets to the collection {F1 , . . . , Fn−1 } ∪ {Ci : i ≥ 1} gives us the
required decomposition of E. ⊔
⊓
Proof. Fix a decreasing sequence of positive numbers εn which tend to zero. Using
lemma A.2, decompose X = ∞j=1 F( j) ⊎ N where F( j) are pairwise disjoint closed
U
sets of positive measure and diameter less than ε1 , and N1 is a null set.
Applying lemma A.2 to each F( j), decompose F( j) = ∞
U
k=1 F( j, k)⊎N( j) where
F( j, k) are pairwise disjoint closed sets of positive measure and diameter less than
ε2 , and N( j) is a null set.
Continuing in this way we obtain a family of sets F(x1 , . . . , xn ), N(x1 , . . . , xn ),
(n, x1 , . . . , xn ∈ N) such that
1. F(x1 , . . . , xn ) are closed, have positive measure, and diam[F(x1 , . . . , xn )] < εn ;
2. F(x1 , . . . , xn−1 ) = y∈N F(x1 , . . . , xn−1 , y) ⊎ N(x1 , . . . , xn−1 );
U
3. µ[N(x1 , . . . , xn )] = 0.
Set X ′ := n≥1 x1 ,...,xn ∈N F(x1 , . . . , xn ). It is a calculation to see that µ(X \ X ′ ) =
T U
1
π(x) =
1
x1 +
x2 + · · ·
C := {E ∈ B([0, 1] \ Q) : π −1 (E) ∈ B}
contains the cylinders. It is easy to check that C is a σ –algebra. The cylinders gener-
ate B([0, 1]\ Q) (these are intervals whose length tends to zero as n → ∞). It follows
that C = B([0, 1] \ Q) and the measurability of π is proved.
Next we observe that π[F(a1 , . . . , an ) ∩ X ′ ] = [a1 , . . . , an ], so π −1 : [0, 1] \ Q → X ′
is Borel measurable by an argument similar to the one in the previous paragraph.
It follows that π : (X, B, µ) → ([0, 1] \ Q, B([0, 1] \ Q), µ ◦ π −1 ) is an isomor-
phism of measure spaces. There is an obvious extension of µ ◦ π −1 to B([0, 1])
obtained by declaring µ(Q) := 0. Let m denote this extension. Then we get an
isomorphism between (X, B, µ) to ([0, 1], B([0, 1]), m) where is m is some Borel
probability measure on [0, 1]. Since µ is non-atomic, m is non-atomic.
1 Hint: apply the transformation x 7→ [1/x] to both sides.
138 A The isomorphism theorem for standard measure spaces
We now claim that ([0, 1], B([0, 1]), m) is isomorphic to ([0, 1], B([0, 1]), λ ),
where λ is the Lebesgue measure.
Consider first the distribution function of m, s 7→ m[0, s). This is a monotone
increasing function (in the weak sense). We claim that it is continuous. Otherwise it
has a jump J at some point x0 :
This means that m{x0 } ≥ J, which cannot be the case since m is non-atomic.
Since Fm (s) = m([0, s)) is continuous, and Fm (0) = 0, Fm (1) = m([0, 1])−m{1} =
1, Fm (s) attains any real value 0 ≤ t ≤ 1. So the following definition makes sense:
Notice that Fm (ϑ (t)) = t. We will show that ϑ : ([0, 1], B, m) → ([0, 1], B, λ ) is an
isomorphism of measure spaces.
Step 1. m([0, 1] \ ϑ [0, 1]) = 0.
Proof. s ∈ ϑ [0, 1] iff s = ϑ (Fm (s)) iff ∀s′ < s, m[s′ , s] > 0.
Conversely, s ̸∈ ϑ [0, 1], iff s belongs to an interval with positive length and zero
m measure. Let I(s) denote the union of all such intervals. This is a closed interval
(because m is non-atomic), m(I(s)) = 0, λ (I(s)) > 0, and for any s1 , s2 ∈ [0, 1] \
ϑ [0, 1], either I(s1 ) = I(s2 ) or I(s1 ) ∩ I(s2 ) = ∅.
There can be at most countably many different intervals I(s). It follows that
[0, 1] \ ϑ [0, 1] is a finite or countable union of closed intervals with zero m-measure.
In particular, m([0, 1] \ ϑ [0, 1]) = 0.
Step 2. ϑ is one-to-one on [0, 1].
Proof: Fm ◦ ϑ = id
Step 3. ϑ is measurable, with measurable inverse.
Proof. ϑ is measurable, because for every interval (a, b), ϑ −1 (a, b) = {t : t =
Fm (ϑ (t)) ∈ Fm (a, b)} = (Fm (a), Fm (b)), a Borel set.
To see that ϑ −1 is measurable, note that ϑ is strictly increasing, therefore
ϑ (a, b) = (ϑ (a), ϑ (b)) ∩ ϑ ([0, 1]). This set is Borel, because as we saw in the proof
of step 1, ϑ ([0, 1]) is the complement of a countable collection of intervals.
Step 4. m ◦ ϑ = λ .
Proof. By construction, m[0, ϑ (t)) = t. So m[ϑ (s), ϑ (t)) = t −s for all 0 < s < t < 1.
This the semi-algera of half-open intervals generates the Borel σ -algebra of the
interval, m ◦ ϑ = λ .
Steps 1–4 show that ϑ : ([0, 1], B([0, 1]), m) → ([0, 1], B([0, 1]), λ ) is an isomor-
phism. Composing this with π, we get an isomorphism between (X, B, µ) and the
unit interval equipped with Lebesgue’s measure. ⊔
⊓
A.4 The isomorphism theorem 139
We comment on the atomic case. A standard probability space (X, B, µ) can have
at most countably many atoms (otherwise it will contain an uncountable collection
of pairwise disjoint sets of positive measure, which cannot be the case). Let {xi : i ∈
Λ } be a list of the atoms, where Λ ⊂ N. Then
where µ ′ is non-atomic.
Suppose w.l.o.g that X ∩ N = ∅. The map
(
x x ̸∈ {xi : i ∈ Λ }
π : X → X ∪ Λ , π(x) =
i x = xi
Note that the σ –algebra in the definition is the Lebesgue σ –algebra, not the Borel
σ –algebra. (The Lebesgue σ –algebra is the completion of the Borel σ –algebra with
respect to the Lebesgue measure, see problem 1.2.) The isomorphism theorem and
the discussion above say that the completion of a standard space is a Lebesgue space.
So the class of Lebesgue probability spaces is enormous!
Appendix A
The Monotone Class Theorem
Proof. For (1), observe that A = n≥1 An+1 \ An and use σ –additivity. For (2), fix
U
n0 s.t. µ(An0 ) < ∞, and observe that An ↓ A implies that (An0 \ An ) ↑ (An0 \ A). ⊔
⊓
The example An = (n, ∞), µ =Lebesgue measure on R, shows that the condition in
(2) cannot be removed.
141
142 A The Monotone Class Theorem
M ′ := {E ′ ∈ M (A ) : (E ′ )c ∈ M (A )}
E ∈ M (A ), A ∈ A =⇒ E ∪ A ∈ M (A ).
Again, the reason is that the collection M ′ of sets with this property contains A ,
and is a monotone class.
Now fix E ∈ M (A ), and consider the collection
M ′ := {F ∈ M (A ) : E ∪ F ∈ M (A )}.
143
144 Index