0% found this document useful (0 votes)
225 views

LShouh

This document is a long draft containing many sections about real analysis concepts including sets, functions, series, sequences, differentiation, measure, integration, probability, and Fourier analysis. It provides examples and explanations of key ideas in each of these areas of real analysis.

Uploaded by

Amal Mendis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views

LShouh

This document is a long draft containing many sections about real analysis concepts including sets, functions, series, sequences, differentiation, measure, integration, probability, and Fourier analysis. It provides examples and explanations of key ideas in each of these areas of real analysis.

Uploaded by

Amal Mendis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 280

101 Illustrated Real Analysis Bedtime Stories

DRAFT July 8, 2016


2
Contents

Preface i

I The basics 1
1 Sets, functions, numbers, and infinities 3
1 Paradoxes of the smallest infinity . . . . . . . . . . . . . . . . . . 3
2 Uncountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Cantors infinite paradise of infinities . . . . . . . . . . . . . . . . 18
4 The real deal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 Reals from rationals . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 The Cantor set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2 Discontinuity 33
7 Guessing function values . . . . . . . . . . . . . . . . . . . . . . . 33
8 The Dirichlet function . . . . . . . . . . . . . . . . . . . . . . . . 38
9 Conways base-13 function . . . . . . . . . . . . . . . . . . . . . . 41
10 Continuity is uncommon . . . . . . . . . . . . . . . . . . . . . . . 45
11 Thomaes function . . . . . . . . . . . . . . . . . . . . . . . . . . 46
12 Discontinuities of monotone functions . . . . . . . . . . . . . . . 47
13 Discontinuities of indicator functions . . . . . . . . . . . . . . . . 51
14 Sets of discontinuities . . . . . . . . . . . . . . . . . . . . . . . . 52
15 The Baire category theorem . . . . . . . . . . . . . . . . . . . . . 56

3 Series 61
16 Stacking books . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
17 Inserting parentheses and rearranging series . . . . . . . . . . . . 68
18 A Taylor series that converges to the wrong function . . . . . . . 73
19 Misshapen series . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
20 If you torture a series enough, it will converge . . . . . . . . . . . 77

4 Sequences of functions 83
21 Cauchys wrong theorem . . . . . . . . . . . . . . . . . . . . . . . 83
22 Walrus tusks and nasty pointwise limits . . . . . . . . . . . . . . 85

3
4 CONTENTS

23 Less nasty pointwise limits . . . . . . . . . . . . . . . . . . . . . . 88


24 Uniform convergence and metric spaces . . . . . . . . . . . . . . 89
25 Polynomials are pretty good at approximating continuous functions 94
26 Variance of dimension . . . . . . . . . . . . . . . . . . . . . . . . 99

5 Differentiation 109
27 Discontinuous derivative . . . . . . . . . . . . . . . . . . . . . . . 109
28 Darbouxs theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 111
29 Continuous but nowhere differentiable functions . . . . . . . . . . 111
30 Derivatives at infinity . . . . . . . . . . . . . . . . . . . . . . . . 113
31 Bump functions and partitions of unity . . . . . . . . . . . . . . 115
32 Multivariable limits, derivatives, and local extrema are weird . . 116

6 Measure 119
33 Episode I: The Phantom Measure (Measure is problematic) . . . 119
34 Jordan measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
35 Lebesgue measure . . . . . . . . . . . . . . . . . . . . . . . . . . 127
36 The Smith-Volterra-Cantor set . . . . . . . . . . . . . . . . . . . 129
37 A nonmeager set of measure zero . . . . . . . . . . . . . . . . . . 130
38 Lebesgues density theorem . . . . . . . . . . . . . . . . . . . . . 135
39 Measure and Minkowski sums . . . . . . . . . . . . . . . . . . . . 136
40 Intersections of measure zero sets and their images . . . . . . . . 137
41 Noise sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
42 Convergence in measure vs. pointwise convergence . . . . . . . . 139
43 Borel measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
44 Measures in general . . . . . . . . . . . . . . . . . . . . . . . . . 143
45 Episode II: Attack of the Clones (Banach-Tarski and paradoxical
decompositions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

II More advanced stuff 145


7 Integration 147
46 Introduction to the Riemann integral . . . . . . . . . . . . . . . . 147
47 Dirichlet, Thomae, and Lebesgues criterion . . . . . . . . . . . . 151
48 Episode III: Revenge of the Antiderivatives (Volterra) . . . . . . 153
49 Episode IV: A New Hope for Integration (Lebesgue) . . . . . . . 157
50 Integration on measure spaces . . . . . . . . . . . . . . . . . . . . 162
51 Nice functions and the 3 limit theorems . . . . . . . . . . . . . . 165
52 Detour: Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . 169
53 An introduction to Lp spaces . . . . . . . . . . . . . . . . . . . . 173
54 Be careful with Fubini . . . . . . . . . . . . . . . . . . . . . . . . 176
55 Radon-Nikodym and Riesz representation . . . . . . . . . . . . . 177
56 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
57 Lebesgues differentation theorem . . . . . . . . . . . . . . . . . . 181
58 Convoluted convolutions . . . . . . . . . . . . . . . . . . . . . . . 185
CONTENTS 5

59 Rectangles are not so friendly . . . . . . . . . . . . . . . . . . . . 188


60 Weak and strong operators and unhappy rotated rectangles . . . 190
61 The FTC for grown-ups (absolute continuity) . . . . . . . . . . . 193

8 Episode V: Differentiation strikes back 199


62 Monotonic functions are differentiable a.e. . . . . . . . . . . . . . 199
63 The Cantor function and Lebesgue-Stieltjes measures . . . . . . . 201
64 Discontinuity sets of derivatives . . . . . . . . . . . . . . . . . . . 204
65 Fabius function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
66 Pompeiu derivative . . . . . . . . . . . . . . . . . . . . . . . . . . 204
67 Functional derivatives . . . . . . . . . . . . . . . . . . . . . . . . 204

9 Probability and ergodic theory 209


68 What is probability, anyway? . . . . . . . . . . . . . . . . . . . . 209
69 The Cauchy distribution and the false law of large numbers . . . 210
70 The strong law of large numbers . . . . . . . . . . . . . . . . . . 210
71 Birkhoffs ergodic theorem . . . . . . . . . . . . . . . . . . . . . . 210
72 Random walks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
73 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
74 Khinchins constant . . . . . . . . . . . . . . . . . . . . . . . . . 211
75 Normal numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . 214

10 Fourier analysis 215


76 Introduction to Fourier series . . . . . . . . . . . . . . . . . . . . 215
77 Divergence of Fourier series . . . . . . . . . . . . . . . . . . . . . 218
78 Continuous functions make Fourier series cry . . . . . . . . . . . 219
79 Summability is nice . . . . . . . . . . . . . . . . . . . . . . . . . . 223
80 Fourier series finally converge . . . . . . . . . . . . . . . . . . . . 227
81 Introduction to the Fourier Transform . . . . . . . . . . . . . . . 230
82 Fourier series vs. the Fourier transform (and LCA groups) . . . . 235
83 Schwartz functions and trading smoothness for decay . . . . . . . 243
84 Uncertainty principles . . . . . . . . . . . . . . . . . . . . . . . . 245
85 What about L2 ? . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
86 Fourier transforms turn time-translates into frequency modulations252
87 The gradient meets Littlewood-Paley . . . . . . . . . . . . . . . . 254
88 Dirac delta gets kicked out of functions on Rn land again . . . . 257
89 Dirac delta gets Fourier transformed . . . . . . . . . . . . . . . . 259
90 The differential operator finds Sobolev and multiplier friends . . 261

11 Miscellaneous (maybe move these later?) 265


91 Rectifiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
92 Hausdorff (fractal) dimension . . . . . . . . . . . . . . . . . . . . 265
93 Kakeya sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
94 Cauchys functional equation and Hamel functions . . . . . . . . 266

12 Acknowledgments 269
6 CONTENTS

A Omitted Details 271


1 Adding books to the top of a stack . . . . . . . . . . . . . . . . . 271
refsection:1

Preface

Theres no ulterior practical


purpose here. Im just playing.
Thats what math is wondering,
playing, amusing yourself with
your imagination.
lockhart09
Paul Lockhart [2]

In the genre of entertainment through mathematics education, content is


scarce. Hopefully this book helps to remedy that. In each of 101 sections, we
present a fun idea from real analysis. The word story in thefig:book-contents
title is meant
loosely. Dont expect too many once-upon-a-times. (See Figure 1.)
This is not a real analysis textbook. We de-emphasize maths role as a tool
in favor of its role as an art form and more generally as a mode of expression.
The things youll learn from this book might help you solve actual problems,
but theyre more likely to help you converse with a suitable set of guests at a
cocktail party.
Like pants, the book is divided into two parts. You should find Part 1
gentle and accessible, as long as youve learned basic calculus. Maybe youre
a curious high school student, or you work in a technical area outside of pure
math, or youre currently learning undergraduate analysis. Or maybe youre
familiar with the material but you want to read some bedtime stories! We
include tidbits of math, history, and philosophy that are absent in standard
presentations; hopefully even advanced students will find Part 1 worth reading.
Part 2 will be harder to digest. If youve taken a thorough undergraduate
analysis course or two (or three), you should be able to follow along fine. Oth-
erwise, well, youll understand when youre older. But even if youre a beginner,
you should get something out of every section.
As a reader, you should have plenty of food for thought, but we hope you
wont have to think so hard that reading this book becomes a chore. Many
proofs are omitted or just outlined, since who wants a technical proof in the
middle of a bedtime story? We try to provide references for the proofs though.
Most can be found in standard undergraduate or graduate texts on real and
functional analysis.
The title of this book is inspired by 101 Illustrated Analysis Bedtime Sto-

i
ii PREFACE

Pictures

35%

Discussion 30% Pie charts (0.5%)


4.5%
Actual stories
5%
25% Jokes

Definitions, theorems, proofs

fig:book-contents Figure 1: The approximate composition of this book.

bedtime
ries [1], a brilliant work of fiction. This book is both nonfictional and about
R, hence the name. We didnt include any complex analysis, because we feel
that topics in complex analysis are more like magic tricks than bedtime stories.
(Can I get a bounded entire volunteer from the audience? Abracadabra, hocus
pocus, tada! Youre constant.) Happy reading!

Some undergrads
Earth, 2016

References
bedtime [1] S. Duvois and C. Macdonald. 101 Illustrated Analysis Bedtime Stories.
2001. url: http://people.maths.ox.ac.uk/macdonald/errh/.
lockhart09 [2] Paul Lockhart. A mathematicians lament. Bellevue literary press New
York, 2009.
Part I

The basics

1
refsection:2

Chapter 1

Sets, functions, numbers,


and infinities
chap:sets
OK, dude. Math is the
foundation of all human thought,
and set theory countable,
uncountable, etc. thats the
foundation of math. So even if
this class was about Sanskrit
literature, it should still probably
start with set theory.
qcsd
Scott Aaronson [1]

This chapter isnt exactly about real analysis, but its fun stuff that you
need to understand anyway. To appreciate the real analysis stories, you need to
know something about the world in which they take place.

1 Paradoxes of the smallest infinity


sec:countable
Are there more even integers or odd integers? How about integers vs. rational
numbers? Our goal for this section is to make sense of and answer questions
fig:comparing-infinities
like these by explaining how to compare infinities. (See Figure 1.1.)
A set is just a collection of objects, called the elements of the set.1 A set
has neither order nor multiplicity, e.g. {1, 2, 3} = {3, 2, 1, 1}. We write x X
(read x in X or x is in X) to say that x is an element of the set X.
To compare two finite sets, you can simply count the elements in each set.
Counting to infinity takes too long though. As you know from playing musical
1 Are you unsatisfied with this definition? Strictly speaking, were taking sets as our most

primitive objects, so rather than defining them, we should just give some axioms about them
which we will assume. Well be assuming the ZFC axioms. Look them up if youre curious.
It shouldnt matter.

3
4 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

fig:comparing-infinities Figure 1.1: How can infinities be compared?


1. PARADOXES OF THE SMALLEST INFINITY 5

Figure 1.2: The centipede can compare two finite sets even though it doesnt know
how to count. The centipede puts a clean sock on each foot until it either runs out
of socks or runs out of feet. If it runs out of socks with some feet still bare, it can
comparing-finite-sets conclude that it has more feet than clean socks, so its time to do laundry.

chairs, to compare two finite sets X and Y , you can avoid counting or numbers
and instead use an even more primitive concept: matching. Pair off elements of
X with elements of Y one by one. The sets are the same size if and only if you
fig:comparing-finite-sets
end up with no leftover elements in either set. (See Figure 1.2.)
Armed with this observation, way back in 1638, Galileo declared that in-
fig:galileo-paradox
finities cannot be compared. Heres his reasoning. (See Figure 1.3.) Suppose
were interested in comparing the set of natural numbers N = {1, 2, 3, . . . } with
the set of perfect squares S = {1, 4, 9, . . . }. On the one hand, obviously, N
is bigger than S, because S is a proper subset of N, i.e. every perfect square
is a natural number but not vice versa. If we list the natural numbers on the
left and the perfect squares on the right, we can match each perfect square n2
on the right with the copy of that same number n2 on the left, leaving a lot of
lonely unmatched natural numbers.
But on the other hand, instead, we could match each natural number n N
with the perfect square n2 S. That would leave no leftovers on either side,
suggesting that N and S are the same size! We get two different answers based
on two different matching rules. Its as if we play musical chairs twice, with the
same set of people and the same set of chairs both times. In the first game, the
chairs all fill up, with infinitely many losers still standing. But in the rematch,
everybody
galileo1638
finds a chair to sit in! Galileo concluded that this is all just nonsense
[4]:

So far as I see we can only infer that the totality of all numbers is
infinite, that the number of squares is infinite, and that the number
of their roots is infinite; neither is the number of squares less than the
totality of all the numbers, nor the latter greater than the former;
and finally the attributes equal, greater, and less, are not
applicable to infinite, but only to finite, quantities.

Galileo was on the right track, but he didnt get it quite right. The main
takeaway from Galileos paradox is that we really do need a definition in order
6 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

1 1 1 1
2 2 4
3 3 9
4 4 4 16
5 5 25
6 6 36
7 7 49
8 8 64
9 9 9 81
.. .. .. ..
. . . .

Figure 1.3: Galileos paradox. With infinite sets, different matching rules can lead
fig:galileo-paradox to different outcomes.

to compare infinite sets. In the 1800s, Georg Ferdinand Ludwig Philipp Cantor
provided a good one, declaring that X has the same cardinality as Y if there is
some way to pair off the elements of X with elements of Y , leaving no leftovers
in either set.2 So the definition is biased in favor of declaring sets to be the
same size. Cantor says, the appropriate way to handle Galileos paradox is to
say yeah, there really are just as many natural numbers in total as there are
perfect squares. Infinitys weird like that.
To explore cardinality properly, we need to be more precise. Galileos para-
dox involved two different ways of associating elements of N with elements of S:
two different binary relations between N and S.

Definition 1. A binary relation consists of a set X (the domain), a set Y (the


codomain), and a set G of ordered pairs (x, y) where x X and y Y . (G is
called the graph of the relation.)

Definition 2. A function f : X Y (read f from X to Y ) is a binary


relation with domain X and codomain Y whose graph G satisfies the following:
For every x X, there is exactly one y Y such that (x, y) G. We write
f (x) = y instead of (x, y) G. Functions are also called maps.

Its sometimes useful to think of f as a machine, which is given the input


x and produces the output f (x). But this idea doesnt always make too much
sense, because there isnt necessarily a formula or algorithm for figuring out
f (x). Youre probably most familiar with functions R R, where R is the set
2 Since this was such a brilliant insight of Cantors, philosophers honor him by referring

to it as Humes principle.
1. PARADOXES OF THE SMALLEST INFINITY 7

1 1

2 2

3 3

.. ..
. .

Figure 1.4: An example of a familiar relation with domain N and codomain N, the
relation.

Figure 1.5: Let Obama be the binary relation with domain R and codomain R whose
graph G is depicted above. Obama is not a function, for two reasons. First, for some
values of x, there are multiple y so that (x, y) G. (Obama fails the vertical line
test.) Second, for some x, there does not exist a y so that (x, y) G. (Obama is not
entire.)
8 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

1
a
2
b
3
c
4

fig:example-func Figure 1.6: A function f : {1, 2, 3, 4} {a, b, c}, with e.g. f (4) = c.

1 1

2 2

3 3

.. ..
. .

Figure 1.7: The function f : N N defined by f (x) = x + 1 is injective, because fig:example-func


no
two arrows point to the same number. In contrast, the function depicted in Figure 1.6
is not injective, because 1 and 2 collide.


of real numbers ( 32 , 109 , 2, e, etc.3 ) These real-valued functions of a real
argument are going to be the main characters in most of our stories.
A collision of a function f : X Y is a pair of distinct inputs x1 , x2 X
such that f (x1 ) = f (x2 ). A function is injective if it has no collisions. An
injective function is lossless: you can recover the input from the output. An
bacon09
injection preserves information. Darius Bacon [3]. To put it another way, if
X is a set of people and Y is a set of chairs, an injection X Y is a seating
arrangement where each person gets her own chair, possibly leaving some chairs
empty.
You should think of the codomain Y as the set of allowed outputs of f .
The image of f , denoted f (X), is the set of actual outputs of the function,
i.e. f (X) is the set of all f (x) Y as x ranges over X.4 E.g. the image of the
3 Unsatisfied by this definition as well? Well discuss what real numbers really are in
sec:real-number-axioms
Section 4. For now, just think of points on a number line, or decimal expansions.
4 You might have heard the term range before. The word range is ambiguous. Dont

use it. When people say range, sometimes they mean codomain, and other times they mean
image.
1. PARADOXES OF THE SMALLEST INFINITY 9

kevin09
Figure 1.8: Sir Jective hits everything with his sword. Kevin [9]. See also
twistedpencil196
[8]. [TODO replace picture. The image should depict the fact that the knight hits
everything with his sword. Maybe he is stabbing something in the picture, hes missing
a leg, the horse is missing a leg; nearby is a slain dragon, a headless chicken, a mailbox
fig:sir-jective cut in half, a chopped-down tree...]

fig:example-func
function depicted in Figure 1.6 is {b, c}. We say that f is surjective if f (X) = Y .
In other words, f is surjective if for every y Y , there exists an x X so that
f (x) = y. A surjection is a seating arrangement which fills every chair, possibly
with many people sharing a single chair.
A function is bijective if it is injective and surjective. A bijection is also called
a one-to-one correspondence:5 it is the notion of matching with no leftovers
that we were looking for. A bijection is a seating arrangement in which every
person is assigned her own chair and every chair is filled. Heres the official
version of Cantors definition.

Definition 3. Suppose X and Y are sets. We say that X has the same car-
dinality as Y if there exists a bijection X Y . We write |X| = |Y | in this
case.

Example 1. The set of even integers has the same cardinality as the set of
odd integers, because f (2k) = 2k + 1 is a bijection between these two sets. (See
fig:even-odd
Figure 1.10.) This should be intuitive, since even and odd seem to be on equal
footing.

5 Warning: some mathematicians use this phrase one-to-one correspondence to mean

bijection, and then in the same breath use the term one-to-one to mean injection (note
the omission of the word correspondence.) Some say f maps X onto Y to say that f
is surjective, while the subtly different f maps X into Y merely means that X and Y
are the domain and codomain of f ! Its a terminological disaster. Much better to stick
with the injective/surjective/bijective terms, invented by the group of mathematicians known
pseudonymously as Bourbaki.
10 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

f (x)

Figure 1.9: The function f (x) = x2 is not surjective when thought of as a function
R R, because negative numbers are not part of its image. However, it is surjective
if we think of it as a function R [0, ).

..
. ..
.
4
3
2
1
0
1
2
3
4
5
6
..
.. .
.

fig:even-odd Figure 1.10: The bijection from the set of even integers to the set of odd integers.
1. PARADOXES OF THE SMALLEST INFINITY 11

Example 2. The set of all integers (positive, negative, and 0) has the same
cardinality as N (the set of positive integers). To see why, observe that we can
fig:integers
reorder the integers as follows (see also Figure 1.11):

0, 1, 1, 2, 2, 3, 3, . . .

The function f (n) which gives the nth element in the list is a bijection from
N to the set of integers. We denote the set of all integers by Z (which stands
for zahlen, the German word for number). So what weve just shown is that
|Z| = |N|. This is counterintuitive: it feels like there are about twice as many
integers as positive integers.

3 2 1 0 1 2 3

0 1 2 3

1 2 3

0 1 2 3

0 1 1 2 2 3 3

fig:integers Figure 1.11: To count the integers, we fold Z in half.


12 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

1 2 3 4 5 ...
1 1 1 1 1

1 2 3 4 5 ...
2 2 2 2 2

1 2 3 4 5 ...
3 3 3 3 3

1 2 3 4 5 ...
4 4 4 4 4

1 2 3 4 5 ...
5 5 5 5 5

.. .. .. .. .. ..
. . . . . .

Figure 1.12: The proof that |Q| = |N|. We make an infinite table of fractions, with
the row index being the denominator and the column index being the numerator.
We circle all of the reduced fractions, and then we can make a list of all the rational
numbers in the zig-zag order indicated by the arrows. A small simplification made in
fig:q-countable this illustration is that it omits the nonpositive rational numbers.

Example 3. The set Q of all rational numbers (i.e. fractions of integers) has the
same cardinality as N! This seems horribly wrong, because there are infinitely
many rational numbers between every two integers. Its sufficiently surprising
glencoe14
that at least one high school textbook [2] boldly asserts that Q and N have
different cardinalities. But in fact, we can enumerate the rational numbers as
follows. Every rational number can be written as a reduced fraction pq , where
p is a nonnegative integer and q is a positive integer. First, we list all rational
numbers with p + q = 1 (theres just one: zero.)
0
1
Then, we list all rational numbers with p + q = 2:
0 1 1
, ,
1 1 1
Then all rational numbers with p + q = 3:
0 1 1 1 1 2 2
, , , , , ,
1 1 1 2 2 1 1
Etc. etc. Every rational number will eventually be listed. Just like in the case
of Z, this reordering immediately gives a bijection between Q and N, showing
fig:q-countable
that |Q| = |N|. (See Figure 1.12.)
We call a set countably infinite if it has the same cardinality as N. (If
you carefully count the elements in a countably infinite set, its false that you
2. UNCOUNTABILITY 13

Figure 1.13: Hilberts paradox of the Grand Hotel. There are countably infinitely
many rooms, all of which are occupied, yet the hotel is still accepting more guests.
When a new guest arrives, the hotel asks the patron in room n to move to room n + 1,
with the net effect being that room 1 is freed up for the new arrival. Do you see how
the hotel can deal with countably infinitely many guests who all arrive simultaneously?
fig:hilbert-hotel (Credit for the No vacancy, guests welcome sign: [hilbert-hotel])

will eventually have counted every element, but its true that for each element,
you will eventually have counted that element.) fig:hilbert-hotel
Countable infinity may be the
smallest infinity, but its got teeth. (See Figure 1.13.)

2 Uncountability
sec:uncountability
A simple observation: a set is finite if and only if you can write down the entire
set, after giving each element a name. Theres a similar characterization of
countable sets. (A set is countable if it is either finite or countably infinite.) If
X is countable, maybe you cant write down the entire set, but at least you can
write down an arbitrary element of X.

prop:strings Proposition 1. A set X is countable if and only if each element of X can be


written down. More precisely, X is countable if and only if there is some finite
alphabet and an injection from X to the set of finite strings of symbols
from .
14 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

Geology rocks.
Vacuuming sucks.
Dont drink and derive.
Two wrongs can make a riot.
Why is the letter before Z?
Statisticians say mean things.
Your calendars days are numbered.
A plateau is the highest form of flattery.
..
.
Two fish are in a tank. One says to the other, Do
you know how to drive this thing?
..
.
Bobby Fischer got bored of playing chess with
Russians. He asked the association to fix his next
match with some other Europeans, writing, How about
a Czech mate?
..
.

Figure 1.14: The set of all puns is countable, because every pun can be written
down, and hence the puns can be enumerated: we start with the shortest, then move
fig:puns on to longer and longer puns. (We deserve no credit for the puns listed.)

Before the proof, some examples: We can write down an arbitrary element
of N using the alphabet = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and standard decimal no-
tation. Similarly, to write down elements of Z and Q, just throw in two more
symbols, and /. (That was a much easier proof that Q is countable than the
zig-zag argument we did before!)
prop:strings
Proof of Proposition 1. If X is countable, then we can index each element of
X with a natural number, which we can think of as its name. Writing down
natural numbers is easy enough.
For the converse, well show that is countable. To enumerate , first
list all the length-0 strings (theres only one: the empty string.) Then list all
the length-1 strings, then the length-2 strings, etc. There are only finitely many
fig:puns
strings of each length, so this gives a bijection N . (See Figure 1.14.)
prop:strings
Proposition 1 reveals tons of countable sets: the set of all finite subsets of
Z, the set of all polynomials with integer coefficients, the set of all possible
computer viruses, the set of all possible recipes describing yummy food, the set
of all love notes which can ever be written, the set of all theorems, the set of
all proofs, the set of all stories, the set of all finite mazes, the set of all vague
2. UNCOUNTABILITY 15

philosophical questions, the set of all possible digital photographs, the set of all
physical laws that we have any hope of making sense of...
Are there any sets that are uncountable even bigger than N? Of course,
by the thats-why-the-word-countable-was-invented principle. Where do we findprop:strings
one of these super-infinite sets, despised by Count von Count? Proposition 1
gives a hint: it ought to require an infinite amount of information to specify
an element of the set. A sequence is a function with domain N, except that if
A is a sequence, we write An instead of the functional notation A(n).
thm:uncountability Theorem 1. Let 2N denote the set of all sequences of zeroes and ones. Then
2N is uncountable.
The proof, due to Cantor, is unquestionably one of the greatest proofs of all
time. Remember that the definition of cardinality was biased in favor of sets
having the same cardinality, which makes it especially tricky to prove that two
sets have different cardinalities. We have to prove that there does not exist a
2N . Lots of people like to say that you cant prove a negative
bijection Nprice06
jacoby13
[randi09, 11, 7]. But were about to do exactly that.
Proof. Consider any arbitrary function f : N 2N ; we will show that f is not
a surjection. fig:diagonalization
We can represent f as a table, like the example in Figure 1.15. Let A be the
diagonal sequence, defined by An = f (n)n that is, the nth term of A is the
nth term of the nth sequence. Let B be the opposite of A:
(
0 if An = 1
Bn = (1.1)
1 if An = 0.

By construction, for every n N, f (n) differs from B in its nth term. Thus, B
is not in the image of f , so f is not surjective!
Heres a more familiar uncountable set:
thm:r-uncountable Theorem 2. The set R of all real numbers is uncountable.
Proof. Define f : 2N R by

f (a1 , a2 , . . . ) = the real number represented by 0.a1 a2 . . . in base 10.

Then f is injective, and hence f is a bijection between 2N and f (2N ) ( (1, 1).
This shows that some subset of R is uncountable, which implies that R is un-
countable.
thm:uncountability
The diagonalization argument in the proof of Theorem 1 is extremely clever.
Cantor wondered whether there was a bijection between N and R for several
years. He asked Richard Dedekind for help, but Dedekind couldnt solve the
problem either. Cantor eventually published a more complicated proof that R
sec:bair56432322e
is uncountable in 1874 (well see this early proof in Section ??.) He published
his diagonalization argument in 1891 [was-cantor-surprised].
16 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

n f (n)
1 0 0 0 0 0 0 0 0 0 0 ...
2 1 1 1 1 1 1 1 1 1 1 ...
3 0 1 0 1 0 1 0 1 0 1 ...
4 0 1 0 0 1 0 0 0 1 0 ...
5 1 1 1 1 0 1 0 1 0 0 ...
6 0 1 0 0 0 1 1 0 1 0 ...
7 1 0 0 1 0 0 1 0 0 1 ...
8 0 1 1 1 1 1 1 1 1 1 ...
9 1 0 1 1 1 1 1 1 1 1 ...
10 0 0 0 1 1 1 0 0 0 1 ...
.. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . .
A 0 1 0 0 0 1 1 1 1 1 ...
B 1 0 1 1 1 0 0 0 0 0 ...
thm:uncountability
Figure 1.15: How the proof of Theorem 1 works for one example function f . The
sequence B cannot be in the image of f , because for every n, B and f (n) disagree at
fig:diagonalization their nth position.

thm:r-uncountable
Theorem 2 is profound. Obviously some numbers, like , have infinite deci-
mal expansions. We still manage to write down such numbers, by using special
thm:r-uncountable
notation, like the symbol . But Theorem 2 tells us that no matter how much
notation we make up, there will still be some numbers which cannot be written
down! As Shakespeare said,

There are more things in heaven and earth, Horatio, than are dreamt
of in your philosophy.

For example, there must exist noncomputable numbers numbers for which
there is no algorithm for listing the the digits of thenumber. Numbers which
turn up in the wild tend to be computable (, e, 2, etc.) But the noncom-
putable ones are out there! thm:r-uncountable
Notice that in the proof of Theorem 2, we actually showed that the interval
(1, 1) is already uncountable! Intuition suggests that R has a greater cardinal-
ity than a puny little interval like (1, 1), but youve probably learned by now
that your intuition can be misleading in this business:

prop:interval-cardinalities Proposition 2. For any real numbers a < b, |(a, b)| = |R|.

Proof sketch.
fig:tan
The function f (x) = tan(x) is a bijection ( 2 , 2 ) R. (See
fig:interval-cardinalities
Figure 1.17.) By translating and scaling like in Figure 1.16, you can get a
bijection (a, b) R.
prop:interval-cardinalities
Proposition 2 is bizarre, because we like to think of intervals as having
different sizes, e.g. (0, 2) should be twice as big as (0, 1). Well address that
chap:measure
idea in depth in Chapter 6.
2. UNCOUNTABILITY 17

0 1

0 2

nterval-cardinalities Figure 1.16: f (x) = 2x is a bijection (0, 1) (0, 2).

0 1

R
0

fig:tan Figure 1.17: Comparing (0, 1) and R.

To illustrate the care that must be taken to show that one set is bigger than
another, we conclude this section with some philosophical nonsense.
SIMPLICIO: Ive discovered a proof that there are more bad ex-
periences than good experiences. Take any good experience, and
imagine altering it by setting yourself on fire. Now its a bad ex-
perience! So there are at least as many bad experiences as good
experiences, and of course there are some bad experiences in which
youre not on fire, so the inequality is strict.
SALVIATI: No, that wont do. Youve provided a map from the set
of good experiences to the set of bad experiences (the set-yourself-
on-fire map) which is injective, but not surjective. Your conclusion
that there are more bad experiences than good experiences would
only be justified if we were dealing with finite sets. (After all, the
map f : N N defined by f (x) = x + 1 is injective but not sur-
jective! You dont think that N is bigger than itself, do you?) But
in actual fact, there are infinitely many experiences. Just consider
the experience of holding n marbles, for n N. Theres a different
experience for each n.
SIMPLICIO: No no, youve misunderstood what I mean by ex-
perience. You thought that I meant a situation, which the subject
18 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

of the situation would judge to be good or bad. But I meant the


perception of that situation. That is, an experience is a brain state,
or rather a sequence of brain states, bounded in length by the human
lifespan. I claim that there are only finitely many experiences. For
example, if n and m are sufficiently large, then holding n marbles
is indistinguishible from holding m marbles, and hence they are the
same experience. My proof is salvaged.
SALVIATI: Ah, but if experience means sequence of brain states,
then the set-yourself-on-fire map is not injective! Consider two good
experiences in which you are watching a sunset. In one experience, a
squirrel runs by at some distance from you. In the other experience,
there is no squirrel. When you set yourself on fire, these cease to be
distinct experiences, because you wouldnt notice the squirrel if you
were on fire!

3 Cantors infinite paradise of infinities


Definition 4. For sets X, Y , we say that |X| |Y | if there exists an injection
X Y.
Youll be happy to know that every two sets can be compared in this way.
Theorem 3. For any two sets X and Y , |X| |Y | or |Y | |X| (or both.)
honig54
(See [6] for a proof.) Youll also be happy to know that if |X| |Y | and
|Y | |X|, then |X| = |Y |.
thm:cbs Theorem 4 (Cantor-Bernstein-Schroder). Suppose X and Y are sets. If there
is an injection f : X Y and another injection g : Y X, then there exists a
bijection h : X Y .
Despite the theorems name, Dedekind was the first one to prove it, and
Cantor never gavethm:cbs
a proof for it (though he was the first to state it.) The
proof of Theorem 4 is notoriously difficult, but in principle, it requires no deep
mathematics education to understand. If you want to confuse yourself, read one
of these proofs: [TODO references]
Weve seen that there are at least two different kinds of infinity (countable
and uncountable.) Obvious question: is there a biggest infinity?
Definition 5. For a set X, the power set of X (denoted P(X)) is the set of all
subsets of X. For example, if X = {1, 2, 3}, then
n o
P(X) = , {1}, {2}, {3}, {1, 2}, {1, 3}, {2, 3}, {1, 2, 3} .

(The symbol denotes the empty set, the set with no elements.)
Our next theorem implies that no matter how huge a set you come up with,
there is always an even huger set. The proof is just a slightly more abstract
version of the diagonalization argument that revealed uncountability.
3. CANTORS INFINITE PARADISE OF INFINITIES 19

Theorem 5 (Cantors theorem). For every set X, |X| < |P(X)|.


Proof. Consider an arbitrary function f : X P(X). Let A be the diagonal
set, i.e.
A = {x X : x f (x)}.
The above bit of notation is read the set of all x in X such that x is in f (x),
and it means exactly what it sounds like. Let

B = X \ A = {x X : x 6 A}.

Fix any x X; well show that f (x) 6= B. If x f (x), then x A, so x 6 B.


If x 6 f (x), then x 6 A, so x B. Either way, f (x) disagrees with B on x.
Hence, B is not in the image of f , so f is not surjective.
(Wait, you ask, isnt X = a counterexample to Cantors theorem? No,
because P() = {}, which has one element, whereas has zero elements.)
Cantors theorem uncovers a rabbit hole to the wonderland of set theory.
(No onehilbert26
shall expel us from the Paradise that Cantor has created. David
Hilbert [5]) This book is supposed to be about real analysis, so were not going
to explore the rich landscape of infinities in depth, but well visit it as tourists
and see some sights, to give you a bit more intuition about infinity.
Weve defined expressions like |X| |Y |, but we havent actually defined the
object |X| by itself. If X is finite, |X| is just the number of elements in the set
X. For X infinite, its a bit trickier to give a suitable definition; suffice it to say
that one can be given. These objects |X| are called cardinal numbers. Despite
what you may have heard, infinity is a number, or rather many numbers.6 Its
just not a real number, the kind of number with which you are most familiar.
So whats to be done with all these numbers? Arithmetic!

3.1 Cardinal addition


When you were a newborn baby learning arithmetic of natural numbers, you
were taught that n + m is the number of apples you have in total if you combine
a pile of n apples with a pile of m apples. Notice that theres a hidden technical
assumption, which is that the two piles offig:addition
apples are disjoint, i.e. they dont
share any apples in common! (See Figure 1.18.)
Even with infinitely many apples, the definition still stands. For two sets
X and Y , the union X Y is the set of all x such that x X or x Y .
The intersection X Y is the set of all x such that x X and x Y . (See
fig:boolean-operations
Figure 1.19.) If X and Y are disjoint (i.e. X Y = ), we define

|X| + |Y | = |X Y |.

If X and Y are not disjoint, just rename the elements of each set, giving new
sets X and Y which are disjoint satisfying |X| = |X | and |Y | = |Y |.
6 This is actually one of many senses in which infinity is a number. See also ordinal numbers,

hyperreal numbers, surreal numbers.


20 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

fig:addition Figure 1.18: Despite what these two piles of apples may suggest, 5 + 4 6= 7.

X Y X Y

X Y X \Y Y \X

fig:boolean-operations Figure 1.19: The Boolean set operations: union, intersection, and set difference.
3. CANTORS INFINITE PARADISE OF INFINITIES 21

Our fold in half proof that |N| = |Z| can easily be adapted to show that

|N| + |N| = |N|.

We also have |R| + |R| = |(0, 1)| + |(0, 1)| |(0, 2)| |R|, and hence

|R| + |R| = |R|.

These two calculations are not coincidences: it turns out that for any infinite
set X,
|X| + |X| = |X|.
Adding an infinity to itself doesnt do anything!

3.2 Cardinal multiplication


As an infant, you were taught that n m is the number of apples in an n m
grid of apples. In general, for two sets X and Y , the Cartesian product X Y
is the set of all ordered pairs (x, y) where x X and y Y . For example, R R
is the real plane. We define

|X| |Y | = |X Y |.

Our zig-zag proof that |N| = |Q| can easily be adapted to show that

|N| |N| = |N|.


sec:variance-of-dimension
As well discuss in Section 26,

|R| |R| = |R|.

Again, these are not coincidences: for any infinite set X,

|X| |X| = |X|.

Even multiplying an infinity by itself doesnt do anything!

3.3 Cardinal exponentiation


As an infant, you were taught that nm is the repeated product n n n n, with
n appearing m times. Combinatorially, this is the number of different ways to
fill out an m-question multiple choice exam where each question has n options.
In other words, if we fix a set N with n elements and a set M with m elements,
then nm is the number of functions M N .
In general, for two sets X and Y , we define Y X to be the set of functions
X Y . For example, earlier, we denoted the set of all binary sequences by
2N . If you identify the number 2 with the two-element set {0, 1}, our notation
makes good sense. We define

|Y ||X| = Y X .
22 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

We saw that addition and multiplication are pretty boring for infinite car-
dinal numbers. Is exponentiation similarly boring? You can specify a subset
A X by giving its indicator function A : X {0, 1} defined by
(
0 if x 6 A
A (x) =
1 if x A.

(A (x) indicates whether x is in A.) Hence, Cantors theorem implies that


|X| < 2|X| for every set X. So cardinal exponentiation is not boring unlike
addition and multiplication, exponentiation can actually get you somewhere.
Youll be happy to know that standard exponent rules, like (ab )c = abc , hold
for cardinal numbers.
We use the notation i0 (pronounced bet nought) to denote |N|, the car-
dinality of the countably infinite. (Here i is the second7 letter of the Hebrew
alphabet.) Then we define i1 = 2i0 , and more generally in+1 = 2in , giving
a whole sequence of increasingly enormous infinite cardinal numbers. It turns
out that |R| = i1 . In normal mathematics (i.e. outside set theory), the only
infinite cardinalities youre likely to encounter are i0 , i1 , i2 , and maybe i3 .
Thats all were going to say8 about sets and functions, the basic foundations
of math that they ought to teach in middle school. Its time to discuss the
foundations of real analysis in particular.

4 The real deal


sec:real-number-axioms
Real9 numbers seem like awfully familiar friends, but evidence suggests that
different people have different concepts in mind when they talk about numbers.
Back in the 5th century B.C., the Pythagoreans had a confused conception of
number. On the one hand, they thought that all numbers were either integers
or fractions (i.e. rational numbers.) But on the other hand, the Pythagoreans
wanted to use numbers to describe Euclidean geometry, including, of course,
the famous Pythagorean theorem. This led to trouble:

Proposition 3. 2 is irrational. That is, there is no rational number pq so
 2
that pq = 2.

Proof. Assume for a contradiction that p2 = 2q 2 , where pq is a reduced fraction.


Then p2 is even, which implies that p is even, and hence p2 is divisible by 4.
7 Thecardinality of N is more often denoted 0 ; here is the first letter of the Hebrew
alphabet. We prefer the i notation, because the other numbers are much more confusing.
Look up the continuum hypothesis and the generalized continuum hypothesis if youre curious.
8 If youd like a more detailed exposition of this sort of background material that most

math books assume


munkresyou already know, you might try Chapter 1 of James R. Munkres book
Topology [10]. The rest of that book is quite good too, if youre interested in learning about
topology!
9 The word real here doesnt actually mean anything. (Its Descartes fault.) Imaginary

numbers and real numbers are equally nonfictional. Real numbers shouldve been called
continuum numbers or line numbers or something. Too late now.
4. THE REAL DEAL 23

?
?
1

Figure 1.20: Uh oh, rational numbers and triangles are not friends.

Hence 2q 2 is divisible by 4, so q 2 is even, which implies that q is even. But that


contradicts the fact that pq is reduced.

Legend has it that the irrationality of 2 was discovered by Hippasus while
he was at sea, and his fellow Pythagoreans were so enraged that they threw
him into the ocean, where he drowned. Unfortunately, as far as we can tell, the
legend is just made up.
In modern times, many people claim that 0.999 . . . 6= 1 when asked. (Over
tall00
80% in one small study [12].) Mathematicians, on sec:cantor-set
the other hand, are all con-
fident that 0.999 . . . = 1. (More on this in Section 6.) Are non-mathematicians
just not thinking clearly when they say that 0.999 . . . is infinitesimally smaller
than 1? Its more reasonable to suggest that they simply were never told clearly
what numbers are and how they are represented, so they came up with their
own mental model which doesnt match the standard definitions used by math-
ematicians. Lets clear up these definitional issues now.

Definition 6 (Real numbers). A real number system is a 4-tuple10 (R, +, , ),


where R is a set, + and are binary operations on R, and is a binary relation
on R, satisfying the real number axioms.

Well that wasnt a very good definition. Wed better tell you what the real
number axioms are, eh? Most of them are ax:sup
pretty boring. You should just skim
them to get the flavor, except for Axiom 8 which is important. The first four
axioms say that arithmetic works like it ought to.

Axiom 1. Addition is commutative (x+y = y+x) and associative ((x+y)+z =


x + (y + z).) There is an additive identity 0 (x + 0 = x) and every number x
has an additive inverse x (x + x = 0.)

Axiom 2. Multiplication is commutative and associative, there is a multiplica-


tive identity 1, and every nonzero number x has a multiplicative inverse x1 .

Axiom 3. Multiplication distributes over addition (x (y + z) = x y + x z.)

Axiom 4 (Everyones favorite axiom). 1 6= 0.


24 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

2
1
3

4
6
5

Figure 1.21: Number systems satisfying Axioms 1 through 4 (with no order struc-
ture) are called fields. There are a lot of bizarre fields which are nothing like R. For
example, Z/7Z is the field you get by coiling Z up into a circle, pretending that n
and n + 7 are the same number for every n. Division in this field is pretty weird, e.g.
1
3
= 5, since 5 3 = 15 = 1. The point is, Axioms 5 through 8 are important.

The last four axioms deal with the order structure of R.


Axiom 5. The order is transitive (x y and y z implies x z), antisym-
metric (x y and y x implies x = y), and total (for every x, y, either x y
or y x.)
Axiom 6. If x y, then x + z y + z.
Axiom 7. If x 0 and y 0, then x y 0.
So far, Q has satisfied all of these axioms. From the axioms weve listed
so far, you can derive familiar facts like zero times anything is zero and a
negative times a negative is a positive. Yawn. Theres one more axiom, and
its the fun one. When is the last moment of Sunday? Midnight? No, thats
Monday already...
Definition 7. Fix X R. We say a R is an upper bound for X if a x for
all x X.
Definition 8 (Supremum). Fix X R. A number b R is called the supremum
of X if b is the least upper bound of X. That is, b is an upper bound for X, and
if a is another upper bound for X, then b a.
The supremum of X is denoted sup X; the abbreviation sup is pronounced
like soup. If a set X has a maximum, then sup X = max X. Some sets, like
(0, 1), have no maximum, but still have a supremum; sup(0, 1) = 1. There is
no last moment of Sunday; the set of moments which are on Sunday does not
have a maximum. But it does have a supremum: midnight. You might say that
midnight is the sup du jour! (Ba dum tss.)
10 An n-tuple is just an ordered list of n objects. So all were saying is that a real number

system has four parts: R, +, , and .


4. THE REAL DEAL 25

Figure 1.22: Q is like Swiss cheese: its riddled with holes. R is like cheddar cheese:
fig:cheese it tastes good grated over scrambled eggs.

Figure 1.23: The countable set S = { n1 : n N}. This set has a maximum,
fig:sup-inf max S = sup S = 1. It has no minimum, but inf S = 0.

ax:sup Axiom 8 (Supremum axiom). If X R is nonempty and X has an upper


bound, then X has a supremum.
The supremum axiom is crucial in real analysis. We prefer R to Q because
Q does not satisfy the supremum axiom. To see why, consider the set S = {r
Q : r2 < 2}. Suppose a is a rational upper bound for S. Then a2 2, and since
a is rational, the inequality is strict, i.e. a2 > 2. But that means that theres
another rational number b with a2 > b2 > 2, and hence b is a smaller upper
bound for S. So no least upper bound for S exists in Q. Hence the need for R.
In some sense, Q is riddled with holes. The supremum axiom asserts that all
fig:cheese
holes are filled in. (See Figure 1.22.)
The infimum of a set X (denoted inf X) is the greatest lower bound of X.
Just like the concept of supremum is a generalization of maximum, the concept
of infimum generalizes that of minimum. It follows from the real number axioms
fig:sup-inf
that any nonempty set with a lower bound has an infimum. (See Figure 1.23.)
So that concludes the definition11 of what a real number system is. But
were not done yet. Usually, people speak of the real number system, and we
need to justify that terminology. Does there even exist a real number system?
(If some of the real number axioms contradict each other, we are in serious
trouble!)
thm:r-existence Theorem 6. Thankfully, there does exist a real number system.
thm:r-existence
sec:r-construction
Well sketch a proof of Theorem 6 in Section 5. So theres a real number
system, but is it unique? Not quite. Let denote some fixed object, e.g. the
11 Mathematicians summarize the whole definition by saying that a real number system is

a complete ordered field.


26 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

empty set, or Abraham Lincoln, or radical freedom. Given one real number
system (R, +, , ), we can build another real number system. Our new set of
real numbers is R {}, i.e. the set of all pairs (x, ) where x R. Arithmetic
is defined by (x, ) + (y, ) = (x + y, ) and (x, ) (y, ) = (x y, ), and the
order is defined by saying that (x, ) (y, ) if and only if x y.
But thats dumb. All we did is rename each number x to (x, ), which
shouldnt count as building a whole new real number system. A rose by any
other name would smell as sweet. This renaming silliness is the only thing
that goes wrong; any two real number systems are isomorphic, i.e. each can be
obtained from the other by renaming the elements. Precisely:
thm:r-uniqueness Theorem 7. The real number system is unique up to ordered field isomorphism.
That is, if (R1 , +1 , 1 , 1 ) and (R2 , +2 , 2 , 2 ) are two real number systems, then
there exists a bijection f : R1 R2 so that
For all x, y R1 , f (x +1 y) = f (x) +2 f (y).
For all x, y R1 , f (x 1 y) = f (x) 2 f (y).

For all x, y R1 , x 1 y if and only if f (x) 2 f (y).


(For example, f (x) = (x, ) is an isomorphism R R {}.) The proof of
thm:r-uniqueness
Theorem 7 is not too hard but somewhat tedious, so well omit it.
So now we can speak of the real number system R, with the slight caveat
that youre only allowed to ask questions which can be phrased in terms of
+, , and . Thereals are only defined up to ordered field isomorphism, so
questions
like Is 2 ? arent meaningful. You might be thinking, Duh,
2 and arent sets, but its not that simple. In any particular real number
system, 2 and are sets! Under the hood, everythings a set. The questions
not
meaningful because in some real number systems (e.g. Dedekind cuts)
2 , but in others (e.g. Cauchy sequences) 2 6 . On the other hand,
questions like Does every real number have a real square root? make perfect
sense, because the answer is the same (no) in every real number system.
4. THE REAL DEAL 27

1
9
1.2
8
4.5 5
6
4

1.4
7

3.5

1.6
8
6

1.8
9
2. 5

2
5

1.
4.5

2 2

1. 8 1.4 2.5
1.6
4
3.5
3

Figure 1.24: In ancient times (circa 1970), engineers used slide rules to quickly mul-
tiply and divide numbers. A simple circular slide rule is depicted; the gray portion
rotates relative to the white portion. The depicted position corresponds to multiplica-
tion/division by 2. Slide rules work because the exponential function f (x) = ex is an
isomorphism between the additive structure of R and the multiplicative structure of
(0, ), because of the standard exponent rule ex+y = ex ey . (This is a slightly sim-
thm:r-uniqueness
pler kind of isomorphism than the ordered field isomorphism of Theorem 7, because thm:r-uniqueness
here were just preserving the structure of one operation, whereas in Theorem 7 we
fig:isomorphism preserved the structure of two operations and a relation.)
28 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

5 Reals from rationals


sec:r-construction
Lets start from Q and build a real number system, thereby sketching a proof
thm:r-existence
of Theorem 6. Dedekind observed that between any two real numbers, there is
a rational number. Therefore, to specify a point X on a real number line, it
suffices to specify the set of rational numbers less than X. So we can just define
the real numbers to be the appropriate sets of rational numbers.

Definition 9. A Dedekind cut is a set X Q with the following properties:


1. (X is closed downward) If a < b and b X, then a X.
2. (X has no maximum) If a X, there is some b X with a < b.
3. (X is nontrivial) 6= X 6= Q.

0 or x2 < 2} is a Dedekind cut, which will be the


{x Q : x <fig:dedekind-cut
For example,
real number 2. (See Figure 1.25.) To officially define our real number system,
we define R to be the set of all Dedekind cuts. We identify each rational number
x with the real number X = {y Q : y < x}. If X and Y are Dedekind cuts,
then we set
X + Y = {x + y : x X, y Y }.
We say that X Y if X Y . Multiplication is a little more annoying because
of minus signs. If X, Y 0, then we define

X Y = {x y : x X, y Y, x 0, y 0} {x Q : x < 0}.

We define X by
X = {x y : x < 0, y 6 X}.
And now we can extend our definition of multiplication to all reals by setting
(X) Y = X (Y ) = (X Y ) and (X) (Y ) = X Y . Its tedious, but it
can be verified that these definitions make (R, +, , ) a real number system.
There are lots of alternative constructions of R. The Cauchy sequence con-
struction, discovered by Cantor [TODO cite], is more in the spirit of real anal-
ysis. Cantor identified a certain class of sequences of rational numbers which
deserve to converge, and then defined the real numbers specifically so that
those sequences really do converge. You might appreciate reading about it
sec:continuity-definition
[TODO reference], but you should read Section ?? first.
6. THE CANTOR SET 29

283

200 = 2.002 . . .
13

9 = 2.08 . . .
3

2 = 2.25
5

3 = 2.77 . . .

2= 4
5
2 = 6.25

1
0.25 = 2

1 = 1 6
1.44 = 5

1.77 . . . = 34

1.96 = 75
707
1.9993 . . . = 500

Figure 1.25: The Dedekind cut identified with 2 is the set of shaded rational
fig:dedekind-cut numbers.

6 The Cantor set


sec:cantor-set
Time for our first legitimate real analysis problem. A set E R is dense if it
intersects every open interval J R. For example, Q is dense (this is why its
surprising that Q is countable, and this is why the Dedekind cut construction
works.) More generally, if I R is an open interval, we say that E is dense
in I if E intersects every subinterval J I. For example, Q [0, 1] is dense in
I = (0, 1), but not in I = (0, 2).
A set E is nowhere dense if there is no interval I in which E is dense.
A nowhere dense set is just like your friends arguments against your favorite
political positions: no matter which part you zoom in on, you can see a gaping
hole. For example, Z is nowhere dense. For another example, the set { n1 : n N}
fig:nowhere-dense-example
is nowhere dense. (See Figure 1.26.) In some sense, a nowhere dense set is
small. How does this notion of size interact with our earlier notion, cardinality?
Q shows that countable does not imply nowhere dense. Does nowhere dense
imply countable?

Figure 1.26: The set S = { n1 : n N} is nowhere dense. Given any interval (such as
nowhere-dense-example the blue interval), there is a subinterval (in red) which completely misses S.
30 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

Nope! The simplest counterexample, discovered by Henry Smith in 1874, is


denoted and called the Cantor set. (Cantor popularized it.) To construct ,
we start with the interval [0, 1], and remove a bunch of open intervals. In the
first iteration, we remove the middle third interval ( 13 , 23 ). This leaves us with
the two intervals [0, 31 ] and [ 32 , 1]. Next, we remove the middle third interval
from each remaining interval, so that were left with four intervals. We continue
the process of removing the middle third of each remaining interval ad infinitum,
and then is everything left over. That is, if we let n be the set that we have
after n iterations of this process, then = n n , the setfig:cantor-set
of points which are
in every n . The first few iterations are shown in Figure 1.27.

0
0 1

1
0 1 2 1
3 3

2
0 1 2 1 2 7 8 1
9 9 3 3 9 9

fig:cantor-set Figure 1.27: Construction of the Cantor set.

Its easy to see that is nowhere dense: for any open interval I that inter-
sects [0, 1], there is a sufficiently large n so that the nth step of constructing
involves removing a subinterval of I. In fact, after removing all those intervals,
how much of [0, 1] is left over? The sum of the lengths of the intervals that make
up n is ( 23 )n , so if we take a limit as n , we see that chap:measure
the total length
of is 0. (Well come back to this calculation in Chapter 6.) So must be
empty... right? Wrong! For example, 0, 1 . In fact, has infinitely many
points: any number of the form 31n is in .
But to really understand ||, we need to take a detour. So far, weve talked
about real numbers in the abstract. When you met R as a child, real numbers
were presented to you in the guise of decimal expansions. A decimal expansion
is a string, something like 3.14159265 . . . , which (by definition) represents12 the
real number
1 4 1 5
3+ + + + +
10 100 1000 10000
(We havent talked about limits yet, but since all the terms are nonnegative,
you can interpret this infinite sum as the supremum of the set of partial sums.)
Proposition 4. Every real number has a decimal expansion.
(Well skip the proof.) How about uniqueness? Annoyingly, some real num-
bers have two different decimal expansions. The real number 1 can also be
12 You might complain that we havent explained which real number is referred to by strings

like 3, 10, 4, 100, etc.! Well, you understand which integers are referred to by such strings,
right? And the real number 1 is part of the axioms. So identify the positive integer n with
the real number 1 + 1 + + 1 (with n ones.)
6. THE CANTOR SET 31

Figure 1.28: 0.999 . . . apples are depicted. Or maybe 0.999 . . . apple is depicted?

represented as 0.999 . . . , where there are an infinite number of nines after the
decimal point. Do you doubt it? Lets prove it. Certainly 1 is an upper bound
on {0.9, 0.99, 0.999, . . . }. If there were a smaller upper bound, say 1 , then
would be infinitesimal : greater than zero, but smaller than n1 for every natural
number n. Such numbers do not exist:
Theorem 8 (Archimedean Property). For any real > 0, there exists a natural
number n > 0 so that > n1 .
Proof. Let S = {n N : n 1 }. Our goal is to show that S 6= N. If S is
empty, were done. Otherwise, by the supremum axiom, S has a least upper
bound sup S. By the minimality of sup S, there exists s S with s > (sup S)1.
Then s+1 > sup S, so sup S is not an upper bound on N. Therefore, S 6= N.
If youre still in doubt, maybe youd be convinced by tripling both sides of
the equation 31 = 0.333 . . . . If youre still uncomfortable, maybe it helps to keep
in mind that decimal expansions are just strings, not the numbers themselves.
So is 1 the only two-faced scoundrel in R? Nope, e.g. 97.842 = 97.841999 . . . .
Every real number with a finite decimal expansion has a second decimal expan-
sions. But thats the only thing that goes wrong.
Proposition 5. Every real number has at most two decimal expansions. A
real number has two decimal expansions if and only if it has a finite decimal
expansion.13
Now we can finally understand ||, by representing numbers in ternary, i.e.
base 3. (All of our discussion of decimal expansions applies mutatis mutandis
for any integer base b 2, or even weirder bases like base 2i where i is the
imaginary unit.) The interval ( 31 , 23 ) that we remove in the first iteration of the
construction of consists of all those real numbers x [0, 1] whose first ternary
digit (after the decimal point14 ) is 1. More precisely, ( 13 , 23 ) consists of those real
numbers x [0, 1] such that in every ternary representation of x, the first digit
13 Note that for this proposition, we count 3 and 3.0 and +003 as all being the same

decimal expansion. If you were trying to be careful, you might disallow leading/trailing
zeroes in decimal expansions.
14 It would really be more appropriate to call it a radix point, but whatever.
32 CHAPTER 1. SETS, FUNCTIONS, NUMBERS, AND INFINITIES

is 1. Similarly, in the nth step, we remove those real numbers x such that in
every ternary representation of x, the nth digit is 1. So what were left with, ,
is the set of real numbers in [0, 1] which can be represented in ternary without
using the digit 1. But of course there are uncountably many such real numbers,
because every sequence of 0s and 2s represents a distinct such real number!
So on the one hand, is big: it is an uncountable set. But on the other
hand, is small: it is nowhere dense, and it has total length zero. Well
meet again many times, when these odd properties make it useful.

References
qcsd [1] Scott Aaronson. Quantum computing since Democritus. Cambridge: Cam-
bridge University Press, 2013. isbn: 978-0521199568.
glencoe14 [2] Algebra 2, Study Guide and Intervention Workbook. McGraw-Hill Educa-
tion, 2014.
bacon09 [3] Darius Bacon. Comment on blog post of Scott Aaronson. url: http://
www.scottaaronson.com/blog/?p=391#comment-13569.
galileo1638 [4] Galileo Galilei. Discourses and Mathematical Demonstrations Relating to
Two New Sciences. Italy, 1638.
hilbert26 [5] D. Hilbert. ger. In: Mathematische Annalen 95 (1926), pp. 161190. url:
http://eudml.org/doc/159124.
honig54 [6] Chaim Samuel H{onig. Proof of the well-ordering of cardinal numbers.
In: Proceedings of the American Mathematical Society 5.2 (Feb. 1954),
p. 312. doi: 10.1090/s0002-9939-1954-0060558-3. url: http://dx.
doi.org/10.1090/S0002-9939-1954-0060558-3.
jacoby13 [7] S. Jacoby S. Roell. An Interview with Susan Jacoby on Athiesm. url:
http://fivebooks.com/interviews/susan-jacoby-on-atheism.
twistedpencil196 [8] Jette. The Adventures of Sir Jective. url: http://twistedpencil.com/
posting/196.
kevin09 [9] Kevin. Comment on blog post of Scott Aaronson. url: http : / / www .
scottaaronson.com/blog/?p=391#comment-13542.
munkres [10] James Munkres. Topology. 2nd ed. Prentice Hall, 2000.
price06 [11] Nelson L. Price. Is There A God. url: http://www.nelsonprice.com/
is-there-a-god/.
tall00 [12] D. Tall. Cognitive development in advanced mathematics using tech-
nology. In: Mathematics Education Research Journal 12 (Dec. 2000),
pp. 196218. doi: 10.1007/BF03217085.
refsection:3

Chapter 2

Discontinuity

For half a century we have seen a


mass of bizarre functions which
appear to be forced to resemble as
little as possible honest functions
which serve some purpose.

Henri Poincare [poincare1899]

7 Guessing function values


sec:continuity-def
The heroes of this chapter are functions f : R R, i.e. functions which eat a
number and spit a number fig:boring-funcs
back out. You met these functions in school and drew
their graphs. (See Figure 2.1.) Roughly speaking, we say that f is continuous
if you can draw its graph without ever picking up your pencil. (Euler defined
continuity by saying that a f is continuous if the graph fig:boring-funcs
of f can be described
by freely leading the hand. [TODO cite]) So in Figure 2.1, f is continuous but
g isnt.
Consider the function f : R R defined by
f (x) = sin(1/x),
fig:sin-one-over-x
shown in Figure 2.2. (For this section, we adopt the convention sin(1/0) = 0.)
Is f (x) continuous? How about the function
g(x) = x sin(1/x),
fig:x-sin-one-over-x
shown in Figure 2.3?
Evidently, Eulers pencil-never-leaves-the-paper nonsense is not a precise
enough definition of continuity! The idea of the true definition, first given by
bolzano1817
Bolzano [1], is that f is continuous at x if f (x) is exactly what youd expect it
to be, based on how f behaves near x. Continuous functions are predictable.
These predictions are, of course, limits.

33
34 CHAPTER 2. DISCONTINUITY

f (x) g(x)

x x

fig:boring-funcs Figure 2.1: Some boring functions R R.

fig:sin-one-over-x Figure 2.2: The topologists sine curve, f (x) = sin(1/x).

fig:x-sin-one-over-x Figure 2.3: The topologists sine curve after a pliers accident, g(x) = x sin(1/x).
7. GUESSING FUNCTION VALUES 35

Definition 10 (Limit of a sequence). Suppose x1 , x2 , . . . is a sequence of real


numbers, and L is a real number. We say that xn converges to L if for all > 0,
there exists an N such that for all n > N ,

|xn L| < .
fig:limit-definition
(See Figure 2.4.) In this situation, we write lim xn = L, or just xn L.
n

xn

L+

n
N

Figure 2.4: The definition of the limit of a sequence. For any error margin > 0,
fig:limit-definition for all sufficiently large n, xn is within of L.

Traditionally, real analysis students find the epsilontics involved in the defi-
nition of a limit to be confusing.1 Maybe a real-life example would help clarify.
You are the pilot of a helicopter carrying secret agents. For their secret spy
mission, its important that you hover L feet off the ground. Let xn be the
altitude of the helicopter after youve made n adjustments. (Its a digital heli-
copter.) Then xn L means that no matter what tolerance > 0 your crazy
boss demands of you, by making enough careful adjustments, you can eventually
guarantee that the helicopter is within of L and always will be in the future.
Meeting higher standards takes more time, of course: if is very small, then N
might have to be very big.
Now lets move on to defining continuity. You and your spouse want to go on
a trip to the Moon. Your spouse has been obsessively watching the fluctuating
rocket ticket prices, trying to get the best possible deal. I finally bought the
tickets just now at time t, your spouse says.
How much did they end up costing? you ask.
1 Steven Krantz reports that when asked to give the - definition of continuity on a quiz,

one student responded: For every > 0 there is a > 0 such that you can draw the graph
without lifting your pencil from the paper. [TODO cite Mathematical Apocrypha]
36 CHAPTER 2. DISCONTINUITY

You dont wanna know / your spouse replies. But you really do wanna
know, so you ask, Well how much did they cost at time t 100?
Only $200! We shouldve bought them then!
How about at time t 10?
They shot up to $1000, which scared me.
And at time t 1?
Down to $600. I thought Id better grab them soon.
What about at time t 0.1?
$580. Having learned the prices at times near t, you can extrapolate to
guess the price at t, but youd have to assume that the price doesnt fluctuate
too wildly. You keep needling your spouse, learning the prices at times t 0.01,
t 0.001, t 0.0001, t 0.00001... You gain more and more confidence in your
extrapolations, because you have to assume less and less about the behavior of
the price. After infinitely many questions, youve learned the price at a sequence
of times tn with tn t, so you just have to extrapolate infinitesimally to infer
the price at t. All youre assuming now is that the price function is continuous
at t.

Definition 11 (Continuity). Suppose f : R R and x R. We say that f is


continuous at x if for every sequence of inputs x1 , x2 , . . . converging to x, the
corresponding sequence of outputs f (x1 ), f (x2 ), . . . converges to f (x).

To put it another way, f is discontinuous at x if there is some misleading


sequence xn x with f (xn ) 6 f (x). So f (x) = sin(1/x) is discontinuous at 0,
because f (0) = 0, yet there is a sequence x 0 so that f (xn ) = 1 for every
fig:sin-one-over-x-discontinuous n
n. (Figure 2.5.) Remember, though, we declared f (0) = 0 by fiat. Maybe that
was a mistake. Would it be better to choose ffig:sin-one-over-x-discontinuous-2
(0) = 1? Nope: theres another
sequence yn 0 with f (yn ) = 1. (Figure 2.6.) So theres no value of f (0)
that would make f continuous at 0.
On the other hand, g(x) = x sin(1/x) is continuous everywhere. (The only
worrisome spot is x = 0, but observe that |g(x)| |x|.) Lets see you draw the
graph of that, Euler! If a function is continuous everywhere, we just say that
it is continuous. In other words, a continuous function is one which commutes
with limits, i.e.  
lim f (xn ) = f lim xn .
n n

Well end this section with a ridiculous theorem about infinitesimal extrapo-
weather1
lation even in the face of discontinuity, from [7]. Lets play a game. We choose a
function f : R R. Then a point x0 R is randomly chosen (drawn from, say,
a standard normal distribution, or whatever.) We reveal to you the restriction
of f to R \ {x }. (I.e. you get to know f (x) for every x 6= x .) Then you have
to guess what f (x ) is. You win if you get it right; we win if you get it wrong.
Youre probably thinking, Ill take a limit! You could find a sequence
xn x with xn 6= x , and evaluate limn f (xn ). If that limit exists, it
seems like the obvious guess. If we choose a continuous f and you follow this
strategy, youre guaranteed to win.
7. GUESSING FUNCTION VALUES 37

Figure 2.5: A sequence (in red) showing that f (x) = sin(1/x) is discontinuous at 0.
-over-x-discontinuous The sequence suggests that f (0) = 1, but in actuality f (0) = 0.

Figure 2.6: Another sequence showing that f (x) = sin(1/x) is discontinuous at 0.


ver-x-discontinuous-2 This time, the sequence suggests that f (0) = 1.

f (x)

? ?
?
?
?
x
x

Figure 2.7: This is the sort of picture that you have to deal with in our guessing
game. Every value of the function is revealed except one mysterious point.

But were not going to make it that easy. We dont make any promises at
all about f . Can you still force a guaranteed win? Nah, youre doomed to
occasionally give wrong answers. But, absurdly, you can force an almost sure
win:
thm:function-guessing Theorem 9. There is a strategy you can follow which ensures that for any
function f we choose, there are only finitely many values of x which lead you
to lose. In particular, no matter which f we choose, your probability of winning
is 100%.
Proof. Define a binary relation on the set of all functions R R by declaring
38 CHAPTER 2. DISCONTINUITY

that f g if f and g agree on all but finitely many points. This relation is an
equivalence relation, i.e. it is reflexive (f f ), symmetric (f g = g f )
and transitive (f g, g h = f h.) Therefore, partitions the set of
all functions R R up into equivalence classes maximal sets of functions any
two of which agree on all but finitely many points. For each equivalence class
C, choose one representative function fC C.
When youre presented with f with its value at x hidden, figure out which
equivalence class f belongs to (call it C.) Then guess that f (x ) = fC (x ). For
any f , there are only finitely many x causing you to lose, because f fC !

That proof was our first2 encounter with the Axiom of Choice (AC), which
is the axiom of set theory which allows the step where we defined fC .

Axiom 9 (Axiom of Choice). Suppose U is a set, F P(U ), and 6 F. Then


there exists f : F U such that for every X F, f (X) X. (The function f
is called a choice function.)

AC frustrates many people, because it allows for very nonconstructive proofs.


thm:function-guessing
Notice that our proof of Theorem 9 doesnt actually explain how you should play,
in practice. It just shows that there exists, in the abstract, a strategy with the
desired properties. Some mathematicians prefer to avoid AC when possible, but
sometimes
ht08
it is unavoidable. AC will be a recurring character in this book. See
[5] for similar, even more ridiculous theorems, also relying heavily on the axiom
of choice.

8 The Dirichlet function


sec:dirichlet
Can a function be discontinuous in infinitely many places? Sure, easy peasy:
a step function with infinitely many steps, like the floor function. (See Fig-
fig:step
ure 2.8.) How about a function which has uncountably many discontinuities?
Well do even better. Johann Peter Gustav Lejeune Dirichlet (pronounced deer-
ish lay) discovered a function which is discontinuous everywhere.
Dirichlet realized he could exploit the fact that in every open interval (a, b)
R, there are both rational and irrational numbers. (Q and R\Q are both dense.)
The Dirichlet function is another name for Q , the indicator function of the
rationals. As a reminder, the definition is
(
1 if x Q
Q (x) =
0 if x 6 Q.
fig:dirichlet
(See Figure 2.9.)

Proposition 6. For every x R, the Dirichlet function is discontinuous at x.


2 Actually,
chap:sets
several of the results that we stated without proof in Chapter 1 rely on AC.
8. THE DIRICHLET FUNCTION 39

f (x)

fig:step Figure 2.8: Countably infinitely many discontinuities.

1
1 0.5 0 0.5 1

fig:dirichlet Figure 2.9: The Dirichlet function.


40 CHAPTER 2. DISCONTINUITY

0.5

0.5

1
1 0.5 0 0.5 1

fig:continuous-at-one-point Figure 2.10: Continuous at precisely one point.

Proof. Theres a sequence of rational numbers x1 , x2 , . . . converging to x, and


theres another sequence of irrational numbers y1 , y2 , . . . converging to x. By
definition, Q (xn ) = 1 and Q (yn ) = 0, so Q (xn ) and Q (yn ) cannot both
converge to Q (x).
An oddity of the Dirichlet function is that it is periodic and nonconstant,
yet it has no smallest period: for any rational number r and any real number
x, Q (x + r) = Q (x), so Q is periodic with period r.
Dirichlets idea spawns more monstrosities. Heres a function which is con-
tinuous at one point, but discontinuous everywhere else:
(
x if x is rational
f (x) =
x if x is irrational.
fig:continuous-at-one-point
(See Figure 2.10.) Even better, heres a function which is differentiable at one
point, but discontinuous everywhere else:
(
x2 if x is rational
f (x) =
0 if x is irrational.
fig:differentiable-at-one-point
(See Figure 2.11.)
Part of the reason Dirichlet gets his name attached to Q is that it played
a role in clarifying the concept of a function. Mathematicians were churning
out functions way back in the 1600s in the course of developing calculus. But
shockingly, it seems that the now-standardbourbaki54
sec:countable
definition of function that we gave in
Section 1 first appeared in a 1954 book [2]! So how did mathematicians get by
in the intervening several hundred years? Well, they played around with many
9. CONWAYS BASE-13 FUNCTION 41

0.8

0.6

0.4

0.2

0.2
1 0.5 0 0.5 1

entiable-at-one-point Figure 2.11: Differentiable at 0, but discontinuous everywhere else.

different notions of function, of varying degrees of rigor. For the first couple
hundred years, it was popular to think of functions in terms of formulas
or analytic expressions,
euler1748
whatever that means. E.g. in 1748, Euler gave a
definition [4]:
A function of a variable quantity is an analytic expression composed
in any way whatsoever of the variable quantity and numbers or con-
stant quantities.
dirichlet1829
In 1829 [3], sec:dirichlet-thomae-revisited
Dirichlet gave Q as an example of a function with no integral
(see Section ??). Since Q is not really defined by a formula, some infer that
Dirichlet had internalized the modern concept of a function, for which they
lakatos76
therefore give him credit. But Lakatos correctly points out [6, p 151] that the
credit is undeserved. Dirichlet never gave any such definition.

9 Conways base-13 function


sec:conway-base-13
You might remember the Intermediate Value Theorem from calculus class, which
says that if your position is a continuous function of time, then you cant tele-
fig:ivt fig:oceania
port. (See Figures 2.12 and 2.13.)
Theorem 10 (Intermediate Value Theorem). Suppose f is continuous and a <
b. Then for any y between f (a) and f (b), there is an x (a, b) so that f (x) = y.
The IVT seems pretty obvious, because no teleporting sounds almost what
we meant by being continuous! At least, it sounds awfully similar to Eulers
pencil-never-leaves-the-paper idea of continuity... but remember, that wasnt
the actual definition of continuity.
42 CHAPTER 2. DISCONTINUITY

f (x)

f (b)
y
f (a)

x
a xb

Figure 2.12: The intermediate value theorem: in order for a continuous function to
fig:ivt get from one value to another, it must pass through every value in between.

Figure 2.13: When you drive in a car, your distance from Wellington varies contin-
uously. Every point on the Earths surface which is 550 miles away from Wellington
is at sea. So by the IVT, if you want to drive from New Zealand to Australia, youre
fig:oceania going to have to build a car that can drive through water. Or a bridge or something.
9. CONWAYS BASE-13 FUNCTION 43

Hey, maybe now we can argue that Bolzanos formal definition of continu-
ity successfully captures Eulers intuitive idea! A function which satisfies the
conclusion of the IVT is called a Darboux function. That is, f is a Darboux
function if for every a < b and every y between f (a) and f (b), there is an
x (a, b) so that f (x) = y. Maybe thats a definition of continuity that Euler
could get behind! The IVT says that every continuous function is Darboux, so
now we just have to prove that every Darboux function is continuous.
Theres one small hitch: that last statement is extremely false! The function
f (x) = sin(1/x) is Darboux, but discontinuous at one point. It gets worse.
Well give a function f : R R such that for every open interval (a, b), we have
f ((a, b)) = R. That is, for every open interval (a, b) and every y R, there
exists x (a, b) so that f (x) = y. So f is certainly Darboux, but f is not even
remotely close to continuous. In fact, its discontinuous at every point (like the
Dirichlet function, but much crazier.)

fig:conway Figure 2.14: A truncated graph of Conways base-13 function (in black).

fig:conway2 Figure 2.15: A truncated graph of Conways base-13 function (in white).
fig:conway
Figure 2.14 is a little misleading. The graph of f isnt all of R2 (its a
44 CHAPTER 2. DISCONTINUITY

function, after all!) But every disc in R2 contains a point in the graph of f . In
other words, the graph of f is a dense subset of R2 .
So what function has this bizarre property? One example is by British
mathematician John Horton Conway, who (as of 2015) is still alive, unlike the
other mathematicians weve encountered. His idea is to represent numbers in
base 13, with these symbols:

0 1 2 3 4 5 6 7 8 9 + .

Every real number has a unique base-13 expansion with no trailing . symbols
sec:cantor-set
(recall Section 6.) Conways base-13 function f is defined with respect to this
expansion as follows.
For the interesting case, suppose the base-13 expansion of x is of the form
AB, where removing all the circles from the symbols in B yields a sensible
base-10 expansion for a real number y. Then set f (x) = y.
Otherwise, just set f (x) = 0.
For example, let x be the real number with base-13 expansion

x= + 6 . 2 4 3 . 1 4 1 5 9 2 6 ...
| {z }| {z }
A B

Notice that if we start at the digit, then if we removed the circles, we


would get a string y = 3.1415926 . . . , which is a base-10 expansion for the real
number . So we set f (x) = . For another example, let x be a real number
with infinitely many + symbols in its base-13 expansion. Then f (x) = 0.

prop:conway Proposition 7. Let f denote Conways base-13 function. Then for every a < b
and every y, there is some x such that a < x < b and f (x) = y.
Proof. Start with the base-13 expansion for the midpoint 21 (a + b). If we go out
far enough in this base-13 expansion, we can change anything we want and well
still have a number in (a, b). So in particular, we can replace the sequence of
subsequent digits with the circledfig:conway-proof
base-10 expansion of y, to obtain an x (a, b)
such that f (x) = y. (See Figure 2.16.)
So Darboux functions are a lot more complicated than continuous functions.
In fact, Darboux functions are absurdly expressive:
Theorem 11 (Sierpinski). For every function f : R R, there are two Darboux
functions g, h so that f = g + h.
Proof. Define an equivalence relation on R by declaring that x y if xy Q.
Let E be the set of equivalence classes. Observe that

|R| = |E Q| |E E| = |E| |R|


10. CONTINUITY IS UNCOMMON 45


1
.
(a + b) = 0 . 5 1 8 1 0 5 9 ... (base 13)
2
y = +3.1415926 . . . (base 10)


.
x= 0 . 5 1 8 + 3 1 4 ... (base 13)

prop:conway
Figure 2.16: The proof of Proposition 7. The location of the vertical bar in the
base-13 expansion of 12 (a + b) is chosen based on how big b a is, to make sure that
fig:conway-proof x (a, b).

so |E| = |R|. Partition E up into two disjoint sets E = E1 E2 so that


|E1 | = |E2 | = |R|. There are bijections 1 : E1 R and 2 : E2 R. Define
(
1 ([x]) if [x] E1
g(x) = (2.1)
f (x) 2 ([x]) if [x] E2 ;
(
f (x) 1 ([x]) if [x] E1
h(x) = (2.2)
2 ([x]) if [x] E2 .

(Here, [x] denotes the equivalence class to which x belongs.) By construction,


f = g + h. To show that g and h are Darboux, well show that even better, they
(like Conways base-13 function) map every open interval surjectively onto R.
Fix a < b and y. Let [x] = 11 (y). Since Q is dense in R, we can find x x
so that a < x < b. Then (x ) = y, showing that g((a, b)) = R. The same
argument works for h.

Whats the moral of this story? Is there something wrong with Bolzanos
definition of continuity? Nah. Euler would probably agree that Conways base-
13 function does not deserve to be called continuous. The notion of a Darboux
function is not a reasonable definition of continuity. Its hard to say what it
means for the graph of a function to be described by freely leading the hand,
but it really ought to be more conservative than continuity, not more liberal.

10 Continuity is uncommon
c:continuity-uncommon
In the previous couple of sections, we saw some really nasty functions with
tons of discontinuities. But in everyday life, it seems like we only run into
continuous functions. You might be tempted to infer that most functions are
continuous. But in truth, in the sense of cardinality, the vast majority of func-
tions are discontinuous!

Proposition 8. Let C(R, R) be the set of all continuous functions R R.


46 CHAPTER 2. DISCONTINUITY

f (x) f (x)

x x

Figure 2.17: To recover the full graph of f given the values of f on Q, just connect
g:specifying-continuous-function the dots.

Then |C(R, R)| = |R| = i1 . (In contrast, note that the set RR of all functions
R R has cardinality i2 .)
Proof. Since Q is dense, to specify a continuous function f : R R, it suffices
to give the restriction of f to Q. (The value of f at any point x can be recovered
from its values on Q, because theres a sequence x1 , x2 , .fig:specifying-continuous-function
. . of rational numbers
converging to x, and f (x) = limn f (xn ). See Figure 2.17.) Therefore,

|C(R, R)| |R||Q| = (2i0 )i0 = 2i0 i0 = 2i0 = i1 .

Constant functions establish the reverse inequality |C(R, R)| i1 .


Notice that the same basic argument actually shows that the vast majority
of functions have uncountably many discontinuities! (To specify a function f
with countably many discontinuities, just give f Q along with f D where D is
the set of x values at which f is discontinuous.)
For whatever reason, this is a recurrent phenomenon in mathematics. Usu-
ally, the vast majority of cases are pathological (in appropriate senses of vast
majority and pathological.)

11 Thomaes function
sec:thomae
Weve seen some very discontinuous functions. But bigger is not always better.
Maybe youre especially fond of some set D R. Like discontinuity connois-
seurs, we can look for a function which is discontinuous exactly at the x values
in D. For now, lets consider the case D = Q. In the 19th century, the German
mathematician Carl Johannes Thomae devised his namesake function:
(
0 if x is irrational
f (x) = 1 (2.3)
q if x = pq , with pq reduced and q > 0.
fig:thomae
(See Figure 2.18.)
Proposition 9. Thomaes function is continuous at irrational x and discon-
tinuous at rational x.
12. DISCONTINUITIES OF MONOTONE FUNCTIONS 47

fig:thomae Figure 2.18: Thomaes function (in black).

Proof. First, suppose x is rational, so that f (x) > 0. Theres a sequence of


irrational numbers x1 , x2 , . . . converging to x, but f (xn ) = 0 6= f (x), so f is
discontinuous at x.
Conversely, for the harder direction, suppose x is irrational. The intuition
here is to think about rational approximations to x, and notice that close ap-
proximations must have large denominators. So as x gets very close to x, f (x )
really will get very close to 0. Now for the proof:
Consider an arbitrary sequence x1 , x2 , . . . converging to x, and fix an arbi-
trary > 0. There are only finitely many rational numbers within distance 1 of
x with denominator no more than 1 , so one of them (call it y) is closest to x. If
n is sufficiently large, the sequence xn is even closer to x than y is, and hence
f (xn ) < . Since was arbitrary, f (xn ) 0, showing that f is continuous at
x.

12 Discontinuities of monotone functions


otone-discontinuities
A function f is monotone increasing if x x implies f (x) f (x ). Monotone
decreasing is defined in the obvious way, and monotone fig:monotone-function
just means either mono-
tone increasing or monotone decreasing. (See Figure 2.19.) The pathological
functions weve seen so far have not been monotone. The following theorem,
due to Darboux despite its name, gives an excuse:

thm:froda Theorem 12 (Frodas theorem). Suppose f is monotone. Then f has only


countably many discontinuities.
thm:froda
The key to proving Theorem 12 is a recharacterization of continuity.
48 CHAPTER 2. DISCONTINUITY

f (x) g(x)

x x

Figure 2.19: A monotone increasing function on the left and a monotone decreasing
fig:monotone-function function on the right.

Definition 12. Fix f : R R and c R. We write lim f (x) = L to mean


xc
that for every > 0, there exists > 0 such that

0 < |x c| < = |f (x) L| < .

(Its basically like the definition of the limit of a sequence, with x playing
the role of n and playing the role of N .) If you just check, youll see that
f is continuous at c if and only if limxc f (x) = f (c). So now we can divide
the crime of discontinuity into fig:discontinuity-types
three tiers, depending on how badly limxc f (x)
fails to equal f (c) (see Figure 2.20):
1. Suppose limxc f (x) exists, but it doesnt equal f (c). Then f is charged
with having a removable discontinuity at c. For this minor infraction, f
is required to enroll in a 12-step program, where it learns how to change
its value at c and thereby become continuous.
2. The left limit, denoted limxc f (x) or f (c), is defined just like limxc f (x),
except we only pay attention to x < c. Similarly for right limits. Suppose
f (c) and f (c+) both exist, but theyre not equal, and hence limxc f (x)
doesnt exist. Then f is charged with having a jump discontinuity at c.
For this misdemeanor, f is incarcerated in a correctional facility, where
professionals attempt to decrease f (x) for all x on one side of c, thereby
eliminating the jump and restoring continuity.
3. Finally, suppose either f (c) or f (c+) does not exist. Then f is charged
with having an essential discontinuity at c, which is a felony. Making f
continuous at c would require fundamentally altering f s character. So f is
just sentenced to life imprisonment, to protect society from its incorrigible,
deviant behavior.

Proof of Frodas theorem. Without loss of generality, assume f is monotone in-


creasing. Suppose f is discontinuous at c R. Monotonicity implies that its
12. DISCONTINUITIES OF MONOTONE FUNCTIONS 49

Figure 2.20: Removable discontinuity, jump discontinuity, and essential discontinu-


g:discontinuity-types ity.

1000000

Figure 2.21: You open the right envelope and see 106 . Do you guess that x1 = 106
or x2 = 106 ? Does 106 seem like a small number, or a big number? What a dumb
fig:envelope-game question. Surely, all you can do is toss a coin and hope for the best... right? Nope!

a jump discontinuity. Since f (c) < f (c+), there is some rational number qc
with f (c) < qc < f (c+). The map c 7 qc is an injection from the set of
discontinuities of f to Q.

Time for a fun application of Frodas theorem in the form of a game. We


choose two distinct real numbers x1 < x2 and put each in an unmarked envelope.
We shuffle the envelopes and give them to you. You choose an envelope and
open it, learning the real number inside. You then guess whether youre looking
fig:envelope-game
at x1 or x2 . If youre right, you win. If youre wrong, we win. (See Figure 2.21.)
You can trivially achieve a win probability of 50% by just opening a random
envelope and saying x1 . Bizarrely, you can beat 50%. Heres what you do:
Pick your own third number y randomly, from (say) a standard normal distri-
bution. Choose a random envelope and open it. Assume that y falls between
x1 and x2 , and guess accordingly.
Heres why it works: Whatever values x1 , x2 we choose, theres a positive
probability that y falls between them. In that case, youll win. And in the other
case, it all depends on which envelope you open, so youve still got a 50-50 shot.
So overall, your probability of winning is

Pr(win) = Pr(y (x1 , x2 )) 1 + Pr(y 6 (x1 , x2 )) 0.5


= 0.5 + 0.5 Pr(y (x1 , x2 )) > 0.5.
fig:gaussian-envelope-strategy
(See Figure 2.22.)
Admittedly, its a bit anticlimactic. The strategy beats 50%, but only by ,
where we can force to be as small as we want by choosing x1 and x2 very close
50 CHAPTER 2. DISCONTINUITY

x1 x2

Figure 2.22: The strategy which gives you a win probability greater than 50%. The
area of the green region is the probability that y really does fall between x1 and x2 ,
in which case you win. If the blue or yellow event occurs, youll win if and only if you
fig:gaussian-envelope-strategy open the envelope containing x2 or x1 , respectively.

together. Maybe theres a cleverer strategy which guarantees you probability of


success at least p where p > 0.5?
Nope! Heres why. Fix an arbitrary strategy. Let f (x) be the probability
that you guess x2 given that you observed the number x in the envelope you
opened. Then your probability of success is
Pr(win) = 0.5 f (x2 ) + 0.5 (1 f (x1 ))
= 0.5 + 0.5[f (x2 ) f (x1 )].
If f is not monotone increasing, we can choose x1 < x2 so that f (x1 ) > f (x2 ),
putting your win probability below 50%. If f is monotone, then by Frodas
theorem, it has a point of continuity, so we can force f (x2 ) f (x1 ) to be smaller
than whatever > 0 we choose.
How about the converse to Frodas theorem? Yep, countability characterizes
the sets of discontinuities of monotone functions!
thm:froda-converse Theorem 13. Suppose D R is countable. Then there is some monotone
function f : R R which is discontinuous precisely at points in D.
Proof. Say D = {d1 , d2 , . . . }. Define
X
f (x) = 2i . (2.4)
di x
fig:froda-converse
(See Figure 2.23.) The sum makes sense, because the terms are all nonnegative,
so the order of summationP doesnt matter. The sum converges to a finite number
i
between 0 and 1, since i=1 2 = 1. Its immediate that f is monotone
increasing; as x gets bigger, we add up more and more things. And of course f
is discontinuous at di D, because the value jumps up by 2i there.

Finally, fix x 6 D; we must show that limxP x f (x ) = f (x). Consider an

arbitrary > 0. Let N be large enough that i=N 2i < . Let be small
enough that the interval [x , x + ] doesnt contain any of the first N elements
of D. ThenP while traversing this interval [x , x + ], the value of f changes

by at most i=N 2i < as desired.
13. DISCONTINUITIES OF INDICATOR FUNCTIONS 51

f (x)

3
4

1
2

1
4

x
0
thm:froda-converse
fig:froda-converse Figure 2.23: The function f used to prove Theorem 13 in the case D = N.

Notice that this provides another example of a function which is discontinu-


ous at exactly the rationals (like Thomaes function.) But this thm:froda-converse
time, its mono-
sec:cantor-function
tone! Well revisit the construction in the proof of Theorem 13 in Section 63
after developing measure theory, and hopefully it will seem more natural then.

13 Discontinuities of indicator functions


cator-discontinuities
For a function f , let D(f ) be the set of x values such that f is discontinuous
at x. So far, in every example weve seen, D(f ) has either been countable or
else has contained an interval. Can D(f ) be an uncountable nowhere dense
set? E.g., is there a function with D(f ) = , where is the Cantor set from
sec:cantor-set
Section 6? Yep! Oddly enough, the indicator function is an example! This
is in contrast to the situation with Q, whose indicator function is discontinuous
everywhere.

Proposition 10. is discontinuous precisely at .

Proof. First, suppose x 6 . Remember that to construct , we just removed


a bunch of open intervals from [0, 1], so there is some open interval I such that
x I and I = . Then is 0 on all of I, so it is continuous at x.
Conversely, suppose x . Remember that is nowhere dense, so in par-
ticular, does not contain any intervals. Therefore, there are points arbitrarily
close to x which are not in , where is 0. Therefore, is discontinuous at
x.

Lets generalize, so we can understand what just happened. Its time to


introduce you to topology. The definitions are a bit more intuitive in Rn . For
52 CHAPTER 2. DISCONTINUITY

Figure 2.24: Let E denote the gray region. Then x is an interior point of E, y is an
fig:interior-exterior-boundary exterior point of E, and z is a boundary point of E.

a point x Rn and a radius r > 0, let Br (x) denote the open ball of radius r
centered at x.

Definition 13 (Interior, exterior, boundary). Fix a set E Rn and a point


x Rn .

If theres some > 0 so that B (x) E, we say that x is an interior point


of E.

If theres some > 0 so that B (x) E c , we say that x is an exterior


point of E.

If x is neither an interior point nor an exterior point of E, we say that x


is a boundary point of E.
fig:interior-exterior-boundary
(See Figure 2.24.)

The interior of E, denoted int(E), is the set of interior points of E. The


exterior of E is denoted ext(E), and the boundary of E is denoted E. For
example, if I is an interval from 0 to 1, then regardless of which endpoints are
included, we have int(I) = (0, 1), I = {0, 1}, and ext(I) = (, 0) (1, ). A
couple other examples of boundaries: R = , Q = R, Z = Z, and = .

Proposition 11. For any set E R, E is discontinuous precisely on E.

Proof. A point x is in E if and only if there points arbitrarily close to x in E


and points arbitrarily close to x in E c .

14 Sets of discontinuities
sec:f-sigma
Does Thomaes function have a twin? That is, does there exist a function which
is continuous at rational points and discontinuous at irrational points?
As in the last section, D(f ) is the set of points where f is discontinuous.
Weve seen examples of messed up functions with D(f ) = R, D(f ) = Q, D(f ) =
14. SETS OF DISCONTINUITIES 53

Figure 2.25: Adolf Hitler does not appreciate the terms open and closed
fig:hitler [hitlertopology].

, etc. We have not, however, seen any hints about how you might rule out the
possibility of asec:monotone-discontinuities
function f with some given discontinuity set.
In Section 12, we saw a satisfying theorem: There exists a monotone function
f such that D(f ) = D if and only if D is countable. In this section, well prove
an analogous theorem without the monotone qualifier.
Definition 14. Fix a set E Rn . We say that E is closed if E E. We say
that E is open if E E c .
For example, thankfully, open intervals are open and closed intervals are
closed. An open set is one where each point has some wiggle room. Fuzzy
set probably would have been a better term for open sets. The term closed
set is more reasonable, because a set E is (topologically) closed if and only if
it is closed under the operation of taking limits. That is, E is closed if and only
if whenever xn is a convergent sequence of points in E, we have lim xn E.
Warning: some sets, like [0, 1), are neitherfig:hitler
open nor closed, and other sets, like
, are both open and closed. (See Figure 2.25.)
If E = E, like the case E = , then theres a function f with D(f ) = E,
namely f = E . By adapting Dirichlets simple trick, we can handle all closed
sets, even the ones with nonempty interiors.
:closed-discontinuity Proposition 12. Suppose E R is closed. Then there is some function f with
D(f ) = E.
Proof. Define

1 if x E Q
f (x) = 1 if x E \ Q


0 if x 6 E.
fig:closed-discontinuity
(See Figure 2.26.) This is obviously continuous on x 6 E, because theres a
neighborhood around x on which f is constant. Conversely, suppose x E, so
54 CHAPTER 2. DISCONTINUITY

that f (x) 6= 0. Let xn , yn be sequences converging to x with xn Q, yn 6 Q.


Then f (xn ) is nonnegative and f (yn ) is nonpositive, so they cant both converge
to f (x).

f (x)

prop:closed-discontinuity
fig:closed-discontinuity Figure 2.26: The function used to prove Proposition 12 in the case E = [1, 1].

prop:closed-discontinuity
(An alternative way to prove Proposition 12 is to show that every closed
subset of R is the boundary of some set.) How about the converse? Do we
have our characterization are discontinuity sets precisely closed sets? Nah,
that hypothesis has already been falsified. For example, Q is not closed, but its
the discontinuity set of Thomaes function. The real criterion is slightly more
complicated.

Definition 15. A set E R is F if it can be written as a countable union of


closed sets.

(The term F comes from the French words ferme and somme, meaning
closed and union.) For example, any countable set, like Q, is F , because
singleton sets are closed. Any closed set, like or R, is trivially F . The set
R \ {0} is F , because
[ 1
 
1

R \ {0} = , , .
n n
nN

Notice that every set of discontinuities that weve encountered so far is F ! This
is no coincidence. Using the basic idea behind Thomaes function, we can tweak
prop:closed-discontinuity
the proof of Proposition 12 to handle arbitrary F sets.

hm:f-sigma-implies-discontinuity Theorem 14. Suppose E is F . Then there exists a function f : R R with


D(f ) = E.

Proof. Say E = n En , where each En is closed. Define



1
max{ n : x En } if x E Q
f (x) = max{ n1 : x En } if x E \ Q


0 if x 6 E.
14. SETS OF DISCONTINUITIES 55

fig:f-sigma-implies-discontinuity prop:closed-discontinuity
(See Figure 2.27.) First, suppose x E. The proof used for Proposition 12 still
applies, showing that f is discontinuous at x. Conversely, suppose x 6 E, so
f (x) = 0. Suppose xm x. Since each En is closed, the sequence xm must
eventually escape En and never return. Once xm has escaped E1 , . . . , En , we
have |f (xm )| n1 . So f (xm ) 0, and f is continuous at x.

f (x)

x
0 1 1 3 2 5 3
2 2 2

thm:f-sigma-implies-discontinuity
Figure 2.27: The function used to prove Theorem 14 in the case En = [ n1 , 3 1
n
],
implies-discontinuity which is discontinuous precisely on E = (0, 3).

And the converse is also true: F -ness characterizes discontinuity sets.

nuity-implies-f-sigma Theorem 15. For any function f : R R, the set D(f ) is F .

Remember how the key to Frodas theorem was to classify discontinuities


as more or less severe? Thats true here too, in a slightly different sense. The
diameter of a set E R is defined by

diam(E) = sup |x y|.


x,yE

For a function f : R R and a set E fig:oscillation


R, the oscillation of f in E is defined
by f (E) = diam(f (E)). (See Figure 2.28.) The oscillation of f at a point x is
defined by
f (x) = lim f (B (x)).
0

The oscillation of f at x measures the extent to which f is discontinuous at x.


For example, if f has a removable discontinuity at x, then f (x) is the distance
from the actual value f (x) to the better value limx x f (x ). In particular,
f (x) = 0 if and only if f is continuous at x.
56 CHAPTER 2. DISCONTINUITY

Figure 2.28: The oscillation of f in E is the height of the smallest box that contains
the graph of the restriction of f to E. For example, the oscillation of sin(1/x) in any
fig:oscillation interval containing 0 is 2.

thm:discontinuity-implies-f-sigma
Proof sketch of Theorem 15. We can write
[ 1

D(f ) = x R : f (x) . (2.5)
n
nN

If you just check, youll see that {x : f (x) } is a closed set.


thm:f-sigma-implies-discontinuity
thm:discontinuity-implies-f-sigma
Theorems 14 and 15 help a lot toward understanding which sets are dis-
continuity sets. For example, the vast majority of sets are not discontinuity
sets.

Proposition 13. Let D denote the set of all F subsets of R. Then |D| = |R|
(which is smaller than |P(R)| by Cantors theorem.)

Proof sketch. It turns out that every open set U R can be written as a
countable union of disjoint open intervals. A closed set is just a complement
of an open set, so a closed set can be specified by a sequence of real numbers.
Hence, an arbitrary element of D is specified by a sequence of sequences of reals.
Therefore,
|D| (|R||N| )|N| = |R||N| = |R|.

But the story so far isnt entirely satisfying, because its not obvious how
to identify examples of sets which are not F . Can the set R \ Q be written as
a countable union of closed sets? Its difficult to say! (Thats the thing about
characterization theorems. Youre never really sure when youre done.) Stay
sec:baire
tuned, well answer this question in Section 15.

15 The Baire category theorem


sec:baire
Our goal in this section is to prove that Thomaes function does not have a twin.
That is, R \ Q is not F . On the way, well meet the meager sets. Meagerness
might seem like a technical, awkward concept. At the very least, its a useful
tool. And meager sets are actually pretty fun to hang out with, once you get
to know them.
15. THE BAIRE CATEGORY THEOREM 57

sec:uncountability
In Section 2, we saw Cantors famous 1891 diagonal argument, which proved
that R is uncountable. Diagonalization is a great trick to have up your sleeve;
sec:cardinal-numbers
we saw in Section ?? that it can be used to prove that |P(S)| > |S| for every set
S. Historically, diagonalization was not the first technique used to prove that R
is uncountable. Lets take a look at (a slight variant of) Cantors original proof
that R is uncountable, from 1874. The older proof is actually more real-analysis-
ish than the slick diagonalization trick, and if you understand the proof, youll
be ready to meet meager sets. A set E R is bounded if diam(E) < .

Theorem 16 (Cantors intersection theorem). Suppose E1 E2 . . . is a


nested sequence of closed, bounded, nonempty sets. Then n En 6= .

Proof. Closedness implies that each En has a minimum xn = min En . Then xn


is a bounded, monotone increasing sequence, so it has a finite limit x = supn xn .
For any Em , the sequence xn is eventually in Em , so closedness implies that
fig:cantor-intersection
x Em . (See Figure 2.30.)

n 0 1
n

Figure 2.29: The hypotheses of Cantors intersection theorem are important. On


the left, the nested sequence of closed, nonempty, unbounded sets En = [n, ) has
empty intersection. On the right, the nested sequence of open, nonempty, bounded
sets En = (0, n1 ) has empty intersection.

Figure 2.30: The proof of Cantors intersection theorem in the case En = [ n1 , n1 ].


g:cantor-intersection The left endpoints limit to 0, the sole element of n En .

Theorem 17. R is uncountable.

Proof from 1874. Let x1 , x2 , . . . be an arbitrary sequence. Inductively define


fig:r-uncountable-original
closed, bounded intervals I1 I2 . . . , with xn 6 In . (See Figure 2.31.) By
Cantors intersection theorem, there is some x n In . Then x 6= xn for every
n. Since the sequence was arbitrary, we can conclude that no sequence exhausts
all of R.
58 CHAPTER 2. DISCONTINUITY

Figure 2.31: Cantors original proof that R is uncountable. Having already defined
In1 (in black), we can find a subinterval In (in blue) which misses the single point
fig:r-uncountable-original xn (in red.)

Uncountability is a sort of bigness. In 1899, Rene-Louis Baire realized3 that


R is big in a stronger sense
by tweaking Cantors proof, we can show that sec:cantor-set
than mere uncountability. Recall from Section 6 that a set E R is nowhere
dense if for every open interval I R, there is an open subinterval J I so
that E J = . In the proof that R is uncountable, we avoided the sequence
of points x1 , x2 , . . . , but the argument actually allows us to avoid a sequence
of sets E1 , E2 , . . . , as long as each En is nowhere dense. This idea led Baire to
classify subsets of R as falling into two categories.
Definition 16. A set E R is meager, or first category, if it can be written as
a countable union of nowhere dense sets.
You should think of meager as meaning small (though this is a more
relaxed sense of smallness than nowhere dense or countable.) Sometimes, people
describe meager sets as thin. Some examples: Any countable set (like Q) is
meager, because a singleton set {x} is nowhere dense. Any nowhere dense set
(like ) is meager. Let + Q = { + q : , q Q}. (In words, we put
a copy of at every rational number. This is called the Minkowski sum of
and Q.) Then + Q is meager, even though its uncountable and dense.
Definition 17. A set E R is nonmeager, or second category, if it isnt meager.

Theorem 18 (The Baire category theorem). R is nonmeager.


Proof. Just repeat the proof that R is uncountable, replacing the sequence
x1 , x2 , . . . of real numbers with a sequence E1 , E2 , . . . of nowhere dense sets.
The Baire category theorem opens the door to a host of nonmeager sets. A
comeager set is the complement of a meager set.4 Since the union of two meager
sets is meager, Baires category theorem implies that every comeager set, such
as R \ Q, is nonmeager.
We promised that all this Baire category stuff was going to help us to show
that R \ Q is not F . Maybe F sets are always meager? Nah, thats not true.
Theres really no reasonable sense in which F sets are small, because R itself
is F ! The true connection is a little subtler: F sets are either small, or big,
but never medium! Precisely:
3 Dunno if this was actually Baires thought process. But its a reasonable guess.
4 Theco- prefix convention for complements is especially popular in mathematical logic.
Its useful for constructing low-quality jokes. E.g., a coconut should have just been called a
nut.
REFERENCES 59

Figure 2.32: Baire categories.

prop:f-sigma-category Proposition 14. Suppose E is F . Then either E is meager (E is small) or


else E contains an interval (E is big.) In particular, R \ Q is not F .
Proof. Say E = n En , where every En is closed. If E doesnt contain an
interval, then each En is nowhere dense (if it were dense in I, it would contain
I by closedness.) So E is meager.
A couple other examples: The set of transcendental numbers is not F . The
set of noncomputable numbers is not F . ( + Q)c is not F .

References
bolzano1817 [1] Bernard Bolzano. Rein analytischer Beweis des Lehrsatzes da zwischen je
zwey Werthen, die ein entgegengesetzetes Resultat gewahren, wenigstens
eine reelle Wurzel der Gleichung liege. Gottlieb Haase, 1817.
bourbaki54 [2] Nicolas Bourbaki. Berlin: Springer, 2006. isbn: 3540340343.
60 CHAPTER 2. DISCONTINUITY

dirichlet1829 [3] P. G. L. Dirichlet. On the convergence of trigonometric series which serve


to represent an arbitrary function between two given limits. In: 4 (1829),
pp. 157169.
euler1748 [4] L. Euler. Introductio in analysin infinitorum. Vol. 1. 1748.
ht08 [5] Christopher S. Hardin and Alan D. Taylor. A peculiar connection be-
tween the Axiom of Choice and predicting the future. In: American
Mathematical Monthly (2008), pp. 9196.
lakatos76 [6] Imre Lakatos. Proofs and refutations : the logic of mathematical dis-
covery. Cambridge New York: Cambridge University Press, 1976. isbn:
0521290384.
weather1 [7] Set Theory and Weather Prediction. url: http : / / xorshammer . com /
2008/08/23/set-theory-and-weather-prediction/.
refsection:4

Chapter 3

Series

On the whole, divergent series are


the work of the devil, and its a
shame that one dares base any
demonstration upon them.
sd00
Niels Henrik Abel [5], possibly
mistranslated

16 Stacking books
How far over the edge of a table can a stack of books protrude without toppling?
fig:stacking-books-problem
(Figure 3.1)

d
Table

Figure 3.1: The book stacking problem with N = 4. We are interested in maximizing
tacking-books-problem the overhang d.

To be more precise, we have N identical unit-length books, and in our stack,


no two books may have the same vertical position. How large can the horizontal
distance be between the edge of the table and the right edge of the rightmost
book? For N = 1, you can achieve an overhang of d = 12 , but if you push the
fig:one-book
book any farther than that it will fall off the table. (Figure 3.2)
What about general N ? So far, this is a physics question, but we can turn
it into a math question by trusting Newton: well assume that the stack falls if
and only if for some n N , the center of mass (COM) of the top n books is
fig:unbalanced-books
not above the surface on which those n books rest. (Figure 3.3)

61
62 CHAPTER 3. SERIES

Table

fig:one-book Figure 3.2: An optimal stack of 1 book.

If youre in the mood to solve this puzzle yourself, close this book now and
ponder. Otherwise, read on for the solution.

Table Table

Figure 3.3: The stack on the left is unbalanced and will topple over. The COM of
the entire stack (marked ) is over the table like it should be, but the COM of the
top two books (marked ) is to the right of the third book. The top two books will
pivot about the top right corner of the third book, as shown on the right. Note: We
assume that the books have uniform density, so the COM of a set of books is just the
fig:unbalanced-books average of their spatial centers.

Definition 18. The harmonic stack is defined inductively as follows. To build


a harmonic stack of N books, first place a book on the table poking over the
1
edge a distance of 2N . Then build a harmonic stack of N 1 books, treating
that first book as if it werefig:harmonic-example
the table. For example, the harmonic stack of 4
books is depicted in Figure 3.4.

Table
1 1
8 4
1 1
6 2

fig:harmonic-example Figure 3.4: A harmonic stack of 4 books.

prop:harmonic-balanced Proposition 15. Harmonic stacks do not topple over.

Proof. Say there are N books in the stack. By induction, we can assume that if
you held the bottom book steady, the stack wouldnt fall over. The only thing
to worry about is the horizontal component of the COM of the whole stack
compared to the edge of the table. Put the origin at the upper right corner of
16. STACKING BOOKS 63

1
the table, so that the COM of the top N 1 books is at most 2N (by induction.)
Hence, the COM of all the books is at most
"   #
1 1 1 1
1 + +(N 1) = 0.
N 2 2N 2N
| {z }
COM of
bottom book

The overhang achieved by the harmonic stack of N books is


N
1X1
d= .
2 n=1 n
fig:harmonica
You should recognize the famous harmonic series. (See Figure 3.5.)

fig:harmonica Figure 3.5: The harmonic series is not to be confused with the harmonica series.
P
A series is an expression1 of the form n=1 an , where a1 , a2 , . . . is a sequence
of real numbers (the terms of the series.) The sequence of partial sums of the
PN
series is the sequence S1 , S2 , . . . where SN = n=1 an . We say that the series
converges/diverges if the sequence of partial sums converges/diverges. Series
can diverge because the limit is infinite, e.g. 1 + 1 + 1 + , or because the limit
does not exist, e.g. 1 1 + 1 1 + .
P
Proposition 16. The harmonic series n=1 n1 diverges, i.e.

XN
1
lim = .
N
n=1
n

Proof. This proof was discovered by the philosopher Nicole Oresme in the 1300s.
fig:oresme
(Figure 3.6) Well make the series a little smaller, and show that it still diverges.
Replace each term with the next power of two to appear:
X
1 1 1 1 1 1 1 1 1 1
= + + + + + + + + +
n=1
n 1 2 3 4 5 6 7 8 9
1 1 1 1 1 1 1 1 1
+ + + + + + + + +
1 2 |4 {z 4} |8 8 {z 8 8} 16
1/2 1/2

that strictly speaking, two series n=1 an and


1 Notice
P P
n=1 bn arePequal only if they are

equal termwise, i.e. only if an = bn for every n. But typically, when an appears in a
Pn=1
mathematical expression, it stands for the value P of the series,PlimN N n=1 an , rather than
for the series itself.
P n P So for example, even though 2n and 2 3n are two different series,
we still write 2 = 2 3n = 1.
64 CHAPTER 3. SERIES

30

importance
20

10

ts

s
rie
ar

se
ch

ic
r
ba

on
rm
ha
Figure 3.6: Other than his proof that the harmonic series diverges, Oresmes main
fig:oresme contribution to the world may have been the invention of bar charts.

1 1
3 5

1 1
2 4

Figure 3.7: The proof that the harmonic series 12 + 13 + 41 + . . . diverges (the first
term 1 is not drawn.) We divide the infinitely many terms of the series into blocks,
and alternatingly color the blocks gray and red. Each block has only finitely many
terms (twice as many as the previous block) yet each block has a total width of at
fig:harmonic-series-diverges least 21 .

Grouping together like powers of two gives


X
1 1 1 1 1 1
+ +2 +4 +8 + ...
n=1
n 1 2 4 8 16
1 1 1 1 1
= + + + + + ...
1 2 2 2 2
= .
fig:harmonic-series-diverges
(See Figure 3.7.)
The implications for book stacking are astounding. The overhang of the
harmonic stack of N books limits to as N ! So for any distance d,
1010
no matter how large a mile, a million miles, 1010 miles, anything you
could, in principle, build a precariously balanced stack of books which hangs a
fig:harmonic-52
fig:cards
distance d over the edge of the table! (Figures 3.8, 3.9)
However, the harmonic series diverges
PN slowly. Oresmes proof that the har-
monic series diverges suggests that n=1 n1 scales like log N . In fact, it turns
16. STACKING BOOKS 65

Table

fig:harmonic-52 Figure 3.8: A harmonic stack of 52 books, which achieves an overhang of about 2.27.

Figure 3.9: You can get near the theoretical optimal overhang with a deck of 52
fig:cards playing cards.
66 CHAPTER 3. SERIES

fig:euler-mascheroni Figure 3.10: Euler discovers the Euler-Mascheroni constant.

out that
PNthere is a constant 0.58 called the Euler-Mascheroni constant such
that n=1 n1 + ln N in the sense that
"N # !
X1
lim ln N = .
N
n=1
n
fig:euler-mascheroni
fig:harmonic-log-overlay
(Figures 3.10, 3.11).
To paraphrase2 Daniel Shanks, ln N goes to infinity with great dignity. Turn-
ing things around, the number of books youd need to achieve an overhang of
d using a harmonic stack grows very rapidly with d; it scales like e2d . Even
for smallish distances like d = 30, you would need far more books than can be
found on Earth.
So the harmonic stack isnt as exciting as it seemed. Unfortunately, the
harmonic stack is optimal:
prop:harmonic-optimality Proposition 17. The maximum overhang that can be achieved by a stack of N
books is that achieved by the harmonic stack of N books.
(The proof, which is elementary, is omitted.) One way to get around this
annoyance is to relax the model by allowing multiple books at each vertical
fig:side-by-side
position, side by side. (Figure 3.12) It turns out that in this new model, the
2 The original quote: log log log x goes to infinity with great dignity.
16. STACKING BOOKS 67

1
2 ln N

Table a

Figure 3.11: The harmonic stack is shaped like the exponential function (or the
natural log function if your head is sideways.) The distance marked a is approximately
:harmonic-log-overlay /2, where is the Euler-Mascheroni constant.

number of books needed to reach a distance d scales like d3 instead of like e2d .
[TODO cite https://math.dartmouth.edu/~pw/papers/maxover.pdf] Much
more practical.
Finally, well address two misconceptions about book stacking. Misconcep-
tion one: Some people mistakenly summarize our discussion of harmonic stacks
by saying, You can build a stack of books that reaches infinitely far away from
the table. But infinitely far is much different than arbitrarily far. (What
physics is even supposed to apply to an infinite stack of books?)
Misconception two: Some people mistakenly believe that you can add books
to the top of an ever-growing stack, one by one, in such a way that the overhang

Table

Figure 3.12: When you allow books to be side by side (unlike our original problem),
new possibilities open up. A harmonic stack of 9 books achieves an overhang of
d 1.41, but this simple diamond stack of 9 books achieves a superior overhang of
fig:side-by-side d = 1.5.
68 CHAPTER 3. SERIES

goes to as time progresses. Our discussion of harmonic stacks did not prove
this claim; notice that to get from a harmonic stack of N books to a harmonic
stack of N + 1 books, you have to add another book to the bottom of the
stack! And in fact, in the model where no two books can have the same vertical
position, the claim is false. Since this point is a little bit subtle, and it isnt
discussed anywhere outside this book to the bestapx:book-stacking
of our knowledge, we give a
fairly detailed statement and proof in Appendix 1.

17 Inserting parentheses and rearranging series


sec:rearrangement
In Oresmes proof that the harmonic series diverges, there was a step where we
grouped together like powers of two:
1 1 1 1 1 1 1 1 1 1 1 1
+ + + + + + + + = + + 2 + 4 + ...
1 2 4 4 8 8 8 8 1 2 4 8
Seems pretty true. But how can we legitimately justify this move? Effectively,
we are inserting parentheses into the series, so that e.g. the two terms 41 , 41 in
the original series are replaced with a single term ( 41 + 41 ) in the new series.
This smells like the familiar associative law for addition, which says that we
can insert and remove parentheses in finite sums without changing the value,
e.g. a + (b + c) = (a + b) + c. Does associativity still hold for infinite sums
(series)?
Nope! For an easy counterexample, lets look at Grandis series,

1 1 + 1 1 + 1 1 + , (3.1) eqn:divergent-series

which diverges since its partial sums form the divergent sequence 1, 0, 1, 0, . . ..
Now insert some parentheses to help it along:

(1 1) + (1 1) + (1 1) + (3.2) eqn:divergent-series2

This series is just 0 + 0 + 0 + 0 + , which converges to 0. We can even insert


parentheses a different way and evaluate

1 + (1 + 1) + (1 + 1) + , (3.3) eqn:divergent-series3
fig:idcrisis
which then converges to 1. (Figure 3.13) Evidently, we cant get associativity
for infinite sums in general. Uh oh. Does Oresmes proof have a gaping hole in
it? It seemed so convincing!
P No, not a gaping hole, just a tiny technicality to address. It is true that if
an converges, then we can insert parenthesis wherever we want and it will still
converge to the same thing. Proof: Inserting parenthesisfig:series-associativity-failure
amounts to looking
at a subsequence of the sequence of partial sums. (Figure 3.14) If the sequence
of partial sums converges to begin with, then all subsequences also converge
to the same thing, so we can add parentheses to the series willy-nilly and the
sum wont change. Adding parentheses can only help the series converge. So
17. INSERTING PARENTHESES AND REARRANGING SERIES 69

Figure 3.13: Guido Grandi experiences an identity crisis. Actually, the paradox
didnt bother Grandi at all. He found it theologically illuminating: By putting
parentheses into the expression 11 + 11 + . . . in different ways, I can, if I want, obtain
0 or 1. But then the idea of the creation ex nihilo is perfectly plausible. [TODO cite
fig:idcrisis Bagnis Appunti di Didattica della Matematica]

SN

0 N

Figure 3.14: The sequence of partial sums for Grandis series oscillates and diverges.
But the subsequence consisting of just the blue dots converges to 1, and the subse-
associativity-failure quence consisting of just the black dots converges to 0.
70 CHAPTER 3. SERIES

Oresmes proof works, because3 if the harmonic series did converge, then the
series 1 + 21 + 12 + 21 + . . . would have to converge to something smaller.
Now that weve seen that associativity does not generalize to infinite series,
well look at commutativity (a + b = b + a). Can we rearrange terms of a series
without affecting the sum?
The answer is, again, no in general. As an example, lets rearrange the al-
P n+1
ternating harmonic series, n=1 (1)n . Using Taylor series, we can evaluate:
X
(1)n+1 1 1 1
= 1 + + = log 2.
n=1
n 2 3 4

By the way, thats a natural logarithm.45 Now lets rearrange the series like
this:
1 1 1 1 1 1 1 1
S =1 + + + ,
2 4 3 6 8 5 10 12
which is a pattern of an odd denominator followed by two consecutive even
denominators. This new series does not converge to log 2. If it did, we could
insert parentheses without altering the sum, but:
     
1 1 1 1 1 1 1 1
1 + + +
2 4 3 6 8 5 10 12
1 1 1 1 1 1
= + + +
2  4 6 8 10 12 
1 1 1 1 1 1
= 1 + + +
2 2 3 4 5 6
1
= log 2.
2
fig:paychecks-and-bills
(Figure 3.15.)
How far can we push this madness? Which series have sums which depend
on the order of summation? And which values can such a series be made to sum
to?
Wed better clarify what it means to rearrange the terms of a series. In-
tuitively, we just want to add up the terms in a different order. But of course,
1 1 1 1
1+ + + + + ...
2 4 8 16
should not count as a rearrangement of the harmonic series, because some
terms of the harmonic series will never appear. We want every term of the
original series to appear exactly once in the new series. To make this precise,
note that a permutation of a set S is a bijection : S S.
3 Another way to justify Oresmes argument: Every term of the harmonic series is nonnega-
tive, so the sequence of partial sums is monotone. Every subsequence of a monotone sequence
has the same convergence behavior as the original sequence.
4 In analysis, when the base of a logarithm isnt specified, you should assume its base e.

This is in contrast to e.g. computer science, where logs are base 2 by default.
5 Heres a joke: What do analysts and number theorists throw into the fireplace? Answer:

Natural logs!
17. INSERTING PARENTHESES AND REARRANGING SERIES 71

.
..
Bill
Paycheck
Bill
/ /
,
Paycheck
,
Bill
/ /
,
Paycheck
,
Bill
/ /
,
Paycheck
,
Bill
/ /
,
Paycheck
,
/ /
, ,

P (1)n+1
Figure 3.15: You can think about a series such as n
financially. The positive
terms of the series are paychecks and the negative terms are bills. When you get a
paycheck, you immediately deposit it, and when you get a bill, you immediately pay
it off. The series converges to log 2, which means that as time progresses, your bank
account balance will converge to log 2. The paychecks sum to infinite wealth, and the
bills sum to infinite debt, so your bank account balance converging is the result of a
careful balancing act. Each paycheck puts your bank account balance a little above
log 2, and each bill puts your bank account balance a little below log 2. It should
make sense that if you start getting two bills for every paycheck, you wont be able to
g:paychecks-and-bills maintain such a high bank account balance.

P
Definition 19. A rearrangement of the series n=1 an is a series of the form
P
n=1 a(n) , where is a permutation of N.
P P
Recall that a convergent series n=1 an is conditionally convergent if n=1 |an | =
. For example, the alternating harmonic series is conditionally convergent.
P
riemann-rearrangement Theorem 19 (Riemanns rearrangement theorem). Let an be a condition-
ally convergentP series. Then for any L R {}, there is a permutation
(n) so that a(n) = L.

Apparently, conditionally convergent series are so weak-willed that they can


be persuaded to converge to anything at all by permuting the terms! Lets get
started with the proof.
P
Lemma 1. Suppose n=1 an is conditionally convergent. Then the sum of the
positive terms diverges (to +) and the sum of the negative terms diverges (to
).
P +
Proof. Let a+n := max{an ,P 0} and a n := min{an , 0}, so that an is thePsum

of the positive terms
P and a
P n is the sum of the negative terms. Since an
converges, either a+ n and a n both converge, or they both diverge. But they
72 CHAPTER 3. SERIES

SN
21
41
1
16 81 10
1.2
1
+ 19 1
+ 19 1 + 25 1
1 + 29
1
+ 13 1 + 23 + 27
+ 17 1
+ 21
+ 31 + 17 1
+ 11 1
+ 15

+ 15

1 1 1 1 1 1 1 1 1
1.2 = 1 + + + + + + +
3 2 5 7 9 4 11 13 6

+ 11
N

Figure 3.16: The rearrangement of the alternating harmonic series that the proof of
thm:riemann-rearrangement
fig:riemann-rearrangement Theorem 19 constructs for the target sum L = 1.2.

P P P
cant both converge, because that would imply that |an | = a+
n a
n
converges.6
thm:riemann-rearrangement
Proof sketch of Theorem 19. First suppose L R. Without loss of generality,
assume L 0. By the lemma, our positive terms are worth and our negative
terms are worth , so lets use them! Add a bunch of positive terms until our
partial sum exceeds L. Then throw in some negative terms until we drop below
L, then back to positive terms, etc. We switch to adding terms of the other
sign as soon as we pass L. In this way, we use up all the terms in the series,
and the error between our partial sum and L goes to 0 as time progresses, since
fig:riemann-rearrangement
an 0 as n . (Figure 3.16)
Now suppose L = +. The idea is to add up a lot of positive terms, then
a negative term, then a lot of positive terms, then a negative term, etc. By the
lemma, we can always add enough positive terms to more than make up for the
negative term. The L = case is symmetric.
P P
If n=1 |an | converges, we say that n=1 an is absolutely convergent. Heres
a converse to Riemanns rearrangement theorem. Dirichlet showed that rear-
ranging an absolutely convergent series never changes the sum:

Theorem 20 (Dirichlet). Any rearrangement of an absolutely convergent series


converges to the same thing as the original series.
6 Here were using the easy fact that rearranging a series with nonnegative terms does not

affect the sum.


18. A TAYLOR SERIES THAT CONVERGES TO THE WRONG FUNCTION73

P P
Proof. Let n=1 an = L. ThePkey fact is that since n=1 |an | converges, there
is some N so that the tail n=N +1 |an | is . For this same N ,

N
X X
X

L an = an |an | .

n=1 n=N +1 n=N +1

Now we wait for our permutation (n) to hit all the numbers {1, . . . , N } (this
will happen in finite time, since theres only finintely many numbers we need to
hit). If it takes time t to get all of them, so {(1), . . . , (t)} {1, . . . , N }, then

X Xt X

L a(n) L a(n) + a(n) 3,

n=1 n=1 n=t+1
P P
since n=t+1 a(n) n=N +1 |a(n) | and

Xt XN X

L a(n) L an + more terms beyond aN 2.

n=1 n=1
| P
{z }

N +1 |an |

18 A Taylor series that converges to the wrong


function
sec:taylor-series
If we have an infinitely differentiable function f : R R, we can hope to get a
Taylor series centered at 0 and write

X
f (n) (0) n
f (x) = x , for |x| < R, some R > 0. (3.4) eqn:taylor
n=0
n!
P xn
Some examples you may P be familar with are ex = n=0 n! for all x R
1
(R = ) and 1x = n=0 xn for |x| < 1 (R = 1). In these examples, the
Taylor series is equal to the original function wherever the series converges.
But in general, if the Taylor series converges for |x| < R, must f be equal to its
Taylor series there? Well answer this question negatively by answering a weaker
question: If the Taylor series converges for all x R, is there a neighborhood
eqn:taylor
of x = 0 such that (3.4) holds for all x in that neighborhood? Such a function
is called real analytic, meaning it is locally equal to a convergent power series.
Surprisingly, even this weaker statement is false. Not only can a convergent
Taylor series fail to converge to f everywhere, but it doesnt even need to con-
verge to f in any neighborhood of the center! In other words, the Taylor series
of f can converge to the wrong function in every neighborhood of the center
point.
74 CHAPTER 3. SERIES

Define f by (
e1/x , x>0
f (x) = ,
0, x0
fig:exp1
whose graph is shown in Figure 3.17.

Figure 3.17: Plot of e1/x near zero. It is super flat at zero but then ever so slowly
fig:exp1 makes it way up away from the x-axis.

The function f is infinitely differentiable at x = 0, but its Taylor series at


0 is just the zero function T (x) 0, which does not converge to f for any
x > 0. (If you dont like the fact that rmk:exp2
the series does converge to f for x < 0,
take a look at the function in Remark 1. We choose f here instead because the
computations are a little simpler.)
Since f (x) = 0 for x < 0, the left hand limit at zero is easy to compute:

(n) f (n1) (x) f (n1) (0)


f (0) = lim = 0.
x0 x0
For the right hand limit at zero, we have the following lemma, which can be
proved using induction.
lem:poly Lemma 2. For x > 0,
f (n) (x) = pn (1/x)e1/x , (3.5) eqn:kderiv

where pn is a polynomial of degree at most 2n.


Using this, the definition of the derivative, and the fact that exponentials
(n)
dominate polynomials, we can show that f+ (0) = 0, so that f (n) (0) = 0. Then
the Taylor series for f centered at x = 0 is simply
X
f (n) (0) n X
T (x) = x = 0 = 0,
n=0
n! n=0

which is certainly convergent for all x R. But since f (x) 6= 0 for x > 0, no
open interval around x = 0 exists such that T (x) = 0 is equal to f (x).
( 2
e1/x , x 6= 0
rmk:exp2 Remark 1. f has a relative g(x) = with similar properties.
0, x=0
fig:seagull-fcn
(Figure 3.18.)
19. MISSHAPEN SERIES 75

exp(1/x2 )

2
Figure 3.18: The seagull function f (x) = e1/x is infinitely differentiable, but all
fig:seagull-fcn of its derivatives at 0 are 0, so its Taylor series converges to the function g(x) 0.

Remark 2. Maybe you hope these non real analytic functions are rare. Too
bad! Lets say we start with a function h that is real analytic; i.e. it is locally
given by a convergent power series. Then add f to it, which will guarantee that
h + f is not real anlaytic. So for every real analytic function h, we can construct
a unique non real analytic function h + f ! In fact, even the set of smooth
Darst
but
nowhere analytic functions on R is second category in C (R)! (See [1].)

Remark 3. This nonsense disappears in complex analysis. Complex differen-


tiability is equivalent to (complex) analyticity (and infinite differentiability).

Heres a joke: Recall that a Maclaurin series is just a Taylor series centered
at zero. So, why do Maclaurin polynomials fit the original function so well?
Because they are Taylor-made! In light of this section, Maclaurin polynomials
may not actually work so well, but its just a joke.

19 Misshapen series
P
So far, weve investigated standard series, of the form n=1 an . But standards
are for chumps. How about a two-sided series? E.g.


X
2|n| = + 22 + 21 + 20 + 21 + 22 +
n=

P
It seemsfig:series-2-sided
pretty clear that this series should converge to 1 + 2 n=1 2n = 3.
(Figure 3.19.)
76 CHAPTER 3. SERIES

1
X 1
X
2|n| = 1 2|n| = 1
n= n=1

5 4 3 2 1 0 1 2 3 4 5 n

fig:series-2-sided Figure 3.19: Summing a two-sided series.

How about a two-dimensional series?

21 + 22 + 23 + 24 + 25 + 26 +
+ 22 + 23 + 24 + 25 + 26 +
+ 23 + 24 + 25 + 26 +
+ 24 + 25 + 26 +
+ 25 + 26 +
+ 26 +
..
.
P fig:series-2d
This one ought to converge to n=1 n2n = 2. (Figure ??.)

n2n
n

fig:series-2dim Figure 3.20: Summing a two-dimensional series.

More generally, if we have a countable index set I, sec:rearrangement


we can make sense of the
series via a bijection N I. As we saw in Section 17, the value of the sum
might depend on which bijection we choose. But there are no discrepancies if
all the terms are positive. [TODO clarify relationship between this idea and the
next idea] sec:monotone-discontinuities
You might remember that back in Section 12, we actually found it useful for
a proof to use a series with terms that were not indexed by N! The definition
20. IF YOU TORTURE A SERIES ENOUGH, IT WILL CONVERGE 77

we used there generalizes nicely: Suppose I is some arbitrary index set, and for
i I, ai is a nonnegative real number. Then we define
( )
X X
ai = sup ai : J is a finite subset of I .
iI iJ

For example, if I = N, this definition matches the standard notion of con-


vergence for series. By taking I = R, we can add up all the values of some
nonnegative function R R! But uncountable sums are not as exciting as you
might hope:
Proposition
P 18. Suppose that for uncountably many i I, ai > 0. Then
a
iI i = .
Proof. For n N, let En = {i I : ai > n1 }, so that n En is uncountable.
A countable union of finite sets is countable, so some En must be infinite.
Therefore,
X 1 1 1
ai + + + = .
n n n
iI

What if our series has negative terms?

3 2 1 0 + 1 + 2 + 3 + = ???

Sometimes
P we can say something about such a series. We can separate our series
S = iI ai into the positive part and the negative part:
X X
S+ = ai , S = ai .
iI iI
ai 0 ai 0

Both of these series make sense by our earlier definition, and if at most one of
S + and S is infinite, then we can define S = S + S . But if S + = and
S = , we just leave S undefined. All of these ideas are generalized by
chap:integration
measure theory and the Lebesgue integral. But thats a story for Chapter ??.

20 If you torture a series enough, it will con-


verge
Earlier, we saw Grandis series S = 11+11+ , which can be made to sum
to 0 or 1 by judiciously inserting parentheses. Mathematicians are sane, clear-
thinking folk, so of course, all the great mathematicians of history understood
that Grandis series obviously diverges, and thus it simply doesnt have a sum:
its not 0, its not 1, and its certainly not anything else. ...Right?
[TODO get quote from Grandi]
1 1 1 1
(Leibniz, 1674) [TODO cite] 1+1 = 1 1+1 . Ergo 1+1 = 11+
1 1 + 1 1 etc.
78 CHAPTER 3. SERIES

(Leibniz, 1713) [TODO cite] ...And now since from that one [Gero-
lamo Cardano] who wrote of the values of the gambling games, it had
been shown that when the average between two even quantities is
found by calculation, the arithmetic mean ought to be found, which
is one-half of the sum, and in such a way this nature of things attends
to the same law of righteousness; hence although 11+11+11+
etc is 0 in the case with an finite even number of elements, in the
case with a finite odd number of elements it is equal to 1; it follows
that in the case with both sides vanishing into multitude of infinite
elements, where the law is confounded by the presence of both evens
and odds, and there is such a great sum on both sides, that 0+1 2 = 2
1

emerges, which is what has been proposed.

Gambling games? Law of righteousness? What was Leibniz smoking?


But hes in good company!

(Euler, 1760) [TODO cite] For if in a calculation I arrive at this


series 1 1 + 1 1 + 1 1 etc. and if in its place I substitute 1/2,
no one will rightly impute to me an error, which however everyone
would do had I put some other number in the place of this series.
Whence no doubt can remain that in fact the series 11+11+11
+ etc. and the fraction 1/2 are equivalent quantities and that it is
always permitted to substitute one for the other without error.

[TODO give a bunch of quotes from famous mathematicians, all asserting


that Grandis series sums to 1/2]
[TODO: Rewrite from scratch, with the history in mind. Apparently theres
a bunch to say about this.] P
Lets go back to series of the form n=1 an , with terms indexed by natu-
ral numbers. Remember how by inserting parentheses, we were able to make
Grandis series S = 1 1 + 1 1 + . . . sum to 0 or 1? Heres an argument that
it sums to 21 :

S + S = 1 1 + 1 1 +
+ 1 1 + 1 1 +
2S = 1 + 0 + 0 + 0 +
1
S= .
2
Hopefully youve learned to not take such an argument very seriously. But in
this case, the answer 12 is correct:
P
Definition 20. Suppose a series n=1 an has partial sums S1 , S2 , . . . .The
Cesaro sum of the series is the limit of the arithmetic mean of the first m
partial sums: Pm
N =1 SN
C = lim .
m m
20. IF YOU TORTURE A SERIES ENOUGH, IT WILL CONVERGE 79

P P
Proposition 19. If n=1 an = L R, then the Cesaro sum of n=1 an is L.
Proof. Fix an arbitrary > 0. Choose N0 large enough so that for every
N > N0 , |SN L| < . Then apply the triangle inequality a few times:
Pm PN0
m N Pm

N =1 SN N0 N =1 SN 0 N =N0 +1 SN
L L + L
m m m m m
1 m N0
(no m dependence) +
m m
2 for m sufficiently large.

So Cesaro provides the right answer when you give him a convergent series.
But sometimes, he even gives an answer if you give him a divergent series! The
partial sums of Grandis series are 1, 0, 1, 0, 1, 0, . . . , hence the Cesaro sum of
the series is 21 .
A summation method is an assignment of a realPvalue to some series. For ex- PN

ample, the standard summation method assigns to n=1 an the value limN n=1 an .
[TODO discuss Borel summation, Ramanujan summation, Abel summation,
etc. Be sure to mention that silly 1 + 2 + = 1/12 thing] Cesaro summation
is mentioned in fourier analysis (summability is nice)
We can sum a geometric series,
1
1 + x + x2 + x3 + = ,
1x
1
provided that |x| < 1. However, the right side of the equation, 1x , makes sense
for any x 6= 1. This lets us make fun claims like

1 + 2 + 4 + 8 + = 1
1
1 1 + 1 1 + = .
2
TODO analytic continuation, remark that there are possible values for the sums
(analytically continue a different function) These values are not unique; for
example
One sum that everyone loves is
1
1 + 2 + 3 + 4 + = (?!?)
12
At the end of this section, well talk about how this can come from analytic
continuation of the Riemann zeta function. Beware, this uses some complex
analysis. To avoid that, here is a quick alternative derivation from TODO cite
First define
d  
1 + 2x + 3x2 + 4x3 + = x + x2 + x3 +
dx
d x 1
= = =: S(x).
dx 1 x (1 x)2
80 CHAPTER 3. SERIES

Wed like to plug in x = 1 but we cant yet. So we clear some denominators


and verify the identity
1
= S(x) 4xS(x2 ).
(1 + x)2

In particular, using the functional inequality,


1
S(1) = 1 + 2 + 3 + 4 + = .
12
Yay!

Warning: Complex analysis in use


We are going to need analytic continuation from complex analysis. If youre
already familiar with this, feel free to skip a few paragraphs down to the part
about the Riemann zeta function. The idea is to take a (complex-)differentiable
function defined on some set S C with an accumulation point and try to
extend it to a complex-differentiable
P n function on a larger set . For
1
example, f (z) := n=0 z is equal to g(z) := 1z for |z| < 1 but is not defined
for |z| > 1. But the function g, which agrees with f on the unit disk, is defined
and complex-differentiable on the larger set C \ {1}, and we say that g is an
analytic continuation of f .
Remark 4. You may recall that an analytic function is one that is locally given
by a convergent power series. It turns out that one time differentiable. infinitely
differentiable, and analytic are all equivalent in complex analysis! Complex dif-
ferentiability is such a strong requirement that it gets you infinite differentiabil-
ity and analyticity.
Analytic continuation is incredibly useful because of the uniqueness of ana-
lytic functions:
Theorem 21 (uniqueness of analytic functions). If f, g : C are complex-
differentiable on a connected open set C and f (z) = g(z) for all z in a
sequence of distinct points with an accumulation point, then f (z) = g(z) on all
of .
This is an amazing result! It says there is only one analytic continuation: if
f and g agree on some smaller set S with an accumulation point (for example,
any open set), then they must be equal on the entire set . So we can talk
about the analytic continuation. ComplexSteinShakarchi3
analysis is of course magic. For more
magic, see a complex analysis book likeP[4].

The Riemann zeta function (s) := n=1 n1s is important in analytic number
theory. This particular expression is defined (converges) for Re s > 1, but we
would really like to analytically continue it to all of C \ {1}. (Sorry harmonic
series, you still dont converge.) We wont prove the analytic continuation, but
well try to give some ideas for a proof.
REFERENCES 81

The Riemann zeta function has a friend called the Gamma function,
Z
(s) := et ts1 dt, Re s > 0.
0

Gamma may look a bit scary, but by integrating by parts, we can verify the
functional equation (s + 1) = s(s) and conclude that (n + 1) = n! for
n N. (And yes, that is a factorial, not just us being very excited.) We can
analytically continue the gamma function using the functional equation to copy-
paste everything to the left one unit at a time. The only issue is the singularity
at s = 0, which gets translated to all the negative integers.
TODO picture of gamma function plot on R?
Gamma plays nicely with Riemann zeta. One way to prove the analytic
continuation of Riemann zeta is to form the auxiliary function

(s) := s/2 (s/2)(s),

and then prove analytic continuation


P of . This involves the Poisson summation
2
formula and Theta series (t) := kZ etk along with some sums, integrals,
and computations. The end result is an integral formula for that works for
s C, s 6= 0, 1. Then we get a formula for by dividing by .
P
Theorem 22 (analytic continuation of Riemann zeta). (s) = n=1 n1s , Re(s) >
1, extends to a meromorphic function on C with only a simple pole at s = 1.
Now that we have analytic continuation of , we can have fun assigning
values to sums like (1) = 1 + 2 + 3 + . By comparing poles and residues
of and , it turns out
1
1 + 2 + 3 + = (1) =
12
1
and 1 + 1 + 1 + = (0) = .
2
1
Tada! If you sum up all the positive integers you get 12 , and if you just sum
1
1 a bunch of times you get 2 .
https://cornellmath.wordpress.com/2007/07/30/sum-divergent-series-ii/
Hardy SteinShakarchi3
References:Deitmar
[3], complex analysis [4], analytic continuation of the Riemann
zeta function [2]

References
Darst [1] R. B. Darst. Most infinitely differentiable functions are nowhere ana-
lytic. In: Canad. Math. Bull. 16 (Jan. 1973), pp. 597598. doi: 10.4153/
cmb-1973-098-3. url: http://dx.doi.org/10.4153/CMB-1973-098-3.
Deitmar [2] Anto Deitmar. A First Course in Harmonic Analysis. Springer, 2005.
Hardy [3] G. H. Hardy. Divergent Series. 1948.
82 CHAPTER 3. SERIES

SteinShakarchi3 [4] Elias Stein and Rami Shakarchi. Complex Analysis. Vol. 3. Princeton Lec-
tures in Analysis. Princeton University Press, 2003.
sd00 [5] Arild Stubhaug and Richard H Daly. Niels Henrik Abel and his times:
called too soon by flames afar. Springer, 2000.
refsection:5

Chapter 4

Sequences of functions

Hopefully, at this point youre pretty comfortable with sequences of numbers,


and limits of such sequences. But another natural concept is a sequence of
functions. For example, we might want to try to use a sequence of polynomials,
which are super nice, to approximate an arbitrary continuous function, which
may not be as nice. Or we might want to use the sequence of Taylor polynomials
to approximate some C (R) function. Whatever the motivation, sequences of
functions are useful, important, and can produce some interesting results (or
bedtime stories)!
[TODO: it might be nice to include some things about exchanging limits and
integrals in this chapter. The reader doesnt know about integrals yet, strictly
speaking, but maybe just a brief teaser.]

21 Cauchys wrong theorem


The most straightforward sense in which a sequence of functions f1 , f2 , . . . could
converge to a function f is pointwise convergence, which just means that at
each point x, the sequence f1 (x), f2 (x), . . . converges to f (x). For example, the
sequencefig:pointwise-convergence
1.1x, 1.01x, 1.001x, . . . converges pointwise to the function f (x) = x.
(Figure 4.1.)
Now if we take a limit of a sequence of continuous functions, will we always
get a continuous function? That sounds pretty intuitive, and theres a version of
it which is true. But first, the version which is false: Cauchys Wrong Theorem1 .
wise-limit-continuous False Theorem 1 (Cauchy, 1821). Suppose f1 , f2 , . . . is a sequence of contin-
uous functions, which converges pointwise to the function f . Then f is contin-
uous.
Actually, Cauchy said that if a series of continuous functions converges
pointwise, then the sum is continuous. But consideration of partial sums shows
wrongthm:pointwise-limit-continuous
that this is essentially the same thing as asserting Wrong Theorem 1.
1 A. Cauchy, Cours dAnalyse (1821).

83
84 CHAPTER 4. SEQUENCES OF FUNCTIONS

fn
f

x0

Figure 4.1: Pointwise convergence: for each fixed x0 , the sequence f1 (x0 ), f2 (x0 ), . . .
fig:pointwise-convergence converges to f (x0 ).

In 1826, Abel saw this theorem and said, ...it seems to me that this theorem
admits exceptions.2 Indeed, a counterexample to this theorem is the simple
fig:seqfcns-pointwise
sequence of functions depicted in Figure 4.2. But in 1833, in a book3 , Cauchy
wrongthm:pointwise-limit-continuous
once again asserted False Theorem 1. It wasnt until 1853 that Cauchy finally
conceded that the theorem is wrong, saying4 [TODO is this citation correct?]:
As has been remarked by MM. Bouquet and Briot, this theorem
is verified for ordered series according to the ascending powers of
a variable. But for other series, it can not be admitted without
restriction. [TODO replace with actual translation this is from
Google Translate!!!]

1 fn 1 f

0 0
n1 0 0

Figure 4.2: Pointwise convergence of continuous functions to a discontinuous func-


fig:seqfcns-pointwise tion.

Its difficult to comprehend the fact that Cauchy made such an elementary
mistake, since he was usually quite careful. Indeed, some historians try to argue
wrongthm:pointwise-limit-continuous
that attributing Wrong Theorem 1 to Cauchy misinterprets his assertions in 1821
m(m1) m(m1)(m2)
2 N. H. Abel, Untersuchungen uber die Reihe: 1 + m 1
x+ 12
x2 + 123
x3 +
u.s.w., Journal fur die reine und angewandte Mathematik, 1(1826), 311-339. Quote (in
German) on pg.9 of http://name.umdl.umich.edu/ABW7150.0001.001.
3 A. Cauchy, Resumes analytiques (1833).
4 A. Cauchy, Note sur les series convergentes dont les divers termes sont des fonctions con-

tinues dune variable reelle ou imaginaire, entre des limites donnees. In Oeuvres completes,
Series 1, Vol. 12, pp. 30-36. Paris: Gauthier-Villars, 1900
22. WALRUS TUSKS AND NASTY POINTWISE LIMITS 85

and 1833. They say that Cauchy actually meant a different, correct theorem.5
But this position seems untenable, considering Cauchys own concession.
Theres a tiny grain of truth in Cauchys Wrong Theorem, though. If you
take a pointwise limit of continuous functions, what you end up with might be
discontinuous, but it wont be catastropically discontinuous.

Definition 21. A function f : R R is Baire class one if there is a sequence


of continuous functions fn : R R so that fn f pointwise.
fig:seqfcns-pointwise
For example, every continuous function is Baire class one. Figure 4.2 shows
that [0,) is Baire class one. It is immediate from the definition of the deriva-
tive that if f is differentiable, then f is Baire class one.

Theorem 23. Fix D R. Then D is the discontinuity set of some Baire class
one function if and only if D is meager and F .

[TODO give pointers for proof. Note that the if direction comes from the
stronger fact that if D is meager and F , then there is some derivative with
discontinuity set D.]

22 Walrus tusks and nasty pointwise limits


rudin
This section and the next are based on an exercise in Rudin [4].
[TODO we should give this and the next section catchier names. We should
come up with names for these sequences :D wmh] [dangerous thumbtacks (for
pointing up), safe thumbtacks (for pointing down), canine teeth, eye teeth,
walrus tusks, maxillary canines, mandibular canines,... -lhs]
[TODO check some of the statements, the irrational vs rational might be
mixed up...]
Weve seen pointwise limits of continuous functions need not be continuous.
But we can make the limit fail to be continuous in spectacular ways. Here we will
construct a sequence of continuous nonnegative functions whose whose pointwise
limit is bounded precisely at rational points! For reasons that will become
obvious, well call these functions Walrus tusk functions. At each rational r,
fn (r) is bounded, but at each irrational x, fn (x) is unbounded !
The idea is to start with the constant function n, which is unbounded as
n . After enumerating the rationals, we want to make downward spikes
seqfcns-badlimits
(that look like walrus tusks, Figure ??) at each rational number to ensure that
fn (r) is bounded. These spikes also need to keep getting skinnier and skinnier,
so that the irrational numbers near the spikes get to have their time at the level
n, instead of being stuck at low numbers because of the downward spike.
We have to be a little careful about how we define fn . We dont want
overlapping spikes, and we need to make sure that the spikes do get skinny
enough fast enough to ensure that fn is unbounded on irrational numbers. So
5 For example, see D. Laugwitz, Infinitely Small Quantities in Cauchys Textbooks, Historia

Mathematica 14 (1987), pp.258-274.


86 CHAPTER 4. SEQUENCES OF FUNCTIONS

fn
n

0 r1 r2
(b) Walrus tusks at Cape Peirce,
(a) Walrus tusks at r1 and r2 . Alaska.

fig:seqfcns-badlimits Figure 4.3: Walrus tusks at various locations.

what we do is this: At each step, we put in a maximum of one new spike.


Suppose were trying to construct fn from fn1 and weve inserted m1 n1
spikes so far.

1. Look at the next rational rm in our enumeration. See if the spike centered
on rm with width n1 fits without overlapping any current spikes. If so,
then insert it there. If not, do nothing. (This waiting is a key step, and
will ensure that fn (x) is unbounded on irrational numbers.)
1
2. Shrink all spikes inserted so far so they have width n.

Since well eventually get spikes at every rational r, fn (r) will be bounded (in
fact, have limit 0). At irrationals x, well get fn (x) = n infinitely often: If
fn (x) 6= n, then x is in some downward spike. But because the spikes are
shrinking, eventually x will move back up to n, when its sufficiently far away
from the rational number that had the spike. This is where the waiting is key:
x cannot go from being in one spike to being in a new spike without first going
back up to n. A new spike will only be inserted into fn if it doesnt overlap any
of the spikes in fn1 . So eventually x will be out of the original spike, and it
can only go back below n (in a new spike) if it was not in a spike in the previous
iteration, i.e. it went back up to n in between.
You might complain that though (fn (x)) is unbounded for irrational x, we
cant say much meaningful about a pointwise limit as n . But we can fix
this:

Claim 1. There exists a sequence of continuous nonnegative functions fn with


limn fn (x) = + if x is irrational, and limn fn (r) =: f (r) < if r is
rational.

Well use the same idea that made Thomaes function continuous precisely
at irrational numbers. The function fn is defined as follows: at each of the first
22. WALRUS TUSKS AND NASTY POINTWISE LIMITS 87

n rational numbers pq , we set fn (p/q) = q. We extend fn to all of R by linearly


interpolating, making it horizontal outside the interval containing the first n
rational numbers.

q3

q1

q2

p1 p3 p2
q1 q3 q2

Figure 4.4: The function f3 in constructions of a sequence fn such that fn (r) is


bounded (actually eventually constant) if r is rational and fn (x) if x is irrational.

If x is rational, then fn (x) is eventually constant, so there are no problems


there. Suppose x is irrational. Intuitively, we should expect that fn (x) blows
up as n , because well keep taking into account more and more rational
numbers which do a better and better job of approximating x and hence have
larger and larger denominators. A more careful proof: Fix an arbitrary Q N;
well show that fn (x) is eventually larger than Q. Let > 0 be the smallest
distance from x to a rational number with denominator smaller than Q. There
are rational numbers c1 , c2 which are within distance of x satisfying c1 < x <
c2 . Say c1 and c2 are among the first N rational numbers. Then for all n > N ,
fn (x) Q.
88 CHAPTER 4. SEQUENCES OF FUNCTIONS

x c1 x c2 x+

Figure 4.5: The proof that fn (x) if x is irrational. For each n > N , the value
of fn (x) is chosen by linearly interpolating between values which are both bigger than
Q, so fn (x) Q.

23 Less nasty pointwise limits


In the last section, we had a sequence of continuous nonnegative functions whose
pointwise limit was bounded precisely at rational points. How about the reverse?
Can we construct such a sequence whose sec:baire
pointwise limit is bounded precisely
at irrational points? Recall from Section 15 that it is possible to have a func-
tion discontinuous precisely at the rationals, but not the reverse. Is something
like that true here? Iffig:seqfcns-badlimits2
we wanted to construct such a sequence, we might try
something like Figure 4.6, which inverts the spikes from the first example in the
previous section.

fn

0 r1 r2
(a) Upside-down walrus tusks at r1 and r2 . (b) More upside-down walrus tusks.

fig:seqfcns-badlimits2 Figure 4.6: Upside-down walrus tusks in various locations.


24. UNIFORM CONVERGENCE AND METRIC SPACES 89

But this isnt good enough! For x irrational, there will always be some ratio-
nal close enough to x that will make fn (x) jump up, forcing {fn (x)} unbounded.
And in fact, it is impossible for such a sequence to exist.
Claim 2. There does not exist a sequence of continuous nonnegative functions
{fn } that is bounded precisely for irrational x.
sec:baire
The proof uses similar ideas as in Section 15. The key is to use Baire, and
try to write R \ Q as a countable union of closed sets (F ), which is impossible.
Well take the sets to be,

Yk := {x R : n, fn (x) k},
S
each of which is closed, and Y = kN Yk , so that x Y if and only if {fn (x)} is
bounded. But then we cannot have R \ Q = Y , since by Baire, R \ Q cannot be
written as a countable union of closed sets. So indeed such a task is impossible.
[TODO: obvious question: if we want a sequence of continuous nonnegative
functions whose pointwise limit is bounded precisely on E, is it necessary and
sufficient that E is F ?]
Next, to get more confusing, lets ask the analogue of the second problem
from the previous section: Does there exist a sequence of continuous nonnegative
functions with limn fn (x) = + if and only if x is rational? We just saw that
getting such a sequence with {fn (x)} unbounded iff x is rational is impossible,
but this ones different. It turns out that yes, we can. In fact, the example we
fig:seqfcns-badlimits2
considered in Figure 4.6 works.
In summary:
1. There exists a sequence of continuous nonnegative functions fn with {fn (x)}
bounded if and only if x is rational.
2. There does not exist a sequence of continuous nonnegative functions {fn }
that is bounded precisely for irrational x.
3. There exists a sequence of continuous nonnegative functions fn with limn fn (x) =
+ iff x is irrational, and a sequence with limn gn (x) = + iff x is
rational.

24 Uniform convergence and metric spaces


Cauchys wrong theorem indicates that pointwise convergence is somehow too
weak. The pointwise limit of a sequence of functions doesnt really have behavior
which is the limit of the behaviors of the functions in the sequence, because
pointwise limits only pay attention to one point at a time.
A much stronger sort of convergence is uniform convergence. We say that
f1 , f2 , . . . uniformly converges to f if for every > 0, there is some N so that
for all n > N , for all x simultaneously, |fn (x) f (x)| < . In other words, the
distance between f and fn must go to zero, where the distance between two
functions is measured as the maximum vertical distance between their graphs.
90 CHAPTER 4. SEQUENCES OF FUNCTIONS

f +

f
g

Figure 4.7: The idea of uniform convergence. In this picture, for all x, g(x) lies
fig:uniform-convergence between f (x) and f (x) + , i.e. for all x, |g(x) f (x)| < .

fig:pointwise-convergence
Recall the sequence 1.1x, 1.01x, 1.001x, . . . from Figure 4.1, which converges
pointwise to the identity function. But it does not converge uniformly to f (x) =
x, because for every n, the distances |fn (x) f (x)| can get arbitrarily large by
taking x large.
But what if we consider functions with a compact domain, say functions
[0, 1] R? Considered thus, the sequence above actually does converge uni-
formly, because |fn (x) f (x)| 10n for all x. This might make you hope
that maybe for continuous functions defined on [0, 1], pointwise and uniform
convergence coincide... but no such luck. Consider the sequence of functions
fig:seqfcns-spike
depicted in Figure 4.8.

fn

1
0 2n
1
n

fig:seqfcns-spike Figure 4.8: Pointwise convergence without uniform convergence.

These functions converge pointwise to the function f (x) = 0, because at


each x > 0, for large enough n, the spikes are always to the left of x. But they
dont converge uniformly to anything. To see why, first note that if they did,
it would have to be to f (x) = 0, since they converge pointwise to 0. But the
24. UNIFORM CONVERGENCE AND METRIC SPACES 91

distance between fn (x) and the zero function is 1 for every n.


Now, were ready to fix Cauchys mistake. The correct version of Cauchys
wrong theorem is: Every uniform limit of continuous functions is again contin-
uous.
prop:uniform-limits Proposition 20 (Uniform limit theorem for R). Let fn : R R be a sequence
of functions that converges uniformly to some f : R R. Assume that for some
x0 R, each fn is continuous at x0 . Then f is continuous at x0 .
Proof. This is sometimes remembered as the /3argument, since we write

|f (x) f (x0 )| |f (x) fN (x)| + |fN (x) fN (x0 )| + |fN (x0 ) f (x0 )|, (4.1) eqn:epsilon-3

then show each term on the right side is smaller than /3 for x close to x0 and
sufficiently large N .
By uniform convergence, choose N so that for all n N ,

x : |f (x) fn (x)| < /3. (4.2)


eqn:epsilon-3
This takes care of both the first and third terms on the right side of (4.1)!
Since fN is continuous at x0 , choose a > 0 so that

|x x0 | < = |fN (x) fN (x0 )| < /3. (4.3)

So now weve taken care of the second term, and we get in total,

|f (x) f (x0 )| < ,

for |x x0 | < , so f is continuous at x0 .


Whenever we talk about convergence of functions (fn ), we need to specify
the type of convergence. In the beginning, we looked at pointwise convergence.
This is easy: plug in an x, and see if the sequence of real numbers (fn (x))
converges to f (x) or not. This seems like the obvious choice of convergence,
but its really not the most natural one. Remember we can get some pretty weird
pointwise limits, and that a pointwise limit of continuous functions need not be
continuous. On the other hand, uniform convergence seems quite nice. Every
uniformly convergent sequence of continuous functions converges to a continuous
function. If we want to talk about the space of continuous functions and
convergence of sequences of these functions, were going to want to use uniform
convergence. But first, we need the idea of a metric space:
Definition 22. Let M be a set. A function d : M M [0, ) R is called
a metric on M if
1. d(x, y) = 0 x = y
2. d(x, y) = d(y, x) x, y M
3. d(x, y) d(x, z) + d(z, y) x, y, z M (triangle inequality)
92 CHAPTER 4. SEQUENCES OF FUNCTIONS

The pair (M, d) is called a metric space. A sequence (xn )nN M converges if
there is an x M such that

> 0, N N n N : d(xn , x ) . (4.4) eqn:sequence-converge

Heres some examples:


eqn:sequence-convergence
M = R and d(x, y) = |x y|. The condition for convergence, (4.4), then
just becomes the usual definition of convergence of a sequence of real
numbers.

Perhaps you were hoping the triangle inequality involved triangles? Take
the usual Euclidean metric p on R2 : the distance between two points is
d((x1 , y1 ), (x2 , y2 )) = (x1 x2 )2 + (yfig:triangle-ineq
2
1 y2 ) . Now we get actual trian-
gles in the triangle inequality (Figure 4.9).

Math books for sale!

Figure 4.9: Matt has to walk further to get home if he stops to buy math books.
fig:triangle-ineq Matt is a nerd.

Taxicab or Manhattan metric on R2 : Here d((x1 , y1 ), (x2 , y2 )) = |x1 x2 |+


|y1 y2 |.fig:taxicab
The name is inspired by streets laid out in a grid in Manhattan.
(Figure 4.10.) This is also called an 1 norm.
(
0, if x = y
Let M be any set. We have the discrete metric d(x, y) = .
1, if x 6= y
Everyone is so lonely and far away from everyone else! By taking < 1,
we see that the only convergent sequences are the ones that are eventually
constant!
24. UNIFORM CONVERGENCE AND METRIC SPACES 93

Figure 4.10: A taxicab in a US city can only drive horizontally and vertically (red,
fig:taxicab blue, yellow lines). The diagonal green line shows the usual Euclidean distance.

Let X be a compact space, and let C(X) be the space of continuous func-
tions X R. Equip it with the metric d(f, g) = supxX |f (x) g(x)|.
Convergence in this metric is uniform convergence for all > 0, there
exists N N so that for all n N , for all x X, |fn (x) f (x)| < .
sec:indicator-discontinuities
Remark 5. Recall from Section 13 the open balls Br (x) of radius r centered at
x Rn . In general, for a metric space M , Br (x) is the set {y M : d(x, y) < r};
all we did is generalize the notion of distance.
prop:uniform-limits
The uniform limit theorem for R (Proposition 20) generalizes to the same
result for metric spaces, utilizing basically the same proof.
Proposition 21 (Uniform limit theorem). Let X and Y be metric spaces, and
let fn : X Y be a sequence of functions that converges uniformly to some
f : X Y . Assume that for some x0 X, each fn is continuous at x0 . Then
f is continuous at x0 .
One way to construct a metric is using norms.
Definition 23. Let V be a R or C vector space. A function k k : V [0, )
is called a norm on V if
1. kxk = 0 x = 0
2. kxk = ||kxk x V, C
3. kx + yk kxk + kyk x, y V
The pair (V, k k) is called a normed space. A normed space becomes a metric
space by setting d(x, y) = kx yk.
Youre probably most familiar
p with the normed space Rn equipped with the
Euclidean norm, kxyk = (x1 y1 )2 + + (xn yn )2 . But the space C(X)
94 CHAPTER 4. SEQUENCES OF FUNCTIONS

of continuous functions on a compact metric space X can also become a normed


space. The uniform or supremum norm is
kf k := sup |f (x)|, (4.5)
xX

and gives rise to the metric weve been using all along,
d(f, g) = kf gk = sup |f (x) g(x)|. (4.6)
xX

TODO put norms on finite dim equivalent here?


munkres
carothers
References: [2, 1]

25 Polynomials are pretty good at approximat-


ing continuous functions
sec:weierstrass-approximation
While continuous functions seem pretty nice, weve seen that they still do weird
things like admit the nowhere differentiable Weierstrass function. So, which
kinds of functions are really really nice? (Other than constant functions.) Why,
polynomials! And conveniently, polynomials are actually quite good at approx-
imating continuous functions on an interval [a, b] (yes, even oddities like the
Weierstrass function or any other continuous function that looks nothing like a
polynomial)6 . In fact, we have:
Theorem 24 (Weierstra approximation). Let f C([a, b], C). Then for every
> 0, there is a polynomial p with kf pk := supx[a,b] |f (x) p(x)| < .
Thus there is a sequence of polynomials (pn ) converging uniformly (i.e. in k k
norm) to f on [a, b].
Are you thinking the proof is going to be some kind of nonconstructive
density proof? And that that would be kind of disappointing? Fear not, were
going to do it constructively!
First simplify to f C([0, 1], C); for [a, b], just scale and translate appropri-
ately. Were going to use Bernstein polynomials to approximate f ; these will
be our explicit pn ! Given f C([0, 1]), define its Bernstein approximation by
Xn  
k n k
Bn (f )(x) := f x (1 x)nk .
n k
k=0

What exactly are we doing here? Were sampling f at spacing n1 , and trying
to interpolate between those points with a polynomial. As n increases, the
sampling points get finer, and wed hope the approximation gets better. (Fig-
fig:bernstein-approx
ure 4.11.)
The proof that indeed kBn (f ) f k 0 involves a decent amount of real
analysis and inequalities (not the best bedtime story material), but well outline
the idea. [TODO outline]
6 Power
sec:taylor-series
series are not enough, even for C functions! cf. Section 18
25. POLYNOMIALS ARE PRETTY GOOD AT APPROXIMATING CONTINUOUS FUNCTIONS95

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.2 0.4 0.6 0.8 1.0

3.5

3.0

2.5

2.0

1.5

1.0

0.5

0.2 0.4 0.6 0.8 1.0

Figure 4.11: (Above) Bernstein approximations of degrees 1, 2, 10, and 50 for the
function | 12 x cos(5x) + sin(20x) + 2.5x| (shown in blue). (Below) Bernstein approxi-
fig:bernstein-approx mations of degrees 100 and 1000.

Next up: How about approximating continuous functions on R instead of


just [a, b]? We dont want to use polynomials since theyre not bounded and we
could have bounded continuous functions. A good compromise is C (R), the
space of infinitely differentiable functions on R.
m:continuous-approx-R Theorem 25. Every function f C(R) can be uniformly approximated (i.e. in
k k ) by functions in C (R).
( 2
e1/x , x =
6 0
Claim 3. Let g(x) = . Then g C (R) and g (n) (0) = 0 for
0, x=0
all n N.
sec:taylor-series
Well revisit this function in Section 18 and actually prove the claim there.
2
The idea in the proof though is to use induction and the fact that mn em 0
as m for every fixed n.
96 CHAPTER 4. SEQUENCES OF FUNCTIONS
(
g(x 1)g(x + 1), |x| 1
Assuming the claim, take h(x) = . Then h
0, otherwise
fig:bump-smooth
C (R) with h(x) = 0 for |x| 1, and h 0 always. (Figure 4.12.)

0.14

0.12

0.10

0.08

0.06

0.04

0.02

-2 -1 1 2

1 1
Figure 4.12: Plot of h(x) = e (x1)2 (x+1)2 [1,1] .
fig:bump-smooth

The idea is to use h to make a smooth partition of unity (just a way of


writing 1 in a complicated way), which will be used to construct a C function
fe close to f . If we didnt care about smoothness of fe, we could just use the
rough cut-offs [n,n+1] as follows: On each inteval [n, n + 1] for n Z, we would
just use the classical Weierstra approximation theorem to get a polynomial
pn (x) with |pn (x) f (x)| on [n, n + 1], then set
X
fe(x) := pn (x)[n,n+1] (x).
nZ

This is roughly the idea, but since we need a C function, we need smooth
cut-offs to form our partition of unity instead of rough cut-offs.
Start by defining
X
H(x) := h(x n) > 0,
nZ

so we just put a copy of h at each P integer and let them overlap a bit. fig:H-sum
Then
H C (R), and for each x, the sum nZ h(xn) is a finite sum. (Figure 4.13.)
Finally, we get to the actual cut-offs. We had to do all that work to make
sure these smooth cut-offs formed a partition of unity. Define

e m (x) := h(x m) C (R).



H(x)

e m vanishes in (, m 1] [m + 1, ) and is positive in (m 1, m + 1)


Then fig:smooth-cutoff
and 1. (Figure 4.14.)
25. POLYNOMIALS ARE PRETTY GOOD AT APPROXIMATING CONTINUOUS FUNCTIONS97

0.14

0.12

0.10

0.08

0.06

0.04

-2 -1 1 2

P
fig:H-sum Figure 4.13: Plot of H(x) = nZ h(x n). Note it is 1periodic.

1.0

0.8

0.6

0.4

0.2

-2 -1 1 2

fig:smooth-cutoff
e 0 , and
e 1 ,
Figure 4.14: The smooth cut-offs e1 .

e m are rigged so that for


The important thing is that these smooth cut-offs
all x R,
X P
e m (x) = m h(x m) = 1.

H(x)
mZ

This is a nice partition of unity.


Now as mentioned earlier, we use the classical Weierstra theorem to get
polynomials pn with maxx[n1,n+1] |pn (x) f (x)| . We take

X
fe(x) := e n (x),
pn (x) (4.7)
nZ
98 CHAPTER 4. SEQUENCES OF FUNCTIONS

which is in C (R) and which for any x R,


X
|fe(x) f (x)| = e n (x)
(pn (x) f (x))
nZ
X
e n (x) ,
|pn (x) f (x)|
| {z }
nZ

e n (x)

thm:continuous-approx-R
since maxx:en (x)6=0 |pn (x) f (x)| . So Theorem 25 is proved. 
Finally, instead of approximating f C([0, 1]) by polynomials, we can ap-
proximate it by trigonometric polynomials of the form

N
X
TN (x) = cj e2ijx , cj C. (4.8)
j=N

We use T to denote the 1-dimensional torus or circle R/Z, which is also often
identified with the interval [0, 1). Then we have:

Theorem 26 (Weierstra trig approximation). Let f C(T). Then for every


> 0, there is a trigonometric polynomial T with kf T k < . Thus there is
a sequence of trig polynomials (Tn ) converging uniformly to f on T.
chap:fourier
Well revisit trigonometric polynomials in Chapter 10 on Fourier analysis.
The fast proof of this theorem is to use the classical Weierstra theorem on
F , where F (e2ix ) := f (x). Then the polynomial approximating F (y) becomes
a trig polynomial approximating f (x) via the map y = e2ix . An alternative
proof without the classical Weierstra theorem uses the Fejer kernel from Fourier
analysis.
Finally, well mention there is a really good generalization of the (classical)
Weierstra approximation theorem. All the previous approximation theorems
in this section can be proved fairly quickly from this one.7

Theorem 27 (Stone-Weierstra, real version). Let X be a compact metric


space. Let A be a sub-algebra of C(X, R) (i.e. A is a vector space of con-
tinuous functions which is closed under multiplication), which separates points
and vanishes at no point. Then A = C(X, R).

Okay, so there were some words there we havent defined yet. But the
idea is if we have a set A C(X, R) satisfying certain properties, then its
automatically dense in C(X, R). Also, there is a complex version for C(X, C).
We recommend
teschl-fa
looking at the relevant sections in nearly any real analysis book,
carothers
rudin-little
like [6], [1], or [3], if you are interested in learning about the Stone-Weierstra
theorem.
7 One of the standard proofs of the Stone-Weierstra theorem uses the classical Weierstrass

theorem, but its only a little more work to avoid using it in the proof of Stone-Weierstra.
26. VARIANCE OF DIMENSION 99

26 Variance of dimension
variance-of-dimension
What is it that makes planes, spheres, and Mobius strips two-dimensional, while
lines, circles, and knots are one-dimensional? A typical answer one hears is: it
takes two coordinates to specify a point in a two-dimensional space (e.g. x and
y for the plane, or latitude and longitude for the sphere) whereas it only takes
one coordinate to specify a point in a one-dimensional space. [TODO include
picture]
Cantor realized that there is something to be proven here. As he put it in
an 1877 letter to Dedekind8

For several years I have followed with interest the efforts that have
been made, building on Gauss, Riemann, Helmholtz, and others,
towards the clarification of all questions concerning the ultimate
foundations of geometry. It struck me that all the important in-
vestigations in this field proceed from an unproven presupposition
which does not appear to me self-evident, but rather to need a justifi-
cation. I mean the presupposition that a -fold extended continuous
manifold needs independent real coordinates for the determina-
tion of its elements, and that for a given manifold this number of
coordinates can neither be increased nor decreased.
This presupposition became my view as well, and I was almost con-
vinced of its correctness. The only difference between my standpoint
and all the others was that I regarded that presupposition as a theo-
rem which stood in great need of a proof; and I refined my standpoint
into a question that I presented to several colleagues, in particular
at the Gauss Jubilee in Gottingen. The question was the following:
Can a continuous structure of dimensions, where > 1, be related
one-to-one with a continuous structure of one dimension so that to
each point of the former there corresponds one and only one point
of the latter?

In todays terminology, Cantor was wondering, how does the cardinality of [0, 1]
compare to the cardinality of [0, 1]2 ? When someone says that two coordinates
are necessary to specify an arbitrary point in the unit square [0, 1]2 , they seem
to implicitly be claiming that |[0, 1]2 | > |[0, 1]|. Surprisingly, they are wrong:

al-square-cardinality Proposition 22. |[0, 1]| = |[0, 1]2 |.

Proof. There is an obvious injection establishing that |[0, 1]| |[0, 1]2 |, so it
suffices to show the reverse inequality. Define f : [0, 1]2 [0, 1] by interleaving
digits! That is, if x = 0.x1 x2 x3 . . . and y = 0.y1 y2 y3 . . . , then we set

f (x, y) = 0.x1 y1 x2 y2 x3 y3 . . .
8 F. Gouvea, Was Cantor Surprised?, The American Mathematical Monthly (March

2011).
100 CHAPTER 4. SEQUENCES OF FUNCTIONS

(For the purposes of this definition, if x or y has multiple decimal expansions,


use the one with trailing nines.) This is an injection, because to recover x and
y, we just write down the decimal expansion of f (x, y) (choosing the one with
trailing nines if there is a choice) and then read off the digits of x and y.
So a spheres two-dimensionalness certainly does not come from the fact that
it takes two coordinates to specify a point on a sphere. Anything that can be
specified with two numbers can be specified with one number! (Similar remarks
apply to theprop:interval-square-cardinality
concept of degrees of freedom.) Naturally, Cantor interpreted
Proposition 22 as problematic for the foundations of geometry. His letter to
Dedekind continues:

Most of those to whom I presented this question were extremely


puzzled that I should ask it, for it is quite self-evident that the
determination of a point in an extension of dimensions always
needs independent coordinates. But whoever penetrated the sense
of the question had to acknowledge that a proof was needed to show
why the question should be answered with the self-evident no. As
I say, I myself was one of those who held it for the most likely that
the question should be answered with a no until quite recently I
arrived by rather intricate trains of thought at the conviction that
the answer to that question is an unqualified yes. Soon thereafter I
found the proof which you see before you today.
So one sees what wonderful power lies in the ordinary real and irra-
tional numbers, that one is able to use them to determine uniquely
the elements of a -fold extended continuous manifold with a single
coordinate. I will only add at once that their power goes yet further,
in that, as will not escape you, my proof can be extended without
any great increase in difficulty to manifolds with an infinitely great
dimension-number, provided that their infinitely-many dimensions
have the form of a simple infinite sequence.
Now it seems to me that all philosophical or mathematical deduc-
tions that use that erroneous presupposition are inadmissible. Rather
the difference that obtains between structures of different dimension-
number must be sought in quite other terms than the number of
independent coordinates the number that was hitherto held to be
characteristic.

Dedekind realized that Cantors discovery posed a serious challenge to the


concept of dimensionality, but also that this challenge must be met: we should
not simply give up and declare that dimension is a meaningless concept.
Dedekind responded, saying9
...I declare (despite your theorem, or rather in consequence of reflec-
tions which it stimulated) my conviction or my faith (I have not yet
9 W. Ewald, From Kant to Hilbert, Volume 2, Clarendon Press Oxford (2007).
26. VARIANCE OF DIMENSION 101

had time even to make an attempt at a proof) that the dimension-


number of a continuous manifold remains its first and most impor-
tant invariant, and I must defend all previous writers on this subject.
To be sure, I gladly concede that the constancy of the dimension-
number is thoroughly in need of proof, and so long as this proof has
not been furnished one may doubt. But I do not doubt this con-
stancy, although it appears to have been annihilated by your theo-
rem. For all authors have clearly made the tacit, completely natural
presupposition that in a new determination of the points of a con-
tinuous manifold by new coordinates, these coordinates should also
(in general) be continuous functions of the old coordinates... Now,
for the time being, I believe the following theorem: If it is possible
to establish a reciprocal, one-to-one, and complete correspondence
between the points of a continuous manifold A of a dimensions and
the points of a continuous manifold B of b dimensions, then this
correspondence itself, if a and b are unequal, is necessarily utterly
discontinuous.
[TODO explain why Cantors surjection is not continuous. Also discuss Z-
order curves.]
In other words, Dedekind conjectured that there is no continuous bijection
[0, 1]n [0, 1]m with n 6= m. As Brower proved in 1911, Dedekinds conjecture
is correct. The phenomenon described by this theorem and similar theorems is
called Invariance of Dimension.
variance-of-dimension Theorem 28 (Invariance of Dimension). If n 6= m, then there is no continuous
bijection [0, 1]n [0, 1]m .
It would take us too far afield to discuss any of the ideas that go into the
thm:invariance-of-dimension
proof of Theorem 28; the theorem is really about topology, not analysis. (See
[TODO cite Hatcher] for a proof.) [TODO mention the people who tried and
failed to prove it see The problem of the invariance of dimension in the
growth of modern topology. Thomae, Netto, Cantor all published incorrect
proofs. Cantor, in fact, claimed a proof that there are no continuous surjections
[0, 1]n [0, 1]m if m > n, which is false. Brower finally published a correct
thm:invariance-of-dimension
proof.] Instead, well discuss the surprising fact that Theorem 28 is not robust.
That is, on first reading the theorem, youre probably tempted to draw the
inference that from the perspective of continuous maps, [0, 1]2 is bigger than
[0, 1]. But Peano discovered in 1890 that this is not true!
m:space-filling-curve Theorem 29. There exists a continuous surjection [0, 1] [0, 1]2 .
A continuous map f : [0, 1] [0, 1]2 is also called a curve we write f (t)
and think of t as time. For example, the curve

f (t) = (cos(2t), sin(2t))


traces out a circle at unit speed in the counterclockwise direction. Intuitively,
one would expect that (like in the case of a circle) the image f ([0, 1]) R2 of
102 CHAPTER 4. SEQUENCES OF FUNCTIONS

thm:space-filling-curve
any curve must be a thin, one-dimensional sort of an object. But Theorem 29
tells us that there are curves which fill up the entire unit square; such curves
are called space-filling curves.
So how does the proof go? Well sketch a slight variant of Peanos original
proof, due to Hilbert. Hilbert devised a sequence of curves f1 , f2 , . . . , each of
which gets a little closer to filling up the unit square than the last. We wont
fig:hilbert-curve
bother giving a careful definition of fn ; take a look at Figure ?? and you should
see the pattern. Then, we let f be the limit of the fn curves.
fig:hilbert-curve

Figure 4.15: The first four curves in the sequence defining the Hilbert curve.

To finish the proof, we need to show two things. First, we need to show
that the sequence of curves is uniformly convergent. (That way, we can be sure
that f is continuous, i.e. is a curve.) [TODO I dont think weve mentioned
Cauchy sequences or complete metric spaces in this whole book... but its kinda
unavoidable here. Maybe well just skip the proof?] Now, we need to show that
f ([0, 1]) = [0, 1]2 . Certain elementary topological considerations10 imply that
it is sufficient to show that for every open set U [0, 1]2 , the curve f passes
through U , i.e. f ([0, 1]) U 6= . But this is clear from the pictures: for all
sufficiently large n, fn ([0, 1]) intersects U , so f ([0, 1]) does as well.
Space-filling curves are seriously messed up. Invariance of dimension tells
us that it is impossible to parameterize the unit square by a single coordinate t
in such a way that the point associated with t varies continuously with t, and
10 The continuous image of a compact set is compact, so f ([0, 1]) is a closed subset of [0, 1]2 .
26. VARIANCE OF DIMENSION 103

every point in the square has exactly one associated t. But the existence of
space-filling curves means that a parameterization is possible where every point
in the square has at least one associated t! That seems like it should have been
the hard part!
The weirdness doesnt stop here. A Jordan curve is a continuous injection
[0, 1] [0, 1]2 . (So the curve never crosses over itself.) Invariance of dimension
tells us that if f is a Jordan curve, then f ([0, 1]) 6= [0, 1]2 . At this point, youre
probably tempted to draw the inference that if f is a Jordan curve, then f ([0, 1])
has to be small the reason that f ([0, 1]) cannot equal [0, 1]2 is that the square
is just too big. But thats not right either! In TODO, Osgood discovered that
[0, 1]2 is not too big for a Jordan curve to fill up. Rather, [0, 1]2 is just shaped
wrong! Precisely:

Theorem 30. There exists a Jordan curve f : [0, 1] [0, 1]2 whose image
f ([0, 1]) [0, 1]2 has positive area (Lebesgue measure.)

Proof sketch. Even better, well show that for any < 1, there is a curve (an
Osgood curve) whose image has Lebesgue measure at least . (Well define
sec:lebesuge-measure
Lebesgue measure formally in Section ??, but for now, just think of it as the good
definition of area.) [TODO the reader doesnt know about measure yet...?
TODO also cite 2D Lebesgue measure?] Like the Hilbert curve, well construct
the Osgood curve by giving a sequence of fig:osgood-2
curves.
fig:osgood-1
Again,
fig:osgood-3
we wont fig:osgood-6
fig:osgood-4
bother
fig:osgood-5
with an explicit definition. See figures 4.16, 4.17, 4.18, 4.19, 4.20, and 4.21 for
f1 , f2 , f3 , f4 , f5 , f6 ; you should be able to see the general pattern for fn . (The
curve fn is in black.) The crucial parameter is the thickness of the gray grate.
When forming fn+1 , we choose the new grate to have thickness

1
.
4 6n1
With this choice, a straightforward calculation shows that the set of points
which never appear in a grate has total area .
As with the Hilbert curve, we should show that the sequence fn converges
uniformly. [TODO discuss.] Let f = lim fn . Now, well show that f ([0, 1]) has
Lebesgue measure at least , by showing that every point p which never appears
in a grate is in the image of f . Just like when we showed that the Hilbert curve
was surjective, it suffices to show that f passes through every neighborhood of
p, which is quite clear from the pictures.
Finally, we need to show that f is injective. This is also fairly clear from
the pictures. Fix t1 6= t2 , and consider the sequences fn (t1 ) and fn (t2 ). If both
sequences are eventually in the grate, then at that time, they will not intersect
(since every fn is injective) and the sequences will be constant from then on.
Hence, f (t1 ) 6= f (t2 ). If one sequence is eventually in the grate but the other
avoids the grate forever, then f (t1 ) 6= f (t2 ), because one of f (t1 ), f (t2 ) is in
the grate while the other is not. (Note that we need to exclude the grates
boundary in the definition of the grate to make this argument work.) Finally, if
both sequences avoid the grate forever, then they will eventually be in distinct
104 CHAPTER 4. SEQUENCES OF FUNCTIONS

fig:osgood-1 Figure 4.16: The first curve in the sequence defining the Osgood curve.

white squares, and clearly this implies that they will not converge to the same
thing, so again, f (t1 ) 6= f (t2 ).

Sagan
For further reading about space-filling curves, see the beautiful little book
[5].
26. VARIANCE OF DIMENSION 105

fig:osgood-2 Figure 4.17: The second curve in the sequence defining the Osgood curve.

fig:osgood-3 Figure 4.18: The third curve in the sequence defining the Osgood curve.
106 CHAPTER 4. SEQUENCES OF FUNCTIONS

fig:osgood-4 Figure 4.19: The fourth curve in the sequence defining the Osgood curve.
26. VARIANCE OF DIMENSION 107

fig:osgood-5 Figure 4.20: The fifth curve in the sequence defining the Osgood curve.
108 CHAPTER 4. SEQUENCES OF FUNCTIONS

fig:osgood-6 Figure 4.21: The sixth curve in the sequence defining the Osgood curve.
refsection:6

Chapter 5

Differentiation

I turn away with fright and


horror at the lamentable plague
of functions without derivatives.

Charles Hermite [TODO citation]

27 Discontinuous derivative
sec:discont-deriv
Lets start with a quick review of derivatives. Differentiability is a stronger sort
of predictability than continuity. Roughly, we say that a function f : R R is
differentiable at x if there is some linear transformation T : R R such that
for all small x, we have f (x + x) f (x) + T (x). The function T is called
the Frechet derivative of f at x. Since we are working in just one dimension, T
is necessarily of the form T (x) = x; the number is called the derivative of
f at x and denoted f (x). [TODO picture. Also make less confusing and more
enlightening] The exact definition:

Definition 24 (Differentiability). We say f : R R is differentiable at x = c


(for c R) if
f (c + h) f (c)
lim
h0 h
exists. In this case, the limit is denoted by f (c), and called the derivative at c.
If f is differentiable at all c R, then we say f is differentiable.

Its easy to come up with a continuous function which fails to be differentiable


somewhere, e.g. f (x) = |x|. The derivative of this function exists everywhere
fig:abs-val-deriv
except 0, where it jumps from 1 to +1. (See Figure 5.1.)
But what can we say if we require that f be differentiable everywhere? Can
the derivative still jump? Must the derivative be continuous?

109
110 CHAPTER 5. DIFFERENTIATION

0 0

fig:abs-val-deriv Figure 5.1: The absolute value function and its derivative.

The answer to the latter question is no. Heres well get a function that is
differentiable everywhere, but its derivative is not continuous at x = 0. Let
( 
x2 sin x1 , x 6= 0
f (x) = .
0, x=0
fig:discont-deriv
Its graph near zero is shown in Figure 5.2.

fig:discont-deriv Figure 5.2: Plot of f (x) = x2 sin(x1 ) near 0.

Using the limit definition and squeeze theorem, we can show that f (0) = 0.
But, for all x 6= 0, we can use the product rule to compute,
   
1 1
f (x) = 2x sin cos .
x x

However, lim f (x) = lim (2x sin(1/x) cos(1/x)) does not exist because of
x0 x0
oscillation! Therefore, f exists but is not continuous at x = 0.
Remark 6. A relative of f has the following property: g is differentiable with
g (0) > 0, but there is no open interval I containing 0 with g (x) > 0 for all
x I. Somehow, g manages to oscillate so quickly that every neighborhood of
zero has points with negative derivative! We let
(
x2 sin(1/x) + 0.001x, x 6= 0
g(x) = .
0, x=0
28. DARBOUXS THEOREM 111

Similar to how we showed things for f , we can show that g (0) = 0.001 > 0, but
1
that if x = 2k for k Z, then g (x) = 0.999 < 0.

28 Darbouxs theorem
sec:conway-base-13
Recall from Section 9 that a Darboux function is a function which satisfies the
conclusion of the intermediate value theorem. That is, f is Darboux if for every
a < b and every y between f (a) and f (b), there is some x (a, b) so that
f (x) = y. The IVT says that every continuous function is Darboux. Conways
base 13 function is a discontinuous function which is Darboux. A step function
with two steps is an example of a function which is not Darboux.
In the last section, we saw that derivatives can be discontinuous. But not ev-
ery function is a derivative! In particular, one necessary condition for a function
to be a derivative is that it is Darboux:
Theorem 31 (Darbouxs theorem). Suppose f : R R is differentiable. Then
f is Darboux.
Proof. Without loss of generality, assume that f (a) > y > f (b). Let g(x) =
f (x) yx. Let x be the point in [a, b] where g(x) is maximized. Since g
is increasing at a and decreasing at b, we must have x (a, b). Therefore,
g (x) = 0, and hence f (x) = y as desired.
[TODO draw picture]
[TODO this section is very short. maybe combine with something like pre-
vious section, or mention that characterizing derivatives is hard? -lhs]

29 Continuous but nowhere differentiable func-


tions
We know that continuous functions can fail to be differentiable at points, but
can we find a continuous function that is differentiable at no points? Turns out,
yes! The classic example is the Weierstrass function; Karl Weierstrass proved
that the function
X
f (x) = an cos(bn x),
n=0

with 0 < a < 1 and b a positive odd integer such that ab > 3
2 + 1, is everywhere
continuous and nowhere differentiable. As an example, the function

X
g(x) = (3/4)n cos(9n x)
n=0

satisfies the requirements on a and b. All the partial sums are differentiable
(finite sum of cosines), but it turns out that in the infinite sum, f oscillates so
rapidly at each point that it is nowhere differentiable!
112 CHAPTER 5. DIFFERENTIATION

4 2 0 2 4
P5 n
fig:cont-not-diff2 Figure 5.3: The partial sum n=0 (3/4) cos(9n x) looks quite messy.

Heres another example of a nowhere differentiable function (same idea, but


a little simpler). Let g(x) := minnZ |x n|; its graph is shown below.

1
2

2 1 1 2

Then g is differentiable exactly at R \ 12 Z (everywhere except integers and half-


integers), with absolute value of the derivative 1 where it exists. g is continuous
and is 0 exactly at Z. Define

X
1
f (x) := n
g(2n x).
n=0
2

PN 1 n
Each partial sum fN (x) = n=0 2n g(2 x) is continuous, and


X
1 1 X 1 N
kfN f k kgk 0,
2n 2 2n
n=N +1 n=N +1

so f is continuous (apply TODO cite uniform convergence thm on C([0, 1]) and
use periodicity). Now pick some point x R, and choose

j j+1
un := n
x < n =: vn . (5.1)
2 2
30. DERIVATIVES AT INFINITY 113

f (vn ) f (un )
If f was differentiable at x, then the ratio would converge to
v n un
f (x) as n . But in fact,
n1
f (vn ) f (un ) X g( 2nk ) g( 2nk ) X 1 g(2kn (j + 1)) g(2kn j)
j+1 j
= + ,
v n un 2k (vn un ) 2k v n un
k=0 | {z } k=n | {z }
=:dn =0 by periodicity of g
(2kn (j+1),2kn (j)Z)
(5.2)
where dn {1} since g always looks like x or x on [0, 1]. But then
n1
f (vn ) f (un ) X
= dn
vn un
k=0

does not converge. So no derivative anywhere!


[TODO mention that the coordinate functions of the Hilbert curve from
sec:space-filling-curves
Section ?? are also examples of continuous but nowhere differentiable functions.
Maybe even prove it]
[TODO continuous but nowhere differentiable functions are 2nd category]

30 Derivatives at infinity
Suppose f is differentiable and limx f (x) = L exists and is finite. What
can we say about limx f (x)? Intuitively, it seems like f should go to zero
since f has a limit so probably looks pretty flat for large x. But this intuition
is wrong! It is possible for f to have no limit as x .
The easiest way to construct such an example is to make f oscillate a lot as
x , but in decreasing amplitude so that f has a limit. Then the oscillations
will hopefully prevent f from limiting to zero. In fact, this is exactly what
sin(x2 )
happens with f (x) = . Although limx f (x) = 0, we can calculate the
x
derivative to be
d  1  sin(x2 )
f (x) = x sin(x2 ) = + 2 cos(x2 ),
dx x2
which has no limit as x . The function f oscillates faster and faster
(because of the x2 ) as x increases, but at the same time, the x1 factor squeezes
fig:deriv-at-infinity-osc1
f to zero. The derivative, however, oscillates worse and worse! (Figure 5.4.)
3
We even have the function g(x) = sin(x x
)
, which oscillates so fast that the
3
derivative g (x) = 3x cos(x3 ) sin(x
fig:deriv-at-infinity-osc2 x2
)
is not even bounded as x +! (Fig-
ure 5.5.)
Okay, so we just saw that oscillation at can prevent limx f (x) from
existing even if limx f (x) exists and is finite. But what if we impose some
conditions on f so it cant oscillate? How about we require f to be monotonic?
That seems like a pretty good way to ensure that f looks really flat at .
114 CHAPTER 5. DIFFERENTIATION

0.5

0.5
0 5 10 15 20
2
Figure 5.4: The function f (x) = sin(x x
)
oscillates a lot. Even though it gets squeezed
fig:deriv-at-infinity-osc1 to a limit, its derivative is out of control!

0.5

0.5

0 5 10 15 20
3 2
Figure 5.5: The graph of sin(xx
)
looks similar to that of sin(x
x
)
, but its derivative
is even more out of control! TODO increase samples to more like 3000 or 4000 when
fig:deriv-at-infinity-osc2 actually compiling

But even assuming this, we still cannot ensure limx f (x) = 0! We can rig
a function so that it increasing and differentiable, but near +, it increases in
small discrete bursts that prevent f from having a limit! Let g be the following
tent-post function,where each tent-post is a triangle centered at an integer
n N. Each triangle has height 1, and base length 21n . This way, the triangle
31. BUMP FUNCTIONS AND PARTITIONS OF UNITY 115

1
centered at n N has area 2n+1 .

1 2 3

Define Z x
f (x) := g(t) dt,
0

so f = g since g is continuous (fundamental theorem of calculus). Then


Z
X 1 1
lim f (x) = g(t) dt = = , (5.3)
x 0 n=1
2n+1 2

but limx f (x) = limx g(x) does not exist!

31 Bump functions and partitions of unity


Can you think of an infinitely differentiable function that zero outside of (1, 1)
and positive on (1, 1)? Certainly something like this must exist [TODO why???
Ive always found it surprising that these things exist! -wmh], but we run in
to some problems if we start trying to glue standard funtions like polynomials
together at 1 and 1. We can make it have say k derivatives, but infinitely
many???
But heres an explicit formula for such a function:
(
1
e 1x2 , 1 < x < 1
h(x) = .
0, |x| 1

This function h is a called a bump function, and it is identically zero outside


(1, 1), positive only on (1, 1), and infinitely differentiable. See the graph of
fig:bump
h in Figure 5.6.
TODO Rn
TODO partitions of unity
116 CHAPTER 5. DIFFERENTIATION

fig:bump Figure 5.6: A speed bump function. Also known as h.

32 Multivariable limits, derivatives, and local


extrema are weird
Definition 25. directional derivative, differentiability

Claim 4. There exists a function R2 R with no limit at (0, 0), even though
all the directional limits exist and are equal.

clarify directional limit

Claim 5. There exists a function R2 R whose directional derivatives all exist


yet is not differentiable.

differentiable vs equal mixed partials (gelbaum pg.120) i.e. why those re-
quirements in Clairauts theorem are necessary. Also include a discussion about
why Clairauts theorem is true at all! (Maybe involving Fubinis theorem.)

Claim 6. There exists

(i) a differentiable function R2 R which has exactly one critical point,


which is a local minimum but not a global minimum

(ii) a differentiable function R2 R which has two local maxima but no local
minimum

Remark 7. True R R.

both example functions below from Stewarts calculus

f (x, y) = 3xey x3 e3y ,

has only one critical point (a local maximum) but has no absolute extrema (it
is unbounded). Since f is differentiable, we find critical points by solving


f = 3ey 3x2 , 3xey 3e3y = 0.

The only critical point turns out to be (1, 0), which is a local maximum by the
second partials test. (D = fxx fyy (fxy )2 , and D(1, 0) = (6)(39)9 = 27 >
0 while fxx (1, 0) = 6 < 0.) We can evaluate f (1, 0) = 1. However, looking at
f (x, 0) = 3x x3 1 as x , we see f (x, 0) gets arbitrarily large so that
32. MULTIVARIABLE LIMITS, DERIVATIVES, AND LOCAL EXTREMA ARE WEIRD117

f has no absolute maximum value. Thus the local maximum we found is not a
global maximum.
TODO GRAPH
TODO single variable

f (x, y) = (x2 1)2 (x2 y x 1)2 ,


has two local maximums but no local minimums.critical points (1, 2) and (1, 0).
second partials test to show that they are both local maximums.
118 CHAPTER 5. DIFFERENTIATION
refsection:7

Chapter 6

Measure
chap:measure
Then you also, perhaps you have
some faults? I do not believe
so.

Borel and Lebesgue [TODO cite


Neyman]

33 Episode I: The Phantom Measure (Measure


is problematic)
sec:vitali
What does area mean? The area of a plane region E R2 somehow measures
the size of E, but this is a different notion of size than cardinality, as even
Aristotle realized back in the fourth century B.C. while trying to resolve Zenos
paradoxes [TODO cite Physics, Book V]:
For there are two ways in which length and time and generally any-
thing continuous are called infinite: they are called so either in re-
spect of divisibility or in respect of their extremities.
At least one high school geometry book1 defines the area of a plane region to be
the number of nonoverlapping unit squares which fill the region. And then, in
almost the same breath, the book asserts (with no justification) that the area
of a triangle is 12 bh, where b is the length of its base and h is its height. But
you cant fill a triangle with nonoverlapping unit squares!
It helps to recall where the formula 12 bh comes from (at least in a special
case.) We put the triangle in a b h rectangle, and then draw a vertical line
fig:area-of-triangle
segment down from the top point to the base. (See Figure 6.2.) The rectangle
is divided into two pairs of congruent pieces, with one piece from each pair
inside the triangle. So we conclude that the triangle fills up half the area of the
1 Well omit the citation, to spare the book from embarrassment.

119
120 CHAPTER 6. MEASURE

Figure 6.1: Duh, you cant fill a triangle with nonoverlapping unit squares.

Figure 6.2: A proof that the area of a triangle is 12 bh. Note that this argument
only makes any sense if the triangle actually fits inside the rectangle, i.e. only if the
fig:area-of-triangle top vertex of the triangle is above the base.

rectangle. This argument made several assumptions about how area works, like
that congruent shapes have the same area, and that you can calculate the area
of a region by decomposing it into subregions and adding up the areas of those
subregions. There are some nontrivial claims to be proven here!
Maybe you remember the classic and silly missing square paradox, where
a triangle appears to have two different areas based on two different decom-
fig:missing-square
positions. (See Figure 6.3.) Of course, the missing square paradox is just a
simple illusion, as it ought to be. But have you ever seen a real proof that the
phenomenon purportedly exemplified by the missing square paradox can never
actually happen?
And how do you handle something like the area of a disk? Do you plan to
somehow cut up finitely many squares into finitely many pieces and perfectly
cover a disk?2 Archimedes calculated the familiar area formula r2 for a disk
of radius r as follows: We can cut the disk into N congruent wedges, with N
very large. The wedges are essentially just triangles with height r and base
length (2r)/N , so the area of each wedge is approximately r2 /N . Adding up,
we calculate that the area of the disk is approximately r2 . The point is that
as we take N , this approximation gets better and better. But how can we
justify this sort of limiting argument? (What is it that we are approximating,
exactly?) [TODO discuss Euclids treatment of area]
And theres another difficulty with this informal treatment of area. When
were cutting up these squares, what sorts of cuts are we allowed to make,
anyway? Can we cut off the set of points with rational coordinates?
We hope weve convinced you by now that defining area is quite a tricky
business! Of course, theres nothing special about two dimensions. Volume
2 Note that weirdly enough, this particular task can actually be performed, in a sense.

Look upsec:banach-tarski
Tarskis circle-squaring problem. You might appreciate it more after first reading
Section 45.
33. EPISODE I: THE PHANTOM MEASURE (MEASURE IS PROBLEMATIC)121

Figure 6.3: The missing square paradox: it appears that two different decompositions
of the same triangle lead to two different area calculations. An alternative description
is that it appears that by translating the pieces in the lower decomposition, you can end
up with the same overall triangle we started with plus one free unit square. The non-
profound resolution to the paradox is that the small blue triangle and the large green
triangle are not similar, so neither overall figure is actually a triangle (the hypotenuses
fig:missing-square are slightly bent.) The two overall figures are not congruent.

Figure 6.4: How Archimedes calculated the area of a disk. The approximation
2
with N = 20 is shown; the area of each wedge is approximately r 20
. Another way
to understand this argument is to rearrange the wedges so that they form a shape
which is approximately rectangular; the width of this rectangle is approximately
half the circumference, and the height is approximately the radius, giving a total area
of 12 (2r) r = r2 . But what was Archimedes calculating exactly?
122 CHAPTER 6. MEASURE

is similarly tricky, and in fact, length is already tricky. (Wed like to make
sec:cantor-set
sense of our claim in Section 6 that the Cantor set has total length zero!) In
general, the problem of giving legitimate definitions for length, area, volume,
etc. is called the problem of measure. For each dimension d, wed like a measure
function md , defined on subsets of Rd . (E.g. m2 is area.) What do we want
from this function?
Well, there should be no paradoxical decompositions, except silly illusions
like the missing square paradox. If you decompose a region in two different
ways even if you use countably infinitely many pieces [TODO motivate this
specifically] you should get the same measure calculation. For starters, lets
just try to solve the one-dimensional problem of measure, where these ideas are
crystallized into four basic Laws of Lengthodynamics that we would like our
measure function m = m1 to satisfy:

1. (The Zeroth Law of Lengthodynamics: Totality) Every subset of R has


a well-defined measure, which is either a nonnegative real number or else
. That is, m is a function P(R) [0, ].

2. (The First Law of Lengthodynamics: Countable Additivity) Measure can


neither be created nor destroyed by breaking a set up into Pdisjoint pieces.
That is, if E1 , E2 , R are disjoint, then m(n En ) = n m(En ).

3. (The Second Law of Lengthodynamics: Translation Invariance) Measure


can neither be created nor destroyed by translating a set. That is, for
any E R and any R, if we let E + = {x + : x E}, then
m(E + ) = m(E).

4. (The Third Law of Lengthodynamics: Normalization) The closed unit


interval has unit measure. That is, m([0, 1]) = 1.

Surprisingly, and sadly, the Laws of Lengthodynamics were meant to be broken.


They are incompatible!

thm:vitali Theorem 32 (Vitali). There does not exist a function m : P(R) [0, ]
which satisfies normalization, countable additivity, and translation invariance.
thm:vitali
Theorem 32 is every conspiracy theorists dream. The doubts we expressed
about the definition and basic properties of area were not just the idle specula-
tions of an overzealous reductionist. Measure is seriously messed up!

Proof sketch. Let S 1 denote the unit circle, i.e.

S 1 = {(x, y) R2 : x2 + y 2 = 1} = {(cos , sin ) : R}.

Well pull off a variant of the missing square paradox with the unit circle, but this
time, instead of just being a cheap trick, it will be real magic. In particular,
well decompose S 1 into countably infinitely many disjoint pieces, and then
use translations and rotations to rearrange those pieces into two copies of the
33. EPISODE I: THE PHANTOM MEASURE (MEASURE IS PROBLEMATIC)123

Figure 6.5: All of the depicted points are in the same equivalence class, along with
infinitely many points which are not depicted. The points depicted are those reachable
from the point (1, 0) by walking counterclockwise an integer distance between 0 and
6; the entire equivalence class includes all points reachable by walking any arbitrary
integer distance in either direction.

original circle! Well omit the relatively boring last step of the proof, where we
ought to show that this paradoxical decomposition implies Vitalis theorem.
For a point p = (cos , sin ) S 1 and a number R, define

p + = (cos( + ), sin( + )).

Define an equivalence relation on S 1 by saying that p p + n for every n Z.


In words, we say that a point p S 1 is equivalent to all the points you can get
to from p by traveling an integer distance around the circumference in either
direction. Note that because there are 2 radians in a circle (an irrational
number), going an integer distance never gets you back where you started, so
each equivalence class is a countably infinite subset of S 1 .
Let V0 be a set which contains one representative from each equivalence class.
(Note that formally, we invoke the Axiom of Choice to obtain V0 . Our definition
has not actually pinned down a unique set V0 ; the answers to questions such as
Is (1, 0) V0 ? are undefined.) For n Z, define Vn = V0 + n, i.e. Vn is a
copy of V0 rotated by n radians. The definition of immediately implies that
the sets Vn partition S 1 . [TODO maybe try to illustrate this... could be done
I suppose?]
Now we just exploit the paradoxical nature of infinity to finish the job. For
every even number 2n, rotate V2n into the position of Vn . For every odd number
2n + 1, rotate V2n+1 into the position of Vn , and then translate it over to the
right by 3 units. As promised, we now have two disjoint copies of the circle,
each congruent to the S 1 that we started with!
sec:banach-tarski
As well see in Section 45, there are many other paradoxical decompositions,
the most dramatic of which is the Banach-Tarski paradox: in three dimensions,
the solid unit ball can be decomposed into five pieces, which are then rearranged
into two copies of the original unit ball. (Five pieces is much more impressive
than the countable infinity of pieces that we used for the circle!)
124 CHAPTER 6. MEASURE

thm:vitali
But Theorem 32 already shows that the problem of measure (as posed) is
unsolvable. The problem of measure is quite a big problem indeed! It looks like
length is not as well-defined as you might hope. Stay tuned though, not all is
lost...3

34 Jordan measure
sec:vitali
We saw in Section 33 that there is no reasonable way to assign a measure to
every subset of R. But thats unacceptable. Surely, we shouldnt just declare
that ideas like length, area, and volume are meaningless gibberish! No, it is our
duty to make sense of them, so we will just have to shoot for a partial solution
to the problem of measure. This section will be about an early partial solution,
called Jordan measure. Well write md (E) to denote the Jordan measure of
a set E Rd . (So m1 is length, m2 is area, m3 is volume, etc.) Following
Archimedes, to define md (E), we want to take a sort of a limit. But you have
to be careful! Well start with how the definition does NOT go.
Given a set E Rd , we might try to estimate the measure of E by sampling
from Rd and seeing how many points we pick from E. To be more precise, we
could count how many points there are in E with all integer coordinates. To
get a better estimate, we could count how many points there are in E with
half-integer coordinates (and then wed have to divide by 2d to normalize.) In
general, our kth estimate is given by
1 n a o

md (E, k) = d a Zd : E .
k k
And then our temptation is to set md (E) = limk md (E, k). The limit may
not always be defined, of course, but were just shooting for a partial solution
here anyway. Let us call this quantity m e d (E) instead, because it is not the true
Jordan measure.
Unfortunately, m e 1 breaks the Second Law of Lengthodynamics: its not
translation invariant! To see why, observe that the set E1 = Q [0, 1] has
e 1 (E1 ) = 1: we only sample at rational points, so E1 looksjust like [0, 1] to our
m
sampling procedure. But its translate, E2 = (Q [0, 1]) + 2, has m e 1 (E2 ) = 0:
to our sampling procedure, E2 looks empty!
Its inevitable that we break at least one Law, but translation invariance is
too good to give up. You might think we could patch m e d up by just sampling
at irrational points, too. The formula that defines md (E, k) makes perfect sense
even if k is non-integral, so we could just try altering our definition of m e d (E)
by taking our limit over arbitrary real values k.
But this is no good either. We still violate the 2D analog of the Second Law:
e 2 is not rotation invariant! To see why, define
m
n y o
E1 = (x, y) [0, 1] [0, 1] : x = 0 or Q
x
3 This is a cliffhanger, in case you hadnt noticed.
34. JORDAN MEASURE 125
g:fake-jordan-measure

Figure 6.6: The k = 4 approximation of the area of a unit-radius disk D. Since


there are 49 dots that landed inside the disk, we estimate the area of the disc as
m2 (D, 4) = 4492 = 3.0625, which is not too far off from the true area of .

That is, E1 is the set of lines passing through the origin with rational slope,
intersected with the unit square. Then with our new definition, m e 2 (E1 ) = 1:
we still only sample at points with rational slope, so E1 looks just like the unit
square to our sampling procedure. But if we define E2 to be E1 rotated by an
irrational fraction of a full turn, then m e 2 (E2 ) = 0.
So lets move on to the actual definition of md (E). For the easiest case, if E
is a box, we define its measure to be the product of its side lengths. That is, if
I1 , . . . , Id are sets of real numbers of the form [ai , bi ], [ai , bi ), (ai , bi ], or (ai , bi ),
then we define

md (I1 I2 Id ) = (b1 a1 ) (b2 a2 ) (bd ad ).

Next, suppose E can be written as a finite union E = B1 B2 Bn ,


where the Bk s are disjoint boxes. (Well
Pn call such a set E an elementary set.)
Then wed like to define md (E) = k=1 md (Bk ). But theres a difficulty
what if E can be written as a finite disjoint union of boxes in two different
ways? Maybe our measure isnt well-defined!
Proposition 23. Dont worry, its well-defined.
Phew. (Well skip the boring proof.) Now, for the general case, we make the
following definition:
Definition 26. Fix a bounded set E Rd . We define the inner Jordan measure
of E by

m
d (E) = sup{md (A) : A is an elementary subset of E}. (6.1)

We define the outer Jordan measure of E by

m+
d (E) = inf{md (A) : A is an elementary superset of E}. (6.2)
126 CHAPTER 6. MEASURE

11
12
21 31
24
21

11 31

Figure 6.7: Two different ways of writing the same elementary set in the plane as
a disjoint union of boxes (rectangles.) Each suggests the same area: 1 + 8 + 2 =
3 + 2 + 3 + 2 + 1.

Figure 6.8: An elementary subset of the unit disk D with area 1.36, and an el-
ementary superset of D with area 4.69. By definition (assuming that D is Jordan
measurable, which it is) this shows that 1.36 m2 (D) 4.69. Archimedes would
agree: 1.36 4.69.

We say that E is Jordan measurable if m +


d (E) = md (E), in which case we
+
define its Jordan measure by md (E) = md (E) = md (E).

Technically, we should show that the general definition of Jordan measure


agrees with the original definition for elementary sets. (Dont worry, it does.)
Now we can make our claim about the Cantor set precise:

Proposition 24. The Cantor set R is Jordan measurable, and m1 () = 0.

Proof. Recall that at the nth step in the construction of , we have an elemen-
tary superset n , and limn m1 (n ) = 0. Obviously 0 m 1 ()
m+1 (), so we are done.

It turns out that lots of regions in the plane, like polygons, disks, and an-
nuluses, are all Jordan measurable, and their Jordan measures are equal to the
35. LEBESGUE MEASURE 127

areas predicted by standard formulae. Archimedes argument can be turned


into a real proof that the Jordan measure of a disk of radius r is r2 .
But not all sets are Jordan measurable. For example, in our definition, we
explicitly required that E be bounded. Unbounded sets can be dealt with by
using a sort of improper Jordan measure; we can set
md (E) := lim md (E a ball of radius r centered at the origin). (6.3)
r

But even bounded sets can fail to be Jordan measurable. For example, take
E = Q [0, 1]. The inner Jordan measure of E is of course 0, since E does
not contain any intervals. But the outer Jordan measure of E is 1: Suppose
E is contained in the elementary set I1 In . Without loss of generality,
assume these intervals are ordered, so that Ik is to the left of Ik+1 . Then the
right endpoint of Ik must be equal to the left endpoint of Ik+1 , since Q is dense.
Hence, m(I1 ) + + m(In ) 1.
Time to evaluate the performance of m1 at solving the problem of measure
in one dimension. Jordan adheres to the Second and Third Laws of Lengthody-
namics: m1 is translation invariant, and m1 ([0, 1]) = 1. And Jordan measure is
finitely additive, i.e. if A and B are Jordan measurable and disjoint, then A B
is Jordan measurable with m(A B) = m(A) + m(B).
But Jordan is guilty of badly breaking the Zeroth Law: tons of sets are not
Jordan measurable. And Jordan breaks the First Law (countable additivity) as
well, because the collection of Jordan measurable sets is not even closed under
countable disjoint unions! E.g. for each rational number q, the singleton set {q}
is Jordan measurable, but their union, Q, is not Jordan measurable. This is the
only thing that goes wrong; if E1 , E2 , . . . are JordanP
measurable and disjoint,
and n En is Jordan measurable, then md (n En ) = n md (En ). One way to
prove this last fact is to use Lebesgue measure, which well discuss next section.
[TODO talk a little more about the history of this stuff]

35 Lebesgue measure
sec:lebesgue-measure
In the early 1900s, Henri Lebesgue (pronounced luh beg) realized that we
can come much closer to a full solution to the problem of measure than Jor-
dan measure. Well focus on the one dimensional case; the generalization to d
dimensions is straightforward. To begin, well define Lebesgue outer measure,
which is like Jordan outer measure, but a little more sophisticated. If E R,
a countable covering by intervals of E is a (countable) sequence I1 , I2 , . . . of
disjoint open intervals such that E n In . Intuitively, if we let (In ) denote
the length of In P
(i.e. ((a, b)) = b a), then such a covering should provide an
upper bound of n (In ) on the measure of E. So we define the Lebesgue outer
measure of E, denoted (E), as follows:
( )
X

(E) = inf (In ) : (In ) is a countable covering of E by intervals .
n
(6.4)
128 CHAPTER 6. MEASURE

(The stands for ebesgue.) So for example, if E is an interval (open or closed),


then (E) = (E). For another example, (R) = . [TODO include picture]

Definition 27 (Lebesgue measurability). A set E is Lebesgue measurable if for


every interval [a, b] R, ([a, b] E) + ([a, b] \ E) = b a.

[TODO verify that this is actually the correct definition, haha. Its not stan-
dard, but I think its much more natural than the standard ones.] [TODO in-
clude a little intuition and a picture] [TODO contrast the definition of Lebesgue
measure with that of Jordan measure] [TODO mention some other ways to de-
fine/construct Lebesgue measure]
Let M denote the set of Lebesgue measurable sets. We define the Lebesgue
measure : M R by (E) = (E). (So the only difference between Lebesgue
measure and Lebesgue outer measure is that Lebesgue measure is defined
on fewer sets.) It can be verified [TODO reference] that Lebesgue measure
satisfies the First, Second, and Third Laws of Lengthodynamics (countable ad-
ditivity, translation invariance, and normalization), along with various other
nice, intuitive properties. On Jordan measurable sets, Lebesgue measure and
Jordan measure coincide. Note that any individual point has Lebesgue mea-
sure zero, so by countable additivity, any countable set (e.g. Q) has Lebesgue
measure zero.
How about the Zeroth Law? It turns out that tons and tons of sets are
Lebesgue measurable, e.g. open sets, closed sets (such as the Cantor set), sets
with outer measure zero, and Jordan measurable sets. In fact, the collection of
Lebesgue measurable sets has the same cardinality as the set of all subsets of
R: the Cantor set has outer measure zero, so every subset of is Lebesgue
measurable, and || = |R|. So there are just as many measurable sets as there
are sets total. It also turns out that the set of Lebesgue measurable sets is
very robust, e.g. it is closed under countable union, countable intersection,
complementation, translation, reflection, continuous transformations [TODO
verify], and any other reasonable operation. Reality is approximated quite well
by pretending that all sets are Lebesgue measurable.
We know that the Zeroth Law must be broken, though, since Lebesgue
adheres to the other three Laws. But you might reasonably deny that weve
seen an actual example thm:vitali
of a set which is not Lebesgue measurable. If you recall,
our proof of Theorem 32 relied heavily on the axiom of choice, so we didnt
really pin down any particular nonmeasurable set.
It turns out, this was not just an artifact of our proof! If you believe some
abstruse set-theoretic assumptions, then the axiom of choice is unavoidable in
any proof of the existence of nonmeasurable sets.4 This is a bit frustrating,
but there is a silver lining. The necessity of the axiom of choice in proving the
existence of nonmeasurable sets suggests that you dont have to worry about
nonmeasurability too much, because if youve constructed some set E, as long
as your construction is nice and explicit, you can expect that E is measur-
able. Indeed, there are theorem versions of this last claim. See [TODO cite
4 Look up the Solovay model for the details.
36. THE SMITH-VOLTERRA-CANTOR SET 129

http://mathoverflow.net/questions/211507/measurability-and-axiom-of-choice] and
[TODO cite Large cardinals imply that every reasonably definable set of reals
is Lebesgue measurable].
[TODO some discussion of the history here e.g. which came first, Lebesgue
measure or Vitali sets? (I actually dont know the answer to this, but Im
guessing Lebesgue measure came first, which raises the further question, what
was Lebesgues inspiration for only defining his measure on a subset of P(R)?)]

36 The Smith-Volterra-Cantor set


sec:svc-set
If E is a Lebesgue measurable set which contains an interval (a, b), then certainly
(E) b a, so in particular, (E) > 0. What about the converse? If a set has
positive measure, must it contain an interval? Nope! An easy counterexample
is the set of irrational numbers.
sec:cantor-set
Okay heres a better question, of a similar flavor. We saw in Section 6 that
there exists a nowhere dense uncountable set. Does there exist a nowhere dense
set of positive measure?
Indeed there does. One example is the Smith-Volterra-Cantor set (SVC), a
type of fat Cantor set. To construct SVC, we start with the interval [0, 1].
First we remove the middle fourth interval,  an open interval of length 1/4  3in
3 5
the center
5  of [0, 1]. That interval is ,
8 8 , leaving us with two intervals, 0, 8
1 1
and 8 , 1 . In the second step, we remove an open interval of length 42 = 16
from the middle of each of these two remaining intervals, leaving us with four
(shorter)
fig:svc
intervals. The first two steps of constructing SVC are shown in Figure
6.9.

0 1

0 3 5 1
8 8

0 5 7 3 5 25 27 1
32 32 8 8 32 32

fig:svc Figure 6.9: Construction of SVC.

In each successive step, we remove an interval of length 41n from the middle of
each of the remaining 2n1 intervals. SVC is the set of all the points remaining
as we continue the remove middle fourth intervals. (It is all the points that are
never eventually removed.)
The length of all the points in SVC, that is, the points remaining after
130 CHAPTER 6. MEASURE

Figure 6.10: SVC is also called the barcode set. [TODO self-cite]

removing all the intervals, is given by


X X
1 n1 1 2n1
1 2 = 1
n=1
4n n=1
4 4n1
1 1 1
=1 1 =
4 1 2
2

In fact, SVC has Lebesgue measure 1/2. But SVC does not contain any intervals.
To see why, observe that in the nth step of the construction, we have a disjoint
union of several intervals, all of the same length (n). The point is that (n) 0
as n (since (n + 1) < 12 (n)), so for any a < b, the interval (a, b) is not
contained in SVC, because eventually, (n) < b a. Since SVC is a countable
intersection of countable sets, it is closed, and hence it is nowhere dense.
Another fun fact about SVC:
Proposition 25. SVC is not Jordan measurable.
Proof. The inner Jordan measure of SVC is zero, since SVC does not contain
any intervals. But the outer measure must be at least 21 , since outer Jordan
measure is an upper bound on Lebesgue measure.
The general criterion for Jordan measurability is as follows:
thm:jordan-measurability Theorem 33. A bounded set E R is Jordan measurable if and only if its
boundary E has Lebesgue measure zero.
For example, since SVC is closed, it is its own boundary; thus, the fact that
it has positive Lebesgue measure implies that it is not Jordan measurable. For
another example, the boundary of Q [0, 1] is all of thm:jordan-measurability
[0, 1], so Q [0, 1] is not
Jordan measurable. Well omit
thm:lebesgue
the proof of Theorem 33, because it is a special
sec:riemann-integrability
case of Theorem 39, which well discuss in Section 47.

37 A nonmeager set of measure zero


[TODO we should give this set a fun name.]
At this point, weve seen (at least) four different notions of a small set of
real numbers: countable sets, measure zero sets, nowhere dense sets, and meager
sets. How do all of these notions interact? Weve already seen how every one
of these notions implies or fails to imply all the others, with one exception: we
have not yet resolved the question of whether measure zero implies meager. It
doesnt!
37. A NONMEAGER SET OF MEASURE ZERO 131

r
ge
ea Meager
nm
no
o-
er
z
e-
ur
as

Definition
6 p:me

Q
n 2 pro

De
fin
SVC

itio
tio


osi

n
op
Pr

Nowhere
Dense
Q

SVC Q
Measure
Countable
Zero Definition

Figure 6.11: The relationships between being countable, being meager, being
nowhere dense, and having measure zero. A solid arrow represents an implication,
while a dashed arrow represents a lack of implication. Each arrow is labeled with its
proof/counterexample.
132 CHAPTER 6. MEASURE

Well prove slightly more than this by showing that there is a comeager set
with measure zero. (A comeager set is the complement of a meager set; by
Baires category theorem, any comeager set is nonmeager.)
prop:measure-zero-nonmeager Proposition 26. There exists a comeager set E with (E) = 0.
Proof. For > 0, let U be an open set of measure containing a neighborhood
of every rational number. Then Uc is nowhere dense, because if I is an interval,
then there is a rational q I, and Uc misses an interval around q. Therefore,
the set \
E= U1/n
nN

is comeager. But on the other hand, for every n, (E) (U1/n ) = 1/n, so
(E) = 0.

Corollary 1. Every set A R can be written as a union of a meager set and


a measure-zero set.
prop:measure-zero-nonmeager
Notice that the set E from Proposition 26 is an example of a measure-zero
set which is not the discontinuity set of any function. Proof:E c is full
cor:comeager-codense
measure,
sec:baire
c
so it is dense, and E is meager, so by Corollary ?? from Section 15, E is not
F .
Well now give a slightly more concrete example of a comeager set of measure
zero, called . Let Hn R2 be a horn shape consisting of all those points
beneath the curve x = 2n |y|3 . [TODO graph just this] [TODO also thats not
even the right formula...] Define Un R2 by putting a copy of Hn at (p/q, 1/q)
for each rational number p/q. Let Vn R be the intersection of Un with the
x-axis. Then we set = n Vn .
For more about the interactions between Baire category and Lebesgue mea-
sure, see [TODO cite Measure and Category by Oxtoby].
37. A NONMEAGER SET OF MEASURE ZERO 133

Figure 6.12: The first step in the construction of .

Figure 6.13: The second step in the construction of .


134 CHAPTER 6. MEASURE

Figure 6.14: The third step in the construction of .

Figure 6.15: The fourth step in the construction of .


38. LEBESGUES DENSITY THEOREM 135

38 Lebesgues density theorem


sec:lebesgue-density
[TODO make more of a story.] [TODO maybe mention conditional probability]
[TODO maybe? mention the set that partially fills each interval. cannot
have
m(E I)
< <1
m(I)
for all intervals I ]
Fix two measurable sets E, U R, with (U ) > 0. We define the density of
E in U by
(E U )
U (E) = .
(U )
In words, U (E) is the fraction of U which is filled up by E. Now, for a
measurable set E and a point x, we define the metric density of E at x by

(E, x) = lim (x,x+) (E).


0

(Note that (E, x) might not be defined, if the limit is not defined.)

Example 4. If (E) = 0, then (E, x) = 0 for every x. For example, (Q, x) =


0 for every x. This is confusing, since Q is dense in the topological sense of
the word.

Example 5. Let E be an interval from 0 to 1 (include whichever endpoints you


want.) Then


1 if x (0, 1)
(E, x) = 12 if x {0, 1}


0 otherwise.

Example 6. TODO nontrivial example of Lebesgues density theorem?

The main theorem here is Lebesgues Density Theorem, which tells us that
TODO was not a coincidence.

Theorem 34 (Lebesgues Density Theorem). Suppose (E) > 0. Then for


almost every x E, (E, x) exists and equals 1, and for almost every x 6 E,
(E, x) exists and equals 0.

In other words, Lebesgues density theorem says that (E, x) = E (x) for al-
sec:lebesgue-diferentiation
most every x. [TODO give reference for proof Section ??] One nifty consequence
is that if we define
(E) = {x : (E, x) = 1},
then picks out one representative from each equivalence class of the equiv-
alence relation on Lebesgue-measurable sets defined by saying that A B iff
(AB) = 0.
136 CHAPTER 6. MEASURE

39 Measure and Minkowski sums


Heres a fun application of Lebesgues density theorem. For two sets A, B R,
recall that their Minkowski sum A + B is defined by

A + B = {a + b : a A, b B}.

For example, Q + Q = Q, since Q is closed under addition. On the other hand,


[0, 1] + [0, 1] = [0, 2]. One way you can think about it is that the set A + B is
formed by making one copy of A for each point in B, and translating each copy
of A over to the corresponding point in B. For another way to think about it,
let B = {x : x B}. Then A + B consists of precisely those points c such
that B + c intersects A. Finally, for yet another way of thinking about it,
note that A + B consists of precisely those points c such that the graph of the
line y = x + c intersects A B. [TODO illustrate all three of these ways of
thinking about A + B]

thm:set-sum-interval Theorem 35. Suppose A and B both have positive Lebesgue measure. Then
A + B contains an interval.

Proof. By Lebesgues density theorem, there are points a A, b B so that


(A, a) = (B, b) = 1. Therefore, by the definition of density, there is some
> 0 so that
3
(A (a , a + ))
2
and
3
(B (b , b + )) .
2
Let A = A (a , a + ), and let B = B (b , b + ). Well show that
A + B contains an interval centered at a + b.
Suppose a + b + t 6 A + B . Then A does not intersect a + b + t B . But
both of these sets are contained in the interval (a t, a + + t), which has
measure 2 + 2t, and both of these sets have measure at least 32 . Therefore,
we must have t 2 . In other words, the interval (a + b /2, a + b + /2) is
contained in A + B as desired. [TODO illustrate this proof. Also shorten it,
possibly sacrificing a small amount of rigor in favor of enlightenment.]
thm:set-sum-interval
Observe that in the case that B = A, the proof of Theorem 35 implies that
A A contains an interval centered at 0. This result is called the Steinhaus
theorem:

Theorem 36 (Steinhaus). Suppose E R has positive measure. Then E E


contains a neighborhood of 0.

One nifty corollary of the Steinhaus theorem is that if E has positive mea-
sure, then |E| = |R|. (Its obvious that E must be uncountable.) How about
a converse to the Steinhaus theorem? If E E contains a neighborhood of 0,
must E have positive measure? Nope!
40. INTERSECTIONS OF MEASURE ZERO SETS AND THEIR IMAGES137

cantor-set-difference Figure 6.16: The line y = x + c intersecting n n , with c = 0.4 and n = 0, 1, 2.

Proposition 27. Let denote the Cantor set. Then = [1, 1], despite
the fact that () = 0.
Proof. Its obvious that [1, 1], since [0, 1]. For the reverse
inclusion, observe that c if and only if the graph of the line y = x + c
intersects . Let n be the nth set in the construction of ,fig:cantor-set-difference
so that
= n (n n ). It should be inductively clear from Figure 6.16 that
y = x + c intersects every n . Hence, if we let L denote the graph of y = x + c,
then L (n n ) is a decreasing sequence of nonempty compact subsets of
R2 . An application of Cantors intersection theorem [TODO we should state
this somewhere] completes the proof.

40 Intersections of measure zero sets and their


images
Combine with above? (both are kind of short)

41 Noise sets
sec:lebesgue-density
[TODO: think more about this name.] As in Section 38, for two measurable
sets E, U R, let U (E) denote the density of E in U , i.e. the fraction of U
which is filled up by E. Well gonig to define a noise set to be a measurable set
E such that for every nonempty interval I,

0 < I (E) < 1.

So a noise set partially fills every interval. You might be imagining that E
partially fills R uniformly, i.e. I (E) is constant at some value between 0
and 1. (So a picture of E would just be gray.) But that cant happen! By the
Lebesgue density theorem, if E is a noise set, then for every interval I, there
are subintervals of I in which E has density arbitrarily close to 1, and there
are subintervals of I in which E has density arbitrarily close to 0. So E has
black and white patches all over the place, like static on your TV (hence the
fig:noise-set
name noise set.) (See Figure 6.18.) In light of the Lebesgue density theorem,
138 CHAPTER 6. MEASURE

q1 q2 r2 r1

Figure 6.17: In this example, we used X = SVC, and A1 , B1 , A2 , B2 are shown. A1


fig:noise-set-construction is in red, B1 is in blue, A2 is in green, and B2 is in purple.

thm:noise-set
Figure 6.18: A noise set formed as in the proof of Theorem 37 using X = SVC(0.3).
Here, SVC() is the set formed via the SVC construction with the interval removed
sec:svc
fig:noise-set at step n having length n . (So in Section ??, we discussed SVC( 14 ).)

its really rather surprising that noise sets exist at all! But they do. For a fun
bonus, they can even have finite measure.

thm:noise-set Theorem 37. There exists a noise set E with finite measure.

Proof sketch. Let X be your favorite nowhere dense subset ofsec:svc-set


[0, 1] with positive
measure (e.g. the Smith-Volterra-Cantor set from Section 36.) The idea is to
look at all intervals with rational endpoints there are only countably many
of these and squeeze two shrunk copies of X into each interval (qn , rn ) (call
them An and Bn .) Since X is nowhere dense, we can ensure that An and Bn do
fig:noise-set-construction.
not intersect each other or any of the previous Am ,
S mB . (See Figure ??) Then
we take just the An s to form our set E, i.e. E = nN An .
The resulting set has the desired properties: since any interval I contains
some rational interval (qn , rn ) inside it which contains the positive-measure set
An E, we get (E I) (An ) > 0. Also, (qn , rn ) contains the positive-
measure set Bn which is not in E, so (E I) (I) (Bn ) < (I). Finally,
n
we can take
S the measures of An decreasing rapidly in n (e.g. (An ) 2 ) so
that E = nN An has finite measure.

[TODO point out that our set is F , so its a discontinuity set. Point out
that this is somewhat surprising, since we proved that there is a sense in which
F sets are never medium-sized, and this set seems very medium-sized]
42. CONVERGENCE IN MEASURE VS. POINTWISE CONVERGENCE139

42 Convergence in measure vs. pointwise con-


vergence
Recall that we can talk about convergence of functions, fn f almost every-
where, which means that fn (x) converges pointwise to f (x) except possibly on
a measure zero set. With the notion of measures, we can talk about a differ-
ent type of convergence, convergence in measure. (This is particularly useful in
probability, where its called convergence in probability.)

Definition 28. We say that fn f in measure if for all > 0, the measure
({|fn f | > }) 0 as n .

In other words, for a fixed , the measure of the set where fn differs from
f by more than goes to 0 as n . Of course, now we want to figure out
when if these notions of convergence are the same, or if one type of convergence
implies the other.

rop:pointwise-measure Proposition 28. If fn f pointwise almost everywhere, then fn f locally


in measure.

Convergence locally in measure is a weaker version of convergence in measure.


Suppose our infinite-measure space, like R, can beSbroken up into a countable
number of finite-measure sets; for example, R = nN [n, n]. (This property
is called -finiteness.) Then we can define convergence locally in measure.
S
Definition 29. Write R = kN Ck where (Ck ) < . We say fn converges
to f locally in measure if for all k N and > 0, ({|fn f | > } Ck ) 0
as n .

Note if we already started with a finite measure set like [0, 1], then conver-
gence in measure is the same as convergence locally in measure.
prop:pointwise-measure
Proof (of Proposition 28). Fix k N, and suppose fn f a.e. on Ck . Then
> 0, The sets where fn f is small, N := {|fn f | , n N }Ck , increase
to Ck (a.e). Thus the complement, where fn f is large, i.e. Ck \ N , decreases
to the empty set (a.e.) as N . By some continuity properties of measures
(which uses that (Ck ) < ), this implies (Ck \N ) 0 as N . Finally,
Ck \ N = {x Ck : |fn f | > , some n N } {x Ck : |fN f | > }.

So, at least, pointwise convergence a.e. implies convergence in measure. The


converse is false as we will see shortly, but we can at least salvage something:

Proposition 29. If fn f locally in measure, then there is a subsequence


(fnk ) of (fn ) such that fnk converges to f pointwise a.e.

Claim 7. This cannot be improved to the full sequence.

The idea to construct a sequence of functions converging in measure but


not pointwise a.e. is to move the set {x : |fn (x) f (x)| } around a lot.
140 CHAPTER 6. MEASURE

The measure of the set has to decrease if we want to converge in measure, but
convergence in measure doesnt care where the set is. Pointwise convergence
does care a lot though! If we can move the bad set repeatedly over all the
points (while the set is still shrinking), then we might have a chance at failing
to converge pointwise a.e.!
Lets look at [0, 1] and some dyadic rationals (rational numbers with denom-
inator a power of 2). Now given m N, which will be our index for the functions
fm , write m = 2n + k, n N0 := N {0}, 0 k 2n . Define
(  
1, if x 2kn , k+1
2n
fm (x) = .
0, otherwise

What fm does is this: Partition [0, 1] into the intervals of length 21n . On the kth
interval (starting from 0), make fm = 1. Everywhere else, make fm = 0. Each
time k increases as m increases, the interval where fm = 1 gets shifted over
by 21n . Every time n gets incremented, we repartition [0, 1] into finer intervals
and repeat, shifting the intervals one step at a time across [0, 1]. As a result,
fm (x) = 1 infinitely often for every x, even though (|fm | > ) = 21n m 2
0
as m .
TODO: picture

43 Borel measure
Recall that a countable union of closed sets is not necessarily closed or open. We
called such sets F sets. Similarly, a countable intersection of open sets is called
a G set. The same phenomenon happens again, up one level: a countable union
of F sets is again an F set, but a countable intersection of F sets might be
neither F nor G . We call such sets F , and their complements are the G
sets.
These are the first few layers of the Borel hierarchy. In order to generalize,
well introduce some less cumbersome notation. We say that a set is 01 if it
is open, and we say that a set is 01 if it is closed. (We also think of 01 as
denoting the set of open sets, and similarly with 01 , etc.) The 02 sets are
the F sets (i.e. countable unions of closed sets.) Their complements are the
02 sets. We continue like this, setting 0n+1 to be the collection of countable
unions of 0n sets and 0n+1 to be the collection of complements of 0n+1 sets.
Notice that 0n , 0n 0n+1 , 0n+1 . One can prove that for each n, the sets 0n
and 0n are distinct 0n is not a good stopping point for any n N. [TODO
include standard simple picture of inclusions between these things]
But maybe if we let B0 = n 0n , then B0 is a nice collection of sets? (Maybe
we can stop at infinity?) Its closed under complement. But its not closed
under countable unions: there are sequences E1 , E2 , . . . with En 0n so that
n En 6 B0 . So we still have not reached a good stopping point. Like Buzz
Lightyear, we must continue! Well just have to call a set 0 if is a countable
union of sets which are 0n for some finite n. Taking complements gives 0 .
43. BOREL MEASURE 141

But were still not done! Weve got to go on to + 1, + 2, + 3, . . . , all the


way to + , and then onward to + + , + + + , . . . , onward to ,


and then to , and then to ! When will the madness end?!
To find a good stopping point (and also to make sense of the preceding
infinitary arithmetic), well introduce ordinal numbers. (The Borel hierarchy is
one of the few cases where analysts need to know how to count past infinity.) A
well ordering on a set X is a binary relation between X and itself, satisfying:

(Totality) For every x, y X, either x y or y x.

(Transitivity) If x y and y z, then x z.

(Antisymmetry) If x y and y x, then x = y.

(Wellfoundedness) If S X is nonempty, then there is some least element


x S, i.e. an x S so that x y for every y S.

For example, the standard ordering of N is a well ordering, but the standard
ordering of Z is not. [TODO give some intuition about how the well orderings
are all just sequences... except not necessarily familiar ones]
Naturally, we say that two ordered sets (X, X ) and (Y, Y ) are isomorphic
if there is a bijection : X Y so that x X x (x) Y (x ). Roughly
speaking, ordinal numbers are canonical representatives for the isomorphism
types of well orders. The precise definition is a bit confusing.

Definition 30. An ordinal number is a set S satisfying:

The relation is a well ordering on S.

Every element of S is a subset of S.

For example, is an ordinal number, which we identify with 0. If n is an


ordinal number, then so is n {n}, which we identify with n + 1. So 1 = {0},
2 = {0, 1}, and 3 = {0, 1, 2} are all ordinal numbers. But there are more: N is
also an ordinal number, typically written in this setting. As is + 1, + 2,
etc. TODO finish
Now we can define 0 and 0 for arbitrary ordinals , using transfinte
induction: 01 is the open sets, 0 is the collection of complements of 0 sets,
and 0 is the collection of all countable unions of sets in 0 with < . It
can be shown that if is a countable ordinal, then 0 6= 0 . [TODO give
reference for proof] But right after all the countable ordinals is a good stopping
point. Let B be the union of all the 0 sets, where ranges over all countable
ordinals. This collection B is the collection of all Borel sets, and B is closed
under complements and countable unions. [TODO explain more]
Theres an alternative definition of the Borel sets, which avoids all these
transfinite shenanigans: we can cut out the non-Borel sets from P(R), instead
of building up the different layers of the Borel hierarchy. This definition is more
understandable, but less concrete.
142 CHAPTER 6. MEASURE

Definition 31. If is a nonempty set, a -algebra on is a collection F


P () of sets of points, such that
1. F.
2. If E F, then so is E c = \ E.
3. If E1 , E2 , . . . is a countable sequence of sets in F, then nN En F.
It follows from this definition that a -algebra is closed under countable inter-
sections.
Some examples of -algebras: For any set , the power set P () is a -
algebra. The set of Lebesgue measurable sets is a -algebra on R. The two-set
collection F = {, } is a rather trivial -algebra. [TODO possibly discuss
some more intuition about -algebras, and how they encode a state of knowl-
edge. Maybe this should wait for the probability section, where it makes a lot
more sense.]
A quick check will verify that if F and F are both -algebras on T , then so
is F F . More generally, if S is a collection of -algebras on , then S is also
a -algebra on . (Oof, thats confusing... S is a set of sets of sets of points!5 )
This allows us to speak sensibly of the smallest -algebra on containing all
the sets in a given collection U P (). Now, we can just define B to be the
smallest -algebra on R containing all open sets.
So now that weve given a couple of definitions of the Borel sets, wed like
to understand which sets are Borel. The first answer is: not very many!
Proposition 30. |B| = |R|.
proof is just a simple generalization of our proof that |02 | = |R| from
The sec:baire
Section 15.
Proof sketch. We can assign each Borel set a Borel code which uniquely identifies
it, based on its position in the Borel hierarchy. [TODO finish]
Every Borel set is Lebesgue measurable, since the collection of Lebesgue
measurable sets forms a -algebra containing the open sets. But recall that
the collection of Lebesgue measurable sets has cardinality i2 , so there must be
sets which are Lebesgue measurable but not Borel. In fact, we can give a fairly
explicit example, due to Nikolai Luzin: Let E be the set of all numbers x with
a continued fraction
1
x = a0 + ,
1
a1 +
1
a2 +
1
a3 +
..
.
5 We could make it more confusing. What were trying to say is, the intersection of the

set of those sets of sets of subsets of a nonempty set whose intersections are not -algebras
with the set of sets of those sets of subsets of the same nonempty set which are -algebras is
empty.
44. MEASURES IN GENERAL 143

where every an is an integer and there is some infinite subsequence an1 , an2 , . . .
such that ani divides ani+1 for every i. It turns out [TODO give reference for
proof] that E is Lebesgue measurable but not Borel.
But the Borel sets form a pretty good sample of the Lebesgue measurable
sets!
Theorem 38. For every Lebesgue measurable set A, there exists a Borel set B
so that (AB) = 0. In fact, even better, there exist Borel sets B1 , B2 so that
B1 A B2 , but (B1 ) = (B2 ).
[TODO give reference for proof] [TODO discuss history]

44 Measures in general
As we discussed to motivate all this measure stuff, length is just one of many
intuitive notions which is best understood via measure theory. In general, a
measure space consists of a set of points, a collection of measurable sets of
points, and an assignment of a measure to each measurable set. Of course, the
collection of measurable sets and the measure itself have to satisfy some axioms.
Definition 32. A measure space is a triple (, F, ), such that:
is a nonempty set (the set of points.)
F is a -algebra on (the collection of measurable sets.) (So we require
that the empty set is measurable and the collection of measurable sets is
closed under complement, countable union, and countable intersection.)
: F [0, ] is a function (the measure function) which satisfies
1. () = 0.
2. If E1 , E2 , . . .Pis a countable sequence of disjoint measurable sets, then
(n En ) = n (En ). (The measure is countably additive.)
Example 7. Let M denote the collection of Lebesgue-measurable subsets of R.
Let : M [0, ] denote the Lebesgue measure. Then (R, M, ) is a measure
space.
Example 8. Let B denote the collection of Borel subsets of R. Let now
denote the restriction of Lebesgue measure to B. Then (R, B, ) is a measure
space.
Example 9. One can extend the definitions of Lebesgue outer measure and
Lebesgue measurable to subsets of Rn easily enough (just replace intervals with
boxes.) This yields a measure space (Rn , M, ). In the case n = 2, this is area,
and in the case n = 3, this is volume.
Example 10. If is any nonempty finite set, define # : P () R0 by setting
#(E) to be the number of elements in E. Then (, P (), #) is a measure space.
We call # the counting measure on .
144 CHAPTER 6. MEASURE

Example 11. Any situation with uncertainty, such as tossing two dice, or pre-
dicting the weather, or taking a difficult multiple-choice exam, can be modeled
by a measure space (, F, Pr), where is the set of possible outcomes, F is
some suitable collection of events (sets of outcomes), and the measure func-
tion Pr(E) gives the probability that the event E occurs. For this reason, we
say that a probability space is a measure space (, F, ) with () = 1. Well
chap:prob
meet all these characters again in Chapter 9.
Example 12. The appropriate way to formalize surface area (e.g. the area of
a sphere) is of course also in terms of measure theory. But we dont want to use
the Lebesgue measure on R3 ; thats volume (so the measure of a sphere is zero.)
We cant use the Lebesgue measure on R2 (a sphere is not a region in the plane.)
One suitable measure is called the Hausdorff measure. For each dimension d
and each Euclidean space Rn , there is a Hausdorff measure Hnd which gives the
d-dimensional measure of a measurable subset of Rn . The definition even makes
sense if d is not an integer! [TODO give reference]
Example 13. Sometimes, physicists talk about point charges, but other times,
they talk about charge densities. Both of these are part of a more general
phenomenon: a charge distribution is properly represented by a measure space
(R3 , F, q), where q(E) gives the total amount of charge enclosed in E R3 .
Similarly, mass is best described with measure theory.
Non-Example 1. Jordan measure is not a true measure. The class of Jordan
measurable sets does not form a -algebra, e.g. because every singleton set is
Jordan measurable but there are countable sets which are not Jordan measurable
(such as Q [0, 1].) Maybe it should have been called Jordan pseudo-measure.
Oh well.
[TODO make this section sound less like a textbook. We dont even have
any pictures!]

45 Episode II: Attack of the Clones (Banach-


Tarski and paradoxical decompositions)
sec:banach-tarski
[TODO write. This should maybe be split into a couple of sections. Theres a
lot to talk about, e.g. Tarskis circle-squaring problem, Banach measure, von
Neumanns paradox.]
Part II

More advanced stuff

145
refsection:8

Chapter 7

Integration

46 Introduction to the Riemann integral


sec:riemann-intro
You might not realize that there are different kinds of integrals. The one you
probably first learned about is the Riemann integral, named for the German
mathematician Georg Friedrich Bernhard Riemann.
The Riemann integral is closely related to Jordan measure. Suppose we
have a function f : [a, b] R. We would like to approximate the area under the
graph of f by sampling f at finitely many locations. Then, wed like to define
Rb
the integral a f (x) dx by taking some sort of a limit as the number of samples
goes to infinity.

a b

Figure 7.1: Approximating the integral of a nice function by left rectangles. We


partition [a, b] into several intervals, then sample f at the left-endpoint of each interval.
This function is nice enough that a roller coaster car can drive on it.

Just like with the definition of Jordan measure, we have to be careful with
this limit. Well start by giving the wrong definition, analogous to the definition

147
148 CHAPTER 7. INTEGRATION

of fake Jordan measure:

Definition 33. If f : R R is a function and a < b, then the fake Riemann


integral is defined by
Z n  
b
baX ba
f (x) dx = lim f a+i . (7.1)
a n n i=1 n

That is, by definition, f is fake Riemann integrable if the above limit exists, in
which case its integral is the value of that limit.

This is a fine definition as far as it goes, but it is not the familiar integral.
Rb Rc Rc
For example,
R the standard rule a f (x) dx + b f (x) dx = a f (x) dx is not
true if is interpreted as a fake Riemann integral! For proof, let f be the
Dirichlet function (i.e. the indicator function of Q), let a = 0, let c = 2, and
Rb
let b = 2. Then a f (x) = 0, because we only evaluate f at irrational values
Rc Rc
in the definition of the integral. Similarly b f (x) dx = 0. But a f (x) = 2,
because we only evaluate f at rational values for the calculation of this last
integral!
One of the ways to approach the definition of the true Riemann integral is
by using Darboux sums. The upper Darboux sum will overestimate the definite
integral, while the lower Darboux sum will underestimate the definite integral.
As we refine these upper and lower Darboux sums, their values will become
closer and closer to the actual value of the integral. It can be shown that
Darboux integration is equivalent to Riemann integration, so we often use the
upper and lower (Darboux) sums to introduce the Riemann integral.

Definition 34. A partition P of the interval [a, b] is a set {x0 , . . . , xn }, such


that a = x0 < x1 < < xn = b.

Definition 35. Let f : [a, b] R. The upper Darboux sum for a partition
P = {x0 , . . . , xn } on [a, b] is defined as
n
X
U (f, P ) := Mi (xi xi1 ),
i=1

where Mi := sup[xi1 ,xi ] f (x), the supremum of f on the subinterval [xi1 , xi ].


Similarly, the lower Darboux sum for P is
n
X
L(f, P ) := mi (xi xi1 ),
i=1

where mi := inf [xi1 ,xi ] f (x), the infimum of f on the subinterval [xi1 , xi ].

These sums are essentially the same as the upper and lower approxima-
tions used in many calculus classes. The upper and lower sums correspond to
summing the areas of the upper rectangles (U (f, P )) and lower rectangles
46. INTRODUCTION TO THE RIEMANN INTEGRAL 149

a b

fig:darboux-sums Figure 7.2: Upper and lower Darboux sums.

(L(f, P )), so U (f, P ) and L(f, P ) trap the value of the integral between them.
fig:darboux-sums
(Figure 7.2.) How can we improve our estimate? By taking smaller and smaller
subintervals [xi1 , xi ], we can better approximate f on each subinterval, which
will give a better approximation for the integral. As we get finer and finer parti-
tions, the upper sums U (f, P ) decrease, while the lower sums L(f, P ) increase,
squeezing the value of the integral between them. Using this idea, we can define
the upper and lower Darboux integrals.
Definition 36. The upper Darboux integral U (f ) is defined as

U (f ) := inf{U (f, P ) : P a partition of [a, b]}.

Similarly, the lower Darboux integral L(f ) is defined as

L(f ) := sup{L(f, P ) : P a partition of [a, b]}.

If U (f ) = L(f ), then we say f is (Darboux-)integrable on [a, b], and write


Z b Z b
f= f (x) dx = U (f ) = L(f ).
a a

Now that weve finished defining the Darboux integral, youre probably won-
dering about the definition of the actual Riemann integral. The main difference
from upper and lower Darboux integrals is that instead of the sup or inf on
[xj , xj+1 ], we allow any function value f (tj ), for tj [xj , xj+1 ].
Definition 37. A tagged partition of the interval [a, b] is a set {x0 , . . . , xn }
with a = x0 < x1 < . . . < xn = b, along with a set {t0 , . . . , tn1 } with tj
[xj , xj+1 ]. The mesh of a partition is the length of the largest subinterval,
max1jn (xj xj1 ).
In other words, a tagged partition is just a usual partition but we also tag a
point tj from each interval.
150 CHAPTER 7. INTEGRATION

Rb
Definition 38. The Riemann integral a f exists and equals S iff for all > 0,
there exists a > 0 so that all tagged partitions with mesh < satisfy

n
X
f (tj )(xj xj1 ) S < . (7.2)

j=1

It can be shown that this is the same as the Darboux integral. But this
is a rather unwieldy definition, which is why we like to use the Darboux in-
tegral definition. From now on, well use Darboux and Riemann integrability
interchangeably.
The relationship between the Riemann integral and Jordan measure is two-
fold. On the one hand, the Riemann integral generalizes one-dimensional Jordan
measure. To see how, for a set E R, we define E , the indicator function of
E, by
(
0 if x 6 E
E (x) := (7.3)
1 if x E.

(E (x) indicates whether x is in E.)

Proposition 31. Fix a set E R. Suppose E is bounded, say E [a, b],


so that we can think of E as being a function [a, b] R. Then E is Jordan
measurable if and only if E is Riemann integrable. In this case,
Z b
m(E) = E (x) dx. (7.4)
a

But on the other hand, the Riemann integral is a special case of two-
dimensional Jordan measure:

Proposition 32. Fix a bounded function f : [a, b] R. Let A+ be the plane


region under the positive part of the graph of f , and let A be the plane region
above the negative part of the graph of f . That is,

A+ := (x, y) R2 : 0 y f (x) (7.5)

A := (x, y) R2 : 0 y f (x) . (7.6)

Then f is Riemann integrable if and only if A+ and A are both Jordan mea-
surable. In this case,
Z b
f (x) dx = m(A+ ) m(A ). (7.7)
a

So the Riemann integral and Jordan measure are two sides of the same coin.
47. DIRICHLET, THOMAE, AND LEBESGUES CRITERION 151

47 Dirichlet, Thomae, and Lebesgues criterion


riemann-integrability
Recall the Dirichlet function Q and the Thomae function
(
0 if x is irrational
f (x) = 1
q if x = pq , with pq reduced and q > 0
sec:dirichlet
sec:thomae
from Sections 8 and 11.
The Dirichlet function Q is not Riemann-integrable on any interval [a, b],
a < b, because the set Q [a, b] is not Jordan measurable. We can also prove
this directly from the definition of the Riemann integral. Pick a partition P
of [a, b]. Each interval in the partition contains both rational and irrational
numbers, so f takes exactly the values 0 and 1 in each interval. Thus the upper
Darboux sum is U (f, P ) = 1 (b a) while the lower sum is L(f, P ) = 0, for
every partition P . As a result, the upper and lower integrals are b a and 0
respectively, so the Dirichlet function is not Riemann-integrable.
Perhaps more surprisingly, the Dirichlet functions relative, the Thomae
function, is actually Riemann-integrable on [a, b]. For simplicity, well just look
at [0, 1]. Just like we argued for the Dirichlet function, since every interval con-
tains irrational numbers, the lower integral is zero. Now for the upper integral:
The key observation is that for any > 0, there are only finitely many rational
fig:thomae-integrable
numbers r1 , . . . , rN with f (rj ) = q1j > . (Figure 7.3.)

Figure 7.3: The key observation to prove the Thomae function is Riemann-integrable:
For any > 0, there are only finitely many rational numbers rj with f (rj ) = q1j > .
Note there are finitely many denominators qk with qk < 1 , and then only finitely
fig:thomae-integrable many numerators to go with each qk .

Now, use a partition P fig:thomae-integrable2


so that the intervals containing these rj have small
total length . (Figure 7.4.) Then the total contribution to the upper sum
from these intervals Ij is at most height length < 1 . The total contribution
from all other intervals in the partition is at most height length < 1. Thus
152 CHAPTER 7. INTEGRATION

Figure 7.4: We cover all the points rj (0, 1) such that f (rj ) > with the yellow
intervals of total length < . The yellow intevals contribution to the upper Darboux
sum is then < . On the remaining part of [0, 1], shown in turquoise, the function f
takes values , and so the turquoise intervals contribution to the upper Darboux
fig:thomae-integrable2 sum is also < .

we can force the upper sum to be 2 for any > 0, so the upper integral and
hence Riemann integral are just zero. 
Whats the difference between the Dirichlet and Thomae functions that
makes one Riemann integrable and one not? Recall a key fact about conti-
nuity of these functions: the Dirichlet function is continuous nowhere, while the
Thomae function is continuous at every irrational (and discontinuous at every
rational). Could this be related to the fact that the Thomae function is Riemann
integrable, while the Dirichlet function is not?
It turns out yes, because there is the following criterion for Riemann-integrability
that is commonly known as Lebesgues criterion.

thm:lebesgue Theorem 39 (Lebesgues criterion for Riemann integrability). A bounded func-


tion f on [a, b] is (Riemann) integrable if and only if its set of discontinuities
has (Lebesgue) measure zero.

To use Lebesgues criterion, we just need to find everywhere f is discon-


tinuous and compute the (Lebesgue) measure of that set. From this, we can
immediately see that the Dirichlet function is not Riemann integrable since it
is discontinuous everywhere, while the Thomae function is.
thm:lebesgue
Theorem 39 can be interpreted probabilistically. (Probability will be for-
chap:prob
mally introduced in Chapter 9.) A bounded function f is Riemann integrable
on [a, b] if and only if when we pick a point x0 [a, b] uniformly at random, then
f is continuous at x0 almost surely, i.e. with probability one. [TODO rephrase
or rewrite to make clearer] Basically, pick some point in [a, b] randomly, and
check if f is continuous there. The function f is going to be continuous there
with probability one iff it is Riemann integrable.
We end this section with a brief proof outline of Lebesgues criterion. First,
48. EPISODE III: REVENGE OF THE ANTIDERIVATIVES (VOLTERRA)153

we need Riemanns condition for integrability.


thm:riemann Theorem 40 (Riemanns condition for integrability). A bounded function f on
[a, b] is integrable if and only if for every > 0, there exists a partition P of
[a, b] such that
U (f, P ) L(f, P ) .
This theorem is useful because its generally fairly difficult to explicitly cal-
culate the upper and lower Darboux sums U (f ) and L(f ). With this theorem,
to show Riemann integrability, it suffices to show that given some > 0, we can
construct a partition P with U (f, P ) L(f, P ) .
thm:lebesgue
Proof sketch (of Theorem 39, Lebesgues criterion). Recall when we look at dis-
continuities, we like to look at the oscillation of f ,
f (x) := inf (f ; I) := inf sup |f (s) f (t)|,
Ix Ix s,tI

where I x is an open interval. Then the set of discontinuities is


[ 1
D(f ) = {x [a, b] : f (x) > 0} = {f (x) }.
n
nN

For the () direction, which is esaier, we need to show that m({f (x) n1 }) = 0
for each n N. For the () direction, we use m({f (x) n1 }) = 0 for each
n N to construct a partition P with U (f, P ) L(f, P ) < .
The key relation between oscillation and the Riemann integral is
n
X
U (f, P ) L(f, P ) = (f ; [xj1 , xj ])xj . (7.8) eqn:lebesgue-criterion-forward
j=1

For (), choose a partition P so that U (f, P ) L(f, P ) < n .


For (), split up [a, b] into two kinds of intervals, Ii where oscillation is
large, and Jj where the oscillation is small. Since m({f 1/k}) = 0 for all
k N, the total measure of the Ii s is small.
Choose k so that k1 < , and use compactness
P to cover {f k1 } with finitely
many (disjoint)
S intervals Ii , i A with m(Ii ) . Let Jj be the intervals in
[a, b] \ Ii , and take the partition P to consist of all the endpoints of Ii . On
Jj , use the bound f < k1 < , which one can show implies (f ; Jj ) < . On Ii ,
use the bound f 2kf k .

48 Episode III: Revenge of the Antiderivatives


(Volterra)
Everyone remembers the fundamental theorem of calculus, which goes some-
thing like
Z b
f (x) dx = F (b) F (a), (7.9) eqn:ftc
a
154 CHAPTER 7. INTEGRATION

where F is an antiderivative
eqn:ftc
of f . But what conditions do we need to impose on
f to make sure that (7.9) works? One useful requirement is that F should exist
so we can make sense of the equation. But do we need to worry about existence
Rb
of a f (x) dx? If an antiderivative F exists, do we need to say anything about
Rb
a
f (x) dx existing?
Its not too hard to see that the answer is yes. Even if f has an antiderivative,
it might not be integrable. For example, let
( 
x2 sin x12 if x 6= 0
F (x) = (7.10)
0 if x = 0.

0.1

0.05

-0.4 -0.2 0.2 0.4

-0.05

-0.1

-0.15

-0.2

Figure 7.5: Plot of F (x).

Observe that F is differentiable, with derivative F = f given by


(  
2x sin x12 x2 cos x12 if x 6= 0
f (x) = (7.11)
0 if x = 0.

-0.4 -0.2 0.2 0.4

-1

-2

Figure 7.6: A truncated plot of F (x).

This is not technically Riemann integrable over the interval [1, 1], for the
boring reason that it is unbounded. Although in this case, one could get
around this technicality with improper integration.
48. EPISODE III: REVENGE OF THE ANTIDERIVATIVES (VOLTERRA)155

So what if we know that f is bounded and f has an antiderivative F ? Surely


then we can conclude that f is Riemann-integrable, and f and F are related
by the FTC? It turns out, we cannotthere is a bounded function f with an
eqn:ftc
antiderivative F , such that f is not Riemann-integrable, i.e. (7.9) makes no
sense even though F exists!
In 1881, Volterra presented a function V (x) that is differentiable with bounded
R1
derivative V (x) that is not Riemann integrable. That is, 0 V (x) dx 6= V (1)
V (0), simply because the definite Riemann integral on the left does not exist.
The function V (x) is called Volterras function. sec:svc-set
Recall the Smith-Volterra-Cantor set from Section 36, a subset of [0, 1] which
does not contain an interval yet which has Lebesgue measure 12 . Volterras
function V is constructed so that its derivative V is discontinuous at every
point in SVC. Since then the measure of the discontinuities of V will be 12 > 0,
thm:lebesgue
Lebesgues criterion (Theorem 39) will imply that V is not Riemann-integrable.
sec:discont-deriv
Recall also the function f from Section 27 equal to x2 sin(1/x) for x 6= 0 and
0 for x = 0. It is everywhere differentiable, but f is not continuous at x = 0.
Now consider its rescaled relative
( 
x2 sin x , x 6= 0
r(x) = .
0, x=0

Take the portion of r on [0, 1/2], and rotate this portion around the point (1/2, 0)
fig:volterra1
to form the function g : [0, 1] R. (Figure 7.7.)

fig:volterra1 Figure 7.7: This function g is the basis for constructing Volterras function.

This function g is differentiable, but g is not continuous at x = 0 or x = 1


since it behaves just like the function f . To form Volterras function V (x), we
first shrink (while preserving the aspect ratio) the function g and slide it into
the first interval removed when forming SVC. We call this function g1 . Next we
shrink and slide 2 more copies of g into the 2 intervals removed in the second
step of forming SVC and call this g2 .
We continue this process by shrinking and sliding copies of g into every
open interval that was removed when forming the Smith-Volterra-Cantor set.
156 CHAPTER 7. INTEGRATION

Figure 7.8: The first and second steps of forming Volterras function

When we do this for all intervals removed from [0, 1] (i.e. in the limit), we get
Volterras function, V (x). More formally, we set

X
V (x) = gn (x),
n=1

where gn (x) consists of 2n1 shrunk copies of g slid into to the 2n1 intervals
removed in the nth step of forming SVC.
It turns out that V is continuous and differentiable1 on [0, 1], and that V
is bounded but discontinuous at every point in SVC. Since SVC has Lebesgue
measure 1/2, then the set of discontinuities of V on [0, 1] has measure 1/2, so
V is not Riemann-integrable on [0, 1] by Lebesgues criterion.
How did something like this sneak by the fundamental theorem of calculus?
Exactly for which functions does the fundamental theorem of calculus actually
sec:lebesgue-differentiation
hold? Well revisit this problem later,
sec:absolute-continuity
in Sections 57 (we take derivatives of
integrals) and 61 (the FTC for grown-ups).
Remark 8. Although V is not Riemann-integrable on [0, 1], it is integrable
1 Differentiability is easy away from SVC. For SVC, differentiability from one side is easy,

and differentiability from the other side can be shown using the quadratic bounds on each gn .
49. EPISODE IV: A NEW HOPE FOR INTEGRATION (LEBESGUE) 157

on many other intervals. For example, if we integrate over one of the inter-
vals removed from SVC, then V is continuous except at the endpoint, and the
fundamental theorem of calculus holds. However, we can construct wilder func-
tions that are not Riemann-integrable on any interval. One such example is a
sec:discontinuity-sets
Pompeiu derivative, which will be introduced in Section 64.

49 Episode IV: A New Hope for Integration (Lebesgue)


[TODO maybe add acarothers
discussion of history of R vs L integral?]
From Carothers [1]:
In our time, the Riemann integral has surely become the workhorse
of calculus. While this noble beast is a faithful and true servant, it
is not without its shortcomings not entirely flawed, mind you, just
less than perfect. One such blemish, if you will, is that the Riemann
integral is not defined for as many functions as we might hope.

Figure 7.9: The Riemann integral is the workhorse of calculus.

One issue is that any unbounded function is automatically not Riemann


integrable. The interval in the partition that has the unbounded part is going
to contribute + to the upper integral or to the lower integral. It doesnt
matter how fine you make the partition, still counts as .2 This might not
seem like an issue; after all, we can use improper Riemann integrals, but having
to use them hints that maybe the original Riemann integral isnt as great as we
2 In fact, we can formalize the rules of working with infinity, which well do for Lebesgue
sec:integration-measure
integrals and measurable functions in Section 50.
158 CHAPTER 7. INTEGRATION

hoped. Moreover, what if start with a nice function f , and then simply redefine
f (x0 ) := + at a single point x0 ? Then this new function isnt bounded and
so isnt Riemann integrable. But cmon, its just a single point, its not like we
added any area, right?
In addition, we have oddities like Volterras function. The fundamental
theorem of calculus for Riemann integrals requires overly restrictive conditions
that we dont like. Even simply being Riemann-integrable requires continuity
a.e. by Lebesgues criterion.
Finally, theres also the failure of limits and integrals to commute. For
example, take fn increasing to the Dirichlet function (throw one rational up to 1
R1 R1
each time); then 0 Q (x) dx = 0 limn fn (x) dx doesnt exist as a Riemann
R1
integral. It certainly isnt limn 0 fn (x) = 0, and we dont like that. We
would also like to exchange derivatives with integrals, but that is difficult when
limits and integrals dont necessarily commute.
In the early 1900s, Lebesgue addressed the issue of integration, and devel-
oped a better method of integration now called Lebesgue integration. It allows
for more functions to be integrated and has nice properties that the Riemann
integral lacks. Informally, the idea of the Lebesgue integral is to partition the
y-axis instead of the x-axis. Instead of summing the height times the width of
rectangles partitioned along the x-axis, the idea is to sum the y-value times a
weight corresponding to how often the function takes values close to that
y-value. Theres a quote from Lebesgue that tries to capture this idea [TODO
citation]:

I have to pay a certain sum, which I have collected in my pocket.


I take the bills and coins out of my pocket and give them to the
creditor in the order I find them until I have reached the total sum.
This is the Riemann integral. But I can proceed differently. After I
have taken all the money out of my pocket I order the bills and coins
according to identical values and then I pay the several heaps one
fig:coins
after the other to the creditor. This is my integral. (Figure 7.10)

Now we flesh out some details: for an interval of the y-axis [y0 , y0 + y], we
look at x [a, b] such that f (x) [y0 , y0 + y]. What we need to know is the
measure () of the set

Sy0 := {x [a, b] : f (x) [y0 , y0 + y]}. (7.12)


fig:lebesgue
(Figure 7.11.) The idea is that the region where y0 f (x) y0 +y contributes
roughly y0 (Sy0 ) to the integral. Of course, the set Sy0 can in general be much
fig:lebesgue
more complicated than the intervals in Figure 7.11, but were fine as long as its
a Lebesgue measurable set. In this way, the problem of integration has been
simplified to a problem of measure!
Conveniently, with Lebesgue integration, we get rid of problems like not
being able to integrate Volterras function (V is Lebesgue-integrable) and not
being able to integrate the Dirichlet function (this is also Lebesgue-integrable).
49. EPISODE IV: A NEW HOPE FOR INTEGRATION (LEBESGUE) 159

Figure 7.10: How Riemann counts his money vs. how Lebesgue counts his money.
Either way, it is clear that inventing new types of integrals is not very profitable.
fig:coins (Oriel illustrated this)

y0 + y
y0

x
a S1 S2 S3 b

Sy0 = S1 S2 S3

fig:lebesgue Figure 7.11: The idea of Lebesgue integration.

Additionally, it will turn out that the Lebesgue integral agrees with the Riemann
integral whenever the latter is defined.
Theorem 41. If f : [a, b] R is Riemann integrable, then f is Lebesgue
integrable on [a, b] and the two integrals agree.
Finally, as Riemann integration and the area under a curve related to 2D
sec:riemann-intro
Jordan measure (cf. Section 46), so also does Lebesgue integration and area
fig:oregon
under a curve relate to 2D Lebesgue measure. (Figure 7.12.)
Proposition 33. Let f 0 be measurable and let A be the plane region under
the graph of f ,

A := {(x, y) : a x b and 0 y f (x)},


160 CHAPTER 7. INTEGRATION

where we allow a = or b = +. Then the set A is Lebesgue measurable,


and Z b
f dx = m(A),
a
where m denotes the 2-dimensional Lebesgue measure.
We say f : R is measurable if {f > } is mesaurable for every R.
Pretty much any function on R that you can think of is Lebesgue measurable.
thm:fubini
Proof. The proof is actually
sec:fubin
quite short if we assume Theorem 47 (Fubini-
Tonelli) from Section ??, which allows us to use iterated integrals freely if f 0.
Then
Z b Z bZ Z
f dx = [0,f (x)] dy dx =
[a,b][0,f (x)] (x, y) dm
a a R R2
Z
= A (x, y) dm = m(A).
R2

For measurability, which we probably should have shown first, note that it is
easy if f is a nice simple step function (i.e. it only takes on finitely many
y-values), since then the area under the curve is a nice union of sets that
look like A [y1 , y2 ]. For general f measurable, well see shortly that we can
approximate it from below by simple functions. Then the area under the
curve will just be the union of the areas under the simple functions.
49. EPISODE IV: A NEW HOPE FOR INTEGRATION (LEBESGUE) 161

Figure 7.12: The integral of (part of) the Columbia River equals Oregon (mostly),
the region below the river. [Feel free to edit/replace if this isnt what you had in mind.
fig:oregon file is in gimp format (xcf) -lhs]
162 CHAPTER 7. INTEGRATION

50 Integration on measure spaces


sec:integration-measure
[TODO I think the difficulty just jumped quite a bit? -lhs]
We could go ahead and define the Lebesgue integral using the Lebesgue
measure, but lets go more general. Though if it makes you happier, you can
imagine that were on R and all our measures are the Lebesgue measure. More
generally, once we have a measure space, we can talk about integration. So fix
your favorite measure space (, , ). Were going to want to integrate measur-
fig:measurable-function
able functions, i.e. f : R such that {f > } for all . (Figure 7.13.)
We can accommodate f : C by requiring Re f and Im f to be measurable.

{f < }

Figure 7.13: A function f : R is measurable if {f > } is a measurable set for


fig:measurable-function all R. In this picture, = R+ . Continuous functions are always Borel measurable.

Theres some algebraic structure: measurable functions form an algebra (vec-


tor space with multiplication) and lattice (can take sup and inf). Moreover,
measurable functions are closed with respect to pointwise limits.
Lets start with the easiest functions to integrate. How about a function
that takes a constant value c on some measurable set A, and is zero everywhere
else? This is just the characteristic function f = cA . Wed like the integral to
just be c (A).

Figure 7.14: The integral of cA is just c (A), which agrees with the nice simple
case where A R and is a nice union of a few intervals.
50. INTEGRATION ON MEASURE SPACES 163

Using just this idea, were going to define integrals starting with simple func-
tions. These are functions : R or C measurable such that card () <
, i.e. it only takes on finitely many values. Maybe more clearly, such a func-
tion always has a canoncial representation in terms of characteristic functions,
N
X
= ci Ai with ci 6= cj and Ai Aj = for i 6= j.
i=1

Note that if it werent for ci 6= cj , this wouldnt be unique since we could further
decompose the Ai . Equivalently, we can just require that Ai = 1 (ci ).

Figure 7.15: Some simple functions look simple. Be careful that the sets Aj may
be rather wild though! Those dotted lines on the left that make the picture look like
not-a-function represent something like the Dirichlet function or one of its relatives.

Just like in the really simple case of characteristic functions, define the in-
tegral of a simple function by setting
Z XN
d := ci (Ai ).
i=1

Now that we have integration figured out for simple functions, we turn to mea-
surable functions in general. Were going to start with measurable functions
f 0. Once we have this, then it will be easy to extend to general measurable
f : C by breaking f up into real and imaginary parts and then positive
and negative parts. We will use simple functions to approximate f 0. And
were going to approximate it pointwise since thats convenient. This is the idea
of cutting up the y-axis.
lem:approx Lemma 3 (Approximation lemma). Let f 0 measurable. Then there exists
a sequence of simple functions n 0 such that n increases to f pointwise.
Proof. The idea is to cut up the interval [0, 2n ] on the y-axis into segments of
fig:simple-approx
1
length 2n . (Figure 7.16.) Then we set
22n
X 1
j
n = j j+1 .

j=0
2n { 2n f < 2n }
164 CHAPTER 7. INTEGRATION

2n

1
2n

fig:simple-approx Figure 7.16: Approximation lemma.

Now we define the integral of a measurable function f 0.


Definition 39. Let f 0 be measurable. Then
Z Z 
f d := sup d : 0 f, simple [0, +].

For f real-valued and measurable but not necessarily nonnegative, write


f = f+ f , where f+ := max{f, 0} 0 and f := max{f, 0} 0, and
define, Z Z Z
f d := f+ d f d
R R
whenever both f+ d < and f d < . If this happens, then f is called
integrable. Note that |f | = f+ + f , so f is integrable iff |f | is integrable.
R If
only one of f+ , f is integrable, then we can still assign meaning to f d by
allowing it to be . Finally, for complex-valued measurable functions, we
break up the integral into real and imaginary parts and demand that each part
is integrable.
Remark 9. Heres a fun fact: sums are integrals in disguise! For example, take
the measure space (R, P(R), N ), where N is the measure that puts an atom of
weight one at each natural number. Then
Z
X
f (x) dN = f (n). (7.13)
1 n=1

This means all the theorems well prove later about Lp spaces or integration on
measure spaces carry over to sums. Of course, the special case for sums usually
has its own simpler proof, but its fun to hit it with the general-measure-theory
hammer.
51. NICE FUNCTIONS AND THE 3 LIMIT THEOREMS 165

51 Nice functions and the 3 limit theorems


sec:dct
Integrable functions dont have to be very nice. Remember, measurable sets
dont have to be very nice, and we can always take the indicator function of
some wacky set with finite measure and integrate it. But fortunately, we have
all sorts of nice functions living inside integrable-function-land that we can use
to approximate any general integrable function. To do this, first were going to
introduce the three well-known convergence theorems for Lebesgue integration.
They let us interchange limits and integrals under certain conditions. We do
not have such nice conditions for the Riemann integral.
Let (X, , ) be a measure space. First we need a quick lemma, the proof
of which follows from the definitions in the previous section.

m:integral-properties Lemma 4.
R R
(i) If 0 f g, then f d X g d.
X
R R
(i) If A B and f 0, then A f d B f d.
R R
(i) If f 0 and c R, then X cf d = c X f d.

Theorem 42 (Monotone convergence theorem (MCT)). Given fn 0 measur-


able increasing to f pointwise, i.e. f1 (x) f2 (x) f (x), then
Z Z
f (x) d = lim fn (x) d. (7.14)
X n X

fn

N E Z Z

TTh O m
r e
eo X
lim = lim
X
f3 O
N ce
f2 Overgen
f1 Mon
C
Figure 7.17: We can switch limit and integral if the sequence of functions is pointwise
increasing.

S
Proof. First, f = sup fn is measurable since {f > } = nN {fn > }, and
each set in the union is measurable.3
3 More generally, the pointwise limit of measurable functions is measurable; use lim =

lim sup and the facts that sups and infs of mesaurable functions are measurable.
166 CHAPTER 7. INTEGRATION

R
lem:integral-properties
By Lemma 4, the sequence X
fn d is increasing so converges to say .
Since fn f , Z
f d.
X
For the other inequality, approximate f by a measurable simple function s f .
Let 0 < < 1 and set Xn := {x X : fn (x) s(x)}. Since < 1, we have
Xn X. Also
Z Z Z
fn d fn d s d, n N.
X Xn Xn

Letting n and using continuity of measure yields


Z
s d.
X

Since this holds for all < 1 and all simple functions s f , this implies
Z
f d.
X

Theorem 43 (Fatou). Given fn 0 measurable, then


Z Z
lim inf fn (x) d lim inf fn (x) d. (7.15)
X n n X

Note that the integral for fn does not need to be finite, and fn may not
converge. For example,R look at fn (x) n1 on R. Fatous lemma tells usRthat
when were integrating X fn , the integral can only drop down as we go to X f .
fig:fatou
(Figure 7.18.) The proof is by applying MCT to gn := inf kn fk , since gn is
increasing and lim inf fn = lim gn .
n n

Theorem 44 (Dominated convergence theorem (DCT)). Suppose (fn ) is a


R functions such that |fn (x)| g(x) for all n N, with g
sequence of measurable
integrable, i.e. 0 |g(x)| dx < . Then if fn f pointwise, then
Z Z
f (x) d = lim fn (x) d. (7.16) eqn:dct-1
X n X

In fact, Z
|f (x) fn (x)| d 0. (7.17) eqn:dct-2
X

R
Proof. Apply Fatou to 2g |fn f | 0 and subtract X
2g < from both
sides to obtain Z
lim sup |fn f | d 0,
n X
51. NICE FUNCTIONS AND THE 3 LIMIT THEOREMS 167

Pierre Joseph Louis

Z FATOU
Z
f3 d
f2 d X Z
Z X Z
f6 d
f1 d f4 dZ X
X X
f5 d
X Z

X fd

fig:fatou Figure 7.18: The integral can only drop down in the limit.

You shall not pass!


DOMINATED
Convergence Theorem
Z Z
lim = lim
X X g

f2

f1

f3

Figure 7.19: The Dominated Convergence Theorem. TODO make the graph look
like a bicep :P or have the graph outlining the top of a bicep

eqn:dct-1 eqn:dct-2
using lim
R inf(h)
= 7.16) and (7.17) are equivalent
R lim sup h. Note that (eqn:dct-1
using4 X f d X |f | d and by applying (7.16) to |f fn |.

The class of integrable functions gets a name, L1 (X, , ). It consists of all


4 To
R R
prove X f d X |f | d, prove for simple
R functions
then approximate. Alterna-
R
tively, let C, || = 1 be such that X f d = X f d . Then
Z Z Z Z Z

R f d = f d = Re( f d) = Re(f ) d |f | d.
X X X X X
168 CHAPTER 7. INTEGRATION

R sec:Lp
measurable f such that X |f | d < . Well talk more this in Section 53. If we
identify integrable functions that differ only a measure zero set together, then
L1 is a Banach space with the norm
Z
kf kL1 := |f | d. (7.18)
X

Using the convergence theorems, we can prove nice things about nice func-
tions in L1 . For example,

prop:simple-approx Proposition 34. Simple functions are dense in L1 (X, , ).

This is easy to prove using MCT or DCT and the approximation lemma
lem:approx
(Lemma 3).
Simple functions are simple but maybe not particularly nice. But we also
have:

prop:Cc-approx Proposition 35. Let X be a locally compact metric space and a regular Borel
measure. Then the continuous compactly supported functions, denoted Cc (X),
are dense in L1 (X, , ).

Proof sketch. The idea is to use Urysohns lemma, which states that if X is
a normal space and A, B are closed disjoint subsets of X, then there exists
a continuous function f : X [0, 1] with f (x) = 0 if x A and f (x) = 1
if x B. Since simple functions are dense in L1 , we only need to show K

K
(O \ K)
f (x) = 0 f (x) = 1

Figure 7.20: Urysohns lemma, applied to A = K and B = O \ K with (O \ K) .


The proof of Urysohns lemma is actually quite difficult, although it is easy for metric
fig:urysohn spaces.

can be approximated by continuous functions, for any K compact. Use (outer)


regularity to get an open fig:urysohn
set O K with (O \ K) , and then apply
Urysohns lemma. (Figure 7.20.)

Remark 10. The theorem holds more generally, e.g. for X a locally compact,
-compact Hausdorff space and a Radon (locally finite, inner regular) measure.

In Rn , we can do even better:

prop:Ccinfinity-approx Proposition 36. Let U Rn be open. The space of infinitely differentiable


compactly supported functions, denoted Cc (U ), is dense in L1 (X, d).
52. DETOUR: CONVEXITY 169

The idea is to convolve a Cc function with an approximate delta function,


produces a nice Cc function. Convolutions will be discussed in Sec-
whichsec:convolutions
tion 58.
sec:Lp
Remark 11. introduce Lp spaces, which have norm
R In Section 53, we will prop:simple-approx
theprop:Ccinfinity-approx
prop:Cc-approx
kf kLp := X |f |p d. Propositions 34, 35, andprop:simple-approx
36 still hold for 1 p < ,
with essentially the same proof. Proposition 34 holds for p = using the
prop:Cc-approx
prop:Ccinfinity-approx
approximation lemma, but Propositions 35 and 36 do not hold for p = .
rudin
teschl-fa munkres
References: [4, 7]. Urysohns lemma [3] [TODO also reference https://
terrytao.wordpress.com/2009/03/02/245b-notes-12-continuous-functions-on-locally-compact-hausdorf
for Cc functions.]

52 Detour: Convexity
[TODO make less like a textbook 26 June 2016 -lhs]
Convexity is a pretty useful topic in analysis. Youve probably heard of
convex functions before (maybe the terms concave up and concave down
are familiar to you from first year calculus), but were going to give a proper
definition here.
Before talking about convex functions, we need to talk about convex sets in
a vector space (e.g. think of convex polygons). So let V be a R or C-vector
space.

Definition 40. C V is called convex if for every x1 , x2 C, one has

[x1 , x2 ] = {t x2 + (1 t)x1 : 0 t 1} C.

In other words, the line segment between any two points in C is contained in
fig:convex-set
C. (Figure 7.21.)

x2
[x1 , x2 ]
x1

fig:convex-set Figure 7.21: A convex set.

Now we can define convex functions.

Definition 41. Let f : C R be a function, where C V is convex. Then, f


is convex if its epigraph, epi(f ) = {(x, t) C R : t f (x)}, is a convex set in
fig:epigraph
V R. (Figure 7.22.)
170 CHAPTER 7. INTEGRATION

t
f
epi(f )

x
2 C = [2, 1] 1

fig:epigraph Figure 7.22: The epigraph of a function.

(We cant help ourselves; note that ex is convex .)


This definition is equivalent to: f is convex

x1 , x2 C : f (tx2 + (1 t)x1 ) tf (x2 ) + (1 t)f (x1 ). (7.19) eqn:convexity


fig:secant
In other words, the function value is the secant value (Figure 7.23). The
eqn:convexity
called strictly convex if Equation (7.19) becomes a strict inequality.
function f is eqn:convexity
Sometimes (7.19) is taken as the definition of a convex function.

y
f

x
1 2

fig:secant Figure 7.23: The function value is below the secant line value.

Youre probably most familiar with the case V = R and C = I a (finite)


interval, and maybe with the definition that f is convex iff f (x) 0, or
maybe that f is convex iff f is increasing. Of course, these require that f be
second-differentiable or at least differentiable. Conveniently, we have:

Proposition 37. Let f : I R be differentiable. Then, f is convex f


is increasing. Moreover, f is strictly convex f is strictly increasing.

Proof. () This comes from looking at the geometry of convex functions, which
well do right after this.
() Let x0 < x1 , x0 , x1 I. Define xt := t x1 + (1 t)x0 , 0 t 1. By the
52. DETOUR: CONVEXITY 171

mean value theorem,


f (xt ) f (x0 ) = f (1 )(xt x0 ) = f (1 )t(x1 x0 ) (7.20) eqn:mvt1

f (x1 ) f (xt ) = f (2 )(x1 xt ) = f (2 )(1 t)(x1 x0 ) (7.21) eqn:mvt2
eqn:mvt1 eqn:mvt2
Adding (1 t) times equation (7.20) and t times equation (7.21) yields,
(1 t)f (x0 ) + tf (x1 ) f (xt ) = (1 t)t(x1 x0 )(f (2 ) f (1 )) 0,
with a strict inequality if f is strictly increasing.

As promised, now well look at the geometry of convex functions. The key
lemma is the 3-chord lemma. Were not going to prove it here, but drawing
fig:3-chord-lemma
pictures (Figure 7.24) should make it somewhat convincing.
Lemma 5 (3-chord lemma). Let I R be an interval, f : I R be convex.
Then
f (b) f (a) f (c) f (a) f (c) f (b)
,
ba ca cb
for a < b < c and a, b, c I.

x
a b c

fig:3-chord-lemma Figure 7.24: 3-chord lemma

Heres some nice consequences of the 3-chord lemma.


1. Right and left derivatives exist. Moreover, (D f )(x) (D+ f )(x).
Using the first inequality in the 3-chord lemma, we have
f (x + h) f (x) f (x + h) f (x)
inf = lim+ = (D+ f )(x) < .
h>0
| h
{z } h0 h
decr in h>0 by lemma

Similarly, using the second inequality in the lemma,


f (x) f (x h) f (x) f (x h)
sup = lim = (D f )(x) > .
h>0 | h
{z } h0 h
incr in h>0 by lemma
172 CHAPTER 7. INTEGRATION

x
a b c d

fig:convex-increasing Figure 7.25: Monotonicity picture.

2. (D+ f ) and (D f ) are increasing.


Let a < c; then
f (b) f (a) f (c) f (b) f (d) f (c)
(D+ f )(a) ,
ba cb dc
f (d)f (c) fig:convex-increasing
so (D+ f )(a) dc . Then take d c+ . (Figure 7.25.)
3. f is locally Lipschitz and hence continuous. Fix an interval [a, d] and let
a b < c d. Then
f (c) f (b)
(D+ f )(a) (D f )(d),
cb
so
|f (c) f (b)| max{|(D+ f )(a)|, |(D f )(d)|} |c b|.
By locally Lipschitz, we mean that for every point, there is an open neigh-
borhood on which f is Lipschitz. We need it to be open to avoid functions
like f (x) = 0 on [0, 1) with f (1) = 1.
4. D+ f is right continuous and D f is left continuous. Thus, x0 is a point
of discontinuity for (D+ f ) iff x0 is a point of discontinuity for (D f ).
(Use that for x < y, (D+ f )(x) (D f )(y).)
As a result, f is differentiable at all but at most countably many points. For
this, if (D f )(x) < (D+ f )(x), index it by qx Q where (D f )(x) < qx <
(D+ f )(x). If x < y, then (D f )(x) (D+ f )(x) (D f )(y) (D+ f )(y)
ensures that all the qx are distinct, so there can only be countably many
discontinuities.
What does all this about convexity have to do with integration? It turns
out we can use convexity
sec:Lp
to prove things about the Lebesgue Lp spaces, which
well do in Section 53. Anyway, heres the first useful connection to integration:
Theorem 45 (Jensens inequality). Fix a probability space (, , ), i.e. () =
1. If f L1 (, ) is real-valued and a < f (x) < b (a = and b = are
allowed) for all x , then for any convex function on (a, b),
Z  Z
f d ( f ) d. (7.22)

53. AN INTRODUCTION TO LP SPACES 173

Dont forget the requirement of having a probability space! The proof of


Jensen relies on the geometry of convex functions, in particular:

Lemma 6. For every x0 C, there exists t : R R affine such that

(i) t(x0 ) = f (x0 )

(ii) t(x) f (x), for all x C.

fig:jensen-key-lemma Figure 7.26: The key lemma in Jensens inequality.

TODO proof sketch! If f is strictly convex, it can be shown that t(x) satisfies
t(x) < f (x) for all x 6= x0 , which can be used
R to show that if f is strictly convex,
then we get equality in Jensen iff f (x) = f d a.e.
One immediate application of Jensens inequality is the well-known AM-GM
inequality.

TheoremP46 (Generalized AM-GM). Let a1 , . . . , an 0 and p1 , . . . , pn > 0


n
such that j=1 pj = 1. Then
n
X
ap11 apnn pj a j , (7.23) eqn:am-gm
j=1

with equality iff a1 = = an .

The proof is to look at (x) = ex and = {1, 2, . . . , n}, ({j}) = pj ,


f (aj ) = log aj . Then

Xn n
X
exp pj log aj pj a j ,
j=1 j=1

eqn:am-gm
which is (7.23). Garling
References: [2]

53 An introduction to Lp spaces
sec:Lp
[TODO check when do we use sigma-finite?]
174 CHAPTER 7. INTEGRATION

Fix (, , ) a measure space. For 0 < p < , the Lebesgue Lp (, , )


space (well also just write Lp or Lp () or Lp () or Lp (, ) or whatever is
convenient and clear from context) is
Z
p
L (, , ) := {f : C measurable : |f |p d < }/ a.e..

Basically, we take all measurable functions whose pth powers are integrable, and
then mod out by functions that are the same everywhere. Functions that are
equal a.e. have identical integrals, and we really dont care if functions differ
on just a set of measure zero. So Lp is really a bunch of equivalence classes of
functions. Nevertheless, well generally still talk about elements of Lp as usual
functions since its convenient; just keep the equivalence class idea in the back
of your mind. We also define for 1 p < ,
Z 1/p
kf kp := |f |p d(x) , (7.24) eqn:Lp-norm

which we have yet to show is a norm. In fact, this is not a norm for p < 1. Note
eqn:Lp-norm
that in order for (7.24) to have any hope of being a norm, we need functions
that are zero a.e. to be identified with the zero element in Lp . So indeed we
need theoe equivalence classes.
Before going on, note that L1 is just Lebesgue-integrable functions, or more
precisely, the set of equivalence classes of integrable functions under the relation
equal a.e.. We have yet to define L (, , ). It gets a slightly different
definition:

L (, , ) := {f measurable : f is bounded a.e.}/ a.e.,

and
kf k = esssup |f | = inf{C : |f (x)| C a.e.}.

The L norm is basically the supremum of f , but not counting sets of measure
zero where f might be large: if you go crazy and set f = on a set of measure
zero, we just ignore it. Hence the term essential supremum.
You might wonder, why do we call it L ? Heres some motivation: If has
finite measure, then L () Lp () and

lim kf kp = kf k (7.25)
p

for any bounded measurable function f . Note, we really mean that there is some
function in the equivalence class that is bounded. Equivalently, kf k < . The
idea is that anything raised to the power 1/p is going to get killed as p .
Let S := {x : |f (x)| kf k }, with 0 < < kf k . Then
Z 1/p
p
kf kp (kf k = (kf k ) (S )1/p , (7.26)
S
53. AN INTRODUCTION TO LP SPACES 175

so lim inf p kf kp kf k . On the other hand,

kf kp kf k ()1/p , (7.27)

so lim supp kf kp kf k .
Remark 12. One cool thing about Lp spaces is that they include the little p
sequence spaces. For example, p (N) = Lp (R, P(R), N ), where N is the mea-
sec:integration-measure
sure that puts an atom of weight one at each point in N. cf. end of Section 50,
remark about sums are integrals in disguise.
Here are some useful properties of Lp spaces, 1 p < . Were omitting
teschl-fa
the proofs, but see [7] or any real analysis book that covers Lebesgue integration
if youre interested.
eqn:Lp-norm
Lp is a normed vector space with respect to (7.24). This comes down to
the triangle inequality (Minkowski), which says,

kf + gkp kf kp + kgkp .

One possible proof uses the fact that t 7 tp is convex on [0, ) for p 1.
Lp is complete (Riesz-Fischer). The proof involves showing that a Cauchy
sequence in Lp has a subsequence converging pointwise a.e. to some mea-
surable function, and then using Fatou.
1 1
Holders inequality. Let p, q > 1 with p + q = 1. Then

kf gk1 kf kp kgkp .

The case when p = q = 2 becomes the Cauchy-Schwarz inequality. The


proof of Holder can be done using Jensen (or AM-GM).
Now what happens if 0 < p < 1? Well, the biggest problem is that kkp from
eqn:Lp-norm
(7.24) is not a norm, since t 7 tp is strictly concave on [0, ) if 0 < p < 1. (We
get just the reverse triangle inequality instead!) But
R not everything is lost! We
can define a metric on Lp by setting dp (f, g) := |f g|p d(x). The triangle
inequality boils down to the fact (a+b)p ap +bp for a, b 0 and 0 < p < 1.5 In
fact, (Lp , dp ) is a complete metric space (same completeness proof as for p 1).
But we normally care about 1 p more.
The last thing well mention in this section is the relationship between Lp
spaces for different p. In general, theres no nice inclusion property, although we
can try to say some things using interpolation inequalities. Lyapunov and Lit-
tlewood are the names of two such examples that come from Holder. However,
in the following two special cases, we can say a lot.
5 Since p 1 < 0,
(a + b)p = a (a + b)p1 + b(a + b)p1 a ap1 + b bp1 .
176 CHAPTER 7. INTEGRATION

1. Finite measure spaces (() < ). In this case, Lp (, ) is decreasing in


p [1, +], i.e.
L1 (, ) L (, ).
Moreover, if 1 p < q , then
()1/p
kf kp kf kq .
()1/q
The proof is via Holder.
2. Sequence spaces. We have the opposite result here. The sequences spaces
p are increasing in p [1, +], i.e.
1 (N) (N).
Moreover, if 1 p < q , then kf kq kf kp .
Heres some intuition: On a finite measure space, the way you fail to be inte-
grable is if you blow up to too fast somewhere. Raising large numbers to
the pth power for p 1 just makes the integral even bigger, so L1 (which has
the smallest p) is going to contain the most functions. For a concrete example,
think of something like 1x on [0, 1].
For the sequence spaces, the way you fail to be summable is if your tail is too
large. If you are in some p space, then the tail terms must tend to zero. Raising
these small numbers to the pth power for p 1 makes them even smaller, so
higher pth powers help with convergence, meaning that as p increases, p will
contain more things. (Of course, the jump from p , 1 p < to seems
quite large!)

54 Be careful with Fubini


sec:fubini
Warm-up: The following iterated integrals are not equal:
Z 1Z 1
x2 y 2
I1 = 2 2 2
dy dx
0 0 (x + y )
Z 1Z 1
x2 y 2
I2 = 2 2 2
dx dy.
0 0 (x + y )

Using the substitution y = x tan , we can evaluate I1 = 4 . But note by


changing dummy variables x y,
Z 1Z 1 Z 1Z 1
x2 y 2 x2 y 2
I2 = 2 2 2
dx dy = 2 2 2
dy dx = I1 .
0 0 (x + y ) 0 0 (x + y )

Why doesnt Fubinis theorem work here? Heres the general measure theory of
Fubinis thoerem (combined with Tonelli, which is for nonnegative functions).
Were glossing over the details about forming product measures.
thm:fubini Theorem 47 (Fubini-Tonelli).
TODO: non-warm-up. irrational stuff?
55. RADON-NIKODYM AND RIESZ REPRESENTATION 177

55 Radon-Nikodym and Riesz representation


sec:radon-nikodym
Once upon a time there were two -finite measures , on a common measurable
space (, ). One day was wondering: Does have a density wrt , i.e. does
there exist an f 0 (density) such that
Z
(A) = f (x) d(x), A ?
A

If such a density exists, then (A) = 0 (A) = 0. More importantly, then


could plot with f to simply forget about the measure all-together, and
convince himself hes better than !
Definition 42. Call absolutely continuous wrt (write ) if (A) = 0
implies (A) = 0. Call , mutually singular (write ) if there exist
1 , 2 such that = 1 2 with (1 ) = 0 and (2 ) = 0. In other
words, lives on 2 and lives on 1 .
Now poor had no idea what was up to. Eventually, ran across the
Radon-Nikodym-Lebesgue decomposition theorem, who told him how to decom-
pose a measure with respect to himself.
thm:radon-nikodym Theorem 48. Let , be finite measures on (, ). Then
(i) Lebesgue decomposition theorem: There exists f 0 measurable and a
measure s such that
Z
(A) = f d + s (A), A .
A | {z }
| {z }

(ii) Radon-Nikodym theorem:


R if and only if there exists f 0 measur-
able such that (A) = A f d for all A .
Note that (ii) follows from (i), since s implies s lives on (supported on)
a set of -measure 0.6 The proof of (i) can be done using the Riesz representation
theorem. While we wont prove the Lebesgue decomposition theorem here, we
will state the Riesz representation theorem later in this section since its a pretty
good theorem.
The above theorem is called Lebesgue decomposition because of the special
case where is a finite Borel measure on R and = dx is the Lebesgue
measure. Then there is f 0 measurable so that

d(x) = f (x) dx + ds (x) .


| {z } | {z }
=:dac (x) dx
dx

6 Write A = A A with A = A and A = A , and ( ) = 0, ( ) = 0.


1 2 1 1 2 2 s 1 2
Then s (A) = s (A2 ), and so (A2 ) = 0 (A2 ) = 0 s (A2 ) = (A) = 0 by Lebesgue
decomposition.
178 CHAPTER 7. INTEGRATION

But wait! We can decompose ds (x) further. We have a weight function (x) :=
s ((, x]) which is increasing and right-continuous (Lebesgue-Stieltjes mea-
sure). So (x) has at most countably many discontinuities xk , k N, which we
split off: X
(x) = [xk ,+) (x) ((xk ) (x
k )) +sc (x),
kN
| {z }
=:ck >0

where sc stands for singular continuous. We get


X
ds = d(x) = ck xk + dsc ,
kN

where xk is the atom measure with atom at xk . In summary, we can decompose


into three mutually singular measures:
X
d(x) = dac (x) + ck xk (x) + dsc (x) .
| {z }
kN
| {z } singular
continuous
pure point component

The singular continuous part s has no atoms (points xk where ({xk }) > 0).
The Riemann-Stieltjes/Lebesgue-Stietjes
sec:cantor-function
measure for the Cantor function from
Section 63 is an example of a singular continuous measure.
Now onto the Riesz representation theorem.
Definition 43. A Hilbert space H is a complete inner product space. In other
words, its a vector space equipped with a complete inner product h, i satisfying
1. Sesquilinear form: Linear in the second term, conjugate linear in the first7 .
2. Positive definite: For x 6= 0, hx, xi > 0.
3. hx, yi = hy, xi.
The easy examples of Hilbert spaces are Cn . One important infinite dimen-
sional Hilbert space is
Z
L2 (R) = {f : R C measurable : |f |2 dx < }/equality a.e.,
R

which is equipped with the inner product


Z
hf, gi = f (x)g(x) dx.
R

Definition 44. The dual space H of a Hilbert space H is the set of all
continuous linear functionals on H , i.e.

H = { : H C| linear and continuous}.


7 This is the physics convention. The typical math convention is linear in the first and

conjugate linear in the second.


56. DUALITY 179

Example 14. Fix H . The map : H C that sends 7 h, i


is a continuous linear functional. (Use Cauchy-Schwarz for Hilbert spaces for
continuity.)

Question: Can we think of any continuous linear functionals that are not of
the above form for some H ?
Answer: No; every continuous linear functional looks like for some H .

Theorem 49 (Riesz representation). For all H , there exists a unique


H such that () = h, i; moreover, kkH = kkH . Hence H H
(self-duality).

Proof. WLOG assume H \{0} (statement trivial for = 0). Then ker (
H is a closed subspace of H (continuity of 8 ). Hence, decompose H =
ker (ker ) . We have (ker ) Range() = C, e.g. by the first isomorphism
theorem for vector spaces9 . Thus (ker ) = span{0 } for some 0 H \{0}.
(0 )
Take = k0 k2 0 :

If = 0 (ker ) , then
* +
(0 )
h, i = 0 , 0 = (0 ) = ((0 )) = ().
k0 k2

If ker , then h, i = 0 = () since (ker ) .

To show uniqueness: Suppose there exist p, e H such that () = h, i =


e i for all H . This happens iff h ,
h, e i = 0 for all H . For
e e 2
= , this implies k k = 0 = .e

56 Duality
From the Riesz representation theorem in the previous section, we know that
Hilbert spaces are self-dual, and in particular that L2 is self-dual. What about
the other Lp spaces? We only defined dual spaces for Hilbert spaces last time,
but theyre the same thing for normed vector spaces.

Definition 45. Given a normed C-vector space (V, k k), we can consider its
dual, V := { : V C linear and continuous}, endowed with the operator
norm kkV := sup |(x)|.
kxk1

Its not clear that the dual space is non-trivial in general (for infinite dimen-
sional non-Hilbert spaces). Linearity is no problem since we can just define on
8 continuity preimage of 0 is closed, linearity subspace
9 Alternatively,note that a single vector 0 (ker ) is enough, since any vector H
() ()
can be written = ( ( ) 0 ) + ( ) 0 ; this shows (ker ) is 1-dimensional.
0 0
180 CHAPTER 7. INTEGRATION

Figure 7.27: A vector space and its dual. TODO fix latex resolution

a basis, but the continuity requirement could cause some problems. However,
conveniently, for Lp spaces, we have it easy. First well spoil the surprise and
tell you what the dual space of Lp is, for 1 < p < : If we define the dual
exponent q via p1 + 1q = 1, then the dual space of Lp is Lq ! The definition of
the dual exponent should remind you of Holders inequality, kf gk1 kf kp kgkq
where p1 + 1q = 1. In fact, we can use Holders inequality to get some prototypical
examples of elements in the dual space of Lp . So for us, there are no worries
about the dual space being trivial. Let g Lq ; then the map g : Lp C
defined by Z
g (f ) := gf d (7.28) eqn:dual-linear-funct

is bounded by kf kp kgkq , so g (Lp ) and kg k kgkq . We actually get


kg k = kgkq by taking TODO
Now that we have an easy way to get a linear functional on Lp , we have to
eqn:dual-linear-functional
wonder, is this it? Are all linear functions of the form (7.28) for some g Lq ?
We already know the answer is yes for p = 2 (Riesz), but it turns out this holds
for all 1 < p < .

Theorem 50 (Duality). TODO

L1 and L are different.


We can quickly see the dual of .... is more than ...
For the Lp spaces, it was easy to verify (Lp ) 6= {0} because we had g
(Lp ) . In general, the Hahn-Banach theorem saves the day. Hahn-Banach is
one of the big theorems in functional analysis, and will imply that V ) {0},
even if dim V = .
57. LEBESGUES DIFFERENTATION THEOREM 181

Theorem 51 (Hahn-Banach (real version)). Let U V be a subspace, : U


R linear such that |(x)| p(x), x U , for some p : v R sublinear. Then
there exists L : V R linear such that L|U = and |L(x)| p(x) for all x V .
TODO Hahn-Banach complex version, write stuff
Proposition 38. Given 0 6= x0 V , there exists V such that (x0 ) =
6 0 and kkV = 1.
kx0 k =
Proof. Use the Hahn-Banach theorem for U = span{x0 }, p() = k k. For
y = x0 U , (y) := kx0 k by linearity. Then
|(x0 )| = |kx0 k| = ||kx0 k = kx0 k = p(x0 ).

This also implies V = (V ) is nontrivial, and:


Corollary 2. We can consider V V , in the sense of an isometric embedding
given by the evaluation map, x b0 : V R, 7 (x0 ) which is linear, and
x0 kV kx0 k. From the above proposition, kb
|(x0 )| kkkx0 k = kb x0 kV =
kx0 k.

57 Lebesgues differentation theorem


esgue-differentiation
Lets say youre British and play cricket, and you want to make your batting
average look as good as possible. In particular, lets say youre G.H. Hardy or
J.E. Littlewood and love cricket. One thing you could try to do is to calculate
your batting average over some small interval of time. For example, if you
played six games and averaged 80, 75, 50, 85, 90, and 70, then your overall
batting average average would be 75. But over the last two games, you have a
higher average, 80. And if you average over the last three games, your batting
average is 81.67. How much better can you make your score by choosing how
far back to go?
More generally, suppose we have an integrable function f L1 (Rn ). For a
fixed point x, we want to study the Hardy-Littlewood maximal function,
Z
1
(MHL f )(x) := sup |f (y)| dn y.
r>0 |Br (x)| Br (x)

We select a ball Br (x) of radius r > 0, average |f | over that ball, and then see
how large we can make that value. In terms of cricket batting averages, we fix a
point in time x, select how far back to go r, calculate the batting average over
than time interval, and then see how large we can make our batting average.
It turns out studying the Hardy-Littlewood maximal function will allow us to
generalize the fundamental theorem of calculus. Recall in 1D, for f continuous,
Z x Z
1 x+h
F (x) := f (t) dt is differentiable, and F (x) = f (x) = lim f (t) dt.
x0 h0 h x
(7.29) eqn:ftc-again
182 CHAPTER 7. INTEGRATION

Figure 7.28: Mathematicians v. The Rest of the World: Hardys team of math-
ematicians go to play cricket in 1926. [TODO check I think this is public domain
now?]

R x+h
The last expression limh0 h1 x f (t) dt is related to the averages we take in
the Hardy-Littlewood maximal function. eqn:ftc-again
What conditions do we need to impose on f to ensure that (7.29) holds?
Continuity is certainly enough, but we want it to hold for more functions. The
Lebesgue differentiation theorem generalizes this to hold for f L1loc (Rn ) (f is
integrable on all compact subsets of Rn ), although we relax pointwise conver-
gence to just a.e. pointwise convergence.
thm:lebesgue-differentiation Theorem 52. If f L1loc (Rn , dn x), then
Z
1 r0+
f (y) dn y f (x), a.e. x Rn .
|Br (x)| Br (x)
In R, this just becomes,
Z x+h
1 h0+
f (y) dy f (y), a.e. x R.
2h xh
You might complain, thats not exactly the derivative since were averaging over
an interval (x h, x + h) instead of (x, x + h) or (x h, x). Good news is, the
result still holds if you replace (x h, x + h) with (x, x + h) or (x h, x). More
generally, you can rudin
replace Br (x) with any sequence of nicely shrinking sets.
sec:rectangles
See Section 59 or [4, 7.9].
A useful and common method for proving pointwise convergence a.e. is via
maximal functions (like the Hardy-Littlewood maximal function) and maximal
inequalities.
Theorem 53 (magic of maximal functions). Let X be a Banach space. Suppose
we have a family of subadditive operators indexed by A, T : (X, k k)
57. LEBESGUES DIFFERENTATION THEOREM 183

meas. functions on (, , ) to C. Define (T x)() := supA |(T x)()|, a


maximal function.
[TODO define limit...]
Suppose
(i) There exists X0 dense in X such that for all x X0 , limA (T x)()
exists -a.e. .
(ii) For all x X and > 0:
 
kxk
((T x) > ) f , (7.30) eqn:maximal

eqn:maximal
for some function f with lim0+ f () = 0. The inequality (7.30) is called
a maximal inequality.
Then for all x X, we have limA (T x)() exists -a.e. .
[TODO proof outline]
od-maximal-inequality Theorem 54 (Hardy-Littlewood maximal inequality). Let f L1 (Rn , dx) and
> 0. Then
kf k1
|{(MHL f ) > }| 3n .

Proof. Recall that the Lebesgue measure on Rn is inner regular, i.e. for a Borel
set A, we have |A| = sup |K|. So we can be lazy and only consider
compact KA
compact K {(MHL f ) > }.
We want to show that |K| 3n kfk1 . From the definition of MHL , note that
for all x K, there exists rx > 0 such that
Z
1
< |f (y)| dn y.
|Brx (x)| Brx
SN
By compactness, take xj K, 1 j N , such that K j=1 Brxj (xj ). Now
if somehow it magically happened that {Brxj }1jN was pairwise disjoint, then:

N
X N Z pairwise Z
1X disjoint 1 1
|K| |Brxj (xj )| |f (y)| dy == |f (y)| dy kf k1 .
j=1
j=1 Br x N
j=1 Brx (xj )

j j

But of course theyre probably not going to be pairwise disjoint, so were going
to need the following lemma.
Lemma 7 (Vitali covering lemma). Given a finite family of open balls {Brj }1jN
in a metric space, then there is a nonempty S {1, . . . , N } such that
Brj Brl = , if l 6= j S
SN S
j=1 Brj jS B3rj
184 CHAPTER 7. INTEGRATION

This scaling by 3 gives rise to the factor 3n in the HL-maximal inequality.


The key idea of the proof is the geometry of balls. Order the balls ac-
cording to descending radius, so r1 r2 . All the fig:vitali-covering
balls intersecting the
largest ball Br1 are contained within B3r1 (x1 ). (Figure 7.29.) Discard these,
and repeat with the next largest ball until were done.

rl
xl

r1
rj
3r1 x1 xj

fig:vitali-covering Figure 7.29: Balls intersecting Br1 (x1 ) are contained in B3r1 (x1 ).

S
So we can get K jS B3rj (xj ). Thus

X X Z
n 3n X
|K| |B3rxj (xj )| = 3 |Brxj (xj )| |f (y)| dy
Brx (xj )
jS jS jS j

{Brx (xj ) }jS Z


j
n
pairwise disjiont 3 3n
== |f (y)| dy kf k1 .
jS Brx (xj )
j

sec:lebesgue-density
Finally, as promised in Section 38, we can now prove Lebesgues density
theorem about metric density.

Corollary 3 (Lebesgues density theorem). Let E Rn be a measurable set,


and suppose (E) > 0. Then
(
(E B (x)) 1, a.e. x E
(E, x) := lim = . (7.31)
0 (B (x) 0, a.e. x 6 E
thm:lebesgue-differentiation
Proof. Its just a special case of Theorem 52, when f = E .
rudin
Simon
References: [4], [5, Book 3]
58. CONVOLUTED CONVOLUTIONS 185

58 Convoluted convolutions
sec:convolutions
Convolution is a kind of averaging or smearing operation. If we start with
f L1 which is not very nice at x, we can use convolution with a nice function
to smear out or average f around x.
For f, g L1 (Rn ), their convolution f g is
Z
(f g)(x) := f (y)g(x y) dn y. (7.32)
Rn

Often, we have f L1 , and g a nice smooth function with a large peak at the
origin, like Br (0) except smooth. Then g(x y) acts like a smooth cut-off or
window function for f , and (f g)(x) is a sort of average of f around x.
For example, we can take g to be a family of -function-wanna-bes. More
formally, well call them approximate or nascent -functions. Given some
0 with support in [1, 1] and integral kk1 = 1, we can make a family of
approximate -functions by setting

n (x) := n (nx).
 
Then supp n 1 1
n , n and still kn k1 = 1. As n , f n (x) averages f
on smaller and smaller regions around x. As we average on smaller regions, we
might hope that f n (x) f (x), so that the sequence of approximate delta
functions behaves like actual Dirac delta.
[TODO remark about f n being nice; Dirac delta.]

1 1

Figure 7.30: Approximate -functions. They look up to Dirac delta, both literally
(at the origin) and figuratively.

thm:convolutions Theorem 55. Let n be a family of approximate -functions.


(i) If f C(Rn ), then f n f uniformly on compact subsets of U .
a.e.
(i) If f L1 (Rn ), then f n f as n .
186 CHAPTER 7. INTEGRATION

[TODO is this the theorem we want?]


[TODO check does (ii) just follow from Lebesgue differentiation in like 2
lines?]
[TODO stick with either R or Rn ]
[TODO remark about working for U Rn ]
The second statement is about pointwise a.e. convergence, so were going to
use maximal inequalities.
R First, the dense set comes from (i): For f C(Rn ),
note f (x) = f (x) n (x y) dn y, so
Z
|f n (x) f (x)| |f (y) f (x)| |n (x y)| dy kn k1 ,
| {z } | {z }
<, if n suff. large =1

where we get |f (y)f (x)| < by only considering integration where n (xy) 6=
0 and using uniform continuity of f on compact subsets of Rn .
Next, we need the maximal inequality so we can apply the magic of maximal
inequalities. This gets a little complicated depending on what looks like, but
if we assume an easy case then were good to go:
lem:symmetric Lemma 8. If L1 is symmetric decreasing and f L1 (R), then |(f
)(x)| (MHL f )(x) kk1 .
thm:convolutions
Proof (of Theorem 55). If is symmetric and decreasing, then by the lemma,
we have |f n | kn k1 (MHL f ). Thus, defining the maximal operator
| {z }
=1
(T f )(x) := supnN |f n (x)|, we obtain
kf k1
|{(T f ) > }| |{MHL f > }| C . (7.33)

If is not symmetric and decreasing, then we can form its decreasing sym-
metrization, e
(x) := sup|y||x| (y), which is decreasing and symmetric (Fig-
fig:decreasing-sym
ure 7.31).

Figure 7.31: The decreasing symmetrization is shown in light blue. For 1D, sym-
fig:decreasing-sym metric just means even.

e
Since ,
lemma
fn
|f n | |f | n |f | (MHL f )kf
n k1 .
58. CONVOLUTED CONVOLUTIONS 187

e
We still have (x) fn (nx), so kf
=n e 1 . So
n k1 = kk

 kf k1

|{T f > }| MHL f > C e 1.
kk
e 1
kk

The last thing we need to do is prove the lemma.

lem:symmetric
Proof (offig:layer-cake
Lemma 8). We need a fun fact called the layer cake representation
(Figure 7.32): For measurable g 0,
Z
g(x) = {g>} (x) d. (7.34)
0

The proof is quick: Note {x:g(x)>} (x) = [0,g(x)) (), so

Z g(x) Z Z
g(x) = 1 d = [0,g(x)) () d = {g>} (x) dt. (7.35)
0 R R

g
1

Figure 7.32: Layer cake representation. {g>} (x) = 1 for in the green zone, and
then {g>} (x) = 0 for in the red zone. The integral along the green zone just gives
fig:layer-cake us the height g(x).

Now let 0 L1 be symmetric and 1


R decreasing and f L , and remem-
ber we want to bound |(f )(x)| =
f (y)(x y) dy . By the layer cake
representation,
Z Z
(x y) = {(xy)>} = Br d, (7.36) eqn:layer-cake-phi
(x)
0 0

where we used a key fact: since is symmetric and decreasing, we must have
188 CHAPTER 7. INTEGRATION

eqn:layer-cake-phi
{ > } = Br (0) for some r > 0. Then using (7.36),
Z Z Z !

Fubini
|f (y)|(x y) dy == d |f (y)| dy
0 Br (x)
Z Z !

1
= d f (y) dy |Br (x)|
0 |Br (x)| Br (x)
| {z }
(MHL f )(x)
Z
(MHL f )(x) |Br (x)| d = (MHL f )(x) kk1 .
0 | {z }
=|Br (0)|
=|{>|}

59 Rectangles are not so friendly


sec:rectangles thm:lebesgue-differentiation
In the Lebesgue differentiation theorem (Theorem 52), we took averages over
balls in Rn , and we used the Vitali covering lemma to handle the geometry of
these balls. But there are other friendly shapes in Rn that want to be included
too.
Definition 46. A sequence {Ej } of Borel sets in Rn shrinks nicely to x Rn
if there exists > 0 and a sequence of balls Brj (x) Ej with rj 0 such that

m(Ej ) m(Brj ), n N.

We require Ej takes up a decent amount of space in some ball centered at


fig:nicely-shrinking
x. (Figure 7.33.)

E4 Br4 (x)

x x x x x
E5 Br5 (x)
E3 Br3 (x)
E2 Br2 (x)

E1 Br1 (x)

Figure 7.33: Nicely shrinking sets with say = 1/4. Each set Ej in orange takes up
fig:nicely-shrinking at least of the area of some ball Brj (x) containing it.

So shapes like half-intervals in R, squares in R2 , or half-balls in Rn all work.


However, rectangles in R2 do not work. As a specific example, the sequence of
59. RECTANGLES ARE NOT SO FRIENDLY 189

rectangles with edge lengths 1j and j12 is not a nicely shrinking set. As far as
the Lebsgue differentiation theorem cares, nicely shrinking sets are just as fine
to work with as balls:
Theorem 56. Let x Rd , and suppose {Ej (x)} shrinks nicely to x. If f
L1loc (Rn ), then
Z
1 n
f (y) dn y f (x), a.e. x Rn .
m(Ej (x)) Ej (x)
Proof.
Z Z
(x) n 1
|f (y) f (x)| d y |f (y) f (x)| dn y 0 a.e.
m(Ej (x)) Ej (x) m(Brj (x)) Br j

Now the rectangles in R2 were starting to feel left out and became sad. They
wondered why they werent included. Was this just an oversight? Or was it an
actual problem? Could there exist an f L1 (R2 ) for which the conclusion of
the Lebesgue differentiation theorem failed for families of rectangles? How bad
could it be?

Figure 7.34: The rectangles are sad.

They soon found out, and it made them cry.


Proposition 39. Let R := {(a, a) (b, b) : a, b > 0}. There exists f
L1 (R2 ) such that
Z
1
lim f (y x) dy = + a.e. x R2 .
diam(R)0 |R| R
RR

(We just did a small change of variables to get the x y, since our rectangles
R are centered at the origin rather than at x.)
TODO proof idea, Borel-Cantelli idearudin Stein-osc
References: Nicely shrinking sets: [4]; rectangles: [6]
190 CHAPTER 7. INTEGRATION

60 Weak and strong operators and unhappy ro-


tated rectangles
TODO picture weak, strong
To make the rectangles a little less sad, some functions in higher Lp (R2 )
spaces decided to try to comfort them. First though, they had to introduce the
rectangles to some special types of operators.

Definition 47. Let (X, ) and (Y, ) be measure spaces, and let T : Lp (X, )
{f : Y C meas.}. We say T is of strong type (p, q) if it is bounded from
Lp (X, ) to Lq (Y, ). We say T is of weak type (p, q), q < , if it satisfies the
weak-type inequality
 q
Ckf kp
({y Y : |T f (y)| > }) , some C > 0.

For q = , T is of weak type (p, ) iff it is of strong type (p, ).

Remarks.

1. For (X, ) = (Y, ) and T the identity, the weak (p, p) inequality is the
Chebyshev-Markov inequality,

kf kpp
(|f | > ) .
p

2. Strong type implies weak type, since


Z Z  q
T f (x) q kT f kqq Ckf kp
({|T f | > }) = d
d q
.
{|T f |>} {|T f |>}

Recall the Hardy-Littlewood maximal operator MHL was defined by


Z
1
(MHL f )(x) = sup |f (y)| dy.
r>0 |Br (x)| Br (x)

This isthm:hardy-littlewood-maximal-inequality
a weak type (1, 1) operator (Hardy-Littlewood maximal inequality, The-
orem 54). It is also strong-type (, ) since kMHL f k kf k . What about
1 < p < ? It turns out we can use the Marcinkiewicz interpolation theorem
to show us that MHL is strong type (p, p) for all 1 < p .

Theorem
( 57 (Marcinkiewicz interpolation). Let 1 p1 < p2 . Let T :
L (X, d) Lpw1 (Y, d)
p1
be subadditive and satisfy kT f kpi ,w Ci kf kpi for
Lp2 (X, d) Lpw2 (Y, d)
i = 1, 2. Then T extends as T : Lp (X, d) Lp (Y, d) with kT f kp Cp kf kp
for p1 < p < p2 .
60. WEAK AND STRONG OPERATORS AND UNHAPPY ROTATED RECTANGLES191

The main tool in the proof is to decompose

f = f |> + f |f |
|{z} | |f
{z } | {z }
Lp Lp1 Lp2

and use Fubini-Tonelli and distribution functions.


Now since MHL is weak type (1, 1) and (, ), the Marcinkiewicz interpo-
lation theorem implies that MHL is strong type (p, p), for every p (1, ). The
endpoints in general cannot be included. For example, MHL is not strong type
(1, 1). For every 0 6 f L1 (Rn ), one can show there is some constant c(f ) > 0
such that (MHL f )(x) c|x|n for |x| sufficiently large.
By now, the rectangles were getting a little impatient. So the Lp functions
decided to speed things up a bit. Okay, define the rectangle maximal operator
Z
1
Mr f (x) = sup |f (x y)| dy
RR |R| yR

for f : R2 C. Were going to show that Mr maps Lp (R2 ) boundedly to itself,


i.e. kMr f kp Cp kf kp , for all 1 < p , which means its strong type (p, p).
You guys know from the Marcinkiewicz interpolation theorem that the Hardy-
Littlewood maximal function (with intervals) is a bounded operator on Lp (R)
for 1 < p . So what were going to do is to use the 1D result for intervals
to get the 2D result for rectangles. Define
Z
(1) 1
M f (x1 , x2 ) := sup |f (x1 + t, x2 )| dt
2
Z
1
M (2) f (x1 , x2 ) := sup |f (x1 , x2 + s)| ds.
2

Then just note that


Z bZ a Z b
1 1
|f (x1 +t, x2 +s)| dt dx M (1) f (x1 , x2 +s) ds M (2) M (1) f (x1 , x2 ),
(2a)(2b) b a 2b b

so that kMr kp cp kM (1) f kp Cp kf kp by the result for 1D (intervals) Hardy-


Littlewood.
Yay! celebrated the rectangles. L1 (R2 ) is just messed up, but everyone
else likes us! And the rectangles lived happily ever after near Lp land, p > 1.
Now that things were settled for the normal vertical/horizontal rectangles,
all the rotated rectangles showed up and wanted to know things. However, they
were sorely disappointed and did not get to live happily ever after.
Proposition 40. Let R e be the set of all rectangles centered at the origin with
any orientation. For each 1 p < , there exists f Lp (R2 ) such that
Z
1
lim f (y x) dy = + a.e. x R2 .
diam(R)0 |R| R
e
RR
192 CHAPTER 7. INTEGRATION

/
/ /

Figure 7.35: The rotated rectangles are even more sad.

Basically, vertical/horizontal rectangles break things in L1 but not in higher


p
L spaces, while rotated rectangles just break everything. (Except, of course,
L , where the maximal function is still a strong type (, ) operator.)Stein-osc
The proof that rotated rectangles break everything can be found in [6]. The
key thing to prove is a continuity principle that basically tells us that if the
maximal operator M fails to satisfy the weak-type (p, p) inequality, then there
is some f Lp with (M f )(x) = on a set of positive measure. [TODO
check: get a.e. in R2- translate?] With the rotated rectangles, the fact that the
maximal operator is not of weak type (p, p) for 1 p < comes from a lemma
about Besicovitch sets (which is actually a pretty lengthy construction):
Lemma 9 (Besicovitch sets). Given any > 0, there exists an integer N = N ,
and 2N rectangles R1 , . . . , R2N , each having side lengths 1 and 2N , so that
S N
2
(i) j=1 Rj < . This (small) union of rectangles is called a Besicovitch
set.
(ii) The reaches R ej (translates of Rj by two units in the positive direction
fig:rectangles-reach
along the longer side of Rj ; Figure 7.36) are mutually disjoint so that
S2N e
j=1Rj = 1.

Remark 13. TODO Besicovitch sets and the Kakeya needle problem, ref to
other section
rudin
Stein-osc
References: [4], [6]
61. THE FTC FOR GROWN-UPS (ABSOLUTE CONTINUITY) 193

e1
R

R1
R2 e2
R

R3

e3
R

ej of Rj .
Figure 7.36: The reaches R
fig:rectangles-reach

61 The FTC for grown-ups (absolute continu-


ity)
c:absolute-continuity
The Fundamental Theorem of Calculus says (under some assumptions on f )
that Z x
f (x) = f (a) + f (t) dt.
a
R
sec:volterra
We saw in Section ?? that if is a Riemann integral, then it does not sufficeR
to assume that f is differentiable. But lets be generous and say that is a
Lebesgue integral. Still, since f appears in the conclusion of the FTC, we might
think we need to require that f is differentiable. However, observe that f only
appears inside an integral in the conclusion of the FTC, so it might be ok for
f to not exist in a few places!
Heres a hint about the appropriate assumption in the FTC. Think of f as
playing the role of g:

prop:integral-epsilon Proposition 41. Suppose g L1 (R). Then for every > 0, there exists a
> 0 such that Z
m(E) = g(x) dx .
E

The proof is just some basic measure theory. Basically, the statement is easy
for bounded functions, and continuity of measures along with g L1 (R) ensuresprop:integral-epsilon
that the measure of {g > N } cant beR too large. The point is, Proposition 41
x
smells an awful lot like continuity of a g dt. Its a really strong kind of con-
Ry
tinuity, stronger than uniform. Instead of just caring that x g dt is small for
194 CHAPTER 7. INTEGRATION

(x, y) a small interval, we care about any measurable


Rx set E with small measure.
Since we want f to satisfy f (x) f (a) = a g(t) dt for some g, it looks like the
appropriate assumption in the FTC shouldnt be differentiability, but rather a
strong sort of continuity.
In fact, using what is called absolute continuity, we can characterize all func-
tions f for which the FTC holds without talking about integration or measurable
sets. We just need a slightly stronger version of continuity.

Definition 48. A function f : [a, b] C is absolutely continuous (write f


AC([a, b])) if for every > 0, there exists a > 0 with the following property:
For any finite collection ofPclosed intervals {[aj , bj ]}jJ
P with mutually disjoint
interiors and total length jJ (bj aj ) , we have jJ |f (bj ) f (aj )| .

Absolute continuity is like uniform continuity, except stronger because we


allow any finite collection of intervals and then sum over the differences on all
fig:abs-cont
those intervals. (Figure 7.37.)

fig:abs-cont Figure 7.37: Pick whichever (non-overlapping) intervals you want!

Interestingly enough, it turns out any absolutely continuous function f :


[a, b] R is differentiable almost everywhere (!), and
Z x
f (x) = f (a) + f (t) dt, x [a, b]. (7.37)
a

Moreover,
Z x
AC([a, b]) = {f : [a, b] C : f (x) = f (a) + g(t) dt, some integrable g},
a
(7.38)
so absolutely continuous functions are exactly the ones for which the fundamen-
tal theorem of calculus works. Well state this in a theorem.
61. THE FTC FOR GROWN-UPS (ABSOLUTE CONTINUITY) 195

thm:abs-cont Theorem 58 (The Fundamental Theorem of Calculus for Grown-ups). A func-


tion f : [a, b] C is is absolutely continuous if and only if it is of the form
Z x
f (x) = f (a) + g(t) dt (7.39)
a

for some integrable g. Moreover, in this case, f is differentiable Lebesgue-a.e.


with f (x) = g(x).
The proof typically uses the Lebesgue differentiation theorem and Radon-
Nikodym-Lebesgue decomposition of measures; well give the proof at the end
of this section. Before that, some questions and examples:
Question: Are absolutely continuous functions necessarily differentiable ev-
erywhere?
Answer: No, f (x) = |x| is absolutely continuous but not differentiable at 0.
Notice that we can write Z x
|x| = g(t) dt,
0

where g(t) is 1 for t < 0 and 1 for t > 0.


Question: Are differentiable functions necessarily absolutely continuous?
Answer: No, f (x) = x2 sin(x4 ) is differentiable but not absolutely con-
tinuous on [0, 1]. This can be shown by looking at points an = (n)1/4 and
bn = ((n + 1/2))1/2 . And indeed, f x3 which is not integrable near zero.
We do however have the inclusions

C 1 Lipschitz absolutely continuous bounded variation differentiable a.e.


thm:abs-cont
(7.40)
Before proving Theorem 58, we need to define functions of bounded variation.
Definition 49. For a partition P = {a = x0 , . . . , xn = b}, the variation of f is
n
X
V (f, P ) := |f (xj ) f (xj1 )|. (7.41)
j=1

The total variation is

Vab (f ) := sup V (f, P ). (7.42)


partitions P
of [a,b]

A function f : [a, b] C is of bounded variation (write f BV([a, b])) if its


total variation is finite.

TODO tikz picture

fig:bv Figure 7.38

Functions of bounded variation have a Jordan decomposition:


196 CHAPTER 7. INTEGRATION

thm:bv-jordan Theorem 59 (Jordan). If f BV([a, b]), then f can be decomposed as


1 x
f (x) = f+ (x) f (x), f (x) = (V (f ) f (x)), (7.43)
2 a
where f are increasing.
Proof. For x y, use

f (y) f (x) |f (y) f (x)| Vxy (f ) = Vay (f ) Vax (f ) (7.44)

to show f+ is increasing.
thm:abs-cont
Proof (of Theorem 58).
1. AC([a, b]) C([a, b]) BV([a, b]). For (uniform) continuity, just take a
single interval. For bounded variation, take corresponding to = 1 in
the definition of ac. Then the total variation on an interval of length
is 1, and we can break up [a, b] into ba
such intervals, so the total
variation is ba
.
2. For f AC([a, b]) increasing, its associated Lebesgue-Stieltjes measure is
ac w.r.t. to the Lebesgue measure: The associated LS-measure is

X [
LS (E) = inf f (bj ) f (aj ) : E (aj , bj ) . (7.45)

jN jN

Let E be compact with Lebesgue measure (E) = 0. Since E has Lebesgue


SN
measure zero, there is a finite covering S := j=1 (aj , bj ) E with (S) <
. Break up S into disjoint intervals (aj , bj ) and use the definite of absolute
PM
continuity to get j=1 |f (bj ) f (aj )| < , so LS (E) = 0.
If E is not compact, use the (inner) regularity of LS , which follows from
nearly the same proof as for the Lebesgue measure.10 Thus we may choose
compact K E with LS (K) + > LS (E), and apply the result for
compact K to conclude LS (E) = 0.
3. We claim all f AC([a, b]) are differentiable a.e. and satisfy
Z x
f (x) = f (a) + f (t) dt, x [a, b]. (7.46)
a

If f is not increasing, decompose f = p n where R xp and n are increasing


(Jordan decomposition for BV); then if p(x) = a p d a.e and n(x) =
Rx Rx Rx
a
n d a.e., then f = a (p n ) d = a f d a.e. So now assume
R f is
increasing, so by Radon-Nikodym, f (x) f (a) = LS ([a, x]) = [a,x] h d
10 We also have outer regularity; use the continuity of f to show that for any interval

Ik with interior (a, b), there exists an open interval Ik = (a , b + ) Ik such that
f (b + ) f (a ) f (b) f (a) + 2k .
61. THE FTC FOR GROWN-UPS (ABSOLUTE CONTINUITY) 197

for some measurable h 0. By the Lebesgue differentiation theorem (for


one-sided maximal functions),
Z x+r
f (x + r) f (x) 1
f+ (x) = lim+ = lim+ h(x) d h(x), Lebesgue-a.e.,
r0 r r0 r x


and the same thing works for f (x) with
R x [x r, x]. Thus, f (x) = h(x)
exists Lebesuge-a.e., and f (x) = f (a) + a f (t) dt.
prop:integral-epsilon
4. Finally, by Proposition 41, we get the inclusion in
Z x
AC([a, b]) = {f : [a, b] R : f (x) = f (a)+ g(t) dt, some g : [a, b] R}.
a
(7.47)

teschl-fa
References: [7]
198 CHAPTER 7. INTEGRATION
refsection:9

Chapter 8

Episode V: Differentiation
strikes back

62 Monotonic functions are differentiable a.e.


sec:monotone-discontinuities
Recall from Section 12 that monotonic functions have countably many discon-
tinuities, and thus are continuous a.e. But what about differentiability? How
often must a monotonic function be differentiable? It turns out, pretty often:

thm:monotone-diff Theorem 60. Let f : [a, b] R be increasing. Then f is differentiable a.e.


and Z b
f (x) dx f (b) f (a).
a

The reverse inequality holds for f decreasing.

Note this almost looks like the fundamental theorem of calculus, except we
have an inequality. We can easily see the inequality is necessray in general: If f
then the integral of f notice, but fsec:cantor-function
has a jump discontinuity,fig:monotone-jump (b)f (a) certainly
will be affected (Figure ??). As we will see later (Section 63), discontinuities
arent the only case where the inequality is necessary. We can have f continuous
and yet require the strict inequality, basically because f manages to do all of
its increasing on a set of Lebesgue measure zero.
The result that f is differentiable a.e. reminds us of the same result for
sec:absolute-continuity
absolutely continuous
thm:monotone-diff
functions from Section 61. It turns out the proof of The-
thm:abs-cont
orem 60 is quite similar to the proof of Theorem 58 (FTC for Grown-ups);
well use Lebesgue-Stieltjes measures, the Lebesgue differentiation theorem, and
Radon-Nikodym-Lebesgue, just like before. The difference is that well need to
deal with differentiating a measure that is singular with respect to the Lebesgue
measure. With absolutely continuous functions, we only had to deal with a
Lebesgue-Stieltjes measure that was absolutely continuous with respect to the
Lebesgue measure.

199
200 CHAPTER 8. EPISODE V: DIFFERENTIATION STRIKES BACK

f (b)

f has no idea

f (a)
a b

fig:montone-jump Figure 8.1: If f has a jump discontinuity, then f has no idea how big the jump is.

Proof. Were going to be assuming that f is increasing (just take f for decreas-
ing). First, if f is absolutely continuous
abs-cont
and increasing, then we do the same
thing as in the proof of Theorem ??: LS ([a, x]) := f (x) f (a) is absolutely
continuous with respect to the Lebesgue measure dx, so by Radon-Nikodym,
Z
f (x) f (a) = h dx,
[a,x]

for some measurable h 0. By the Lebesgue differentiation theorem (for one-


sided maximal functions), then
Z
f (x + r) f (x) 1 x+r
f+ (x) = lim+ = lim+ h(x) dx h(x) Lebesgue-a.e.
r0 r r0 r x


Since the same thing works for f with [x r, x], theres no problems here.
Now if we just have an increasing function f , decompose the associated
Lebesgue-Stieltjes measure1 LS ([a, x]) := f (x) f (a) via the Radon-Nikodym-
thm:radon-nikodym
Lebesgue decomposition theorem (Theorem 48): LS = ac + sing , so
Z
f (x) f (a) = ac ([a, x]) + sing ([a, x]) = g dx + sing ([a, x]) (8.1) eqn:decomp-sing
[a,x]

for some measurable g 0. Rx


Lebesgue differentiation will take care of a g dx, so now we just need to
figure out how to differentiate sing . We claim it has derivative zero.
lem:singular-diff Lemma 10. Define the upper derivative of a Borel measure at x Rn by
(B (x))
(D)(x) := lim sup . (8.2)
0 |B (x)|

Then (D) is measurable and


|{x A : (D)(x) > 0}| = 0 whenever (A) = 0. (8.3)
1 We actually need to be a little more careful with defining this Lebesgue-Stieltjes measure
sec:cantor-function
since we arent dealing with continuous functions; see the end of Section 63.
63. THE CANTOR FUNCTION AND LEBESGUE-STIELTJES MEASURES201

In particular, if is singular wrt Lebesgue measure dx, then

(B (x))
(D)(x) := lim = 0, a.e. x Rn .
0 |B (x)|

Proof. TODO
Now suppose is singular wrt dx. Take to be a support for sing with
|| = 0, and then apply the first part to A = R \ . Thus we can remove the
sup in lim sup to get (D)(x) := lim0 (B (x))
|B (x)| = 0 for a.e. x R.

We can adapt this for nicely shrinking sets as we did in the Lebesgue differ-
entiation theorem [TODO check this] to get

sing ([x, x + r])


lim = 0 Lebesgue-a.e., (8.4)
r0 r

and the same result for [x r, x].


eqn:decomp-sing
Then in total, we get from (8.1),

LS ([x, x + h])
f (x) = lim = g(x) Lebesgue-a.e.
h0 h
So f is differentiable Lebesgue-a.e., and
Z Z b
f (b) f (a) = LS ([a, b]) ac ([a, b]) = g(x) dx = f dx.
[a,b] a

thm:monotone-diff
A strict inequality in Theorem 60 occurs when we have nonzero contribution
from sing . The measure sing includes jumps (pure point), but also more exotic
measures called singular continuous (i.e. singular but not pure point). Well
sec:cantor-function
see examples of these in Section 63, and theyll give us examples of monotone,
thm:monotone-diff
continuous f with a strict inequality in Theorem 60.thm:montone-diff
As a final application, we can extend Theorem ??sec:abs-cont
to functions of bounded
variation (recall these were just defined in Sectionthm:bv-jordan
??). Since functions f
BV([a, b]) have a Jordan decomposition (Theorem 59) f = f+ f where f
are increasing, we get:

Corollary 4. Let f : [a, b] R be of bounded variation. Then f is differentiable


a.e.

63 The Cantor function and Lebesgue-Stieltjes


measures
sec:cantor-function
sec:cantor-set
Using the Cantor set (Section 6), we can construct the Cantor function, which
is a continuous function on [0, 1] which is constant almost everywhere but still
202 CHAPTER 8. EPISODE V: DIFFERENTIATION STRIKES BACK

manages to creep up from 0 to 1. Its also called the devils staircase, even
though its not particularly evil. We could have defined this function way earlier
in this book, but we waited until now because we want to make a connection to
Lebesgue-Stieltjes measures.
More specifically, the Cantor function F : [0, 1] [0, 1] is continuous and
surjective, and it has derivative 0 everywhere except on the Cantor set. That
is, it has derivative 0 on all the gaps in the Cantor set, and manages to creep
from 0 up to 1 by moving only on the Cantor set! The function F is defined
iteratively, by making Fn constant on the gaps In (the gaps in the nth step of
constructing the fig:cantor-function
Cantor set) and linear in-between. The first few iterations are
shown in Figure 8.2.

1 1 1
3 3
4 4
1 1 1
2 2 2
1 1
4 4

1 2 1 1 2 1 2 7 8 1 1 2 1 2 7 8 1
3 3 9 9 3 3 9 9 9 9 3 3 9 9

Figure 8.2: Constructing the Cantor function, also known as the devils staircase.
fig:cantor-function On the jth gap in In , Fn (x) = 2jn . Outside the gaps, we connect linearly.

Then we define F (x) := limn Fn (x). Well actually prove F has the de-
sired properties using a different construction of the Cantor function via decimal
expansions. First, we have
X ak
In = {x [0, 1] : x = with ak {0, 1, 2} and aj {0, 2} for 1 j n}.
3
kN

(Any x with some ak = 1 comes from a removed middle third.) Take some time
to convince yourself of this characterization.
So each x is uniquely represented by a sequence of zeros and twos. We
just write x in base 3 without any 1s (there are some ambiguities and details to
work out here, but well skip them; they come from non-uniqueness of decimal
expansions, like 0.1 (base 3) = 0.02222 (base 3)). Then define f : [0, 1]
via
ak f X  ak  1

X
x=
7 .
3k 2 2k
k=1 k=1

All were doing here is taking the base 3 expansion of x, replacing 2s with 1s,
and pretending the expansion is in base 2. For example, 23 = 0.2 (base 3) 7
0.1 (base 2) = 12 .
63. THE CANTOR FUNCTION AND LEBESGUE-STIELTJES MEASURES203

This function f is increasing on , and we want to extend it to [0, 1]. We


define the Cantor function F by setting

F (x) := sup{f (y) : y x}.

Informally, this means to fill in the gaps (removed middle thirds), we just find
the largest value L to the left of the gap, and then make the function constant,
equal to L, along this gap. We have F (x) = f (x) if x and F (x) = 2kn if x
is in the kth gap of In (1 k 2n1 ). Note F (and f ) are surjective onto [0, 1]
(base 2 expansions). Since F is monotone, this means it must be continuous
(the only discontinuities allowed are jump discontinuities, but then it wouldnt
be surjective).
sec:radon-nikodym
Remark 14 (Lebesgue-Stieltjes meaures). Recall from Section 55 that we can
decompose a measure into three mutually singular measures,
X
d(x) = dac (x) + ck xk (x) + dsc (x) ,
| {z }
kN
| {z } singular
continuous
pure point component

where the singular continuous part sc has no atoms (points xk where ({xk }) >
0). The Lebesgue-Stieltjes measure for the Cantor function,

([a, x]) := F (x) F (a),

is an example of a singular continuous measure; there are no atoms and it is


singular wrt Lebesgue measure, yet somehow still manages to satisfy ([0, 1]) =
1.
Next, lets one-up the Cantor function:
Proposition 42. There exists a surjective continuous function f : [0, 1] [0, 1]
which is strictly increasing and which has derivative 0 almost everywhere.
sec:monotone-discontinuities
Proof. We saw this function before, in Section 12. Heres a different definition,
in measure-theoretic terms. Let B be the collection of Borel subsets of [0, 1].
Define a measure : B [0, ] by putting an atom at the nth element of
Q [0, 1] of weight 2n . Define
Z x
f (x) := ([0, x]) = d.
0

The idea is that we put bumps at all the rationals, and then we add up all the
bumps we run over on the way from 0 to x. The measure ([0, x]) is a singular
lem:singular-diff
measure (wrt the Lebesgue measure dx), so Theorem 10 implies

df d([0, x])
(x) = = 0 Lebesgue-a.e.
dx dx
204 CHAPTER 8. EPISODE V: DIFFERENTIATION STRIKES BACK

So far, weve been pretty sloppy about defining these Lebesgue-Stieltjes


measures. Weve only actually ever looked at them on intervals, by setting
([a, x]) = f (x) f (a) for some increasing function f : R R. Is this actually
well-defined? The first thing is that instead of f (x) f (a), we actually want


f (x+) f (a+), I = (a, x]

f (x+) f (a), I = [a, x]
(I) = , (8.5) eqn:measure-intervals

f (x) f (a+), I = (a, x)


f (x) f (a), I = [a, x)

to handle f not necessarily continuous. This way, things make sense and we have
({a}) = f (a+) f (a), the size of the jump. Note also that the actual value
of f at a discontinuity is irrelevant. A useful theorem in measure theory (which
also gives us a way to define Lebesgue measure) ensures these Lebesgue-Stieltjes
measures are well-defined:

Theorem 61 (Caratheodory extension). For every increasing function f : R


eqn:measure-intervals
R, there exists a unique Borel measure which extends (8.5). Two functions
give rise to the same measure if and only if they differ by a constant away from
the discontinuities.

Remark 15. Caratheodorys extension theorem is actually more general than


the result stated here; the general statement is that a pre-measure can be ex-
teschl-fa
tended to a measure. For a proof, see [1].

64 Discontinuity sets of derivatives


sec:discontinuity-sets
TODO write. See http://math.stackexchange.com/questions/112067/how-discontinuous-can-a-
and http://math.stackexchange.com/questions/292275/discontinuous-derivative

65 Fabius function
TODO write. [combine with Taylor series section? or combine with cantor func-
tion section] http://mathoverflow.net/questions/43462/existence-of-a-smooth-function-with-

66 Pompeiu derivative
TODO write. [TODO combine with discontinuity sets of derivatives? -lhs]

67 Functional derivatives
This section is going to be a little unconventional, if that phrase even means
anything for this book. Our starting place is with objects called functionals.
67. FUNCTIONAL DERIVATIVES 205

Definition 50. A functional is a map from functions to real (or complex)


numbers.
For example, the dual space of the function space Lp (Rn ) is all the continuous
linear functionals Lp (Rn ) C. Funcitonals seem complicated, so lets think
about something simpler.
Definition 51. A discrete functional is a map from real-valued vectors in Rn
to real numbers.
As an example, X
f (

x)= x2i ,
i

where xi are the components of the vector x Rn , is a discrete functional.


Holding all but one of these components fixed, we get a function of a single
variable from R to R. In these terms, we can talk about the continuity and
differentiability of f in any of its variables. Our example above though is a bit
boring: its differentiable and continuous in every variable. On the other hand,
we could have chosen a messier case, such as
X
g(
x ) = x1
1 + x2i .
i>1

This function is continuous and differentiable in all but the first variable. Of
course if we restrict our domain to not allow the first variable to be zero, we
once more get a continuous and differentiable object, but wheres the fun in
that?
Now that we have a handle on discrete functionals, lets move on to the
continuum version. For example,
Z 1
F (h) = h(x)2 dx
0

is a functional. It takes as input a function h(x) which is square-integrable over


[0, 1] and returns a real number. This is the continuum limit of f , in which we
associate the vector components {xi } with the values of a function at different
points. To be concrete, suppose that we let
x be a vector of dimension N and
n
fN := f : R R. Then we let
 
i1
xi = h .
N
For sufficiently well-behaved functions h, we may write

fN (

x)
lim = F (h).
N N
This is the connection between the discrete and continuum cases, along the
same lines as the connection between finite differences and derivatives.
206 CHAPTER 8. EPISODE V: DIFFERENTIATION STRIKES BACK

x3 h
x2
xN

x1

0 1 2

N 1 1
N N N

Figure 8.3: Passing from the discrete to the continuum.

So now theres the messy business of derivatives in the continuum. Ordinar-


ily, we differentiate functions along a direction in their domain. For functions
on R, this is the same as ordinary differentiation. For functions on RN , this
means picking a unit vector eb in RN and evaluating the derivative along that
direction. That is,

f (
e) f (
x + b
x)
Debf (x) = eb f (

x ) lim .
0

Given this, the natural extension is to differentiate functionals along different


fig:functional-direction
function directions. (Figure 8.4.) That is,

F (h) F (h + w) F (h) d
:= lim = F (h + w).
w 0 d =0

x
h(t) + w(t)

(t1 , x1 )
h(t) w(t)

(t0 , x0 )
t

fig:functional-direction Figure 8.4: Comparing h and h + w.

This is all pretty abstract, so lets get back to our example. Let

Z 1
F (h) h(x)2 dx
0
67. FUNCTIONAL DERIVATIVES 207

as usual. Then
Z 1
F (h) (h(x) + w(x))2 h(x)2
= lim dx
w 0 0
Z 1 Z 1
2
= lim 2h(x)w(x) + w(x) dx = 2h(x)w(x) dx.
0 0 0

This is remarkably similar to the derivative of the corresponding discrete func-


tional, or even single-variable function.
This is a beautiful idea. We can differentiate over much higher dimensional
spaces than before, spaces with uncountably-many dimensions. The structure
that functionals impose on these spaces provides a clue as to the kinds of func-
tions that inhabit them, and give us a way to pick out ones which extremize
particular quantities, a homing beacon to interesting functions. On what we
hope is an even more profound note, the calculus of functions is the language
in which, as far as modern science can tell, the laws of Nature are written.
Quantum Field Theory and all of its extensions, which collectively represent
the current best explanation for the fundamental structure of Nature, is written
in and intimately tied to the language of functional analysis.
As a simple example from physics, we can examine the Euler-Lagrange equa-
tion and principle of least action from classical mechanics. [TODO standard pic-
ture of minimizing path] [TODO also Euler-Lagrange for Schroedinger, Sobolev]
208 CHAPTER 8. EPISODE V: DIFFERENTIATION STRIKES BACK
refsection:10

Chapter 9

Probability and ergodic


theory
chap:prob
68 What is probability, anyway?
sec:prob-def
[TODO I just realized we should open with Bertrands paradox]
Making sense of the intuitive idea of probability is a tricky business, and
the mathematical theory of probability appeared pretty late in history. In 1933,
Kolmogorov made the following definition: A probability space is a measure
space (, F, ) such that () = 1. In this context, is called the set of
outcomes. Measurable sets (i.e. sets in F) are called events. The measure
function is called the probability function, and it is usually denoted Pr. Some
examples:

Example 15. Imagine tossing two dice, hoping that they will sum to 10.
We model this experiment by the probability space (, F, Pr), where =
1
{1, 2, . . . , 6}2 , F = P(), and Pr(S) = 36 |S|. The probability that the two
1
dice sum to 10 is Pr({(a, b) : a + b = 10}) = 12 .

Example 16. Suppose you are terrible at tossing darts; you always hit the
dartboard, but you have no tendency to hit any particular part of the dartboard.
What is the probability that you will get a bullseye? We model this experiment
by the probability space (, F, Pr), where R2 is the unit disc, F is the set
of Borel subsets of , and Pr is the Lebesgue measure (renormalized so that
Pr() = 1.) The probability in question is the ratio of the area of the bullseye
region to the area of the entire dartboard.

Example 17. Imagine repeatedly tossing a fair coin. What is the probability
that the sequence of outcomes, interpreted as zeroes and ones, is the binary
expansion of a transcendental number? We model this experiment by the so-
called product space of the probability space for a single coin with itself countably
infinitely many times. It turns out that the number whose binary expansion is

209
210 CHAPTER 9. PROBABILITY AND ERGODIC THEORY

given by the sequence of outcomes is uniformly distributed on [0, 1], so the


probability in question is 1.
Theres a great deal of mathematics (of which we will discuss only a tiny
amount) regarding the properties of probability spaces. But you might also
wonder, what is the sense in which these probability spaces model the physical
experiments described? This is a current area of research in philosophy a lot
of people have had a lot of ideas, and none of them is too satisfactory!
[TODO briefly discuss frequentism, subjectivism, propensity theory. Include
the sleeping beauty problem just because its awesome]

69 The Cauchy distribution and the false law of


large numbers
[TODO write]
False Theorem 2. Suppose X1 , X2 , . . . is a sequence of i.i.d. random variables.
Then there is some number so that
X1 + + Xn
lim = (9.1)
n n
with probability 1.

70 The strong law of large numbers


71 Birkhoff s ergodic theorem
[Give probabilistic formulation, and point out that SLLN is a special case] [pos-
sibly combine with SLLN? unless both of these sections end up being quite long
by themselves -lhs]
[mention mean ergodic theorem, maximal ergodic theorem]
Definition 52 (Measure preserving and ergodic maps).
(i) A map T : X X is a measure preserving transformation (MPT) if
(T 1 (E)) = (E) for every measurable E X. [Examples: translation
on Rd , rotation on the circle R/Z = [0, 1)]
(ii) A MPT T : X X is ergodic if E measurable and (T 1 (E)E) = 0
implies (E) = 0 or (E c ) = 0. (the sets that dont change under T 1
are either measure 0 or full measure)
[TODO pictures of the examples] Examples.
(i) TranslationSin Rn : Tv : x 7 x + v is MPT but not ergodic. E.g. for Te1 ,
take E = zZ B1/2 (ze1 ). Then Te1 1
(E) = E but m(E) is not measure
zero or full measure.
72. RANDOM WALKS 211

(ii) Rotations in Rn : MPT but not ergodic (e.g. take E an annulus)


(iii) Rotation on the circle/torus T = R/Z = [0, 1): T : x 7 x + , i.e. x +
mod 1. This is MPT, and is ergodic iff is irrational. If = pq is rational,
put mass spaced at intervals of length 1q . If is irrational, proof using
Fourier series or equidistribution theorem, and alternate characterization
of ergodic, e.g. see [BrinStuck].
(iv) Doubling map on T: T : x 7 2x is MPT and ergodic. (Draw picture,
y = 2x for 0 x < 1/2, 2x 1 for 1/2 x < 1.) MPT easy to see, for
ergodic see [BrinStuck] (proves mixing).
Theorem 62 (Birkhoff ergodic theorem). Let T be an ergodic MPT in a finite
measure space (X, , ) and let f L1 (X, ). Then
n1 Z
1X 1
lim f (T k (x)) = f (x) d, -a.e. x X.
n n (X) X
j=0

It means time average space average a.e. If we take f = A , then


the left side is how often T j x lands in A, and the right side is the measure of
A (normalized). Special case with X = [0, 1], T (x) = x + : equidistribution
theorem (f continuous, T uniquely ergodic then get everywhere convergence)

72 Random walks
73 Brownian motion
[Discuss physics motivation, give a definition, state some basic analysis-ish facts
like the fact that it is continuous everywhere but differentiable nowhere a.s.]

74 Khinchins constant
[TODO: rewrite/edit this, I mainly just copy-pasted from my math club talk
notes -lhs]
An example:
1
=3+ .
1
7+
1
15 +
1
1+
292 +
Define
1
[a1 , a2 , . . . , an ] := , a1 , . . . , an N.
1
a1 +
1
a2 + +
an
212 CHAPTER 9. PROBABILITY AND ERGODIC THEORY

To study continued fraction expansions, it is useful to introduce the Gauss


map,  
1 1
(x) = , (0) = 0, x [0, 1).
x x

For x [0, 1), write x = 1/(1/x + (x)), so a1 = 1/x which leaves us


1
(x) = a2 + . Write (x) = 1/(1/(x) + 2 (x)), so a2 = 1/(x). Continu-
ing, if we set
aj = 1/j1 (x) 1, j n,
then
1
x=
1
a1 +
1
a2 + +
an + n (x)
An irrational x [0, 1) has the infinite continued fraction expansion

1
x = [a1 , a2 , . . .] = ,
1
a1 +
1
a2 +
a3 +
where  
1
an (x) = .
n1 (x)
Question: Do the an (x) tend to favor certain numbers? Note that an (x) = k
1
iff k+1 <
n1
(x) k1 . So we want to look at the distribution of n (x) over all
n.
Recall the Gauss map
 
1 1
(x) = , x [0, 1).
x x

Continued fraction expansion x = [a1 , a2 , . . .] where an (x) = 1/n1 (x),


1
an (x) = k iff k+1 < n1 (x) k1 .
Gauss measure Z
1 1
(E) := dx.
log 2 E 1 + x
Proposition 43. The Gauss map is an ergodic MPT for the Gauss measure
, i.e. (1 (E)) = (E) for measurable E [0, 1) and (1 (E)E) = 0
implies (E) = 0 or (E c ) = 0.
Theorem 63 (Khinchin 1935). For a.e. x [0, 1] (wrt Gauss or Lebesgue
measure), the following hold:
74. KHINCHINS CONSTANT 213

(i) Each k N appears in the sequence a1 (x), a2 (x), . . . with asymptotic fre-
quency  
1 1
log 1 + .
log 2 k(k + 2)

(ii) The arithmetic mean of the partial quotient coefficients is infinity,

1
lim
(a1 (x) + an (x)) = .
n n

(iii) The geometric mean of the partial quotients coefficients has limit

Y log j/ log 2
1
lim (a1 (x)a2 (x) an (x))1/n = 1+ =: K0 2.685.
n
j=1
j(j + 2)

K0 is called Khinchins constant.

Remark 16. The theorem only holds for a.e. x [0, 1). In particular, they do
not hold for quadratic irrationals (irrational root of a quadratic equation with
rational coefficients). For example,

1+ 5
2 = [1; 2, 2, 2, . . .], = [1; 1, 1, 1, . . .].
2
Proposition 44 (Lagrange 1770). The continued fraction expansion for x is
eventually periodic if and only if x is a quadratic irrational.

Also, nth geometric mean of continued fraction coefficients for e = 2.71828 . . .


grows like n1/3 .
Open: , Euler-Mascheroni1 constant ? (numerically seems yes to Khinchin
constant)

[TODO like get rid of this proof]

Proof (of theorem). (i) Let f := (1/(k+1),1/k] , so that an (x) = k iff f (n1 (x)) =
1. Apply Birkhoff ergodic theorem.
(ii) Let f (x) := 1/x = a1 (x), so an (x) = f (n1 (x)) and the N th arith-
metic average is
N 1
1 1 X
(a1 (x) + + aN (x)) = f (j (x)).
N N j=0

Wed like to apply Birkhoffs ergodic theorem, but


Z 1
f (x)
dx = ,
0 1+x
1
Pn 1
:= limn ( log n + j=1 j ).
214 CHAPTER 9. PROBABILITY AND ERGODIC THEORY

R 1 1x
since2 f (x) = 1/x > (1 x)/x and 0 x(1+x) dx = .
(
f (x), f (x) N
Set fN (x) := and note fN L1 ([0, 1), ). Apply Birkhoff
0, else
ergodic theorem to fN : For a.e. x [0, 1),
N 1 N 1
1 X 1 X
lim inf f (j (x)) lim inf fN (j (x))
N N j=0 N N
j=0
Z 1
1 fN (x)
= dx
log 2 0 1 + x

as N .
(iii) Were going to take logarithms. Let f (x) = log1/x, so
N 1
1 1 X
(log a1 (x) + log aN (x)) = f (j (x)).
N N j=0

One can show f L1 ([0, 1), d). Then apply the Birkhoff ergodic theorem and
exponentiate.

75 Normal numbers
[TODO: write. Discuss definition of normality, give an example in base 10, talk
about the fact that almost all numbers are normal. Also talk about the fact
that there ARE specific known examples of numbers which are normal in every
base, contrary to common wisdom...]

2 f (x) 1
jumps at n
, check those regions and use (1 x)/x decreasing
refsection:11

Chapter 10

Fourier analysis
chap:fourier
Ya got the wrong eigenvalue!!
Put a Fourier transform on it.
PRONTO!!

alleged or misquoted quote from


the 1987 movie No Way Out

76 Introduction to Fourier series


Once upon a time there was a function f that was periodic and lived on [0, 1].
Some engineers and physicists found this function and interpreted it as a signal,
fig:waves
like a radio wave or a sound wave (Figure 10.1). They wanted to decompose

TODO picture for radio or sound waves

fig:waves Figure 10.1: radio. mention AM, FM

the function into its distinct sine and cosine frequencies by writing,

X
X
f (x) = a0 + an cos(2nx) + bn sin(2nx), an , bn R.
n=1 n=1

Now f was a bit confused. Why in the world did these engineers and physicists
expect that he could be decomposed into a bunch of sines and cosines? So
he asked them about this, and one of them explained: Well, we think youre
a sound wave (that talks!), and we want to determine the amplitudes of each
of the frequencies that make you up. Besides telling us lots of useful things,
once we know all the amplitudes for all the frequencies, we can synthesize a
wave by just adding up a bunch of pure sine and cosine terms with the correct
amplitudes.

215
216 CHAPTER 10. FOURIER ANALYSIS

Okay, but thats not exactly answering my question; why should sines and
cosines be enough? Why dont you try to write other functions in terms of me
instead? asked the function, who was starting to get jealous of sin and cos.
The physicist started to say something about the set {e2inx }nZ forming an
orthonormal (Schauder) basis for the function space L2 ([0, 1]), then remembered
he was a physicist and proceeded to give an example and some formulas instead.
Forget the sines and cosines; its often easier to deal with this if we use
complex exponentials instead. Write
X
f (x) = an e2inx , (10.1) eqn:fourier-expansion
nZ

where now the coefficients an can beeqn:fourier-expansion


complex. Heres the physics method to de-
termine the an : Lets assume that (10.1) holds, and integrate it against e2imx
for some arbitrary m Z. Then we get,
Z 1 Z 1X
f (x)e2imx dx = an e2inx e2imx dx.
0 0 nZ

Now we use orthonormality of e2inx : If we integrate e2inx e2imx on [0, 1]


well get zero from periodicity, unless n = m in which case the integrand is
just 1. So after assuming we can interchange the sum and integral (cause were
physicists), all the terms on the right side vanish except for the n = m term,
and we get,
Z 1 XZ 1 Z 1
2imx 2i(nm)x
f (x)e dx = an e dx = am dx = am .
0 nZ | 0 0
{z }
=0 unless n=m
R1 2inx
Thus, an = 0 f (x)e dx. So now, in some sense, we have f equal to
eqn:fourier-expansion
its Fourier series (10.1), and we have a formula for the coefficients an , and
everything works out.
Hmmm... began the function before he was interrupted by fig:fourier-square-wave
Heres some
fig:fourier-saw-tooth-wave
examples and pretty pictures of the process working! See Figures 10.2 and 10.3.
Just like the square wave and the saw tooth wave, you too can get a Fourier
series!
Hmmm... began the function again. This time he was interrupted by a
friendly operator called the I-can-compute-Fourier-coefficients operator. This
operator explained the math way to do Fourier series to the poor confused
function: For f L1 ([0, 1]), define the nth Fourier coefficients by
Z 1
fbn = f (x)e2inx dx,
0

and then write X


f (x) fbn e2inx
nZ
76. INTRODUCTION TO FOURIER SERIES 217

1.0

0.5

0.2 0.4 0.6 0.8 1.0

-0.5

-1.0

Figure 10.2: The first few partial sums of the Fourier series for the square wave
g:fourier-square-wave (shown in black).

1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0

Figure 10.3: The first few partial sums of the Fourier series for the saw tooth wave
ourier-saw-tooth-wave (shown in black).

as a formal sum. Of course, wed like to know in what sense f is actually


equal to its Fourier series. There are different types of convergence we could
look at. One is pointwise convergence: for fixed f and x [0, 1], does f (x) =
P b 2inx ? In general, since were looking at f L1 ([0, 1]), we can only
nZ fn e
get almost everywhere convergence at best, since functions that are equal a.e.
are identified together in L1 , and will have the same Fourier coefficients. So we
can ask if we always get pointwise almost everywhere convergence, and we can
ask if we can get everywhere convergence if we add some conditions to f , like
continuity. The other kind of convergence we care about is norm convergence,
218 CHAPTER 10. FOURIER ANALYSIS

that is, convergence in the Lp norm.


Hmmm... said the function again. Ah ah ah, said the I-can-compute-
Fourier-coefficients operator. I know what youre thinking. I can tell you
your Fourier coefficients, but Im not going to say anything about whether your
Fourier series converges or not. Im just the I-can-compute-Fourier-coefficients
operator. My job is easy. Theres a formula and I just compute it. I give you the
sequence of Fourier coefficients {fbn }nZ and then Im done. Youre going to have
to do more work to figure out if the resulting Fourier series actually converges
or not. And with that the I-can-compute-Fourier-coefficients operator ran
away before f could protest further.

77 Divergence of Fourier series


sec:fourier-divergence
Once upon a time there was a function named f who lived on the torus T = R/Z
fig:torus
(Figure 10.4). Basically, think of f as a periodic function on [0, 1], or just a
function on [0, 1] with f (0) = f (1).

Figure 10.4: The torus T = R/Z is really the circle since were in one dimension.
Points in R that differ by an integer are identified together, so we end up with [0, 1]
fig:torus with 0 and 1 identified, which forms the circle.
R
Now f was absolutely integrable, that is, T |f | dx < , so f L1 (T), but
other than that he wasnt a very nice function. One day a new operator came
to visit the torus. His name was the Turn-me-into-a-Fourier-series operator.
For the right price, this operator would give a function an alternate identity in
terms of a sum of complex exponentials. In other words, it told a function its
Fourier series. After seeing some of his friends like the square wave and sawtooth
wave visit the Turn-me-into-a-Fourier-series operator and come back with a
totally new and exciting Fourier series representation, f decided to try it out
for himself. So he asked the Turn-me-into-a-Fourier-series operator to turn
him into a Fourier series. The Turn-me-into-a-Fourier-series operator looked
at him and paused. Hmmm, youre an L1 (T) function at least...I can compute
your Fourier series coefficients...but when I try to create the Fourier series, it
doesnt look like you. Now f , being in L1 , was used to actually representing an
equivalence class of functions in L1 that differed only by a set of zero Lebesgue
measure, so he replied, Okay, okay, I know it might not look like me, but
I represent an equivalence class of pretty similar functions, so as long as it
looks like someone in my equivalence class, its fine. But the Turn-me-into-
a-Fourier-series operator sighed and said, Im sorry, but Im looking at the
78. CONTINUOUS FUNCTIONS MAKE FOURIER SERIES CRY 219

resulting Fourier series, and it doesnt converge anywhere. Like, youre asking for
pointwise almost everywhere convergence, but I cant even get you convergence
at any point! Im going to have to ban functions like yourself! Howd you
even get yourselves into L1 , youre so ugly! At this, f ran away back to his
home on T and cried himself to sleep. All his friends1 could get nice Fourier
series converging to themselves at least almost everywhere, and here he was, his
Fourier series diverging everywhere! As he fell asleep, he secretly blamed the
great mathematician Andrey Kolmogorov for all his woes.
Why did f blame Kolmogorov? In fact, f was quite correct in placing his
blame. In 1923, Kolmogorov shattered the physicists dreams when he showed
that there is an L1 function whose Fourier series diverges almost everywhere.
Well, he didnt really shatter their dreams, since the physicists didnt actually
really care. Anyway, three years later in 1926, he constructed an L1 function
f whose Fourier series diverges everywhere. It was f s unfortunate luck that
he was indeed this function. Things werent looking very good for f or for the
Turn-me-into-a-Fourier-series operator.
Functions started to wonder how bad this could get. Could there exist a
continuous function whose Fourier series diverged a.e.? What kind of condition
sec:carleson
would guarantee convergence a.e.? (For spoilers, skip to Section 80.)
TODO For the last part of this section, well look at the construction of such
a function. TODO Picture????
Reference: Details forGrafakos
Kolmogorovs
Zygmund
functions with a.e. and everywhere
divergent Fourier series: [2, 14]

78 Continuous functions make Fourier series cry


sec:fourier-divergence
After the disaster in Section 77 with the L1 function f , the Turn-me-into-a-
Fourier-series operator decided to be a bit more restrictive about which func-
tions he promised to give a Fourier series to. The functions in Kolmogorovs
sec:fourier-divergence
examples from Section 77, while in L1 , were very discontinuous. The Turn-me-
into-a-Fourier-series operator figured if he only allowed nice enough functions,
hed be able to ensure everywhere convergence of the Fourier series. So he
thought for a bit and figured that only allowing continuous functions would
fix his problem. Eventually, he thought maybe he could weaken this require-
ment, but for now, he was only going to computer Fourier series for continuous
functions.
But the Turn-me-into-a-Fourier-series operator was so wrong. There was
a continuous function g who lived on T and heard about this operator. Being
continuous, he was excited to find out his Fourier series from the Turn-me-into-
a-Fourier-series operator. Being continuous, he also was somewhat arrogant
and would only accept a Fourier series converging to himself everywhere. Sure,
as an L1 function he often was identified as an equivalence class of functions,
but he was certain he was the most important member of that equivalence
class, just because he was continuous. So he didnt care about convergence
1 Well, not all of them.
220 CHAPTER 10. FOURIER ANALYSIS

almost everywhere (i.e. convergence to someone in his equivalence class); he


only wanted everywhere convergence to himself.
Anyway, he visited the Turn-me-into-a-Fourier-series operator one day,
and the operator gladly computed his Fourier series coefficients and Fourier se-
ries and told g. At first, g was happy. But then he noticed something wrong with
the Fourier series. Hey...you gave me the wrong Fourier series; this one doesnt
converge to me at x = 0. In fact, it diverges at x = 0. This is definitely wrong.
I demand everywhere convergence! The Turn-me-into-a-Fourier-series oper-
ator was shocked and apologized profusely and recomputed the Fourier series.
But he got the same result as before. So he triple-checked and quadruple-
checked, but got the same result each time. Eventually, the Turn-me-into-a-
Fourier-series operator gave up and apologized again. He tried to get g to
accept the Fourier series and leave; after all, the Fourier series did converge to
g pointwise almost everywhere, but g refused and stormed out. Needless to say,
g was mad. How did he, of all nice continuous functions, end up with a bad
Fourier series? But as he continued fuming about, the Turn-me-into-a-Fourier-
series operator realized that this problem with the Fourier series of g wasnt
an isolated problem. There were a bunch of continuous functions whose Fourier
series didnt converge everywhere:

prop:fourier-cts-divergence Proposition 45. For every x T, there is a dense G set of functions Fx


C(T) such that supn |Sn (f )(x)| = for every f Fx .

So not only would the Turn-me-into-a-Fourier-series operator fail to get


everywhere convergence for some continuous function, but given any x T,
there are a lot of continuous functions whose Fourier series diverge at x.
The Turn-me-into-a-Fourier-series operator decided to once and for all
come up with a sufficient condition for the Fourier series to converge everywhere.
He didnt really care about necessary conditions, since he figured he could get
enough business if he found a weak sec:carleson
enough sufficient condition. This plan was
going to have to wait until Section 80 though, because right now he wanted to
figure out how the Fourier series of continuous functions could manage to behave
prop:fourier-cts-divergence
so badly at certain points! In other words, he wanted to prove Proposition 45.
To talk about convergence of Fourier series, he had to look at the partial
sums, X
Sn (f )(x) := fbj e2ijx .
|j|n

In order to better understand these partial sums, the Turn-me-into-a-Fourier-


series operator traveled far and wide. P
Eventually, he met the Dirichlet kernel Dn (x) := |j|n e2inx . At first,
he thought the Dirichlet kernel was pretty useless; it almost looked like the
partial sums but it was missing the fbj term on each exponential. But the
Dirichlet kernel assured him he was mistaken; using Dn , the partial sums could
be written as a convolution,

Sn (f ) = f Dn .
78. CONTINUOUS FUNCTIONS MAKE FOURIER SERIES CRY 221

Figure 10.5: The Turn-me-into-a-Fourier-series operator traveled far and wide.

10
8
6
4
2

-0.4 -0.2 0.2 0.4


-2

Figure 10.6: Plot of the Dirichlet kernel Dn (x) for 1 n 5. The central peak rises
as n increases. By summing the geometric series, the Dirichlet kernel can be written
as Dn (x) = sin((2n+1)x)
sin(x)
.

So the Turn-me-into-a-Fourier-series operator agreed to take the Dirichlet


kernel Dn with him on his travels. After explaining the problem with all these
continuous functions having Fourier series that diverge somewhere, the Dirichlet
kernel said, I think I can help you there. Let me introduce you to my friend
the Banach-Steinhaus theorem. Hes got a fancy name, but lots of my friends
just call him the uniform boundedness principle. He usually lives in functional-
analysis-land. Uhh, what do all those words mean? And how is he gonna
help? replied the Turn-me-into-a-Fourier-series operator. Well, said the
Dirichlet kernel, The key parts in the proof are using the growth of the L1
norm kDn k1 combined with the uniform boundedness principle. Let me first
introduce you to him, and then give you the lemma you need about kDn k1 .

Theorem 64 (Banach-Steinhaus). Let X be a Banach space, Y a normed linear


space, and {A } a collection of bounded linear transformation X Y . Then
222 CHAPTER 10. FOURIER ANALYSIS

either kA k C is uniformly bounded, or there is a dense G set X X with


supA kA xk = for all x X .

The uniform boundedness principle didnt say much, other than that he
really liked Baire and his two best friends were Hahn-Banach and the open
mapping theorem.
P
Lemma 11. The L1 norm of the Dirichlet kernel Dn (x) = |j|n e2ikx grows
like log n. More specifically,
4
log n < kDn k1 < 3 + log n.
2
We only actually need kDn k1 & log n, and this we can get by using | sin x|
|x| and estimating
Z 1/2 Z (n+ 21 )
| sin((2n + 1)t)| 2 | sin x|
kDn k1 2 dt = dx
0 t 0 x
n Z k
2X 1 4
> | sin x| dx 2 log n.
k (k1)
k=1

The Dirichlet kernel continued: Now we look at x = 0, and then I introduce


you to a family of operators, An (f ) := Sn (f )(0), which just spits out the value of
the Fourier partial sum at x = 0. Recall under the sup norm, C(T) is a Banach
space, and then note that each An is a bounded linear functional C(T) C
with norm
Z Z
kAn k = sup |f Dn (0)| |f (y)Dn (y)| dy |Dn (y)| dy = kDn k1 .
kf k =1 T T

In fact, we get equality kAn k = kDn k1 by using continuous fj C(T) to ap-


proximate sgn(D n (y)) pointwise. For by the dominated convergence theorem
sec:dct
(recall Section 51), we can interchange the limit j and the integral to get
Z Z
lim An (fj ) = sgn(Dn (y))Dn (y) dy = |Dn (y)| dy = kDn k1 .
j T T

But kDn k1 log n , so kAn k is certainly not uniformly bounded, so the


uniform boundedness principle implies there is a dense G set F C(T) with

sup |Sn (f )(0)| =


n

for all f F. This works for any other x T as well, so we can always find a
lot of continuous functions whose Fourier series diverge at a fixed x T.
Hmm this seems pretty bad. But at least it cant really get any worse than
that right? asked the Turn-me-into-a-Fourier-series operator. And he was
wrong again. The Dirichlet kernel replied,
79. SUMMABILITY IS NICE 223

Well, um...in fact, it gets a bit worse. How about instead of just picking
a single x T where we want the Fourier series to diverge, we pick several?
Or countably infinitely many? Or just any measure zero set? It turns out that
given any measure zero set E T, there is some continuous function whose
Fourier series diverges precisely on E and nowhere else!
Arrrrrrrrrrrgggggggggggghhhhhhhhh!!!!!!!!!!! exclaimed the Turn-me-into-
a-Fourier-series operator as he ran around screaming and complaining about
how much he hated certain continuous functions, Why? why? why? why?
why? why? I hate these functions!
References: Banach-Stenhaus,
rudin
Zygmund
Dirichlet kernel, divergence of Fourier series:
Katznelson
[8, 14]. Any measure zero set: [4].
[TODO set E. Do we get the G or denseness?]

79 Summability is nice
After all these disasters with the Turn-me-into-a-Fourier-series operator fail-
ing to produce nice Fourier series for everyone, the functions on T were losing
faith in Fourier series. They searched far and wide and eventually found a simi-
lar operator that called himself the Average-me-into-a-Fourier-series operator
and promised better results.
The idea of the Average-me-into-a-Fourier-series was this: One thing we
can do if a sequence (an ) doesnt converge is to take averages of the sequence
elements (Caesaro summability). Instead of looking at a0 , a1 , a2 , . . ., we look at
a0 , 12 (a0 +a1 ), 13 (a0 +a1 +a2 ), . . .. That is, we look at the new sequence given by
Pn1
bn := n1 j=0 an . Conveniently, if the original sequence (an ) converged to say
a, then so does the new sequence of averages (bn ). The idea is that eventually
youll just be adding a ton of terms that look like a to the average, so the
sequence of averages will have to tend towards a. The beginning few terms
where you might be far away from a dont matter as that fraction n1 starts to
kick in.
Whats useful, though, is that sometimes (bn ) converges even when (an ) does
not. For example, 1, 1, 1, 1, 1, 1, . . . doesnt converge, but the sequence of
its averages does. This is just the sequence

1 1 1
1, 0, , 0, , 0, , 0, . . . ,
3 5 7
which converges to zero.
To apply this to Fourier series, the Average-me-into-a-Fourier-series
Pn1 oper-
ator got some help from the Fejer kernel, Fn (x) := n1 j=0 Dj (x). Recall we
have a sequence of partial sums (Sn (f )) that doesnt behave nicely, and we hope
that averaging them will make things converge nicely. Computation shows
n1
1X
f Fn = Sj (f ),
n j=0
224 CHAPTER 10. FOURIER ANALYSIS

which means that the Fejer kernel takes care of the averaging for us. Addition-
ally, the Fejer kernel is a summability kernel (sometimes called an approximate
identity or a good kernel ), meaning its a sequence that satisfies
R
1. T Fn (x) dx = 1.
R
2. T |Fn (x)| dx M (uniformly bounded).
R
3. For every > 0, |x| 1 |Fn (x)| dx 0 as n . (Here were looking
2
at integration on [1/2, 1/2].)
The Fejer kernel proudly remarked that the Dirichlet kernel is not a summabil-
ity kernel since it fails condition 2; kDn k1 log n is not bounded. The Fejer ker-
h i2
nel does satisfy condition 2, though, since we can write Fn (x) = n1 sin(nx)
sin(x)
0. Condition 3 says all the mass gets concentrated near the origin as n .
Although the Dirichlet kernel may seem to get more and more concentrated as
n increases since the central peak rises, it actually violates this condition.2

-0.4 -0.2 0.2 0.4

Figure 10.7: Plot of the Fejer kernel Fn (x) for 1 n 6. Note Fn (x) 0, unlike
fig:fejer Dn (x).

The great thing about the Fejer kernel


Pn1is that the averaging is good enough
to get convergence. Recall f Fn = n1 j=0 Sj (f ), and note that this involves a
convolution, which we like when were trying to prove a.e. convergence. (Recall
sec:lebesgue-differentiation
Section 57.) We claim:
2 Recall the computation we did to show kD k & log n? We can do something similar
n 1
here. Take = 14 :
1/2 1 )
(n+ 2
| sin((2n + 1)t)| 1 | sin x|
Z Z
dt = dx
1/4 t 1 (n+ 1 )
2 2
x
n Z k  
1 X 1 n
> | sin x| dx & log 6 0.
k (k1) n/2
k= n
2

79. SUMMABILITY IS NICE 225

1
Pn1 n
prop:summability Proposition 46. If f L1 (T), then f Fn = n j=0 Sj (f ) f almost
everywhere.

With this Proposition, the Average-me-into-a-Fourier-series operator was


able to convince the L1 functions living on T that convergence of the Fourier
series averages was going to work. Of course, the Average-me-into-a-Fourier-
series operator could only guarantee almost everywhere convergence, but for
most L1 functions, this was good enough.
The Average-me-into-a-Fourier-series operator also made a note to try to
include more functions: Since T has finite measure, L1 (T) L2 (T)
L (T), and so the proposition holds for other Lp spaces as well. The Fejer
kernel also remarked that it is important that Dn is not a summability kernel,
else Dn could attempt to use the same proof to show that f Dn also converged
to f a.e. (The Fejer kernel liked to point out how he was better than the
Dirichlet kernel.)

h i2
1 sin(nx) sin((2n+1)x)
Fn (x) = n sin(x)
Dn (x) = sin(x)

[TODO pic of bad/burned/moldy corn kernels]

Good kernels Bad kernels

Figure 10.8: Good kernels and bad kernels.

???

Z/6Z?
Figure 10.9: While were discussing kernels, heres two jokes for your amusement.
How do algebraists eat their corn? Answer: By modding out the kernels! And a
classic: Why cant you grow corn in Z/6Z? Answer: Because its not a field! The first
joke may be in reference to a blog post by the user bentilly on blogspot about how
algebraists and analysts actually eat corn, concluding that algebraists tend to eat it
in rows while analysts tend to eat it in spirals. [TODO cite sources for jokes?]
226 CHAPTER 10. FOURIER ANALYSIS

Convolving f with the Fejer kernel or the Dirichlet kernel acts like a cut-off
on the terms in the Fourier series. Convolving with the Dirichlet kernel is a
sharp cut-off at |n| = N to form the N th partial sum. However, convolving
with the Fejer kernel features a smoother cut-off because of thefig:fejer-vs-dirichlet
averaging, and
this ends up producing better convergence properties. (Figure 10.10.)

DN 1

N N

FN 1

N N

Figure 10.10: How DN and FN act as multipliers on the terms in the Fourier series.
DN just takes the all the terms |n| N with a sharp cut-off, while FN averages the
contributions and ends up with a continuous multiplier (though they really only act
fig:fejer-vs-dirichlet at integer points). This increased regularity makes for nicer convergence properties.

A.e. pointwise convergence of averages isnt as good as simply getting


Sn (f ) f a.e., but its basically the best we can do for L1 . The Average-me-
into-a-Fourier-series also happily pointed out this convergence implies that if
the Fourier series does converge on some E with positive measure, then it at
least converges a.e. to the correct function f on E, since Cesaro summation
converges to the original thing if it existed.
prop:summability
The proof of Proposition 46 proceeds by showing Fn is a summability kernel
with supnN kFen k1 < , where Fen is the decreasing symmetrization of Fn (i.e.,
sec:convolutions
Fen (x) := sup|y||x| Fn (x), which is decreasing and symmetric; recall Section 58),
and then using the consequence of the Lebesgue differentiation theorem for
sec:convolutions
convolutions from Section 58.
TODO picture of bounding; n for |x| < 1/2n, 1/n(1/(2x))2 else.

Figure 10.11: We can bound Fn (x) by symmetric decreasing functions Fen (x) with
uniform L1 bound.

The proposition has a nice corollary about uniqueness:


Corollary 5. Let f, g L1 (T). If fbn = gbn for all n Z, then f = g a.e.
P   fig:fejer-swap
This follows from f Fn = |j|n1 1 |j| n fbj e2ijx (Figure 10.12), since
80. FOURIER SERIES FINALLY CONVERGE 227

then we see g Fn = f Fn , with f Fn f a.e. and g Fn g a.e.

TODO picture of Fejer kernel thing


Pn1 P  
Figure 10.12: The equality f Fn = Sk (f )(x) = |j|n1 1 |j|
k=0 n
fbj e2ijx .
Pk b 2ikx in the sum
The x-axis indicates teh kth term Sk (f )(x) = i=k fk e
Pn1
k=0 Sk (f )(x). The y-axis indicates teh jth Fourier series term fbj e2ijx . In the
Pn1
sum f Fn = k=0 Sk (f )(x), we sum along the k-variable onP the x-axis. If we sum
fig:fejer-swap along the j-variable on the y-axis instead, we obtain f Fn = n1k=0 Sk (f )(x).

There was one final thing the Fejer kernel noted to demonstrate his useful-
kk
ness: If f C(T), we can show f Fn sec:weierstrass-approximation
f , which gives a proof of the
classical Weierstrass theorem from Section 25, that trigonometric polynomials
kk
(e.g. f Fn ) are dense in C(T). For the proof f Fn f , use uniform
continuity of f to get a , and then split up the integral between |t| < and
|t| and use the properties of a summability kernel.
Simon
References: [9, Book 3], TODO others

80 Fourier series finally converge


sec:carleson
Eventually, the original Turn-me-into-a-Fourier-series operator returned to
the torus T = R/Z with some new conditions on the functions for which he
would compute the Fourier series. Recall, continuity wasnt good enough to get
the Fourier series to converge, but the Turn-me-into-a-Fourier-series opera-
tor made still stronger conditions on f that guarantee pointwise convergence
everywhere. Here were some of his sufficient conditions:
1. (Dini) Define the modulus of continuity f (t) := suph>0 |f (t) f (t + h)|.
R (t)
If T ft dt < , then the Fourier series Sn (f ) converges pointwise to f .
2. (Jordan) If f is of bounded variation in a neighborhood of x, then Sn (f )
converges to 12 [f (x+ ) + f (x )].
3. We can even get uniform convergence(!). For 0 < 1, let Lip (T) be the
space of functions f : T C such that |f (x) f (y)| C|x y| , some C.
This is the Holder-Lipschitz condition. If f Lip (T), then the Fourier
series Sn (f ) converges pointwise to f , uniformly on T. Note this covers
Lipschitz functions, and hence continuously differentiable functions.
There were also some conditions invoking absolute summability of the Fourier
coefficients, but no one remembered them after the Turn-me-into-a-Fourier-
series operator stated them.
[TODO maybe prove Jordan (2) but not prove all the tauberian stuff for it]
Those were kind of restrictive conditions (compared to all the integrable
functions in L1 ) though, so the functions demanded the Turn-me-into-a-Fourier-
series operator go back to almost everywhere convergence and accept a lot more
228 CHAPTER 10. FOURIER ANALYSIS

functions. Recall that Kolmogorov showed we cant get a.e. convergence for all
of L1 (T). But at least the Turn-me-into-a-Fourier-series operator eventually
found out about a good positive result for Lp spaces, p > 1.
Theorem 65 (Carleson-Hunt 1960s). If f Lp (T), p > 1, then Sn (f )(x)
f (x) pointwise a.e.
Think were going to include the proof? Not a chance! The theorem is
considered one of the hardest theorems in analysis. Were certainly not including
the proof.3 Showing a.e. convergence of Fourier series in Lp (T) is equivalent to
showing the Carleson operator supn |Sn f | is a bounded operator,

sup |Sn f | : Lp (T) Lp, (T),


n

i.e. a weak-type (p, p) operator. This is what Carleson proved in 1966 for p = 2.
Since the Carleson operator is the maximal function, satisfying sec:lebesgue-differentiation
the weak-type
inequality gives us a.e. everywhere convergence. (Recall Section 57. We already
have convergence on the dense set of trig polynomials.)

0.2 0.4 0.6 0.8 1.0

Figure 10.13: Believe it or not, this function (in black) is in L2 (T) and hence has a
Fourier series that converges to itself pointwise a.e. The first few partial sums dont
look so great though.

Anyway, after this result, the functions on the torus were satisfied and al-
lowed the Turn-me-into-a-Fourier-series operator to stay. Well, the functions
in L1 but no other Lp space werent very happy, but there wasnt much they
could do.
3 There
LaceyThiele
is a simpler proof of Carlesons theorem by Lacey and Thiele [6] from 2000, but it
still requires much more harmonic analysis and technical details than we want here. The goal
of a quater-long class at Caltech, which had graduate analysis as a co/pre-requisite, was to
go over the harmonic anlaysis background related to the proof.
80. FOURIER SERIES FINALLY CONVERGE 229

So they sought out a new operator, who called himself the Turn-me-into-a-
Fourier-series-I-luv-Lp operator. This new operator promised Lp convergence
of Fourier series4 , that is, convergence in Lp norm, but only for 1 < p < .
Note, convergence in Lp norm implies that some subsequence of the Fourier
series converges pointwise almost everywhere. The functions only in L1 and no
other Lp space were again pretty annoyed and jealous.
Of all the favorite Lp spaces, our favorite is L2 (its the Hilbert space), so well
start there. First note that {e2inx }nZ is an orthonormal (Schauder) basis for
L2 (T): Since continuous functions, which are dense in L2 , can be approximated
uniformly by trig polynomials, we get that trig polynomials are also dense in
L2 (T). Orthonormality of {e2inx } is easy to check by computation.

Theorem 66 (Parseval). If f, g L2 (T), then


Z X
f (x)g(x) dx = fbk gbk .
T kZ

In particular, the map f 7 {fbk } is an isometry L2 2 , i.e.,


X
kf k22 = |fbk |2 .
kZ

This immediately gives us convergence in L2 norm, since


X n
kf Sn (f )k22 = |fbk |2 0.
|k|>n

More generally, if {uj }jJ is an orthonormal basis for a Hilbert space H, then
for every f H, X
kf k2 = |huj , f i|2 .
jJ
P
This comes from decomposing
P f into its basis elements, f = jJ huj , f iuj (or
if we are physicists, f = jJ |uj ihuj |f i).
[TODO is there stuff to prove here?]

Remark 17. A quick P detour: One fun thing we can do with Parseval is to

sum the infinite series n=1 n12 . We need to find a function f whose Fourier
1
coefficients look like n . It turns out our friend the saw-tooth wave f (x) = x
on [0, 1] works: After integrating by parts,
Z
b i
fn = xe2inx dx = , n 6= 0,
T 2n
4 This new operator computed the same Fourier series as the original Turn-me-into-a-

Fourier-series operator, but guaranteed Lp convergence instead of pointwise or pointwise a.e.


convergence. So its technically the same operator, but since were personifying them, this
one cares about Lp convergence while the other one cares about pointwise or pointwise a.e.
convergence.
230 CHAPTER 10. FOURIER ANALYSIS

and fb0 = 12 . Thus by Parseval,

X
1 1 1
= kf k22 = + 2 ,
3 4 n=1
4 2 n2

so
X
1 2
= .
n=1
n2 6
P
This method extends to summing n=1 n12k , k N, by using xk on [0, 1] along
with results for all the lower even powers.

Now back to convergence in norm: What about other Lp spaces? First we


have a useful theorem:

Theorem 67. Sn f converges to f in Lp norm, 1 p < , if and only if Sn


is uniformly bounded as operators on Lp , i.e. there is K so that for all f Lp
and n > 0,
kSn (f )kLp Kkf kLp .

TODO proof outline, uniform boundedness principle


From this, we can see we dont get convergence in L1 norm, since supkf k=1 kSn (f )kL1 =
kDn kL1 log n as n by using f = Fk the Fejer kernel5 . Norm
convergence in L (uniform convergence) also certainly doesnt hold since the
uniform limit of continuous functions is continuous. But it turns out, norm
convergence holds for 1 < p < !

Theorem 68. Sn f converges to f in Lp norm if 1 < p < .

The proof is much more work and involves a lot more harmonic analysis
than the case p = 2, so well omit it. The moral of the story is to avoid purely
L1 functions, who arent always very friendly.
Katznelson Grafakos
Katznelson
duo
References: Pointwise convergence: [4], Norm convergence: [2, 4, 1]

81 Introduction to the Fourier Transform


sec:fourier-transform
The functions in Lp (T), p > 1, were satisfied with the results about their Fourier
sec:carleson
series from Section 80. They were proud of living on the torus T = R/Z, and
how their periodicity enabled them to determine their Fourier series (with the
help of the Turn-me-into-a-Fourier-series or Turn-me-into-a-Fourier-series-I-
luv-Lp operator). One day, some of them decided to venture away from the
torus R/Z and visit their relatives on R. At first, the functions from T = R/Z
bragged to their relatives about how they could figure out their Fourier series
because they lived on the torus, and how sad it must be that integrable functions
on R couldnt do this. But the functions in L1 (R) retorted, We dont need your
Fourier series. We have the Fourier transform!
81. INTRODUCTION TO THE FOURIER TRANSFORM 231

Figure 10.14: The Fourier transform is not to be confused with the Fourier trans-
former, who turns into a car with four wheels, four seats, four windows, and four-wheel
drive. It is also not to be confused with the Fouriest transform, which is where you
write a number in the base system in which it has the most fours. (Check out SMBC
2874.)

Figure 10.15: What transform do you apply to turn a sphynx cat into a Norwegian
forest cat? Answer: The furrier transform!

...whats that? replied the functions from the torus.


Well, based on what youve told us about Fourier series, the Fourier trans-
forms is something like a continuous version of Fourier series. We still try to
fig:fourier-domains
decompose a function in terms of its frequencies. (Figure 10.16.)
Its very widely used in engineering, particularly electrical engineering, to
analyze signals and waves. In fact, you can even view it as a generalization of
Fourier series. The Fourier transform doesnt have to be on R; you can Fourier
sec:fourier-both
transform on any locally compact abelian group. (Section 82.) The cases R
5 Details: sup kS f k kD F k k
n 1 n k 1 kDn k1 since Dn Fk converges uniformly to
Dn . The upper bound sup kSn f k1 kDn k1 comes from using the convolution and Fubini.
232 CHAPTER 10. FOURIER ANALYSIS

Figure 10.16: Viewing a function in the time domain (red) and the frequency domain
(blue). The idea of Fourier series and the Fourier transform is to take a function in
fig:fourier-domains time and decompose it into its various frequency parts.

and your home R/Z are just specific examples of spaces where you can Fourier
transform. On R/Z, the Fourier transform just gives you your Fourier series.
The functions from the torus were amazed! And also pretty disappointed.
They had thought they were so special since they had Fourier series, and now
they learned that pretty much any function that anyone cared about could get
one. Plus they didnt know what a locally compact abelian group was.
The functions on R went on to explain the greatness of the Fourier transform.
The convention they chose to use is the one where the exponent has ipx instead
of 2ipx. This is not the good convention, but itshould agree with most physics
conventions, at least in quantum mechanics. The good convention is to put the
2 in the exponent instead, which makes a lot of the annoying factors in various
places disappear. With the physics convention, for f L1 (Rn ), its Fourier
transform is Z
1
fb(p) := f (x)eipx dn x, (10.2)
(2)n/2 Rn

where px = p x = p1 x1 + + pn xn . The Fourier transform fb is a function of


a new variable p Rn , so sometimes we talk about moving to Fourier space
to mean taking a Fourier transform and getting a function of p, where p lives
in Fourier space.
When analyzing signals, the Fourier transform takes you from time space to
frequency space. In quantum mechanics, the Fourier transform takes you from
position (x) space to momentum (p) space.
Despite some confused stares from the functions from the torus, the func-
tions on R went on to describe some fun facts about the Fourier transform that
make it especially useful. Roughly, the Fourier transform turns differentiation
into multiplication by p and translations into modulations (multiplication by a
fig:fourier-translates
phase), and vice versa. (Figure 10.18).sec:riemann-lebesgue
It also turns convolutions into multipli-
cation, which well address in Section 83.
81. INTRODUCTION TO THE FOURIER TRANSFORM 233

Figure 10.17: The 2D box function f (x, y) = [1,1][1,1] along with the magnitude
fig:fourier-box of its Fourier transform fb(1 , 2 ) = 2 sin( 1 ) sin(2 )
1 2
.

em:fourier-properties Lemma 12. For f, g L1 (Rn ) and a Rn ,

(f (x + a)) (p) = eiap fb(p) (10.3)


(eixa f (x)) (p) = fb(p a) (10.4)

If we also have f C 1 (Rn ) with lim|x| f (x) = 0 and the partial derivative
j f L1 (Rn ), then

(j f ) (p) = ipj fb(p). (10.5)

The functions on R went on and on about how useful the Fourier transform
is: So if we need to do something complicated like differentiation, we can
instead move to Fourier space and just multiply by ipj . Or if we need to do
for example a translation in x space, then we can move to Fourier space and
a phase eiap instead. Well give an application of this relationship
multiply bysec:hrt
in Section 86.
Finally, after watching most of the functions on T fall asleep, they decided
to answer one very important question: The Fourier transform gets you from x
space to p space, but how do you get back?
thm:fourier-inversion Theorem 69 (Fourier inversion). The Fourier transform is a bounded injective
map L1 (Rn ) C0 (Rn ) (the space of continuous functions vanishing at ). It
has inverse Z
1 2
f (x) = lim n/2
eipx|p| /2 fb(p)dn p, (10.6) eqn:fourier-inversion1
0 (2) Rn

where the limit is taken in L1 (Rn ). In particular, if f, fb L1 (Rn ), then

(fb) = f, (10.7)

where Z
1
f (p) = eipx f (x) dn x = fb(p). (10.8) eqn:fourier-inversion
(2)n/2 Rn
234 CHAPTER 10. FOURIER ANALYSIS

Figure 10.18: A translation in space x creates modulation in frequency p, and vice


versa. The left column shows the translation or modulation of the box function f
fig:fourier-box
from Figure 10.17, and the right column shows the magnitude of the resulting Fourier
transform. Phase is shown in color, with red corresponding to real values. The top
row shows a translation of f , the middle row modulation, and the bottom row a
fig:fourier-translates combination of translation and modulation.

sec:riemann-lebesgue eqn:fourier-inversion
So for sufficiently nice f (e.g. Schwartz, Section 83), we have Equation 10.8,
which implies (fb) (x) = f (x), so that the Fourier transform has period four.
Its inverse is then just applying the Fourier transform three times.
Now we have a way to get from Fourier space back into regular space, al-
82. FOURIER SERIES VS. THE FOURIER TRANSFORM (AND LCA GROUPS)235

do need fb L1 if we want to use the nice and simple Fourier inversion


though weeqn:fourier-inversion
formula (10.8). The proof of the Fourier inversion theorem uses the fact that the
Fourier transform of a Gaussian is another Gaussian. We add in the Gaussian
2
1
(2)n/2
e|x| /2 to make convergence work, then use Fubini, modulation, and
eqn:fourier-inversion1
L1 convergence with convolutions to get the formula (10.6).
Finally, a note on the normalization conventions for the Fourier transform:
Whenever we do anything involving physics, well use the definition that we
gave here. But when we get to more hard-core harmonic analysis, we might put
the 2 in the exponent instead, which is the good place to put it and gets rid
of the normalization constant (2)n/2 . To avoid some confusion, well use
instead of p when we do this.
References: [Teschl-fa]

82 Fourier series vs. the Fourier transform (and


LCA groups)
sec:fourier-both
After spending some time visiting each other, the functions on T and the func-
tions on R sat down and started talking about the relationship between Fourier
series and the Fourier transform. Even though we only defined Fourier series for
it can be defined similarly for f on Rn , just like the Fourier transform.
f on R, fig:fourier-series-2d
(Figure 10.19.)

Figure 10.19: A plot of Re(eix e2iy ), with the imaginary part indicated by the color.
In 2D, we integrate against e2inx = e2in1 x e2in2 y , so our 2D waves are products
fig:fourier-series-2d of waves in 1D.

To start off the discussion, the functions on R began with the analogue of
236 CHAPTER 10. FOURIER ANALYSIS

sec:carleson
Parsevals relation. Recall, this was in Section 80. One form said,
Z X
2
kf k2 |f (x)|2 dx = |fbk |2 kfbk22 .
R nZ

For the Fourier transform, we get the following (which follows quickly from
Fubinis theorem): [TODO check proof?]

thm:plancharel Theorem 70 (Plancharel). If f, fb L1 (Rn ), then f, fb L2 (Rn ) and


Z Z
2
kf k2 2 n
|f (x)| d x = |fb(p)|2 dn p kfbk22 . (10.9)
Rn Rn

[TODO proof for in L2 , TODO Plancharel with gf instead of f squared]


Other results for Fourier series have an analogue for the Fourier transform also.
sec:riemann-lebesgue
For example, in Section 83, well look at the Riemann-Lebesgue lemma for
sec:fourier-L^2
both Fourier series and the Fourier transform, and in Section 85, well quickly
revisit Carlesons theorem, this time for the Fourier transform. Really though,
the most important sharedP property is Fourier inversion. For T, that was just
Fourier series, f (x) = nZ fbn e2inx . For Rn , modulo conventions for where
R
the 2s go, it is f (x) = Rn fb()e2ix dn . The similarities of these formulas is
not a coincidence! For this, well investigate how the Fourier series and Fourier
transform are related. When convenient, well work in R (rather than Rd ) for
this.

82.1 Expanding the interval


This is the hand-wave-y, not as general way. In some sense, the Fourier trans-
form can be viewed as the limit of a Fourier series as the period T gets large.
When we did Fourier series earlier, we worked on the torus R/Z, or equivalently,
with 1-periodic functions on R. We can instead look at T -periodic functions,
and modify our definition to be
Z T /2
1
an = f (x)e2inx/T ,
T T /2

and try to write,


X
f (x) = an e2inx/T . (10.10) eqn:fourier-T
nZ

For the Fourier transform, we dont want to try to Fourier transform a periodic
function on R, since it wont be in L1 unless its zero a.e. So well look at f
supported on [T /2, T /2]. Then if we integrate over R, it reduces to an integral
over [T /2, T /2],
Z T /2
fb() = f (x)e2ix dx.
T /2
82. FOURIER SERIES VS. THE FOURIER TRANSFORM (AND LCA GROUPS)237

fig:fourier-transform-series
In fact we find that the Fourier coefficients satisfy an = T1 fb( Tn ). (Figure ??.) If
T = 12 , which corresponds to R/Z, then we see that the Fourier series eqn:fourier-T
coefficients
are the Fourier transform sampled at the integers. We can rewrite (10.10) as

1 X b n  2i(n/T )x
f (x) = f e .
T T
nZ

This is something like a Riemann sum (on the unbounded interval R though)
with x = T1 , so letting T become large, we get something like
Z
f (x) = fb()e2it d,
R

which is the Fourier inversion formula!

T 1 Imfb()

a1
a3 a2
a0

T3 T2 T1 0 1
T
2
T
3
T a3
a2
a1

Figure 10.20: The imaginary part of the Fourier transform of f (x) = xT /2,T /2
multiplied by T1 . (The real part is zero.) Evaluation at the points 2T
n
, n Z and
1 b n
multiplication by i retrieves the Fourier series coefficients an = T f ( 2T ). TODO
replace with an L1 function. Then comment, As T , the sampled points get
closer and closer, making the Fourier series sum look more and more like the Fourier
rier-transform-series transform.

82.2 Fourier transform on LCA groups


The actual good way to look at the relationship between Fourier series and the
Fourier transform is to look at the Fourier transform on more general spaces.
The Fourier series and the Fourier transform on Rd will then just be special
examples. The key idea of both Fourier series and the Fourier transform was
to decompose the function in terms of special nice functions, exponentials in
these cases: A Fourier series takes a periodic function and expresses it in terms
of a discrete sum of exponentials, while the Fourier transform takes a function
and expresses it as a continuous sum (integral) of exponentials via the Fourier
inversion theorem. More generally, we can try a similar decomposition on any
locally compact abelian (LCA) group. This is an abelian group with a topology
238 CHAPTER 10. FOURIER ANALYSIS

where the group operation and taking inverses is continuous, and the space is
locally compact and Hausdorff. Rn and T with the normal Euclidean topology
are examples. Our plan is to do Fourier analysis on LCA groups.

[TODO tree with Fourier stuff]

Figure 10.21: The most general we go here is the Fourier transform on LCA groups.

Recall that for f : T C, the Fourier transform (Fourier series coeffi-


cients) fbn lives on Z, and that for f : Rd C, the Fourier transform fb lives on
Rd . In general, for an LCA group G, the Fourier transform lives on the dual
group G . What is G ?
Definition 53. Let G be a topological group (a group with a Hausdorff topology
where the group operation and taking inverses is continuous). A character of
G is a continuous homomorphism : G S 1 . The dual group G is the space
of all characters of G.
For example, n : x 7 e2inx for n Z is a character of the torus T =
R/Z S 1 , and : x 7 e2ix for Rd is a character of Rn . In fact, these
are the only characters for these spaces!
Proposition 47. The characters of T are exactly n , n Z, and the characters
of Rd are exactly , Rd .
Proof. TODO proof
Theorem 71 (Pontryagin duality).
We need one last thing: remember for T or Rd , we recover f from fb by
b = Z,
integrating against characters (Fourier series, Fourier inversion). For T
c
the measure was the counting measure so we had a sum, while for R = Rd , the
d

measure was the Lebesgue measure so we had an integral. How do we choose


which measure to integrate against?
Theorem 72 (Existence of Haar measure). Every LCA group G has a unique
(up to scalar multiplication) nontrivial, regular, translation invariant Borel mea-
sure that is finite on compact sets. This is called Haar measure.
Conveniently, for Z, this is the counting measure (up to a scalar multiple),
while for Rd , this is the Lebesgue measure (up to a scalar multiple).
TODO: idea of proof
Now we can define the Fourier transform:
Definition 54 (Fourier transform). Let G be an LCA group with Haar measure
. The Fourier transform on L1 (G) is defined by
Z
b
f : G C, 7 f (x)(x) (x). (10.11)
G
1
If G is finite, sometimes there is a normalization factor |G| .
82. FOURIER SERIES VS. THE FOURIER TRANSFORM (AND LCA GROUPS)239

Figure 10.22: What is a pirates favorite measure on a locally compact topological


group? Answer: Haaaarrrrrr measure! [TODO cite http://www.math.utoronto.ca/
ashao/ for joke?]

TODO define dual measure

Theorem 73 (Plancherel). Let G be an LCA group with Haar measure and


dual measure . The Fourier transform is a unitary operator on L2 (G), and
Z Z
2
|f (x)| d = |fb(x)|2 d, f L2 (G). (10.12)
G G

TODO Thm inversion


Finally, theres some other Fourier transform friends we havent met yet,
but that are important in areas like computer science and electrical engineer-
ing. These are the discrete time Fourier transform and the discrete Fourier
transform. Heres a table summarizing what they really are (fun fact: The dual
group of a compact group is discrete, and the dual group of a discrete group is
compact):
G G
Fourier series T Z
Fourier transform R R
Discrete time Fourier transform Z T
Discrete Fourier transform Z/nZ Z/nZ
The 2D discrete Fourier transform is often used in image processing. Roughly,
low frequencies correspond to large, slowly changing objects in the image, while
high frequencies correspond to sharp changes and contrasts. Some notable ap-
plications:
240 CHAPTER 10. FOURIER ANALYSIS

Noise removal: Sometimes pictures have periodic noise. By taking the


Fourier transform, we can identify these undesired periodic elements easily,
remove them in Fourier space, and then use the inverse Fourier transform
fig:fourier-noise1
to produce the corrected image. For an example, see Figures 10.26 and
fig:fourier-noise2
10.27.

High and low pass filtering: High pass filtering can be used for edge detec-
tion since rapid changes correspond to high frequencies. Low pass filtering
can be used to blur an image.
Changing contrast: We can change the contrast of an image by raising the
magnitude of the Fourier transform to a power. This will emphasize high
or low frequencies over the other.
Image processing software like the GMIC plugin for Gimp [TODO link] will let
you experiment with Fourier transforming an image.

Figure 10.23: Original image (left) and power spectrum (right, aka magnitude of
the Fourier transform [TODO or norm squared??]) for some simple pictures. The grid
looks like a 2D wave and has just a few (hard-to-see) components in frequency space.

[TODO public domain panda?]


SteinShakarchi-fa Katznelson
References: Expanding interval: [10]; Fourier transform on LCA groups: [4],
others!!
82. FOURIER SERIES VS. THE FOURIER TRANSFORM (AND LCA GROUPS)241

Figure 10.24: Rotations: Original image (left), power spectrum (middle), phase
(right).

Figure 10.25: Rectangles, translations, and more rotations


242 CHAPTER 10. FOURIER ANALYSIS

Figure 10.26: We start with a picture where every other horizontal line has been
replaced with white pixels. We will use the discrete Fourier transform to remove the
lines, essentially by blurring them. The magnitude of the discrete Fourier transform
is shown on the right. The bright areas at the top and bottom indicate a periodicity
in the original image; these came from the horizontal lines. In a normal picture, the
fig:fourier-noise1 bright area is generally concentrated at the center only.

Figure 10.27: We correct the image by coloring over the bright areas at the top and
bottom. The phase of the Fourier transform (not shown) is unchanged. Applying the
inverse Fourier transform yields the picture on the right, in which the horizontal lines
fig:fourier-noise2 are removed.
83. SCHWARTZ FUNCTIONS AND TRADING SMOOTHNESS FOR DECAY243

83 Schwartz functions and trading smoothness


for decay
sec:riemann-lebesgue
[TODO this section needs a lot of work -lhs] [Should this section even exist?
should i just comment out the whole thing... -lhs 2 July 2016]
Think of your favorite family of nice functions on Rn . Now imagine they are
all the infinitely differentiable functions that decay rapidly as |x| . These
are known as the Schwartz functions. Fourier analysis on Schwartz functions
works wonderfully.

Definition 55. Let f C (Rn ) (f has partial derivatives of arbitrary order),


and let = (1 , . . . , n ) Nn0 . Set

|| f
f := , x = x n
1 xn ,
1
|| = 1 + + n ,
x
1
1
x
n
n

i.e. we take 1 partial derivatives wrt x1 , 2 partial derivatives wrt x2 , etc.


The Schwartz space is all functions f C (Rn ) whose derivatives of all orders
decay faster than any inverse power of x:

S(Rn ) := {f C (Rn ) : sup |x ( f )(x)| < , , Nn0 }. (10.13)


x

It is a dense subspace of Lp (Rn ), 1 p < , since it contains Cc (Rn ).

TODO picture of Schwartz function

2
Figure 10.28: One of our favorite Schwartz functions on R is f (x) = ex . Its
2
relative f (x) = e|x| is a Schwartz function on Rn . And of course any Cc function
is Schwartz. Schwartz functions are very friendly.

Heres some convenient facts about the Fourier transform on the space of
Schwartz functions:

Theorem 74. The Fourier transform is a bijection b: S(Rn ) S(Rn ).

Proof. First, the Fourier transform maps Schwartz functions to Schwartz func-
lem:fourier-properties
tions: Recall Lemma 12, which told us (j f ) (p) = ipj fb(p), and use that
p ( fb)(p) = i|||| ( x f (x) (p) is bounded since x f (x) S(Rn ). For
thm:fourier-inversion
the bijection part, use Fourier inversion (Theorem 69), which we conveniently
didnt prove.

Using Fubinis theorem, we also get:

Proposition 48. Let f, g S(Rn ). Then f g S(Rn ) and

(f g) (p) = (2)n/2 fb(p)b


g (p).
244 CHAPTER 10. FOURIER ANALYSIS

Proof.
Z Z
1
dx dt f (x t)g(t)eipx
(2)n/2 Rn Rn
Z Z
1
= dt dx f (x t)g(t)eip(xt) eipt
(2)n/2
= (2)n/2 fb(p)b
g (p).

It turns out the Fourier transform takes smoothness of f and turns it into
fast decay for fb. As we just saw, for Schwartz functions f that are infinitely
differentiable, the Fourier transform fb is again Schwartz and decays quickly.
The general idea is that
smoothness of f fast decay of fb.
[TODO vice versa? Paley-Wiener?] Well start with no smoothness, just as-
suming f L1 , and then work our way up to C k and C .
Lemma 13 (Riemann-Lebesgue). If f L1 (Rn ), then |fb(p)| 0 as |p| .
In fact, the Fourier transform maps L1 (Rn ) into C0 (Rn ), the space of continuous
functions decaying at infinity.
We actually already stated (without proof) a stronger version of this lemma
thm:fourier-inversion
in Theorem 69 about Fourier inversion.
dk f
Corollary 6. Let f C k (R) L1 (R), and suppose dxk
L1 . Then

fb(p) = o(pk ).
k
d k kb 1
TODO check this ( dx k f (x)) (p) = i p f (p) 0 since f L .

Proposition 49. TODO prop about decay of f-hat corresponding to derivatives


[TODO temporary removal]
The same principle of linking smoothness of f to decay of fb holds for Fourier
series as well.
Lemma 14 (Riemann-Lebesgue for Fourier series). If f L1 (T), then |fbn | 0
as n .
The proof idea is to first show it for continuous functions using some results
about Fourier series and Fejers kernel, then use density to extend to all L1 .
The Riemann-Lebesgue lemma gives us fun results, like for any measurable
set A [0, 1],
Z Z
n+
sin(2nx) dx = cos(2nx) dx 0.
A A

This is nice, but we can do better: just like for the Fourier transform, the rate
of decay depends on the smoothness of f .
84. UNCERTAINTY PRINCIPLES 245

Proposition 50. If f C k (T), then fbn = o(nk ). In particular, if f C (T),


then fbn decays faster than any polynomial in n.
The proof is by integrating by parts and applying Riemann-Lebesgue.
Remark 18 (TODO Method of stationary phase, maybe do function asym
expansion).
teschl-fa
References: Fourier transform: [13]

84 Uncertainty principles
One famous result in quantum mechanics is the Heisenberg uncertainty principle.
Roughly, it says that you cant know both the position and momentum of a
small particle to high accuracy. Morally, the better you know one of them,
the less you can know the other. Were going to ignore all the talk about
measurement in quantum mechanics and just do the math version, which we
can obtain from properties of thesec:quantum
Fourier transform. For some background in
quantum mechanics, see Section ??.
Once upon a time there was a quantum mechanical particle running around
in Rn . His position (location) wasnt given by a single point in Rn since he was
a quantum, not a classical,
R particle. His position was instead given by a wave
function (x) with Rd |(x)|2 dx = 1. In other words, (x) L2 (Rn ) with
kk2 = 1. The probability of finding this particle in a region E Rn is
Z
Prob(particle E) = |(x)|2 dx.
E
So were never entirely sure exactly where the particle is, but we have an idea
of where its likely to be. The particle himself was a little annoyed at this
uncertainty. For example, it made it difficult to set an address to meet other
particles or receive mail, but he realized he could localize himself to a small
region E if his wavefunction was very large on E and small outside of E.
Then he could say that he lives in E, and everyone would know where to find
him.
One day, the Fourier transformer was visiting quantum-mechanics-land and
ran into this particle. The particle explained his great plan to localize his
position. The Fourier transform looked skeptical. He asked the particle, But
what about your momentum? If you localize position too much, youll force your
momentum wavefunction to spread out and no one will have any idea where or
how fast youre going! What? replied the particle, whats this momentum
wavefunction? Why cant I just localize that too?
You cant just do that, explained the Fourier transformer. Your position
and momentum wavefunctions are related by the Fourier transform. Given a
position wavefunction (x), you can move to momentum space and get the
momentum wavefunction (p) b via the Fourier transform,
Z
b 1
(p) = (x)eipx dn x.
(2)n/2 Rn
246 CHAPTER 10. FOURIER ANALYSIS

Figure 10.29: The particle wants to live in E.

R
Then Prob(momentum K) = K |(p)| b 2
dp. Im ignoring some of the con-
stants since I dont visit quantum-mechanics-land that often and I dont like ~,
but you get the idea. But then there are uncertainty principles regarding the
Fourier transform that dont let you localize both space and momentum simul-
taneously. Theyre often stated and proved using operators and expected values,
but since Im currently visiting quantum-mechanics-land, Ill express them in
terms of the Fourier transform.
Theorem 75 (Heisenberg uncertainty principle). Suppose S(R) with kk2 =
1. Then for any x0 , p0 R,
Z  Z 
b 1
(x x0 )2 |(x)|2 dx (p p0 )2 |(p)| 2
dp . (10.14)
R R 4
What does this have to do with uncertainty? The expected value of an
operator A in the state is given by

E (A) := hA, i = h, Ai R. (10.15)

For the
R position operator X defined by X(x) = x(x), the expected value is
x = R x|(x)|2 dx. The variance, or uncertainty, is given by
Z
x2 = (x x)2 |(x)|2 dx,
R

and this is the quantity we are looking at in the Heisenberg uncertainty principle.
If is highly localized near x, then x2 will be small. The uncertainty principle
tells us that we cannot make both x2 and p2 very small at the same time, i.e. we
cannot localize on very small sets in both x and p space. If position is localized
to a region of size roughly R, then momentum cannot be localized on a scale
much smaller than R1 .
84. UNCERTAINTY PRINCIPLES 247

Proof (of HUP). First, we can replace (x) with eixp0 (x + x0 ) and change
variables, which lets us assume x0 = p0 = 0. Then we integrate by parts, using
S(R) and ||2 = :
Z Z Z
d d
1= 2 2
|(x)| dx = x |(x)| dx = 2 Re x(x) (x) dx.
R R dx R dx
Then we apply Cauchy-Schwarz to get
b
1 2kx(x)k2 kk2 = 2kx(x)k2 kp(p)k2.

The inequality is sharp, with equality holding for modulated Gaussians


2
Ce2ip0 x ea(xx0 ) . To show these are the only functions for which equality
holds, use the condition for equality in Cauchy-Schwarz.
We canfig:heisenberg-box
look at localization visually by looking at the time-frequency plane
in Figure 10.30. The visualizing only really works well for R, since otherwise
wed need Rd Rd .
Remark 19. Weve been talking about localizing in a pretty vague sense. We
generally dont mean literally supported on some set E, but we do mean that
most of the mass of the function lives in E. For localizing to an interval I, we
mean maybe something like,
C
|f (x)|  100 ,
|xc(I)|
|I|N 1 + |I|

for some C, N and where c(I) is the center of the interval I. (This is called C-
adapated of order N to I, [TODO cf cite].) But since were going to talk about
localization in a pretty vague sense, we wont concern ourselves with trying to
find a precise definition.
Heres another kind of uncertainty principle. If you localize yourself to a
bounded region E with probability 1, then youre guaranteeing that your mo-
mentum is pretty spread out for any magnitude you can think of, theres going
to be nonzero probability that your momentum is larger than that.
Theorem 76. There is no nonzero f L1 (R) which is compactly supported
whose Fourier transform is also compactly supported.
Proof. Theres a really nice proof that uses some complex analysis. You can
prove it without complex analysis, but the complex analysis proof is so nice
were going to use that one.
If f has compact support, then
R we can extend the Fourier transform fb to
6 b 1 itz
an entire function f (z) = 2 R e f (t) dt. The zeros of a nonzero entire
6 Say f is supported on [a, b]; then
Z bX X (iz)n Z b
izt)n
Z
eitz f (t) dt = f (t) dt = tn f (t) dt,
R a n n! n
n! a

where interchange of sum and integral is justified since everything converges absolutely.
248 CHAPTER 10. FOURIER ANALYSIS

p p

x x

Figure 10.30: A Heisenberg box in the time-frequency plane (here the space-
momentum plane). A small width corresponds to localization in position space, while
a small height corresponds to localization in momentu0m space. By the Heisenberg
uncertainty principle, the area of the rectangle must be at least 12 . One fun thing
to do is to take a Heisenberg box, and see how it changes under transformations like
fig:heisenberg-box dilation, modulation, translation, etc. TODO make a separate pictuer for this

2 y

1.5

0.5
x
2 1 1 2

Figure 10.31: Some Gaussian wavefunctions. The black curve is more localized at 0
than the blue curve. Gaussians obtain equality in the Heisenberg uncertainty principle.
TODO add a picture of their Fourier transforms spreading out

function must be discrete, so fb cannot be compactly supported. Tada! Complex


analysis is magic.

On a related note, a physicist might say that if you take the dirac delta
function, which is the most compactly supported function (supported on
a single point), then its Fourier transform is the most un-compactly supported
function, a plane wave eikx , which doesnt even get close to zero. Although the
dirac delta function is not an actual function on Rn , we can make sense of
it by viewing it as a distribution, and then we can still Fourier transform it in
the distributional sense to make sense of what the physicsts mean when they
say they are Fourier transforming the dirac delta. We will tell the tale of Dirac
85. WHAT ABOUT L2 ? 249

sec:distributions
sec:dirac-fourier
delta in Sections 88 and 89.
teschl-fa
SteinShakarchi-fa
tao-notes
References: Uncertainty principles: [13], [10], [11]. More quantum mechan-
teschl-quantum
ics: [12]
[TODO localization reference? Thiele wave packets?]

85 What about L2 ?
sec:fourier-L^2
Weve defined the Fourier transform for f L1 (Rn ), but what about f
L2 (Rn )? This is, for example, the space that physicists tend to care about in
quantum mechanics. Obviously theres no problem for f L1 (Rn ) L2 (Rn )
since we originally defined the Fourier transform for L1 . Conveniently, this space
is dense in L2 (Rn ) since it contains Cc (Rn ). So there exists hope to extend
the Fourier transform to L2 .
By now, the L2 functions were getting rather impatient, since they knew they
were in the best Lp space and wanted to know their Fourier transformss. Recall
thm:plancharel
Theorem 77 (Plancharel), which said that if f, fb L1 (Rn ), then f, fb L2 (Rn )
and kf k22 = kfbk22 . In particular, kf k2 = kfbk2 holds for f S(Rn ). Now we
try to extend the Fourier transform from some dense set, say S(Rn ), to all of
L2 (Rn ) by defining
fb := lim fbN , (10.16) eqn:fourier-L2
N

where fN is a sequence in S(Rn ) converging to f in the L2 norm. We need to


show the limit fb exists though.

thm:plancharel Theorem 77 (Plancharel). The Fourier transform extends to a unitary opera-


torb: L2 (Rn ) L2 (Rn ).

An operator U is unitary if it is surjective and preserves norm, i.e. kU xk =


kxk. Equivalently, U is unitary if U U = U U = I, where U is the adjoint.
Unitary operators are generally quite nice.

Proof. First we need the BLT theorem:

Theorem 78 (BLT). Let A : D(A) X Y be a bounded linear operator and


let Y be a Banach space. If D(A) is dense, then there is a unique continuous
extension of A to X which has the same operator norm.

The proof is to note that a bounded operator maps Cauchy sequences to


Cauchy sequences, so well need Af := limn Afn , fn D(A). The rest of
the proof is to show the definition is independent of the sequence fn f , and
then use continuity of norm to show that the norm is the same.
Anyway, with our friend the BLT theorem, we immediately get that the
Fourier transform extends uniquely to a bounded operator on L2 (Rn ). Plancherels
identity is still true by continuity of the norm. Since the Fourier transform on
the dense space S(Rn ) is a bijection, the range is dense, and also closed since
250 CHAPTER 10. FOURIER ANALYSIS

Figure 10.32: The BLT theorem has nothing to do with a BLT (bacon, lettuce, and
tomato) sandwich. It is instead about extending a bounded linear transformation from
a dense space to the entire space.

a Cauchy sequence in the range translates to a Cauchy sequence in the domain


L2 . So we have a surjective isometry, which must be unitary.7

Now the functions in L2 (Rn ) were extremely happy! They really liked uni-
tary operators since unitary operators are really friendly and nice. It was also
convenient that for f L1 L2 , this Fourier transform on L2 agrees with the
Fourier transform definition for L1 . In practice, its often easier to just use
eqn:fourier-L2
(10.16) with fN L1 L2 not necessarily in S(Rn ), since then we can take
cut-offs fN := f BN (0) as the approximations.

Remark 20. L2 convergence and everything is fine, but what about pointwise
a.e. convergence? Could we just be lazy and set
Z N
1
fb(k) = lim f (x)eikx dx?
N 2 N

TODO: more info, existence


This is very closely related to Carlesons result about pointwise a.e. conver-
gence of Fourier series. Both are sometimes called Luzins conjecture. It holds
for Schwartz functions, and Carleson showed (in the 1950s/60s) that it extends
to all f L2 (R).

Example 18. Heres a well-known example of computing the Fourier transform


of a function in L2 \ L1 . Once upon a time there was a function f (x) = sinx x .
Now f didnt live in L1 (R) since | sin x| spends a decent amount of time near 1
or at least 12 , but f did live in L2 (R). Now f wanted to figure out his own
Fourier transform, but was mad that he couldnt use the pointwise formula for
L1 functions.
What f needed was an approximation by functions in L1 L2 . Letting
fN (x) := sinx x |x|N L1 (R) L2 (R), we see fN f in the L2 norm since x12
7 Injectivity follows from being an isometry, since 0 = kU (f g)k = kf gk.
85. WHAT ABOUT L2 ? 251

1.2
1.0
0.8
0.6
0.4
0.2

-20 -10 10 20
-0.2


Figure 10.33: Plots of x1 and sinx x . One way to see sinx x is not integrable over R
is to note that | sin x| 12 on [ 6 , 5 ] + 2Z. Trying to integrate even just over this
6
subset of R gives infninity, since sinx x looks like at least 2|x|
1
, which is not integrable
5
at infinity even if we only integrate over [ 6 , 6 ] + 2Z.

has a small tail to integrate. For each fN ,


Z N
b 1 sin x ipx
fN (p) = e dx.
2 N x

We can compute the pointwise limit of fbN using some trig identities (separate
into cases for |p| = 1, |p| < 1, |p| > 1) to get

0, |p| > 1
1
lim fbN (p) = 2 , |p| = 1
.
N 2

, |p| 1

TODO check normalization TODO finish, convergence in L2 , equivalence class


And f lived happily ever after.

fb(p)

1 1 p

Figure 10.34: f (x) = sin x


x
was glad to know its Fourier transform. Note that fb is
not continuous.

teschl-fa Lacey
References: Fourier transform on L2 : [13]. Carlesons theorem: [5]
252 CHAPTER 10. FOURIER ANALYSIS

86 Fourier transforms turn time-translates into


frequency modulations
sec:hrt
The functions in L2 (R) wanted to know how special they were. In particular,
they wanted to know how they compared to their time-frequency translates.
These relatives were of the form Mb Ta f , where Ta is translation by a, Ta f (t) :=
f (t a), and Mb is modulation by b, Mb (t) := e2ibt f (t). They were worried
that by taking a linear combination of some of their time-frequency translates,
they could be entirely reconstructed (up to a set of measure zero)! They werent
actually sure if this was a good or bad thing, but for some reason they all wanted
to make sure this wasnt the case. (This probably had to do with the fact that
they were looking at the Gabor system

G(f, ) := {Mb Ta f : (a, b) R R},

and hoping for independence since Gabor systems are often basis-like.) Any-
way, these L2 functions were really interested in finite linear independence since
it seemed simpler than allowing infinite sums. That is, they P were wondering
N
whether or not there exist c1 , . . . , cN C, not all zero, such that j=1 cj Mbj Taj f (x) =
N
0 a.e. The set of points = {(aj , bj )}j=1 can be represented in hte phase space
RR, where the x-axis is time and the y-axis is frequency. This let the functions
talk about the geometry of the points in .

Figure 10.35: Some points in time-frequency space. These four points form a trape-
zoid.

While the functions didnt know for sure whether all possible finite Gabor
systems were independent, they could at least obtain some results for simpler
cases based on the geometry of . The full statement they hoped for was this:
hrt
Conjecture 1 (Heil-Ramanathan-Topiwala [3]). If 0 6 f L2 (R) and R
R is a finite set of distinct points, then G(f, ) is finitely linearly independent.

One easy result they could prove was linearly independence of frequency
86. FOURIER TRANSFORMS TURN TIME-TRANSLATES INTO FREQUENCY MODULATIONS253

translates. The proof: Suppose we have the points {(0, k )}N


k=1 and the relation

XN
cj e2ij x g(x) = 0 a.e.
j=1

PN
If g 6 0, then j=1 ck e2ik x = 0 on a set of positive measure (where g(x) 6=
0). But if we extend this trig polynomial to x C, we get cj = 0 for all
j = 1, . . . , N by uniqueness of analytic functions8 (uncountable subsets of R
have an accumulation point). So we cant get linear dependence with only
frequency modulations.
Additionally,
Proposition 51. Time-translates g(x aj ) are finitely linearly independent.
This seems a bit harder than frequency modulations, which are easy to deal
with:
This seems a bit harder than frequency modulations, which were easy to
deal with. But, if we use the Fourier transform, we can change all of our time
translations into frequency modulations! Since were now doing time-frequency
analysis instead of physics, well take a different normalization for the Fourier
transform where we put the 2 in the exponent,
Z
b
f () := f (x)e2ix dx.
R
tao-notes
This is actually the good place to put the 2, according to Terence Tao [11].
This is where well put the 2 for the rest of this chapter, since were done with
physics for a bit. Well also use instead of p in hopes of avoiding confusion.
Now suppose we have the relation
N
X
cj g(x j ) = 0
j=1

in L2 (R). Since the Fourier transform on L2 is unitary, it preserves norms and


we can apply it to the above equation, using Td f () = e
2ix b
f (), to get
N
X
cj e2ix gb() = 0.
j=1

But from what we had for just frequency modulations, this implies gb() = 0, so
g 0 by Fourier inversion! 
To deal with the time-translates, we effectively took the points {(j , 0)} and
rotated them 90 counterclockwise to become {(0, j )}, which correspond to
fig:hrt-rotate
frequency-translates. (Figure 10.36.)
8 Sorry some complex analysis snuck in there.
254 CHAPTER 10. FOURIER ANALYSIS

TODO picture of rotating points on a line to the y-axis

fig:hrt-rotate Figure 10.36

In general, there are several things we can always do to to try to simplify


its geometry. For any linear transformation A : R2 R2 with det A = 1, there
is a unitary map UA : L2 (R) L2 (R) so that

G(f, ) independent G(UA g, A()) independent.

Such transforms are called metaplectic transforms. We can also translate g


Md Tc g and (c, d). In our case, we rotated by 90 counterclockwise,
which corresponds to the unitary map that is the Fourier transform.
There are several other special cases for which HRT is known to hold. For
example, in 1999, Linnell proved that HRT holds if is a lattice of the form
A(Z2 ), where A is any invertible
hrt
matrix R2 R2 .
References: [hrt-survey, 3].

87 The gradient meets Littlewood-Paley


As stated in the previous section, were going to use the Fourier transform where
we put the 2 in the exponent,
Z
b
f () := f (x)e2ix dx.
R

Once upon a time, there was a differential operator whose name was
Nabla. As expected, Nabla took a nice function f and returned its gradient,
f . Now Nabla was a fairly complicated operator, and so one day decided to
try to find out how to make himself simpler. Nabla was jealous of the really
simple multiplication operators like M : f 7 M f for a constant M R. But
unfortunately, Nabla didnt see an easy way to make himself simpler other than
doing something like only operating on constant functions.
One day, Nabla met the Fourier transformer. The Fourier transformer
promised to turn Nabla into a multiplication operator. Not as simple as the
multiplication by a constant operators, but at least it looked better than differ-
entiation. Nabla was excited! Except then the Fourier transformer transformed
him into Fourier space, using

(f ) () = 2i fb(). (10.17) eqn:fourier-nabla

Now youre just multiplication by 2i, said the Fourier transformer. Hey!
I dont want to have to live in Fourier space, complained Nabla. And he ran
away from the Fourier transformer (after finding the inverse Fourier transformer
to undo the transformation).
Now Nabla continued to wander around until eventually he came to 1-
dimensional R land and met the Littlewood-Paley (LP) square function. (You
87. THE GRADIENT MEETS LITTLEWOOD-PALEY 255

can imagine the same conversation occuring in Rn land with some minor mod-
ifications.) The LP square function takes a function f L2 (R) and spits out a
sequence {Pk f }kZ , where each Pk restricts f to frequencies near 2k .
What exactly are these Pk s? Let () be a radial bump function supported
on [2, 2] and equal to 1 on [1, 1], and let
() := () (2),
fig:fourier-lp
so is supported on {1/2 || 2}. See Figure 10.37 for possible graphs of
and .

()
1

0.5

-3 -2 -1 1 2 3

()
1

0.5

-3 -2 -1 1 2 3

fig:fourier-lp Figure 10.37: Plots of () (top) and () = () (2) (bottom) in R.

Using telescoping sums,


X
(/2k ) = 1,
kZ
fig:fourier-lp
for 6= 0, so we have a partition of unity. (Figure 10.37.) Each (/2k ) is
supported near || 2k . The Littlewood-Paley projection operator Pk is the
Fourier multiplier defined by
Pd k b
k f () = (/2 )f (). (10.18)
Since (/2k ) is supported near || 2k , Pk effectively localizes f to frequency
levels near 2k .
Because of the partition of unity, the LP square functions friend the LP
decomposer takes the pieces Pk f and decomposes f into a sum of them,
X
f= Pk f, (10.19) eqn:fourier-L-P-decomposition
kZ
256 CHAPTER 10. FOURIER ANALYSIS

2k
1

0.5

-3 -2 -1 1 2 3



fig:fourier-lp-all Figure 10.38: Plots of 2k
for k = 3, 2, . . . , 2.

where convergence is in L2 . We also get pointwise convergence since for any


fixed 6= 0, there are only finitely many terms where (/2k ) is nonzero. Each
term Pk f is a function with frequency (Fourier transform) localized around 2k
away from the origin.
The LP square function told Nabla that together, they could nearly fulfill
Nablas dream of become a nice multiplication operator. Or at least a sum of
things like that.
For each Pk f , the LP square function explained, fb is localized near || =
k
has frequencies near 2k . Recall in Fourier space, looks like
2 , i.e. f only eqn:fourier-nabla
multiplication (10.17). But Pk f localizes fb near = 2k , [TODO || vs ...] so
the multiplication 2i Pd kd k
k f () looks pretty much like 2i2 Pk f for near 2 .
Transforming back to normal space, we get roughly,

Pk f 2k Pk .

In other words, taking a derivative of a function that is highly localized in


frequency space is just like multiplying by the frequency.
Since theeqn:fourier-L-P-decomposition
LP square function (with his friend the LP decomposer) can make
f look like (10.19), Nabla can roughly have the decomposition,
X
2k Pk . (10.20) eqn:fourier-L-P-nabla
kZ

Nabla was pleased with this result, although he wanted a bit more precise result
eqn:fourier-L-P-nabla
to back up the LP square functions claim and (10.20). So the LP square
function gave him a proposition:
Proposition 52. For 1 p ,

kPk f kp 2k kPk f kp . (10.21)

(The notation A B means there are constants c1 , c2 so that A c1 B and


B c2 A.) This tao-notes
satisfied Nabla. See the reference for the proof.
References: [11, notes 2]
88. DIRAC DELTA GETS KICKED OUT OF FUNCTIONS ON RN LAND AGAIN257

88 Dirac delta gets kicked out of functions on


Rn land again
sec:distributions
Once upon a time there was this pesky Dirac delta function.
( Dirac delta often
, x = 0
visits physics-land, where he claims he is the function , obtained
0, x 6= 0
say by taking the limit of normalized Gaussians as the variance goes to zero
and the central peak goes to at zero. This definition as a function on Rn
would not be a problem, except that Dirac delta ( claims he has integral equal
, x = 0
to one. If it were literally just the function , then it would have
0, x 6= 0
integral 0 since we can ignore sets of measure zero like {0} when integrating.
So there is a problem, and it is really that dirac delta is not a function on R (or
Rn ).

0 x

Figure 10.39: Dirac delta tries to disguise himself as a function on R. Good luck.

1
Whenever dirac delta( visits a function-land like L (R), he tries to sneak
, x = 0
in by pretending to be . But as soon as anyone tries to integrate
0, x 6= 0
, they get suspicious. Heres a function that is zero a.e. whose integral is
nonzero...wait a minute, thats impossible, we have an imposter! And then they
kick out.
So dirac delta gave up trying to enter function-lands like L1 (Rn ) and wan-
dered far and wide until coming to measure-land. Here, he was accepted as an
atom measure. But this was quite restrictive! As an atom measure, dirac delta
always had to stay inside the integral. The other measures in measure-land
viewed x0 (x) := (x x0 ) as the atom measure that puts weight 1 at the point
x0 , and weight 0 everywhere else, i.e.
(
1, x0 E
x0 (E) = .
0, x0 6 E
258 CHAPTER 10. FOURIER ANALYSIS

R
Then the integral9 R f (x)(x x0 ) dx is just an integral with respect to the
(non-Lebesgue)R measure (x x0 ), and this produces the same answers that the
physicists get, R f (x)(x x0 ) dx = f (x0 ).
But wanted to be free of these restraints and live outside an integral.
So he kept wandering around to find a place where he could be accepted as
himself. He wanted to find a place where he could be some sort of function,
live outside an integral and not only as a measure, and compute derivatives
D. Eventually, made his way to distribution-land, and this is where he
decided to stay. The distributions in distribution-land explained how they were
defined. They started with the space of test functions, Cc () for some open
nonempty Rn , i.e. compactly supported, infinitely differentiable functions
on . These are super nice functions!nice enough that they come and visit all
the distributions in distribution-land all the time. The topology on this space
that we want is essentially uniform convergence of all derivatives on a certain
compact set. More precisely, m converges to iff there is a compact set K
so that m is supported in K for all m and for each Nn0 , m
uniformly on K.
Definition 56. A distribution is an element in the dual space of Cc (), i.e. a
continuous linear functional Cc () C.
Dirac delta immediately saw that he fit right in here as the distribution
: 7 (0), for Cc (R), and was quite happy with this fact. Of course
was still worried about differentiation all these physicists kept assigning
homework problems about his derivative, but he had no idea how to define it
himself! The other distributions in distribution-land assured him there was no
problem. The functions in Cc (Rn ) were so nice that whenever a distribution
needed to be differentiated, they put the differentiation on themselves instead.
For T a distribution,

(D T )() := (1)|| T (D ),

where Nn0 , || = 1 + + n , and D := (/x1 )1 (/xn )n . The


(1)|| makes the definition work with integration by parts. Basically, if theres
a derivative operator on a distribution, just put it on the test function instead.
Dirac delta thought this wasnt a very nice way to treat the test functions
in Cc (Rn ), but the other distributions assured him it wasnt a problem. The
functions in Cc (Rn ) had infinitely many derivatives; they didnt mind taking
a few whenever a distribution needed to take a derivative. Dirac delta had yet
to ask the test functions himself how they felt about this.
Later one day, dirac delta was wandering around in distribution-land when
he ran into some old friends (or enemies?) that he thought he recognized
from L1loc -land (locally integrable functions). What are you guys doing here?
Youre functions on Rn ! You kicked me out of your function-land earlier!
9 The notation (x x ) dx might seem to suggest that is ac wrt the Lebesgue measure
0
dx, but this is false. However, we think the notation d((x x0 )) looks worse and is probably
more confusing.
89. DIRAC DELTA GETS FOURIER TRANSFORMED 259

The L1loc functions explained, Sure, we have citizenship in both L1loc -land and
distribution-land. You see, every L1loc function f can be identified with a distri-
bution Tf , where Z
Tf () := f dx. (10.22)

Sometimes we even say that the distribution Tf is the function f . And if f is
C 1 , then the classical derivative agrees with the weak (distributional) derivative.
But of course, youre not like us. You arent a function at all. Dirac delta was
a bit sad at this, but then he realized that being only a distribution was ok
whats the point in having distribution-land if everyone was already in L1loc -
land? He also pointed out that they were wrong; dirac delta was indeed a
function, just not a function on Rn .
Later, Dirac delta also ran into some old friends from measure land. He
found some Radon measures (locally finite and inner regular measures), who
explained how they are citizens of both measure-land and distribution-land.
For a Radon measure , the corresponding distribution is
Z
T () := d. (10.23)

If m where sec:radon-nikodym
m is the Lebesgue measure, then d = f dm by Radon-
Nikodym (Section 55), and the distribution can be identified with the function
f . Dirac delta realized he too could become a dual citizen of measure-land and
distribution-land since he corresponds to the atom measure! Dirac delta was
quite satisfied.
Finally, you might be wondering why this is in the Fourier transform section.
It turns out we can Fourier transform distributions. And the Fourier transform
on Lp , p > 2 hasliebloss
to be viewed as a distribution. Well address this next.
References: [7]

89 Dirac delta gets Fourier transformed


sec:dirac-fourier
We lied a bit in the last section. Not every distribution can be Fourier trans-
formed. [TODO note why need tempered distributions.] But we can define the
Fourier transform for all tempered distributions, which form a subspace of the
space of distributions. We define them by enlarging the space of test functions.
Instead of taking Cc (Rn ), we take Schwartz functions S(Rn ), with the semi-
norms kx ( f )(x)k which make it into something called a Frechet space (but
not a Banach space). The space of tempered distribution is then the dual space
of the Schwartz functions, i.e. all continuous linear functions S C. Well
denote this space by S .
Just like when the distributions are lazy and put derivatives on the test
functions, theyre also lazy and put the Fourier transform on the test functions.
If T : S C is a (tempered) distribution, then its Fourier transform is also a
(tempered) distribution defined by
b
Tb() := T (), S(Rn ). (10.24)
260 CHAPTER 10. FOURIER ANALYSIS

For example the Fourier transform of dirac delta is


Z
b b b
() = () = (0) = (x) dx,
R

which we identify with the function 1. So b = 1. Similarly, if we translate


R
d
dirac delta and look at (x a), we get T b
a () = (a) = R (x)e
2ixa
dx, so
d
Ta = e 2ixa
, which agrees (up to the normalization convention) with what
we get Rif we pretend were physicists and plug into the pointwise formula
fb() = f (x)e2ix dx.
This pointwise formula makes more sense for if view it as a measure. Note
that we can Fourier transform a finite Radon measure to get
Z
b() =
e2ix d(x), (10.25) eqn:fourier-stieltjes
Rn

in the sense that


Z Z Z
c () =
T b d(y) = d(y) d (x)e2iy
Rn Rn Rn
Z Z
= d (x) e2iy d(y),
Rn Rn
eqn:fourier-stieltjes
which we identify with the function in (10.25). This is known as the Fourier-
Stieltjes transform of a finite Radon measure.
Anyway, we can do pretty much anything we want with Dirac delta, like
taking derivatives, Fourier transforms, integrals, etc. by using distributions. At
least now we can do all10 those questions about the Dirac delta function that
show up in physics classes.
Moving on to the Fourier transform on Lp , p > 2: If f Lp , then f L1loc ,
so we can define the Fourier transform for p > 2 to be the distribution
Z
fb() := f b dx. (10.26)
Rn

Its convenient that Fourier transforming a distribution corresponds to Fourier


transforming the test function. So we can use properties about the Fourier
transform on Schwartz functions to prove things about the Fourier transform
on distributions.
[TODO add discussion (see Demeter notes?)]
Theorem 79. The Fourier transform is a bounded linear bijection S S
with bounded inverse.
Proof. For boundedness, if Tn T in S , then for f S,

Tbn (f ) = Tn (fb) T (fb) = Tb(f ),


10 Well, not quite all; there are still some questions that make no sense.
90. THE DIFFERENTIAL OPERATOR FINDS SOBOLEV AND MULTIPLIER FRIENDS261

so the Fourier transform is continuous.


sec:riemann-lebesgue
Now recall from Section 83 that the Fourier transform is 4 periodic on S.
So if we apply the Fourier transform three times to Tb S , well get T back,
since the Fourier tranform just acts on the Schwartz test function. This inverse
is also continuous since the Fourier transform is continuous and we just apply
it 3 times.
We also get inversion:
Proposition 53. If Tb L1 (Rn ), then
Z
T (x) = Tb()e2ix d, (10.27)
Rn

which is bounded and continuous.


e
Proof. Define Te(f ) := T (fe), where fe(x) = f (x). Then (Tb) = T . Identifying
Tb with Tb() L , we get,
1

Z Z
b eb b
T () = T () = d T () dx (x)e2ix
n n
ZR ZR
= dx (x) d Tb()e2ix .
Rn Rn
R
So we identify T with the function T (x) = Rn
Tb()e2ix d.
liebloss
duo
References: [7, 1]

90 The differential operator finds Sobolev and


multiplier friends
sec:sobolev
[TODO this section needs a lot of work]
TODO check normalizations in this section...
The differential operator wasnt very happy. He was annoyed that he
could only differentiate smooth functions. He noted that when he visited Fourier-
land, however, he ended up just being multiplication by (2i) since ( f ) () =
(2i) fb(). Multiplication by some power of doesnt seem to require much
to do with differentiability, and so hoped he could take more derivatives by
defining them Fourier side. As long as fb() was still in L2 , he could Fourier
invert and just claim that as the derivative f .
Definition 57. For r 0, the Sobolev space H r (Rn ) is

H r (Rn ) := {f L2 (Rn ) : ||r fb() L2 (Rn )}. (10.28)

It is a Hilbert space with the inner product


Z
hf, gi = fb() gb()(1 + 2||2 )r d. (10.29)
Rn
262 CHAPTER 10. FOURIER ANALYSIS

So for f H r (Rn ), found a nice way to take partial derivatives.

f := ((2i) fb()) , f H r (Rn ), || r, Nn0 . (10.30)

for C || functions, this agreed with the usual notion of derivative


At least lem:fourier-properties
(Lemma 12). But now had more functions he could differentiate. He won-
dered how this related to weak derivatives: Was this definition of f better
than the weak derivative definition?
In fact, the Sobolev spaces capture exactly the functions in L2 whose weak
partial derivatives are also in L2 . Recall for Nn0 , a weak derivative of f is
an L1loc function D f such that
Z Z
(D f ) dx = (1)|| ( )f dx, Cc (Rn ). (10.31) eqn:weak-derivative
Rn Rn
thm:plancharel
By using Plancherel (Theorem 77) to move the integral to Fourier space so we
eqn:weak-derivative
can use the definition of d r n
f , we can show (10.31) holds for , f H (R ), so
that f is the weak derivative of f . So if m N0 , then we have the alternate
characterization

H m (Rn ) = {f L2 (Rn ) : f has weak partial derivatives


up to order m which are in L2 (Rn )}. (10.32)

In this case, we also have the equivalent norm


X
kf k2H m := k f k22 . (10.33)
||m

Since Sobolev spaces involve weak derivatives, we might want to investigate


how this relates to actual derivatives. We have the slight problem that func-
tions living in H m are really equivalence classes of functions, but that wont be
an actual problem; well just send the entire equivalence class to some single
function.

Theorem 80 (Sobolev embedding). Let r > k + n2 for some k N0 . Then we


have the continuous embedding H r (Rn ) Cbk (Rn ) with

k f k Cn,r kf kH r , || k. (10.34)

[TODO: proof outline+check statement]


A useful special case is that H 1 (R) embeds into Cb0 (R), the space of bounded
continuous functions on R.
Other than , who cares about Sobolev spaces? Quantum particles and op-
erators do! So do partial differential equations (PDE)! (Of humans, this might
correspond to mathematical physicists and PDE people.) In quantum mechan-
ics, differential operators like the Hamiltonian H and momentum operator P
have Sobolev spaces for their domains. [TODO: check the actual domains...]
REFERENCES 263

[TODO reference Sobolev inequality in hydrogen stability]


[TODO multiplier symbols, commuting with translations] Recall the Fourier
side definition of f via

d b
f () = (2i) f (),

where we simply multiplied fb by a nice function (2i) .


We can define a large class of operators Tcf := mK fb, where mK L and
K is the distribution such that T f = f K. [TODO hmm not bounded so
maybe not the best intro...]
[TODO section on singular integrals, Hilbert transform?]

References
duo [1] Javier Duoandikoetxea. Fourier Analysis. Graduate Studies in Mathemat-
ics 29. American Mathematical Society, 2000.
Grafakos [2] Loukas Grafakos. Classical Fourier Analysis. 3rd ed. Graduate Texts in
Mathematics 249. Springer, 2014.
hrt [3] Christopher Heil, Jayakumar Ramanathan, and Pankaj Topiwala. Linear
independence of time-frequency translates. In: Proc. Amer. Math. Soc
124 (1996), pp. 27872795. url: http://www.ams.org/journals/proc/
1996-124-09/S0002-9939-96-03346-1/S0002-9939-96-03346-1.pdf.
Katznelson [4] Yitzhak Katznelson. An Introduction to Harmonic Analysis. 3rd ed. Cam-
bridge University Press, 2004.
Lacey [5] Michael Lacey. Carlesons Theorem: Proof, Complements, Variations.
Publ. Mat. 48 (2004), no. 2, 251307. 2003. eprint: arXiv:math/0307008.
LaceyThiele [6] Michael Lacey and Christoph Thiele. A proof of boundedness of the Car-
leson operator. In: Mathematical Research Letters 7.4 (2000), pp. 361
370. doi: 10.4310/mrl.2000.v7.n4.a1. url: http://dx.doi.org/10.
4310/mrl.2000.v7.n4.a1.
liebloss [7] Elliott Lieb and Michael Loss. Analysis. Graduate Studies in Mathematics
14. American Mathematical Society, 2001.
rudin [8] Walter Rudin. Real and Complex Analysis. McGraw-Hill Education, 1986.
Simon [9] Barry Simon. A Comprehensive Course in Analysis. American Mathemat-
ical Society, 2015.
SteinShakarchi-fa [10] Elias Stein and Rami Shakarchi. Fourier Analysis: An Introduction. Vol. 1.
Princeton Lectures in Analysis. Princeton University Press, 2003.
tao-notes [11] Terence Tao. Math 254a Harmonic analysis in the phase plane. 2001. url:
https://www.math.ucla.edu/~tao/254a.1.01w/.
teschl-quantum [12] Gerald Teschl. Mathematical Methods in Quantum Mechanics. 2nd ed.
Graduate Studies in Mathematics 157. American Mathematical Society,
2014.
264 CHAPTER 10. FOURIER ANALYSIS

teschl-fa [13] Gerald Teschl. Topics in Real and Functional Analysis. 2015. url: https:
//www.mat.univie.ac.at/~gerald/ftp/book-fa/fa.pdf.
Zygmund [14] Antoni Zygmund. Trigonometric Series. 3rd ed. Cambridge University
Press, 2003.
refsection:12

Chapter 11

Miscellaneous (maybe move


these later?)

91 Rectifiability
[maybe combine with something?]

92 Hausdorff (fractal) dimension


[have some fun pictures here: fractals + picture illustrating Hausdorff dimension
definition]
Define the Hausdorff meausre at dimension d to be

X
H d (E) := lim inf (Q)d : C covering of E by cubes Q of length <
0
QC

The Hausdorff dimension of E is

dimH (E) := inf{d : H d (E) = 0}.

93 Kakeya sets
sec:kakeya
[TODO write. Maybe also mention Nikodym sets.]

265
266 CHAPTER 11. MISCELLANEOUS (MAYBE MOVE THESE LATER?)

94 Cauchys functional equation and Hamel func-


tions
sec:additive-but-not-linear
Recall that we say that a function f : R R is additive if it satisfies Cauchys
functional equation for all x, y R:
f (x + y) = f (x) + f (y).
We say that f is linear if it is additive and in addition, for all , x R, we have
f (x) = f (x). Are additive functions necessarily linear? Almost:
Proposition 54. Suppose f is additive. Then for every x R and every Q,
f (x) = f (x).
Proof. First, we handle the case = 0. We have f (0) = f (0 + 0) = f (0) + f (0),
and hence f (0) = 0. Now, well tackle the case = 1. We have 0 = f (0) =
f (x + x) = f (x) + f (x), showing that f (x) = f (x). Now suppose = n
with n N. We have f (nx) = f (x + x + + x) with x repeated n times; by
n applications of additivity, this implies that f (nx) = nf (x). Finally, for the
general nonzero case, suppose = pq with p, q N. Then
   
p p
f x = f x (11.1)
q q
 
q p
= f x (11.2)
q q
 
p q
= f x (11.3)
q q
p
= f (x). (11.4)
q

But not quite!


prop:hamel-function Theorem 81. There exists a function f which is additive but not linear.
Proof. Think of R as being a vector space over the field of scalars Q. Every
vector space has a basis, so R does too, call it B. (A basis for R as a vector
space over Q is often called
P a Hamel basis.) Then every real number x can
be uniquely written as bB ab b, where each ab is a rational number and only
finitely many ab s are nonzero. We define f : R R by
!
X X
f ab b = ab . (11.5)
bB bB

It is immediate that this function is additive. To see that it is not linear, note
that its image is contained in Q! Furthermore, for b B, we have f (b) = 1.
Let be your favorite irrational number. Then f (b) is irrational, so f (b) 6=
f (b).
94. CAUCHYS FUNCTIONAL EQUATION AND HAMEL FUNCTIONS267

Functions which are additive but not linear areprop:hamel-function


called Hamel functions. The
Hamel function given in the proof of Proposition 81 is pretty nasty. (You might
recall that the statement that every vector space has a basis is equivalent to the
axiom of choice.) It turns out that all additive nonlinear functions are horribly
nasty:

Theorem 82. Suppose f : R R is additive and Lebesgue measurable. Then


f is linear.

Proof. TODO: give credit to Horst Herrlichs Axiom of Choice. Also TODO:
somehow illustrate this proof. Consider replacing with the proof at http://web.stanford.edu/ ck-
hend/additive.pdf, which proceeds directly rather than contrapositively, but
which is slightly less elementary. Also TODO: consider just omitting the proof
entirely, since its not that simple...
First, assume that f : R R is additive and nonlinear, and assume that
f (x0 ) = 0 and f (x1 ) = 1, where x0 is nonzero. For n Z, let qn be a rational
number so that |nx1 qn x0 | < 12 . Define An = f 1 ([n, n + 1)). Define B0 =
A0 [ 12 , 32 ], and define

Bn = B0 + nx1 qn x0 .

Now, if x An [0, 1], then by additivity, x (nx1 qn x0 ) B0 , and


hence (by definition) x Bn . So An [0, 1] Bn . Taking unions, we see that
[0, 1] n Bn . On the other hand, by the definitions of B0 and qn , every Bn is
contained in [1, 2], so n Bn [1, 2]. Therefore,

1 + (n Bn ) 3.

Now, observe that the sets Bn are pairwise disjoint. Indeed, if Bn Bm 6= ,


then there exist x, y B0 so that

x + nx1 qn x0 = y + mx1 qm x0 .

But applying f to both sides and using additivity gives

f (x) + n = f (y) + m.

By the definition of B0 , |f (x) f (y)| < 1, so n = m as desired. Furthermore,


the Bn s are translates of one another. So if B0 were measurable, by countable
additivity, n Bn would either have measure 0, or measure +. This would
contradict our earlier bounds on the outer measure of n Bn , so B0 must not be
measurable, and hence f is not measurable either.
Finally, for the general case, just assume that f : R R is additive and
nonlinear. Since f is nonlinear, there are nonzero x0 , x1 so that f (x 0) f (x1 )
x0 6= x1 .
Define g : R R by
x0 f (x) xf (x0 )
g(x) = .
x0 f (x1 ) x1 f (x0 )
268 CHAPTER 11. MISCELLANEOUS (MAYBE MOVE THESE LATER?)

This new function g is additive:

x0 f (x) xf (x0 ) + x0 f (y) yf (x0 )


g(x + y) = (11.6)
x0 f (x1 ) x1 f (x0 )
x0 f (x + y) (x + y)f (x0 )
= . (11.7)
x0 f (x1 ) x1 f (x0 )

Furthermore, by construction, g(x0 ) = 0, and g(x1 ) = 1. And of course, the


fact that g(x0 ) = 0 implies that g is nonlinear. Therefore, by the first part of
the proof, g is nonmeasurable, and hence f is nonmeasurable as desired.
[TODO theres plenty more to say about Cauchys functional equation.
Maybe we should just give a short list of all the crazy results that people have
cooked up, not bothering to prove any more than what we already have here.
E.g. apparently there is an additive function f such that f (I) = R for every
interval I, Conway-style...]
[TODO mention how these results translate to the equation f (x + y) =
f (x)f (y)]
refsection:13

Chapter 12

Acknowledgments

Many Bothans died to bring us


this information.
jedi
Mon Mothma [2]

We thank many people for their inspirations, suggestions, and contribu-


tions to this project. Heres a list of some of them, in no particular order:
bedtime
Sunshine Dobois and Colin Macdonald [1], William Hoza, Laura Shou, Oriel
Humes, Adam Jermyn, Eric Bobrow, George Washington, Xander Rudelis,
Shival Dasu, Kathleen Hoza, Nancy Zhang, Algae Elbaum, Molly Weasley, Alex
Bourzutschky, Paul Zhang, Zachary Chase, Helen Xue.

References
bedtime [1] S. Duvois and C. Macdonald. 101 Illustrated Analysis Bedtime Stories.
2001. url: http://people.maths.ox.ac.uk/macdonald/errh/.
jedi [2] George Lucas. Star Wars Episode VI: The Return of the Jedi. 1983.

269
270 CHAPTER 12. ACKNOWLEDGMENTS
refsection:14

Appendix A

Omitted Details

1 Adding books to the top of a stack


apx:book-stacking sec:book-stacking
This section elaborates on the last paragraph of Section ??.

Definition 58. A plan is a sequence x1 , x2 , . . . of real numbers. (Think of xn


as the horizontal position of the center of book n, where the origin is at the
upper right corner of the table. Remember, no two books can have the same
vertical position.)

Definition 59. A plan x1 , x2 , . . . is sound if for every N N, if you were to


place N books at positions x1 , . . . , xN (with book 1 on the bottom and book N
on the top), then that stack of N books would not topple. (In other words, if
you were to build an ever-growing stack by adding books to the top onefig:sound-plan
by one
in positions x1 , x2 , . . . , then the stack would never topple.) See Figure A.1.

Proposition 55. Suppose x1 , x2 , . . . is a sound plan. Then for every n, xn 21 .


(So the overhang of the stack always satisfies d 1.)
1
Proof. Suppose x1 , x2 , . . . is a plan with xn = 2 + for some n N and some

..
.

Table

Figure A.1: The plan 0, 12 , 0, 0, 0, 0, 0, . . . is sound. It achieves an overhang of


fig:sound-plan d = 12 .

271
272 APPENDIX A. OMITTED DETAILS

> 0. Well show that the plan is not sound. For each N > n, define
N
1 X
aN = xi
N i=1
XN
1
bN = xi .
N n i=n+1

(So aN is the COM of books 1, . . . , N , and bN is the COM of books n+1, . . . , N .)


Then we have
Pn
(N n)bN + i=1 xi
aN =
N
n
n 1 X
= bN bN + xi .
N N i=1

First, suppose that |bN xn | > 21 for some N . Then when the stack has N
books, it will topple over, pivoting about one of the top corners of book n.
1 n
Pm that |bN xn | 2 for all N . Then N bN 0 as
Therefore, assume instead
1
N , and of course N i=1 xi 0 as N , so |aN bN | 0 as N .
Choose N large enough that |aN bN | < . Then by the triangle inequality,
|aN xn | < 12 + , so aN > 0. Therefore, when the stack has N books, it will
topple, pivoting about the upper right corner of the table.

You might also like