0% found this document useful (0 votes)

159 views

It Lectures

This document provides lecture notes on information theory. It begins with an introduction discussing the preface and authors. The core content covers information measures like entropy, divergence, and mutual information. It discusses properties such as convexity and continuity. It also covers applications like calculating channel capacity and extremizing mutual information. The notes are at a graduate level and cover both discrete and continuous sources/channels.

Uploaded by

Manuel Cebollo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

159 views

It Lectures

Uploaded by

Manuel Cebollo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 342

LECTURE NOTES ON INFORMATION THEORY

Preface
“There is a whole book of readymade, long and convincing, lav-
ishly composed telegrams for all occasions. Sending such a
telegram costs only twenty-five cents. You see, what gets trans-
mitted over the telegraph is not the text of the telegram, but
simply the number under which it is listed in the book, and the
signature of the sender. This is quite a funny thing, reminiscent
of Drugstore Breakfast #2. Everything is served up in a ready
form, and the customer is totally freed from the unpleasant
necessity to think, and to spend money on top of it.”
Little Golden America. Travelogue by I. Ilf and E. Petrov, 1937.
[Pre-Shannon encoding, courtesy of M. Raginsky]

These notes provide a graduate-level introduction to the mathematics of Information Theory.

They were created by Yury Polyanskiy and Yihong Wu, who used them to teach at MIT (2012,
2013 and 2016), UIUC (2013, 2014) and Yale (2017). The core structure and flow of material
is largely due to Prof. Sergio Verdú, whose wonderful class at Princeton University [Ver07]
shaped up our own perception of the subject. Specifically, we follow Prof. Verdú’s style in
relying on single-shot results, Feinstein’s lemma and information spectrum methods. We
have added a number of technical refinements and new topics, which correspond to our own
interests (e.g., modern aspects of finite blocklength results and applications of information
theoretic methods to statistical decision theory and combinatorics).
Compared to the more popular “typicality” and “method of types” approaches (as in
Cover-Thomas [CT06] and Csiszár-Körner [CK81b]), these notes prepare the reader to
consider delay-constraints (“non-asymptotics”) and to simultaneously treat continuous and
discrete sources/channels.
We are especially thankful to Dr. O. Ordentlich, who contributed a lecture on lattice
codes. Initial version was typed by Qingqing Huang and Austin Collins, who also created
many graphics. Rachel Cohen have also edited various parts. Aolin Xu, Pengkun Yang,
Ganesh Ajjanagadde, Anuran Makur, Jason Klusowski, and Sheng Xu have contributed
suggestions and corrections to the content. We are indebted to all of them.
Y. Polyanskiy <[email protected]>
Y. Wu <[email protected]>
18 Aug 2016

This version: August 18, 2017

Contents

Contents 2

Notations 8

I Information measures 9

1 Information measures: entropy and divergence 10

1.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.2 Entropy: axiomatic characterization . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 History of entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4* Entropy: submodularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5* Entropy: Han’s inequality and Shearer’s Lemma . . . . . . . . . . . . . . . . . . . 16
1.6 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 Differential entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2 Information measures: mutual information 23

2.1 Divergence: main inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Conditional divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Mutual information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Conditional mutual information and conditional independence . . . . . . . . . . . 30
2.5 Strong data-processing inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.6* How to avoid measurability problems? . . . . . . . . . . . . . . . . . . . . . . . . . 32

3 Sufficient statistic. Continuity of divergence and mutual information 34

3.1 Sufficient statistics and data-processing . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Geometric interpretation of mutual information . . . . . . . . . . . . . . . . . . . 36
3.3 Variational characterizations of divergence: Donsker-Varadhan . . . . . . . . . . . 37
3.4 Variational characterizations of divergence: Gelfand-Yaglom-Perez . . . . . . . . . 38
3.5 Continuity of divergence. Dependence on σ-algebra. . . . . . . . . . . . . . . . . . 39
3.6 Variational characterizations and continuity of mutual information . . . . . . . . . 42

4 Extremization of mutual information: capacity saddle point 44

4.1 Convexity of information measures . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2* Local behavior of divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3* Local behavior of divergence and Fisher information . . . . . . . . . . . . . . . . . 48
4.4 Extremization of mutual information . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5 Capacity = information radius . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6 Existence of caod (general case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

2
4.7 Gaussian saddle point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Single-letterization. Probability of error. Entropy rate. 58

5.1 Extremization of mutual information for memoryless sources and channels . . . . 58
5.2* Gaussian capacity via orthogonal symmetry . . . . . . . . . . . . . . . . . . . . . . 59
5.3 Information measures and probability of error . . . . . . . . . . . . . . . . . . . . 60
5.4 Entropy rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.5 Entropy and symbol (bit) error rate . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6 Mutual information rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.7* Toeplitz matrices and Szegö’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . 65

6 f -divergences: definition and properties 67

6.1 f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Data processing inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.3 Total variation and hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.4 Motivating example: Hypothesis testing with multiple samples . . . . . . . . . . . 72
6.5 Inequalities between f -divergences . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Inequalities between f -divergences via joint range 76

7.1 Inequalities via joint range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
7.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.3 Joint range between various divergences . . . . . . . . . . . . . . . . . . . . . . . . 83

II Lossless data compression 84

8 Variable-length Lossless Compression 85

8.1 Variable-length, lossless, optimal compressor . . . . . . . . . . . . . . . . . . . . . 85
8.2 Uniquely decodable codes, prefix codes and Huffman codes . . . . . . . . . . . . . 94

9 Fixed-length (almost lossless) compression. Slepian-Wolf. 99

9.1 Fixed-length code, almost lossless, AEP . . . . . . . . . . . . . . . . . . . . . . . . 99
9.2 Linear Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
9.3 Compression with side information at both compressor and decompressor . . . . . 106
9.4 Slepian-Wolf (Compression with side information at decompressor only) . . . . . . 107
9.5 Multi-terminal Slepian Wolf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
9.6* Source-coding with a helper (Ahlswede-Körner-Wyner) . . . . . . . . . . . . . . . 110

10 Compressing stationary ergodic sources 113

10.1 Bits of ergodic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
10.2 Proof of Shannon-McMillan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.3* Proof of Birkhoff-Khintchine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
10.4* Sinai’s generator theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

11 Universal compression 124

11.1 Arithmetic coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
11.2 Combinatorial construction of Fitingof . . . . . . . . . . . . . . . . . . . . . . . . . 125
11.3 Optimal compressors for a class of sources. Redundancy. . . . . . . . . . . . . . . 127
11.4* Approximate minimax solution: Jeffreys prior . . . . . . . . . . . . . . . . . . . . 128

3
11.5 Sequential probability assignment: Krichevsky-Trofimov . . . . . . . . . . . . . . . 130
11.6 Individual sequence and universal prediction . . . . . . . . . . . . . . . . . . . . . 131
11.7 Lempel-Ziv compressor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

III Binary hypothesis testing 136

12 Binary hypothesis testing 137

12.1 Binary Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.2 Neyman-Pearson formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
12.3 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
12.4 Converse bounds on R(P, Q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
12.5 Achievability bounds on R(P, Q) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
12.6 Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

13 Hypothesis testing asymptotics I 147

13.1 Stein’s regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
13.2 Chernoff regime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
13.3 Basics of Large deviation theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

14 Information projection and Large deviation 157

14.1 Large-deviation exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.2 Information Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
14.3 Interpretation of Information Projection . . . . . . . . . . . . . . . . . . . . . . . . 161
14.4 Generalization: Sanov’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163

15 Hypothesis testing asymptotics II 164

15.1 (E0 , E1 )-Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
15.2 Equivalent forms of Theorem 15.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
15.3* Sequential Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

IV Channel coding 173

16 Channel coding 174

16.1 Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
16.2 Basic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
16.3 General (Weak) Converse Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
16.4 General achievability bounds: Preview . . . . . . . . . . . . . . . . . . . . . . . . . 180

17 Channel coding: achievability bounds 181

17.1 Information density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
17.2 Shannon’s achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
17.3 Dependence-testing bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
17.4 Feinstein’s Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

18 Linear codes. Channel capacity 187

18.1 Linear coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
18.2 Channels and channel capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

4
18.3 Bounds on C ; Capacity of Stationary Memoryless Channels . . . . . . . . . . . . 193
18.4 Examples of DMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
18.5* Information Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

19 Channels with input constraints. Gaussian channels. 201

19.1 Channel coding with input constraints . . . . . . . . . . . . . . . . . . . . . . . . . 201
?
19.2 Capacity under input constraint C(P ) = Ci (P ) . . . . . . . . . . . . . . . . . . . . 203
19.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
19.4* Non-stationary AWGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
19.5* Stationary Additive Colored Gaussian noise channel . . . . . . . . . . . . . . . . . 209
19.6* Additive White Gaussian Noise channel with Intersymbol Interference . . . . . . . 210
19.7* Gaussian channels with amplitude constraints . . . . . . . . . . . . . . . . . . . . 210
19.8* Gaussian channels with fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

20 Lattice codes (by O. Ordentlich) 212

20.1 Lattice Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
20.2 First Attempt at AWGN Capacity . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
20.3 Nested Lattice Codes/Voronoi Constellations . . . . . . . . . . . . . . . . . . . . . 216
20.4 Dirty Paper Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
20.5 Construction of Good Nested Lattice Pairs . . . . . . . . . . . . . . . . . . . . . . 220

21 Channel coding: energy-per-bit, continuous-time channels 222

21.1 Energy per bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
21.2 What is N0 ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
21.3 Capacity of the continuous-time band-limited AWGN channel . . . . . . . . . . . 227
21.4 Capacity of the continuous-time band-unlimited AWGN channel . . . . . . . . . . 228
21.5 Capacity per unit cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

22 Advanced channel coding. Source-Channel separation. 235

22.1 Strong Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
22.2 Stationary memoryless channel without strong converse . . . . . . . . . . . . . . . 239
22.3 Channel Dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
22.4 Normalized Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
22.5 Joint Source Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242

23 Channel coding with feedback 246

23.1 Feedback does not increase capacity for stationary memoryless channels . . . . . . 246
23.2* Alternative proof of Theorem 23.1 and Massey’s directed information . . . . . . . 249
23.3 When is feedback really useful? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252

24 Capacity-achieving codes via Forney concatenation 258

24.1 Error exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
24.2 Achieving polynomially small error probability . . . . . . . . . . . . . . . . . . . . 259
24.3 Concatenated codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
24.4 Achieving exponentially small error probability . . . . . . . . . . . . . . . . . . . . 260

5
V Lossy data compression 262

25 Rate-distortion theory 263

25.1 Scalar quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
25.2 Information-theoretic vector quantization . . . . . . . . . . . . . . . . . . . . . . . 269
25.3* Converting excess distortion to average . . . . . . . . . . . . . . . . . . . . . . . . 272

26 Rate distortion: achievability bounds 274

26.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
26.2 Shannon’s rate-distortion theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
26.3* Covering lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281

27 Evaluating R(D). Lossy Source-Channel separation. 284

27.1 Evaluation of R(D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
27.2* Analog of saddle-point property in rate-distortion . . . . . . . . . . . . . . . . . . 287
27.3 Lossy joint source-channel coding . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
27.4 What is lacking in classical lossy compression? . . . . . . . . . . . . . . . . . . . . 295

VI Advanced topics 296

28 Applications to statistical decision theory 297

28.1 Fano, LeCam and minimax risks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
28.2 Mutual information method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301

29 Multiple-access channel 306

29.1 Problem motivation and main results . . . . . . . . . . . . . . . . . . . . . . . . . 306
29.2 MAC achievability bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
29.3 MAC capacity region proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310

30 Examples of MACs. Maximal Pe and zero-error capacity. 313

30.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
30.2 Orthogonal MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
30.3 BSC MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
30.4 Adder MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
30.5 Multiplier MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
30.6 Contraction MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
30.7 Gaussian MAC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318
30.8 MAC Peculiarities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319

31 Random number generators 323

31.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
31.2 Converse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
31.3 Elias’ construction of RNG from lossless compressors . . . . . . . . . . . . . . . . 324
31.4 Peres’ iterated von Neumann’s scheme . . . . . . . . . . . . . . . . . . . . . . . . . 325
31.5 Bernoulli factory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
31.6 Related problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329

32 Entropy method in combinatorics and geometry 331

6
32.1 Binary vectors of average weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
32.2 Shearer’s lemma & counting subgraphs . . . . . . . . . . . . . . . . . . . . . . . . 332
32.3 Brégman’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
32.4 Euclidean geometry: Bollobás-Thomason and Loomis-Whitney . . . . . . . . . . . 336

Bibliography 338

7
Notations

• a ∧ b = min{a, b} and a ∨ b = max{a, b}.

• Leb: Lebesgue measure on Euclidean spaces.

• For p ∈ [0, 1], p̄ , 1 − p.

• x+ = max{x, 0}.

• int(·), cl(·), co(·) denote interior, closure and convex hull.

• N = {1, 2, . . .}, Z+ = {0, 1, . . .}, R+ = {x : x ≥ 0}.

• Standard big O notations are used: e.g., for any positive sequences {an } and {bn }, an = O(bn )
if there is an absolute constant c > 0 such that an ≤ cbn ; an = Ω(bn ) if bn = O(an ); an = Θ(bn )
if both an = O(bn ) and an = Ω(bn ); an = o(bn ) or bn = ω(an ) if an ≤ n bn for some n → 0.

8
Part I

Information measures

9
§ 1. Information measures: entropy and divergence

Review: Random variables

• Two methods to describe a random variable (RV) X:

1. a function X : Ω → X from the probability space (Ω, F, P) to a target space X .

2. a distribution PX on some measurable space (X , F).

• Convention: capital letter – RV (e.g. X); small letter – realization (e.g. x0 ).

• A
P RV X is discrete if there exists a countable set X = {xj : j ≥ 1} such that
j≥1 PX (xj ) = 1. The set X is called the alphabet of X, x ∈ X are atoms, and
PX (x) is the probability mass function (pmf).

• For discrete RV, support supp(PX ) = {x : PX (x) > 0}.

• Vector RVs: X1n , (X1 , . . . , Xn ), also denoted by just X n .

• For a vector RV X n and S ⊂ {1, . . . , n} we denote XS = {Xi , i ∈ S}.

1.1 Entropy
Definition 1.1 (Entropy). Let X be a discrete RV with distribution PX . The entropy (or Shannon
entropy) of X is
h 1 i
H(X) = E log
PX (X)
X 1
= PX (x) log .
PX (x)
x∈X

Definition 1.2 (Joint entropy). Let Xn = (X1 , X2 , . . . , Xn ) be a random vector with n components.
h 1 i
H(X n ) = H(X1 , . . . , Xn ) = E log .
PX1 ,...,Xn (X1 , . . . , Xn )
Note: This is not really a new definition: Definition 1.2 is consistent with Definition 1.1 by treating
X n as a RV taking values on the product space.
Definition 1.3 (Conditional entropy).
h 1 i
H(X|Y ) = Ey∼PY [H(PX|Y =y )] = E log ,
PX|Y (X|Y )
i.e., the entropy of H(PX|Y =y ) averaged over PY .

10
Note:

• Q: Why such definition, why log, why entropy?

The name comes from thermodynamics. The definition is justified by theorems in this course
(e.g. operationally by compression), but also by a number of experiments. For example, we
can measure time it takes for ants-scouts to describe the location of the food to ants-workers.
It was found that when nest is placed at a root of a full binary tree of depth d and food at
one of the leaves, the time was proportional to log 2d = d – entropy of the random variable
describing food location. It was estimated that ants communicate with about 0.7 − 1 bit/min.
Furthermore, communication time reduces if there are some regularities in path-description
(e.g., paths like “left,right,left,right,left,right” were described faster). See [RZ86] for more.

• We agree that 0 log 10 = 0 (by continuity of x 7→ x log x1 )

• Also write H(PX ) instead of H(X) (abuse of notation, as customary in information theory).

• Basis of log — units

log2 ↔ bits
loge ↔ nats
log256 ↔ bytes
log ↔ arbitrary units, base always matches exp

Example (Bernoulli): X ∈ {0, 1}, P[X = 1] = PX (1) , p and P[X = 0] = PX (0) , p

1 1
H(X) = h(p) , p log + p log
p p

where h(·) is called the binary entropy func-

tion.

Proposition 1.1. h(·) is continuous, concave on

[0, 1] and 0 1/2 1
p
h0 (p) = log
p
with infinite slope at 0 and 1.
Example (Geometric): X ∈ {0, 1, 2, . . .} P[X = i] = Px (i) = p · (p)i
∞
X 1 X
∞
1 1
H(X) = p · pi log = pp i
i log + log
i=0
p · pi i=0
p p
1 1 1−p h(p)
= log + p · log · 2 =
p p p p

Example (Infinite entropy): Can H(X) = +∞? Yes, P[X = k] = c

k ln2 k
,k = 2, 3, · · ·

11
Review: Convexity

• Convex set: A subset S of some vector space is convex if x, y ∈ S ⇒ αx + ᾱy ∈ S

for all α ∈ [0, 1]. (Notation: ᾱ , 1 − α.)
e.g., unit interval [0, 1]; S = {probability distributions on X }, S = {PX : E[X] = 0}.

• Convex function: f : S → R is

– convex if f (αx + ᾱy) ≤ αf (x) + ᾱf (y) for all x, y ∈ S, α ∈ [0, 1].
– strictly convex if f (αx + ᾱy) < αf (x) + ᾱf (y) for all x 6= y ∈ S, α ∈ (0, 1).
– (strictly) concave if −f is (strictly) convex.
R
e.g., x 7→ x log x is strictly convex; the mean P 7→ xdP is convex but not
strictly convex, variance is concave (Q: is it strictly concave? Think of zero-mean
distributions.).

• Jensen’s inequality: For any S-valued random variable X

Ef (X)

– f is convex ⇒ f (EX) ≤ Ef (X)

f (EX)
– f is strictly convex ⇒ f (EX) < Ef (X)
unless X is a constant (X = EX a.s.)

Famous puzzle: A man says, ”I am the average height and average weight of the
population. Thus, I am an average man.” However, he is still considered a little overweight.
Why?
Answer: The weight is roughly proportional to the volume, which, for us three-dimensiona
beings, is roughly proportional to the third power of the height. Let PX denote the
distribution of the height among the population. So by Jensen’s inequality, since x 7→ x3
is strictly convex on x > 0, we have (EX)3 < EX 3 , regardless of the distribution of X.
Source: [Yos03, Puzzle 94] or online [Har].

Theorem 1.1 (Properties of H).

1. (Positivity) H(X) ≥ 0 with equality iff X is a constant (no randomness).

2. (Uniform distribution maximizes entropy) For finite X , H(X) ≤ log |X |, with equality iff X is
uniform on X .

3. (Invariance under relabeling) H(X) = H(f (X)) for any bijective f .

12
4. (Conditioning reduces entropy)
H(X|Y ) ≤ H(X), with equality iff X and Y are independent.

5. (Small chain rule)

H(X, Y ) = H(X) + H(Y |X) ≤ H(X) + H(Y )
6. (Entropy under functions) H(X) = H(X, f (X)) ≥ H(f (X)) with equality iff f is one-to-one
on the support of PX ,
7. (Full chain rule)
n
X n
X
i−1
H(X1 , . . . , Xn ) = H(Xi |X )≤ H(Xi ), (1.1)
i=1 i=1
with equality iff X1 , . . . , Xn mutually independent.
Proof. 1. Expectation of positive function
2. Jensen’s inequality
3. H only depends on the values of PX , not locations:

H =H

4. Later (Lecture 2)
h i h 1 i h 1 i
5. E[log PXY (X,Y
1
) ] = E log PX (X)·P1 |X) = E log + E log
Y |X (Y PX (X) PY |X (Y |X)
| {z } | {z }
H(X) H(Y |X)

6. Intuition: (X, f (X)) contains the same amount of information as X. Indeed, x 7→ (x, f (x)) is
one-to-one. Thus by 3 and 5:
H(X) = H(X, f (X)) = H(f (X)) + H(X|f (X)) ≥ H(f (X))
The bound is attained iff H(X|f (X)) = 0 which in turn happens iff X is a constant given
f (X).
7. Telescoping:
PX1 X2 ···Xn = PX1 PX2 |X1 · · · PXn |X n−1
then take the log.
Note: To give a preview of the operational meaning of entropy, let us play the game of 20
Questions. We are allowed to make queries about some unknown discrete RV X by asking yes-no
questions. The objective of the game is to guess the realized value of the RV X. For example,
X ∈ {a, b, c, d} with P [X = a] = 1/2, P [X = b] = 1/4, and P [X = c] = P [X = d] = 1/8. In this
case, we can ask “X = a?”. If not, proceed by asking “X = b?”. If not, ask “X = c?”, after
which we will know for sure the realization of X. The resulting average number of questions is
1/2 + 1/4 × 2 + 1/8 × 3 + 1/8 × 3 = 1.75, which equals H(X) in bits. An alternative strategy is to
ask “X = a, b or c, d” in the first round then proceeds to determine the value in the second round,
which always requires two questions and does worse on average.
It turns out (Section 8.2) that the minimal average number of yes-no questions to pin down the
value of X is always between H(X) bits and H(X) + 1 bits. In this special case the above scheme
is optimal because (intuitively) it always splits the probability in half.

13
1.2 Entropy: axiomatic characterization
P
One might wonder why entropy is defined as H(P ) = pi log p1i and if there are other definitions.
Indeed, the information-theoretic definition of entropy is related to entropy in statistical physics.
Also, it arises as answers to specific operational problems, e.g., the minimum average number of bits
to describe a random variable as discussed above. Therefore it is fair to say that it is not pulled out
of thin air.
Shannon in 1948 paper has also showed that entropy can be defined axiomatically, as a function
satisfying several natural conditions. Denote a probability distribution on m letters by P =
(p1 , . . . , pm ) and consider a functional Hm (p1 , . . . , pm ). If Hm obeys the following axioms:

a) Permutation invariance

b) Expansible: Hm (p1 , . . . , pm−1 , 0) = Hm−1 (p1 , . . . , pm−1 ).

c) Normalization: H2 ( 21 , 21 ) = log 2.

d) Subadditivity: H(X, Y ) ≤ PH(X) + H(Y ). Equivalently,

P Hmn (r11 , . . . , rmn ) ≤ Hm (p1 , . . . , pm ) +
Hn (q1 , . . . , qn ) whenever nj=1 rij = pi and m
i=1 rij = q j .

e) Additivity: H(X, Y ) = H(X) + H(Y ) if X ⊥ ⊥ Y . Equivalently, Hmn (p1 q1 , . . . , pm qn ) ≤

Hm (p1 , . . . , pm ) + Hn (q1 , . . . , qn ).

f) Continuity: H2 (p, 1 − p) → 0 as p → 0.
P
then Hm (p1 , . . . , pm ) = m 1
i=1 pi log pi is the only possibility. The interested reader is referred to
[CT06, p. 53] and the reference therein.

1.3 History of entropy

In the early days of industrial age, engineers wondered if it is possible to construct a perpetual
motion machine. After many failed attempts, a law of conservation of energy was postulated: a
machine cannot produce more work than the amount of energy it consumed from the ambient world
(this is also called the first law of thermodynamics). The next round of attempts was then to
construct a machine that would draw energy in the form of heat from a warm body and convert it
to equal (or approximately equal) amount of work. An example would be a steam engine. However,
again it was observed that all such machines were highly inefficient, that is the amount of work
produced by absorbing heat Q was far less than Q. The remainder of energy was dissipated to the
ambient world in the form of heat. Again after many rounds of attempting various designs Clausius
and Kelvin proposed another law:

Second law of thermodynamics: There does not exist a machine that operates in a cycle
(i.e. returns to its original state periodically), produces useful work and whose only
other effect on the outside world is drawing heat from a warm body. (That is, every
such machine, should expend some amount of heat to some cold body too!)1

Equivalent formulation is: There does not exist a cyclic process that transfers heat from a cold
body to a warm body (that is, every such process needs to be helped by expending some amount of
external work).
1
Note that the reverse effect (that is converting work into heat) is rather easy: friction is an example.

14
Notice that there is something annoying about the second law as compared to the first law. In
the first law there is a quantity that is conserved, and this is somehow logically easy to accept. The
second law seems a bit harder to believe in (and some engineers did not, and only their recurrent
failures to circumvent it finally convinced them). So Clausius, building on an ingenious work of
S. Carnot, figured out that there is an “explanation” to why any cyclic machine should expend
heat. He proposed that there must be some hidden quantity associated to the machine, entropy of it
(translated as “transformative content”), whose value must return to its original state. Furthermore,
under any reversible (i.e. quasi-stationary, or “very slow”) process operated on this machine the
change of entropy is proportional to the ratio of absorbed heat and the temperature of the machine:
∆Q
∆S = . (1.2)
T
If heat Q is absorbed at temperature Thot then to return to the original state, one must return some
Q0 amount of heat, where Q0 can be significantly smaller than Q but never zero if Q0 is returned
at temperature 0 < Tcold < Thot .2 Further logical arguments can convince one that for irreversible
cyclic process the change of entropy at the end of the cycle can only be positive, and hence entropy
cannot reduce.
There were great many experimentally verified consequences that second law produced. However,
what is surprising is that the mysterious entropy did not have any formula for it (unlike, say, energy),
and thus had to be computed indirectly on the basis of relation (1.2). This was changed with the
revolutionary work of Boltzmann and Gibbs, who provided a microscopic explanation of the second
law based on statistical physics principles and showed that, e.g., for a system of n independent
particles (as in ideal gas) the entropy of a given macro-state can be computed as

X̀ 1
S = kn pj log ,
pj
j=1

where k is the Boltzmann constant, and we assumed that each particle can only be in one of `
molecular states (e.g. spin up/down, or if we quantize the phase volume into ` subcubes) and pj is
the fraction of particles in j-th molecular state.

1.4* Entropy: submodularity

Recall that [n] denotes a set {1, . . . , n}, Sk denotes subsets of S of size k and 2S denotes all subsets
of S. A set function f : 2S → R is called submodular if for any T1 , T2 ⊂ S

f (T1 ∪ T2 ) + f (T1 ∩ T2 ) ≤ f (T1 ) + f (T2 ) (1.3)

Submodularity is similar to concavity, in the sense that “adding elements gives diminishing returns”.
Indeed consider T 0 ⊂ T and b 6∈ T . Then

f (T ∪ b) − f (T ) ≤ f (T 0 ∪ b) − f (T 0 ) .

Theorem 1.2. Let X n be discrete RV. Then T 7→ H(XT ) is submodular.

Proof. Let A = XT1 \T2 , B = XT1 ∩T2 , C = XT2 \T1 . Then we need to show

H(A, B, C) + H(B) ≤ H(A, B) + H(B, C) .

2
See for example https://en.wikipedia.org/wiki/Carnot_heat_engine#Carnot.27s_theorem.

15
This follows from a simple chain

H(A, B, C) + H(B) = H(A, C|B) + 2H(B) (1.4)

≤ H(A|B) + H(C|B) + 2H(B) (1.5)
= H(A, B) + H(B, C) (1.6)

Note that entropy is not only submodular, but also monotone:

T1 ⊂ T2 =⇒ H(XT1 ) ≤ H(XT2 ) .

So fixing n, let us denote by Γn the set of all non-negative, monotone, submodular set-functions on
[n]. Note that via an obvious enumeration of all non-empty subsets of [n], Γn is a closed convex cone
n
in R2+ −1 . Similarly, let us denote by Γ∗n the set of all set-functions corresponding to distributions
on X n . Let us also denote Γ̄∗n the closure of Γ∗n . It is not hard to show, cf. [ZY97], that Γ̄∗n is also a
closed convex cone and that
Γ∗n ⊂ Γ̄∗n ⊂ Γn .
The astonishing result of [ZY98] is that

Γ∗2 = Γ̄∗2 = Γ2 (1.7)

Γ∗3 ( Γ̄∗3 = Γ3 (1.8)
Γ∗n ( Γ̄∗n (Γn n ≥ 4. (1.9)

This follows from the fundamental new information inequality not implied by the submodularity of
entropy (and thus called non-Shannon inequality). Namely, [ZY98] showed that for any 4-tupe of
discrete random variables:
1 1 1
I(X3 ; X4 ) − I(X3 ; X4 |X1 ) − I(X3 ; X4 |X2 ) ≤ I(X1 ; X2 ) + I(X1 ; X3 , X4 ) + I(X2 ; X3 , X4 ) .
2 4 4
(to translate into an entropy inequality, see Theorem 2.4).

1.5* Entropy: Han’s inequality and Shearer’s Lemma

Theorem 1.3 (Han’s inequality). Let X n be discrete n-dimensional RV and denote H̄k (X n ) =
1 P
H(XT ) the average entropy of a k-subset of coordinates. Then H̄kk is decreasing in k:
(nk) T ∈([n]
k )

1 1
H̄n ≤ · · · ≤ H̄k · · · ≤ H̄1 . (1.10)
n k
Furthermore, the sequence H̄k is increasing and concave in the sense of decreasing slope:

H̄k+1 − H̄k ≤ H̄k − H̄k−1 . (1.11)

H̄m
Proof. Denote for convenience H̄0 = 0. Note that m is an average of differences:
m
1 1 X
H̄m = (H̄k − H̄k−1 )
m m
k=1

16
Thus, it is clear that (1.11) implies (1.10) since increasing m by one adds a smaller element to the
average. To prove (1.11) observe that from submodularity

H(X1 , . . . , Xk+1 ) + H(X1 , . . . , Xk−1 ) ≤ H(X1 , . . . , Xk ) + H(X1 , . . . , Xk−1 , Xk+1 ) .

Now average this inequality over all n! permutations of indices {1, . . . , n} to get

H̄k+1 + H̄k−1 ≤ 2H̄k

as claimed by (1.11).
Alternative proof: Notice that by “conditioning decreases entropy” we have

H(Xk+1 |X1 , . . . , Xk ) ≤ H(Xk+1 |X2 , . . . , Xk ) .

Averaging this inequality over all permutations of indices yields (1.11).

Theorem 1.4 (Shearer’s Lemma). Let X n be discrete n-dimensional RV and let S ⊂ [n] be a
random variable independent of X n and taking values in subsets of [n]. Then

H(XS |S) ≥ H(X n ) · min P[i ∈ S] . (1.12)

i∈[n]

Remark 1.1. In the special case where S is uniform over all subsets of cardinality k, (1.12) reduces
to Han’s inequality n1 H(X n ) ≤ k1 H̄k . The case of n = 3 and k = 2 can be used to give an entropy
proof of the following well-known geometry result that relates the size of 3-D object to those of its
2-D projections: Place N points in R3 arbitrarily. Let N1 , N2 , N3 denote the number of distinct
points projected onto the xy, xz and yz-plane, respectively. Then N1 N2 N3 ≥ N 2 .

Proof. We will prove an equivalent (by taking a suitable limit) version: If C = (S1 , . . . , SM ) is a list
(possibly with repetitions) of subsets of [n] then
X
H(XSj ) ≥ H(X n ) · min deg(i) , (1.13)
i
j

where deg(i) , #{j : i ∈ Sj }. Let us call C a chain if all subsets can be rearranged so that
S1 ⊆ S2 · · · ⊆ SM . For a chain, (1.13) is trivial, since the minimum on the right-hand side is either
zero (if SM 6= [n]) or equals multiplicity of SM in C, 3 in which case we have
X
H(XSj ) ≥ H(XSM )#{j : Sj = SM } = H(X n ) · min deg(i) .
i
j

For the case of C not a chain, consider a pair of sets S1 , S2 that are not related by inclusion and
replace them in the collection with S1 ∩ S2 , S1 ∪ S2 . Submodularity (1.3) implies that the sum
on the left-hand side of (1.13) does not decrease under this replacement, values deg(i) are not
changed. Since the total number of pairs that are not related by inclusion strictly decreases by this
replacement, we must eventually arrive to a chain, for which (1.13) has already been shown.
3
Note that, consequently, for X n without constant coordinates, and if C is a chain, (1.13) is only tight if C consists
of only ∅ and [n] (with multiplicities). Thus if degrees deg(i) are known and non-constant, then (1.13) can be improved,
cf. [MT10].

17
Note: Han’s inequality holds for any submodular set-function. Shearer’s lemma holds for any
submodular set-function that is also non-negative.
Example: Another submodular set-function is

S 7→ I(XS ; XS c ) .

Han’s inequality for this one reads

1 1
0= In ≤ · · · ≤ Ik · · · ≤ I1 ,
n k
1 P
where Ik = n S:|S|=k I(XS ; XS c ) measures the amount of k-subset coupling in the random vector
( )
k
X n.

1.6 Divergence

Review: Measurability

In this course we will assume that all alphabets are standard Borel spaces. Some of the
nice properties of standard Borel spaces:

• All complete separable metric spaces, endowed with Borel σ-algebras are standard
Borel. In particular, countable alphabets and Rn and R∞ (space of sequences) are
standard Borel.
Q
• If Xi , i = 1, . . . are standard Borel, then so is ∞i=1 Xi

• Singletons {x} are measurable sets

• The diagonal {(x, x) : x ∈ X } is measurable in X × X

• (Most importantly) for any probability distribution PXY on X × Y there exists a

transition probability kernel (also called a regular branch of a conditional distribution)
PY |X s.t. Z Z
PXY [E] = PX (dx) PY |X=x (dy)1{(x, y) ∈ E} .
X Y

Intuition: Divergence (also known as information divergence, Kullback-Leibler (KL) divergence,

relative entropy.) D(P kQ) gauges the dissimilarity between P and Q.

Definition 1.4 (Divergence). Let P, Q be distributions on

• A = discrete alphabet (finite or countably infinite)

X P (a)
D(P kQ) , P (a) log ,
Q(a)
a∈A

where we agree:
0
(1) 0 · log =0
0

18
(2) ∃a : Q(a) = 0, P (a) > 0 ⇒ D(P kQ) = ∞

• A = Rk , P and Q have densities p and q:

(R p(x)
Rk log p(x) q(x) dx Leb{p > 0, q = 0} = 0
D(P kQ) =
+∞ otherwise

• A = general measurable space:

(
EP [log dQ
dP
] = EQ [ dQ
dP dP
log dQ ] P Q
D(P kQ) =
+∞ otherwise

Notes:

• (Radon-Nikodym theorem) Recall that for two measures P and Q, we say P is absolutely
continuous w.r.t. Q (denoted by P Q) if Q(E) = 0 implies P (E) = 0 for all measurable E.
If P Q, then there exists a function f : X → R+ such that for any measurable set E,
Z
P (E) = f dQ. [change of measure]
E

Such f is called a relative density (or a Radon-Nikodym derivative) of P w.r.t. Q, denoted by

dP dP
dQ . Usually, dQ is simply the likelihood ratio:

dP
– For discrete distributions, we can just take dQ (x) to be the ratio of pmfs.
dP
– For continuous distributions, we can take dQ (x) to be the ratio of pdfs.

• (Infinite values) D(P kQ) can be ∞ also when P Q, but the two cases of D(P kQ) = +∞
are consistent since D(P kQ) = supΠ D(PΠ kQΠ ), where Π is a finite partition of the underlying
space A (Theorem 3.7).

• (Asymmetry) D(P kQ) 6= D(QkP ). Therefore divergence is not a distance. However, asym-
metry can be very useful. Example: P (H) = P (T ) = 1/2, Q(H) = 1. Upon observing
HHHHHHH, one tends to believe it is Q but can never be absolutely sure; Upon observing
HHT, know for sure it is P . Indeed, D(P kQ) = ∞, D(QkP ) = 1 bit (see Lecture 13).

• (Pinsker’s inequality) There are many other measures for dissimilarity, e.g., total variation
(L1 -distance)
Z
1
TV(P, Q) , sup P [E] − Q[E] = |dP − dQ| (1.14)
E 2
1X
= |P (x) − Q(x)| (discrete)
2 x
Z
1
= dx|p(x) − q(x)| (continuous)
2
Total variation is symmetric and in fact a distance. The famous Pinsker’s (or Pinsker-Csiszár)
inequality relates D and TV (see Theorem 6.5):
r
1
TV(P, Q) ≤ D(P kQ) . (1.15)
2 log e

19
• (Other divergences) A general class of divergence-like measures was proposed by Csiszár.
Fixing a convex function f : R+ → R with f (1) = 0 we define f -divergence Df as

dP
Df (P kQ) , EQ f . (1.16)
dQ

This encompasses total variation, χ2 -distance, Hellinger, Tsallis etc. Inequalities between
various f -divergences such as (1.15) was once an active field of research. A complete solution
was obtained by Harremoës and Vajda [HV11] who gave a simple method for obtaining best
possible inequalities between any pair of f -divergences.

Theorem 1.5 (H v.s. D). If distribution P is supported on a finite set A, then

H(P ) = log |A| − D(P k UA ).

|{z}
uniform distribution on A

Example (Binary divergence): A = {0, 1}; P = [p, p]; Q = [q, q]

p p
D(P kQ) = d(pkq) , p log + p log
q q
Here is how d(pkq) depends on p and q:

d (p ||q ) d (p ||q )

−log q
−log q−

q p
p 1 q 1

Quadratic lower bound (homework):

d(pkq) ≥ 2(p − q)2 log e

Example (Real Gaussian): A = R

1 σ 2 1 h (m1 − m0 )2 σ12 i
D(N (m1 , σ12 )kN (m0 , σ02 )) = log 02 + + − 1 log e (1.17)
2 σ1 2 σ02 σ02

Example (Vector Gaussian): A = Rk , assuming det Σ0 6= 0,

1
D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) = log det Σ0 − log det Σ1 + (m1 − m0 )H Σ−1
0 (m1 − m0 ) log e
2
+ tr(Σ−1
0 Σ1 − I) log e (1.18)

20
1 −|x−m|2 /σ2
Example (Complex Gaussian): A = C. The pdf of Nc (m, σ 2 ) is e , or equivalently:
πσ 2

2
σ 2 /2 0
Nc (m, σ ) = N Re(m) Im(m) ,
0 σ 2 /2
σ 2 |m1 − m0 |2 σ12
D(Nc (m1 , σ12 )kNc (m0 , σ02 )) = log 02 + + − 1 log e
σ1 σ02 σ02
which follows from (1.18).
More generally, for vector space A = Ck , assuming det Σ0 6= 0,

D(Nc (m1 , Σ1 )kNc (m0 , Σ0 )) = log det Σ0 − log det Σ1 + (m1 − m0 )H Σ−1
0 (m1 − m0 ) log e
+ tr(Σ−1
0 Σ1 − I) log e

Note: The definition of D(P kQ) extends verbatim to measures P and Q (not necessarily probability
measures), in which case D(P kQ) can be negative. A sufficient condition
R R kQ) ≥ 0 is that P
for D(P
is a probability measure and Q is a sub-probability measure, i.e., dQ ≤ 1 = dP .

1.7 Differential entropy

The notion of differential entropy is simply the divergence with respect to the Lebesgue measure:
Definition 1.5. The differential entropy of a random vector X k is

h(X k ) = h(PX k ) , −D(PX k kLeb). (1.19)

In particular, if X k has probability density function (pdf) p, then h(X k ) = E log p(X1 k ) ; otherwise
h(X k ) = −∞. Conditional differential entropy h(X k |Y ) , E log p 1
(X k |Y )
where pX k |Y is a
X k |Y
conditional pdf.
Example: Gaussian. For X ∼ N (µ, σ 2 ),
1
h(X) = log(2πeσ 2 ) (1.20)
2
More generally, for X ∼ N (µ, Σ) in Rd ,
1
h(X) = log((2πe)d det Σ) (1.21)
2
Warning: Even for continuous random variable X, h(X) can be positive, negative, take values
of ±∞ or even undefined.4
Nevertheless, differential entropy shares many properties with the usual Shannon entropy:
Theorem 1.6 (Properties of differential entropy). Assume that all differential entropies appearing
below exist and are finite (in particular all RVs have pdfs and conditional pdfs).
1. (Uniform distribution maximizes differential entropy) If P[X n ∈ S] = 1 then h(X n ) ≤
log Leb(S),5 with equality iff X n is uniform on S.
n
4 n
For an example, consider a piecewise-constant pdf taking value e(−1) on the n-th interval of width ∆n =
c −(−1)n n
n2
e .
5
Here Leb(S) is the same as the volume vol(S).

21
2. (Scaling and shifting) h(X k + x) = h(X k + x), h(αX k ) = h(X k ) + k log |α| and for invertible
A, h(AX k ) = h(X k ) + log | det A|

3. (Conditioning reduces differential entropy) h(X|Y ) ≤ h(X) (here Y could be arbitrary, e.g.
discrete)

4. (Chain rule)
n
X
n
h(X ) = h(Xk |X k−1 ) .
k=1

5. (Submodularity) The set-function T 7→ h(XT ) is submodular.

P
6. (Han’s inequality) The function k 7→ k 1n h(XT ) is decreasing in k.
(k ) T ∈([n]
k )

22
§ 2. Information measures: mutual information

2.1 Divergence: main inequality

Theorem 2.1 (Information Inequality).

D(P kQ) ≥ 0 ; D(P kQ) = 0 iff P = Q

Proof. Let ϕ(x) , x log x, which is strictly convex, and use Jensen’s Inequality:
!
X P (x) X P (x) X P (x)
D(P kQ) = P (x) log = Q(x)ϕ ≥ϕ Q(x) = ϕ(1) = 0
Q(x) Q(x) Q(x)
X X X

2.2 Conditional divergence

The main objects in our course are random variables. The main operation for creating new
random variables, and also for defining relations between random variables, is that of a random
transformation:

Definition 2.1. Conditional probability distribution (aka random transformation, transition prob-
ability kernel, Markov kernel, channel) K(·|·) has two arguments: first argument is a measurable
subset of Y, second argument is an element of X . It must satisfy:

1. For any x ∈ X : K( · |x) is a probability measure on Y

2. For any measurable set A: x 7→ K(A|x) is a measurable function on X .

In this case we will say that K acts from X to Y. In fact, we will abuse notation and write PY |X
instead of K to suggest what spaces X and Y are1 . Furthermore, if X and Y are connected by the
PY |X
random transformation PY |X we will write X −−−→ Y .

Remark 2.1. (Very technical!) Unfortunately, condition 2 (standard in probability textbooks)

will frequently not be strong enough for purposes in this course. The main reason is that we want
dPY |X=x
Radon-Nikodym derivatives such as dQ Y
(y) to be jointly measurable in (x, y). See Section 2.6*
for more.

Example:

1. deterministic system: Y = f (X) ⇔ PY |X=x = δf (x)

1
Another reason for writing PY |X is that from any joint distribution PXY (on standard Borel spaces) one can
extract a random transformation by conditioning on X.

23
2. decoupled system: Y ⊥
⊥ X ⇔ PY |X=x = PY
3. additive noise (convolution): Y = X + Z with Z ⊥
⊥ X ⇔ PY |X=x = Px+Z .
We will use the following notations extensively:
• Multiplication:
PX

PY |X
X −−−→ Y to get PXY = PX PY |X :
PXY (x, y) = PY |X (y|x)PX (x) .

• Composition (marginalization): PY = PY |X ◦ PX , that is PY |X acts on PX to produce PY :

X
PY (y) = PY |X (y|x)PX (x) .
x∈X

Theorem 2.2 (Properties of Divergence).

5. (Conditioning increases divergence) Let PY |X and QY |X be two kernels, let PY = PY |X ◦

Pictorially:

24
PY |X PY

QY |X QY

6. (Data-processing for divergences) Let PY = PY |X ◦ PX

R
PY = R PY |X ( · |x)dPX
=⇒ D(PY kQY ) ≤ D(PX kQX ) (2.1)
QY = PY |X ( · |x)dQX

Pictorially:

PX PY
PY |X
X Y =⇒ D(PX kQX ) ≥ D(PY kQY )

QX QY

Proof. We only illustrate these results for the case of finite alphabets. General case follows by doing
a careful analysis of Radon-Nikodym derivatives, introduction of regular branches of conditional
probability etc. For certain cases (e.g. separable metric spaces), however, we can simply discretize
alphabets and take the granularity of discretization to zero. This method will become clearer in
Lecture 3, once we understand continuity of D.
h i
P
1. Ex∼PX [D(PY |X=x kQY |X=x )] = EXY ∼PX PY |X log QY |X PPX X
Y |X

h i h i
PXY PY |X PX
2. Disintegration: EXY log QXY
= EXY log Q + log QX
Y |X

3. Apply 2. with X and Y interchanged and use D(·k·) ≥ 0.

Q Q
4. Telescoping PX n = ni=1 PXi |X i−1 and QX n = ni=1 QXi |X i−1 .

5. Inequality follows from monotonicity. To get conditions for equality, notice that by the chain
rule for D:

D(PXY kQXY ) = D(PY |X kQY |X |PX ) + D(PX kPX )

| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )

and hence we get the claimed result from positivity of D.

6. This again follows from monotonicity.

Corollary 2.1.
X
D(PX1 ···Xn kQX1 · · · QXn ) ≥ D(PXi kQXi ) or
Q
= iff PX n = nj=1 PXj

25
Note: In general we can have D(PXY kQXY ) ≶ D(PX kQX ) + D(PY kQY ). For example, if X = Y
under P and Q, then D(PXY kQXY ) = D(PX kQX ) < 2D(PX kQX ). Conversely, if PX = QX and
PY = QY but PXY 6= QXY we have D(PXY kQXY ) > 0 = D(PX kQX ) + D(PY kQY ).

Corollary 2.2. Y = f (X) ⇒ D(PY kQY ) ≤ D(PX kQX ), with equality if f is 1-1.

Note: D(PY kQY ) = D(PX kQX ) 6⇒ f is 1-1. Example: PX = Gaussian, QX = Laplace, Y = |X|.

Corollary 2.3 (Large deviations estimate). For any subset E ⊂ X we have

d(PX [E]kQX [E]) ≤ D(PX kQX )

Proof. Consider Y = 1{X∈E} .

Note: This method will be very useful in stuyding large deviation (Section 13.1 and Section 14.1)
when applied to an event E which is highly likely under P but highly unlikely under Q.

2.3 Mutual information

Definition 2.3 (Mutual information).

I(X; Y ) = D(PXY kPX PY )

Note:

• Intuition: I(X; Y ) measures the dependence between X and Y , or, the information about X
(resp. Y ) provided by Y (resp. X)

• Defined by Shannon (in a different form), in this form by Fano.

• This definition is not restricted to discrete RVs.

• I(X; Y ) is a functional of the joint distribution PXY , or equivalently, the pair (PX , PY |X ).

Theorem 2.3 (Properties of I).

1. I(X; Y ) = D(PXY kPX PY ) = D(PY |X kPY |PX ) = D(PX|Y kPX |PY )

2. Symmetry: I(X; Y ) = I(Y ; X)

3. Positivity: I(X; Y ) ≥ 0; I(X; Y ) = 0 iff X ⊥

⊥Y

4. I(f (X); Y ) ≤ I(X; Y ); f one-to-one ⇒ I(f (X); Y ) = I(X; Y )

5. “More data ⇒ More info”: I(X1 , X2 ; Z) ≥ I(X1 ; Z)

PY |X PX|Y
Proof. 1. I(X; Y ) = E log PPXXY
PY = E log PY = E log PX .

2. Apply data-processing inequality twice to the map (x, y) → (y, x) to get D(PXY kPX PY ) =
D(PY X kPY PX ).

3. By definition and Theorem 2.1.

26
4. We will use the data-processing property of mutual information (to be proved shortly, see
Theorem 2.5). Consider the chain of data processing: (x, y) 7→ (f (x), y) 7→ (f −1 (f (x)), y).
Then
I(X; Y ) ≥ I(f (X); Y ) ≥ I(f −1 (f (X)); Y ) = I(X; Y )

5. Consider f (X1 , X2 ) = X1 .

Theorem 2.4 (I v.s. H).

(
H(X) X discrete
1. I(X; X) =
+∞ otherwise

2. If X is discrete, then
I(X; Y ) = H(X) − H(X|Y ).
If both X and Y are discrete, then

I(X; Y ) = H(X) + H(Y ) − H(X, Y ).

3. If X, Y are real-valued vectors, have joint pdf and all three differential entropies are finite
then
I(X; Y ) = h(X) + h(Y ) − h(X, Y )
If X has marginal pdf pX and conditional pdf pX|Y (x|y) then

I(X; Y ) = h(X) − h(X|Y ) .

4. If X or Y are discrete then I(X; Y ) ≤ min (H(X), H(Y )), with equality iff H(X|Y ) = 0 or
H(Y |X) = 0, i.e., one is a deterministic function of the other.

Proof. 1. By definition, I(X; X) = D(PX|X kPX |PX ) = Ex∼X D(δx kPX ). If PX is discrete, then
D(δx kPX ) = log PX1(x) and I(X; X) = H(X). If PX is not discrete, then let A = {x :
PX (x) > 0} denote the set of atoms of PX . Let ∆ = {(x, x) : x 6∈ A} ⊂ X × X . Then
PX,X (∆) = PX (Ac ) > 0 but since
Z Z
(PX × PX )(E) , PX (dx1 ) PX (dx2 )1{(x1 , x2 ) ∈ E}
X X

we have by taking E = ∆ that (PX × PX )(∆) = 0. Thus PX,X 6 PX × PX and thus

I(X; X) = D(PX,X kPX PX ) = +∞ .

h i
2. Telescoping: E log PPXXY
PY = E log 1
PX + log 1
PY − log 1
PXY .

3. Similarly, when PXY and PX PY have densities pXY and pX pY we have

pXY
D(PXY kPX PY ) , E log = h(X) + h(Y ) − h(X, Y )
pX pY

4. Follows from 2.

27
Corollary 2.4 (Conditioning reduces entropy). X discrete: H(X|Y ) ≤ H(X), with equality iff
X⊥ ⊥Y.
Intuition: The amount of entropy reduction = mutual information
Example: It is important to note that conditioning reduces entropy on average, not per reliazation.
i.i.d.
X = U OR Y , where U, Y ∼ Bern( 12 ). Then X ∼ Bern( 34 ) and H(X) = h( 14 ) < 1 bit = H(X|Y = 0),
i.e., conditioning on Y = 0 increases entropy. But on average, H(X|Y ) = P [Y = 0] H(X|Y =
0) + P [Y = 1] H(X|Y = 1) = 21 bits < H(X), by the strong concavity of h(·).
Note: Information, entropy and Venn diagrams:
1. The following Venn diagram illustrates the relationship between entropy, conditional entropy,
joint entropy, and mutual information.
H(X, Y )

H(Y |X) I(X; Y ) H(X|Y )

H(Y ) H(X)

2. If you do the same for 3 variables, you will discover that the triple intersection corresponds to
H(X1 ) + H(X2 ) + H(X3 ) − H(X1 , X2 ) − H(X2 , X3 ) − H(X1 , X3 ) + H(X1 , X2 , X3 ) (2.2)
which is sometimes denoted by I(X; Y ; Z). It can be both positive and negative (why?).
3. In general, one can treat random variables as sets (so that the Xi corresponds to set Ei and
the pair (X1 , X2 ) corresponds to E1 ∪ E2 ). Then we can define a unique signed measure µ
on the finite algebra generated by these sets so that every information quantity is found by
replacing
I/H → µ ; → ∩ , → ∪ | → \ .
As an example, we have
H(X1 |X2 , X3 ) = µ(E1 \ (E2 ∪ E3 )) , (2.3)
I(X1 , X2 ; X3 |X4 ) = µ(((E1 ∪ E2 ) ∩ E3 ) \ E4 ) . (2.4)
By inclusion-exclusion, quantity (2.2) corresponds to µ(E1 ∩ E2 ∩ E3 ), which explains why µ
is not necessarily a positive measure. For an extensive discussion, see [CK81a, Chapter 1.3].

I(X; Y )

Example: Bivariate Gaussian. Let X, Y be jointly Gaus-

sian. Then
1 1
I(X; Y ) = log
2 1 − ρ2XY
E[(X−EX)(Y −EY )]
where ρXY , σX σY ∈ [−1, 1] is the correlation ρ
coefficient. -1 0 1

28
Proof. WLOG, by shifting and scaling if necessary, we can assume EX = EY = 0 and EX 2 =
EY 2 = 1. Then ρ = EXY . By joint Gaussianity, Y = ρX + Z for some Z ∼ N (0, 1 − ρ2 ) ⊥
⊥ X.
Then using the divergence formula for Gaussians (1.17), we get
I(X; Y ) = D(PY |X kPY |PX )
= ED(N (ρX, 1 − ρ2 )kN (0, 1))

1 1 log e
=E log + 2 2
(ρX) + 1 − ρ − 1
2 1 − ρ2 2
1 1
= log .
2 1 − ρ2
Alternatively, use the differential entropy representation Theorem 2.4 and the entropy formula
(1.20) for Gaussians:
I(X; Y ) = h(Y ) − h(Y |X)
= h(Y ) − h(Z)
1 1 1 1
= log(2πe) − log(2πe(1 − ρ2 )) = log
2 2 2 1 − ρ2
Note: Similar to the role of mutual information, the correlation coefficient also measures the
dependency between random variables which are real-valued (more generally, on an inner-product
space) in certain sense. However, mutual information is invariant to bijections and more general: it
can be defined not just for numerical random variables, but also for apples and oranges.

Example: Gaussian vectors. X ∈ Rm , Y ∈ Rn — jointly Gaussian

1 det ΣX det ΣY
I(X; Y) = log
2 det Σ[X,Y]
where ΣX , E [(X − EX)(X − EX)0 ] denotes the covariance matrix of X ∈ Rm , and Σ[X,Y] denotes
the covariance matrix of the random vector [X, Y] ∈ Rm+n .
In the special case of additive noise: Y = X + N for N ⊥
⊥ X, we have
1 det(ΣX + ΣN )
I(X; X + N) = log
2 det ΣN
why?
ΣX ΣX
since det Σ[X,X+N] = det Σ X ΣX +ΣN
= det ΣX det ΣN .
Example: Binary symmetric channel (BSC).
N
1− δ
0 0
X δ Y X + Y
1 1
1− δ

29
1
X ∼ Bern , N ∼ Bern(δ)
2
Y =X +N
I(X; Y ) = log 2 − h(δ)

Example: Addition over finite groups. X is uniform on G and independent of Z. Then

I(X; X + Z) = log |G| − H(Z)

Proof. Show that X + Z is uniform on G regardless of Z.

2.4 Conditional mutual information and conditional

independence
Definition 2.4 (Conditional mutual information).

I(X; Y |Z) = D(PXY |Z kPX|Z PY |Z |PZ ) (2.5)

= Ez∼PZ [I(X; Y |Z = z)] . (2.6)

where the product of two random transformations is (PX|Z=z PY |Z=z )(x, y) , PX|Z (x|z)PY |Z (y|z),
under which X and Y are independent conditioned on Z.

Note: I(X; Y |Z) is a functional of PXY Z .

Remark 2.2 (Conditional independence). A family of distributions can be represented by a directed

acyclic graph. A simple example is a Markov chain (path graph), which represents distributions
that factor as {PXY Z : PXY Z = PX PY |X PZ|Y }.


 X → Y → Z ⇔ PXZ|Y = PX|Y · PZ|Y





 ⇔ PZ|XY = PZ|Y



 ⇔ PXY Z = PX · PY |X · PZ|Y


Cond. indep.
⇔ X, Y, Z form a Markov chain
notation 


 ⇔ X⊥ ⊥ Z|Y





 ⇔ X ← Y → Z, PXY Z = PY · PX|Y · PZ|Y



 ⇔ Z→Y →X

Theorem 2.5 (Further properties of Mutual Information).

1. I(X; Z|Y ) ≥ 0, with equality iff X → Y → Z

2. (Kolmogorov identity or small chain rule)

I(X, Y ; Z) = I(X; Z) + I(Y ; Z|X)

= I(Y ; Z) + I(X; Z|Y )

3. (Data Processing) If X → Y → Z, then

a) I(X; Z) ≤ I(X; Y ), with equality iff X → Z → Y .

30
b) I(X; Y |Z) ≤ I(X; Y )

4. If X → Y → Z → W , then I(X; W ) ≤ I(Y ; Z)

5. (Full chain rule)

n
X
I(X n ; Y ) = I(Xk ; Y |X k−1 )
k=1

6. f and g are one-to-one ⇒ I(f (X); g(Y )) = I(X; Y )

Proof. 1. By definition and Theorem 2.3.3.

2.
PXY Z PXZ PY |XZ
= ·
PXY PZ PX PZ PY |X

3. Apply Kolmogorov identity to I(Y, Z; X):

I(Y, Z; X) = I(X; Y ) + I(X; Z|Y )

| {z }
=0
= I(X; Z) + I(X; Y |Z)

4. I(X; W ) ≤ I(X; Z) ≤ I(Y ; Z)

5. Recursive application of Kolmogorov identity.

Note: In general, I(X; Y |Z) ≷ I(X; Y ). Examples:

a) “>”: Conditioning does not always decrease M.I. To find counterexamples when X, Y, Z do
not form a Markov chain, notice that there is only one directed acyclic graph non-isomorphic to
X → Y → Z, namely X → Y ← Z. Then a counterexample is

i.i.d. 1
X, Z ∼ Bern( ) Y =X ⊕Z
2
I(X; Y ) = 0 since X ⊥
⊥Y
I(X; Y |Z) = I(X; X ⊕ Z|Z) = H(X) = 1 bit

b) “<”: Z = Y . Then I(X; Y |Y ) = 0.

Note:
Pn (Chain rule for IP⇒ Chain rule for H) Set Y = X n . Then H(X n ) = I(X n ; X n ) =
n
k=1 I(Xk ; X |X
k−1 ) = nk=1 H(Xk |X k−1 ), since H(Xk |X n , X k−1 ) = 0.

Remark 2.3 (Data processing for mutual information via data processing of divergence). We
proved data processing for mutual information in Theorem 2.5 using Kolmogorov’s identity. In fact,
data processing for mutual information is implied by the data processing for divergence:

I(X; Z) = D(PZ|X kPZ |PX ) ≤ D(PY |X kPY |PX ) = I(X; Y ),

PZ|Y PZ|Y
where note that for each x, we have PY |X=x −−−→ PZ|X=x and PY −−−→ PZ . Therefore if we have a
bi-variate functional of distributions D(P kQ) which satisfies data processing, then we can define an
“M.I.-like” quantity via ID (X; Y ) , D(PY |X kPY |PX ) , Ex∼PX D(PY |X=x kPY ) which will satisfy data
processing on Markov chains. A rich class of examples arises by taking D = Df (an f -divergence,
defined in (1.16)). That f -divergence satisfies data-processing will be proved in Remark 4.2.

31
2.5 Strong data-processing inequalities
For many random transformations PY |X , it is possible to improve the data-processing inequality (2.1):
For any PX , QX we have
D(PY kQY ) ≤ ηKL D(PX kQX ) ,
where ηKL < 1 and depends on the channel PY |X only. Similarly, this gives an improvement in the
data-processing inequality for mutual information: For any PU,X we have

U →X→Y =⇒ I(U ; Y ) ≤ ηKL I(U ; X) .

For example, for PY |X = BSC(δ) we have ηKL = (1 − 2δ)2 . Strong data-processing inequalities
quantify the intuitive observation that noise inside the channel PY |X must reduce the information
that Y carries about the data U , regardless of how smart the mapping U 7→ X is.
This is an active area of research, see [PW17] for a short summary.

2.6* How to avoid measurability problems?

As we mentioned in Remark 2.1 conditions imposed by Definition 2.1 on PY |X are insufficient.
Namely, we get the following two issues:
dPY |X=x
1. Radon-Nikodym derivatives such as dQY (y) may not be jointly measurable in (x, y)

2. Set {x : PY |X=x QY } may not be measurable.

The easiest way to avoid all such problems is the following:

Agreement A1: All conditional kernels PY |X : X → Y in these notes will be assumed

Notes:

2. Given two kernels PY |X and QY |X specified in terms of the same dominating measure µ2 and
functions ρP (y|x) and ρQ (y|x), respectively, we may set

dPY |X ρP (y|x)
,
dQY |X ρQ (y|x)

outside of ρQ = 0. When PY |X=x QY |X=x the above gives a version of the Radon-Nikodym
derivative, which is automatically measurable in (x, y).

32
3. Given QY specified as
dQY = q(y)dµ2
we may set Z
A0 = {x : ρ(y|x)dµ2 = 0}
{q=0}

This set plays a role of {x : PY |X=x QY }. Unlike the latter A0 is guaranteed to be

measurable by Fubini theorem [Ç11, Prop. 6.9]. By “plays a role” we mean that it allows to
prove statements like: For any PX

PXY PX QY ⇐⇒ PX [A0 ] = 1 .

So, while our agreement resolves the two measurability problems above, it introduces a new
one. Indeed, given a joint distribution PXY on standard Borel spaces, it is always true that one
can extract a conditional distribution PY |X satisfying Definition 2.1 (this is called disintegration).
However, it is not guaranteed that PY |X will satisfy Agreement A1. To work around this issue as
well, we add another agreement:

Agreement A2: All joint distributions PXY are specified by means of data: µ1 ,µ2 –
σ-finite measures on X and Y, respectively, and measurable function λ(x, y) such that
Z
PXY (E) , λ(x, y)µ1 (dx)µ2 (dy) .
E

Notes:

1. Again, given a finite or countable collection of joint distributions PXY , QX,Y , . . . satisfying A2
we may without loss of generality assume they are defined in terms of a common µ1 , µ2 .

2. Given PXY satisfying A2 we can disintegrate it into conditional (satisfying A1) and marginal:
Z
λ(x, y)
PY |X (A|x) = ρ(y|x)µ2 (dy) ρ(y|x) , (2.7)
p(x)
ZA Z
PX (A) = p(x)µ1 (dx) p(x) , λ(x, η)µ2 (dη) (2.8)
A Y

with ρ(y|x) defined arbitrarily for those x, for which p(x) = 0.

Remark 2.4. The first problem can also be resolved with the help of Doob’s version of Radon-
Nikodym theorem [Ç11, Chapter V.4, Theorem 4.44]: If the σ-algebra on Y is separable (satisfied
whenever Y is a Polish space, for example) and PY |X=x QY |X=x then there exists a jointly
measurable version of Radon-Nikodym derivative
dPY |X=x
(x, y) 7→ (y)
dQY |X=x

33
§ 3. Sufficient statistic. Continuity of divergence and mutual information

3.1 Sufficient statistics and data-processing

Definition 3.1 (Sufficient Statistic). Let

• PXθ be a collection of distributions of X parameterized by θ

• PT |X be some probability kernel. Let PTθ , PT |X ◦ PXθ be the induced distribution on T for
each θ.

We say that T is a sufficient statistic (s.s.) of X for θ if there exists a transition probability kernel
PX|T so that PXθ PT |X = PTθ PX|T , i.e., PX|T can be chosen to not depend on θ.

Note:

• With T known, one can forget X (T contains all the information that is sufficient to make
inference about θ). This is because X can be simulated from T alone with knowing θ, hence
X is useless.

• Obviously any one-to-one transformation of X is sufficient. Therefore the interesting case is

when T is a low-dimensional recap of X (dimensionality reduction)

• θ need not be a random variable (the definition does not involve any distribution on θ)

Theorem 3.1. Let θ → X → T . Then the following are equivalent

1. T is a s.s. of X for θ.

2. ∀Pθ , θ → T → X.

3. ∀Pθ , I(θ; X|T ) = 0.

4. ∀Pθ , I(θ; X) = I(θ; T ), i.e., data processing inequality for M.I. holds with equality.

Proof. Apply Theorem 2.5.

Theorem 3.2 (Fisher’s factorization criterion). For all θ ∈ Θ, let PXθ have a density pθ with respect
to a measure µ (e.g., discrete – pmf, continuous – pdf ). Let T = T (X) be a deterministic function
of X. Then T is a s.s. of X for θ iff

pθ (x) = gθ (T (x))h(x)

for some measurable functions gθ and h, ∀θ ∈ Θ.

34
P R
Proof. We only give the proof in the discrete case (continuous case → dµ). Let t = T (x).
“⇒”: Suppose T is a s.s. of X for θ. Then pθ (x) = Pθ (X = x) = Pθ (X = x, T = t) = Pθ (X =
x|T = t)Pθ (T = t) = P (X = x|T = T (x)) Pθ (T = T (x))
| {z }| {z }
h(x) gθ (T (x))
“⇐”: Suppose the factorization holds. Then

pθ (x) gθ (t)h(x) h(x)

Pθ (X = x|T = t) = P =P =P ,
x 1{T (x)=t} pθ (x) x 1{T (x)=t} gθ (t)h(x) x 1{T (x)=t} h(x)

free of θ.

Example:
ind
1. Normal meanPmodel. Let θ ∈ R and observations Xi ∼ N (θ, 1), i ∈ [n]. Then the sample
mean X̄ = n1 j Xj is a s.s. of X n for θ. [Exercise: verify that PXθ n factorizes.]
i.i.d. Pn
2. Coin flips. Let Bi ∼ Bern(θ). Then i=1 Bi is a s.s. of B n for θ.
i.i.d.
3. Uniform distribution. Let Ui ∼ uniform[0, θ]. Then maxi∈[n] Ui is a s.s. of U n for θ.

Example: Binary hypothesis testing. θ = {0, 1}. Given θ = 0 or 1, X ∼ PX or QX . Then Y – the

PX PY |X = PY QX|Y and QX PY |X = QY QX|Y ,

which is precisely the definition of s.s. when θ ∈ {0, 1}. This example explains the condition for
equality in the data-processing for divergence:

PX PY
PY |X
X Y

QX QY
Then assuming D(PY kQY ) < ∞ we have:

D(PX kQX ) = D(PY kQY ) ⇐⇒ Y is a s.s. for testing PX vs. QX

Proof. Let QXY = QX PY |X , PXY = PX PY |X , then

D(PXY kQXY ) = D(PY |X kQY |X |PX ) +D(PX kQX )

| {z }
=0
= D(PX|Y kQX|Y |PY ) + D(PY kQY )
≥ D(PY kQY )

with equality iff D(PX|Y kQX|Y |PY ) = 0, which is equivalent to Y being a s.s. for testing PX vs QX
as desired.

35
3.2 Geometric interpretation of mutual information
Mutual information can be understood as the “weighted distance” from the conditional distribution
to the output distribution. For example, for discrete X:
X
I(X; Y ) = D(PY |X kPY |PX ) = D(PY |X=x kPY )PX (x)
x

Theorem 3.3 (Golden formula). ∀QY such that D(PY kQY ) < ∞

I(X; Y ) = D(PY |X kQY |PX ) − D(PY kQY )

PY |X QY
Proof. I(X; Y ) = E log PY QY , group PY |X and QY .

Corollary 3.1 (Mutual information as center of gravity).

I(X; Y ) = min D(PY |X kQ|PX ),

achieved at Q = PY .
Note: This representation is useful to bound mutual information from above by choosing some Q.
Theorem 3.4 (mutual information as distance to product distributions).

I(X; Y ) = min D(PXY kQX QY )

QX ,QY

Proof. I(X; Y ) = E log PPXXY QX QY

PY QX QY , group PXY and QX QY and bound marginal divergences
D(PX kQX ) and D(PY kQY ) by zero.

Note: Generalization to conditional mutual information.

I(X; Z|Y ) = min D(PXY Z kQXY Z )

QXY Z :X→Y →Z

Proof. By chain rule,

D(PXY Z kQX QY |X QZ|Y )

=D(PXY Z kPX PY |X PZ|Y ) + D(PX kQX ) + D(PY |X kQY |X |PX ) + D(PZ|Y kQZ|Y |PY )
=D(PXY Z kPY PX|Y PZ|Y ) + . . .
= D(PXZ|Y kPX|Y PZ|Y |PY ) + . . .
| {z }
I(X;Z|Y )

Interpretation: The most general graphical model for the triplet (X, Y, Z) is a 3-clique (triangle).
What is the information flow on the edge X → Z? To answer, notice that removing this edge
restricts possible joint distributions to a Markov chain X → Y → Z. Thus, it is natural to ask what
is the minimum distance between a given PX,Y,Z and the set of all distributions QX,Y,Z satisfying
the Markov chain constraint. By the above calculation, optimal QX,Y,Z = PY PX|Y PZ|Y and hence
the distance is I(X; Z|Y ). It is natural to interpret this number as the information flowing on the
edge X → Z.

36
3.3 Variational characterizations of divergence:
Donsker-Varadhan
Why variational characterization (sup- or inf-representation): F (x) = supλ∈Λ fλ (x)

1. Regularity, e.g., recall

a) Pointwise supremum of convex functions is convex

b) Pointwise supremum of lower semicontinuous (lsc) functions is lsc

2. Give bounds by choosing a (suboptimal) λ

Theorem 3.5 (Donsker-Varadhan). Let P, Q be probability measures on X and let C denote the
set of functions f : X → R such that EQ [exp{f (X)}] < ∞. If D(P kQ) < ∞ then for every f ∈ C
expectation EP [f (X)] exists and furthermore

D(P kQ) = sup EP [f (X)] − log EQ [exp{f (X)}] . (3.1)

f ∈C

dP
Proof. “≤”: take f = log dQ .
“≥”: Fix f ∈ C and define a probability measure Qf (tilted version of Q) via Qf (dx) ,
R exp{f (x)}Q(dx) , or equivalently,
X exp{f (x)}Q(dx)

Qf (dx) = exp{f (x) − Zf }Q(dx) , Zf , log EQ [exp{f (X)}] .

Then, obviously Qf Q and we have

dQf dP dQf
EP [f (X)] − Zf = EP log = EP log = D(P kQ) − D(P kQf ) ≤ D(P kQ) .
dQ dQdP

Remark 3.1. 1. What is Donsker-Varadhan good for? By setting f (x) = · g(x) with 1
and linearizing exp and log we can see that when D(P kQ) is small, expectations under P can
be approximated by expectations over Q (change of measure): EP [g(X)] ≈ EQ [g(X)]. This
holds for all functions g with finite exponential moment under Q. Total variation distance
provides a similar bound, but for a narrower class of bounded functions:

| EP [g(X)] − EQ [g(X)]| ≤ kgk∞ TV(P, Q) .

2. More formally, inequality EP [f (X)] ≤ log EQ [exp f (X)] + D(P kQ) is useful in estimating
EP [f (X)] for complicated distribution P (e.g. over large-dimensional vector X n with lots of
weak inter-coordinate dependencies) by making a smart choice of Q (e.g. with iid components).

3. In the next lecture we will show that P 7→ D(P kQ) is convex. A general method of obtaining
variational formulas like (3.1) is by Young-Fenchel duality. Indeed, (3.1) is exactly this
inequality since the Fenchel-Legendre conjugate of D(·kQ) is given by a convex map f 7→ Zf .
Theorem 3.6 (Weak lower-semicontinuity of divergence). Let X be a metric space with Borel
σ-algebra H. If Pn and Qn converge weakly (in distribution) to P , Q, then

D(P kQ) ≤ lim inf D(Pn kQn ) . (3.2)

n→∞

37
Proof. First method: On a metric space X bounded continuous functions (Cb ) are dense in the set
of all integrable functions. Then in Donsker-Varadhan (3.1) we can replace C by Cb to get
D(Pn kQn ) = sup EPn [f (X)] − log EQn [exp{f (X)}] .
f ∈Cb

Recall Pn → P weakly if and only if EPn f (X) → EP f (X) for all f ∈ Cb . Taking the limit concludes
the proof.
Second method (less mysterious): Let A be the algebra of Borel sets E whose boundary has
zero (P + Q) measure, i.e.
A = {E ∈ H : (P + Q)(∂E) = 0} .
By the property of weak convergence Pn and Qn converge pointwise on A. Thus by (3.10) we have
D(PA kQA ) ≤ lim D(Pn,A kQn,A )
n→∞

If we show A is (P + Q)-dense in H, we are done by (3.9). To get an idea, consider X = R. Then

open sets are (P + Q)-dense in H (since finite measures are regular), while the algebra F generated
by open intervals is (P + Q)-dense in the open sets. Since there are at most countably many points
a ∈ X with P (a) + Q(a) > 0, we may further approximate each interval (a, b) whose boundary has
non-zero (P + Q) measure by a slightly larger interval from A.
i.i.d.
Note: In general, D(P kQ) is not continuous in either P or Q. Example: Let B1 , . . . , Bn ∼ {±1}
P D
equiprobably. Then by central limit theorem, Sn = √1n ni=1 Bi −
→N (0, 1). But

D( PSn k N (0, 1)) = ∞

|{z} | {z }
discrete cont’s

for all n. Note that this is an example for strict inequality in (3.2).
Note: Why do we care about continuity of information measures? Let’s take divergence as an
example.
1. Computation. For complicated P and Q direct computation of D(P kQ) might be hard. Instead,
one may want to discretize them compute numerically. Question: Is this procedure stable,
i.e., as the quantization becomes finer, does this procedure guarantee to converge to the true
value? Yes! Continuity w.r.t. discretization is guaranteed by the next theorem.
2. Estimating information measures. In many statistical setups, oftentimes we do not know P or
Q, if we estimate the distribution from data (e.g., estimate P by empirical distribution P̂n
from n samples) and then plug in, does D(P̂n kQ) provide a good estimator for D(P kQ)? Well,
note from the first example that this is a bad idea if Q is continuous, since D(P̂n kQ) = ∞
for n. In fact, if one convolves the empirical distribution with a tiny bit of, say, Gaussian
distribution, then it will always have a density. If we allow the variance of the Gaussian to
vanish with n appropriately, we will have convergence. This leads to the idea of kernel density
estimators. All these need regularity properties of divergence.

3.4 Variational characterizations of divergence:

Gelfand-Yaglom-Perez
The point of the following theorem is that divergence on general alphabets can be defined via
divergence on finite alphabets and discretization. Moreover, as the quantization becomes finer, we
approach the value of divergence.

38
Theorem 3.7 (Gelfand-Yaglom-Perez [GKY56]). Let P, Q be two probability measures on X with
σ-algebra F. Then
Xn
P [Ei ]
D(P kQ) = sup P [Ei ] log , (3.3)
{E1 ,...,En } i=1 Q[Ei ]
S
where the supremum is over all finite F-measurable partitions: nj=1 Ej = X , Ej ∩ Ei = ∅, and
0 log 01 = 0 and log 10 = ∞ per our usual convention.
Remark 3.2. This theorem, in particular, allows us to prove all general identities and inequalities
for the cases of discrete random variables.
Proof. “≥”: Fix a finite partition E1 , . . . En . Define a function (quantizer/discretizer) f : X →
{1, . . . , n} as follows: For any x, let f (x) denote the index j of the set Ej to which X belongs. Let
X be distributed according to either P or Q and set Y = f (X). Applying data processing inequality
for divergence yields

D(P kQ) = D(PX kQX )

≥ D(PY kQY ) (3.4)
X P [Ei ]
= P [Ei ] log .
Q[Ei ]
i

“≤”: To show D(P kQ) is indeed achievable, first note that if P 6 Q, then by definition, there
exists B such that Q(B) = 0 < P (B). Choosing the partition E1 = B and E2 = B c , we have
P P [Ei ]
D(P kQ) = ∞ = 2i=1 P [Ei ] log Q[E i]
. In the sequel we assume that P Q, hence the likelihood
dP dP
ratio dQ is well-defined. Let us define a partition of X by partitioning the range of log dQ : Ej = {x :
dP dP dP
log dQ ∈ · [j − n/2, j + 1 − n/2)}, j = 1, . . . , n − 1 and En = {x : log dQ < (1 − n/2) or log dQ ≥
1 dP P (Ej ) Pn−1 R dP
n/2)}. Note that on Ej , log dQ ≤ (j + 1 − n/2) ≤ log Q(Ej ) + . Hence j=1 Ej dP log dQ ≤
Pn−1 P (Ej ) Pn P (Ej ) 1
j=1 P (Ej ) + P (Ej ) log Q(Ej ) ≤ + j=1 P (Ej ) + P (Ej ) log Q(Ej ) + P (En ) log P (En ) . In other
P P (E ) R
words, nj=1 P (Ej ) log Q(Ejj ) ≥ E c dP log dQ dP 1
− − P (En ) log P (E n)
. Let n → ∞ and → 0 be
√
n
such that n → ∞ (e.g., = 1/ n). The proof is complete by noting that P (En ) → 0 and
R n dP n↑∞
R dP
1 | log dP |≤no dP log dQ −−−→ dP log dQ = D(P kQ).
dQ

3.5 Continuity of divergence. Dependence on σ-algebra.

For a finite alphabet X it is easy to establish the continuity of entropy and divergence:
Proposition 3.1. Let X be finite. Fix a distribution Q on X with Q(x) > 0 for all x ∈ X . Then
the fmap
P 7→ D(P kQ)
is continuous. In particular,
P 7→ H(P ) (3.5)
is continuous.
1
Intuition: The main idea is to note that the loss in the inequality (3.4) is in fact D(PX kQX ) = D(PY kQY ) +
D(PX|Y kQX|Y |PY ), and we want to show that the conditional divergence is small. Note that PX|Y =j = PX|X∈Ej
dP dP Q(E )
and QX|Y =j = QX|X∈Ej . Hence dQX|Y =j = dQ j
1 . Once we partitioned the likelihood ratio sufficiently finely,
P (Ej ) Ej
X|Y =j
these two conditional distribution are very close to each other.

39
Warning: Divergence is never continuous in the pair, even for finite alphabets, e.g.: d( n1 k2−n ) 6→
0.

Proof. Notice that

X P (x)
D(P kQ) = P (x) log
x
Q(x)
and each term is a continuous function of P (x).

Our next goal is to study continuity properties of divergence for general alphabets. First,
however, we need to understand dependence on the σ-algebra of the space. Indeed, divergence
D(P kQ) implicitly depends on the σ-algebra F defining the measurable space (X , F). To emphasize
the dependence on F we will write
D(PF kQF ) .
Shortly, we will prove that divergence is continuous under monotone limits:

Fn % F =⇒ D(PFn kQFn ) % D(PF kQF ) (3.6)

Fn & F =⇒ D(PFn kQFn ) & D(PF kQF ) (3.7)

For establishing the first result, it will be convenient to extend the definition of the divergence
D(PF kQF ) to any algebra of sets F and two positive additive set-functions P, Q on F. For this we
take (3.3) as the definition. Note that when F is not a σ-algebra or P, Q are not σ-additive, we do
not have Radon-Nikodym theorem and thus our original definition is not applicable.
Corollary 3.2 (Measure-theoretic properties of divergence). Let P, Q be probability measures on
the measurable space (X , H). Assume all algebras below are sub-algebras of H. Then:

• (Monotonicity) If F ⊆ G then

D(PF kQF ) ≤ D(PG kQG ) . (3.8)

S
• Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let F = n Fn be their limit, then

D(PFn kQFn ) % D(PF kQF ) .

• If F is (P + Q)-dense in G then2

D(PF kQF ) = D(PG kQG ) . (3.9)

• (Monotone
W convergence theorem) Let F1 ⊆ F2 . . . be an increasing sequence of algebras and let
F = n Fn be the σ-algebra generated by them, then

D(PFn kQFn ) % D(PF kQF ) .

In particular,
D(PX ∞ kQX ∞ ) = lim D(PX n kQX n ) .
n→∞
2
Note: F is µ-dense in G if ∀E ∈ G, > 0∃E ∈ F s.t. µ[E∆E 0 ] ≤ .
0

40
• (Lower-semicontinuity of divergence) If Pn → P and Qn → Q pointwise on the algebra F,
then3
D(PF kQF ) ≤ lim inf D(Pn,F kQn,F ) . (3.10)
n→∞

Proof. Straightforward applications of (3.3) and the observation that any algebra F is µ-dense in
the σ-algebra σ{F} it generates, for any µ on (X , H).4

Note: Pointwise convergence on H is weaker than convergence in total variation and stronger than
convergence in distribution (aka “weak convergence”). However, (3.10) can be extended to this
mode of convergence (see Theorem 3.6).
Finally, we address the continuity under the decreasing σ-algebra, i.e. (3.7).
Proposition 3.2. Let Fn & F be a sequence of decreasing σ-algebras and P, Q two probability
measures on F0 . If D(PF0 kQF0 ) < ∞ then we have

D(PFn kQFn ) & D(PF kQF ) (3.11)

The condition D(PF0 kQF0 ) < ∞ can not be dropped, cf. the example after (3.16).
h i
dP dP
Proof. Let X−n = dQ . Since X−n = EQ dQ Fn , we have that (. . . , X−1 , X0 ) is a uniformly
Fn
integrable martingale. By the martingale convergence theorem in reversed time, cf. [Ç11, Theorem
5.4.17], we have almost surely
dP
X−n → X−∞ , . (3.12)
dQ F
We need to prove that
EQ [X−n log X−n ] → EQ [X−∞ log X−∞ ] .
We will do so by decomposing x log x as follows

x log x = x log+ x + x log− x ,

where log+ x = max(log x, 0) and log− x = min(log x, 0). Since x log− x is bounded, we have from
the bounded convergence theorem:

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ]

To prove a similar convergence for log+ we need to notice two things. First, the function

x 7→ x log+ x

is convex. Second, for any non-negative convex function φ s.t. E[φ(X0 )] < ∞ the collection
{Zn = φ(E[X0 |Fn ]), n ≥ 0} is uniformly integrable. Indeed, we have from Jensen’s inequality

1 E[φ(X0 )]
P[Zn > c] ≤ E[φ(E[X0 |Fn ])] ≤
c c
3
Pn → P pointwise on some algebra F if ∀E ∈ F : Pn [E] → P [E].
4
This may be shown by transfinite induction: to each ordinal ω associate an algebra Fω generated by monotone
limits of sets from Fω0 with ω 0 < ω. Then σ{F} = Fω0 , where ω0 is the first ordinal for which Fω is a monotone class.
But F is µ-dense in each Fω by transfinite induction.

41
and thus P[Zn > c] → 0 as c → ∞. Therefore, we have again by Jensen’s

E[Zn 1{Zn > c}] ≤ E[φ(X0 )1{Zn > c}] → 0 c → ∞.

Finally, since X−n log+ X−n is uniformly integrable, we have from (3.12)

EQ [X−n log− X−n ] → EQ [X−∞ log− X−∞ ]

and this concludes the proof.

3.6 Variational characterizations and continuity of mutual

information
Again, similarly to Proposition 3.1, it is easy to show that in the case of finite alphabets mutual
information is continuous in the distribution:
Proposition 3.3. Let X and Y be finite alphabets. Then

PX,Y 7→ I(X; Y )

is continuous.

Proof. Apply representation

I(X; Y ) = H(X) + H(Y ) − H(X, Y )

and (3.5).

Further properties of mutual information follow from I(X; Y ) = D(PXY kPX PY ) and correspond-
ing properties of divergence, e.g.

1.
I(X; Y ) = sup E[f (X, Y )] − log E[exp{f (X, Ȳ )}] ,
f

where Ȳ is a copy of Y , independent of X and supremum is over bounded, or even bounded

continuous functions.
d
2. If (Xn , Yn ) → (X, Y ) converge in distribution, then

I(X; Y ) ≤ lim inf I(Xn ; Yn ) . (3.13)

n→∞

1 d
• Good example of strict inequality: Xn = Yn = n Z. In this case (Xn , Yn ) → (0, 0) but
I(Xn ; Yn ) = H(Z) > 0 = I(0; 0).
• Even crazier example: Let (Xp , Yp ) be uniformly distributed on the unit `p -ball on the
d
plane: {x, y : |x|p + |y|p ≤ 1}. Then as p → 0, (Xp , Yp ) → (0, 0), but I(Xp ; Yp ) → ∞!
(Homework)

42
3.
X PXY [Ei × Fj ]
I(X; Y ) = sup PXY [Ei × Fj ] log ,
{Ei }×{Fj } i,j PX [Ei ]PY [Fj ]

where supremum is over finite partitions of spaces X and Y.5

4. (Monotone convergence I):

I(X ∞ ; Y ) = lim I(X n ; Y ) (3.14)

n→∞
I(X ∞ ; Y ∞ ) = lim I(X n ; Y n ) (3.15)
n→∞

This implies that all mutual information between two-processes X ∞ and Y ∞ is contained in
their finite-dimensional projections, leaving nothing for the tail σ-algebra.
T
5. (Monotone convergence II): Let Xtail be a random variable such that σ(Xtail ) = n≥1 σ(Xn∞ ).
Then
I(Xtail ; Y ) = lim I(Xn∞ ; Y ) , (3.16)
n→∞

whenever the right-hand side is finite. This is a consequence of Prop. 3.2. Without the finiteness
i.i.d.
assumption the statement is incorrect. Indeed, consider Xj ∼ Bern(1/2) and Y = X0∞ . Then
each I(Xn∞ ; Y ) = ∞, but Xtail = const a.e. by Kolmogorov’s 0-1 law, and thus the left-hand
side of (3.16) is zero.

5
To prove this from (3.3) one needs to notice that algebra of measurable rectangles is dense in the product
σ-algebra.

43
§ 4. Extremization of mutual information: capacity saddle point

4.1 Convexity of information measures

Theorem 4.1. (P, Q) 7→ D(P kQ) is convex.
Proof. First proof : Let X ∼ Bern(λ). Define two conditional kernels:
PY |X=0 = P0 , PY |X=1 = P1
QY |X=0 = Q0 , QY |X=1 = Q1
Conditioning increases divergence, hence
λ̄D(P0 kQ0 ) + λD(P1 kQ1 ) = D(PY |X kQY |X |PX ) ≥ D(PY kQY ) = D(λ̄P0 + λP1 kλ̄Q0 + λQ1 ).
Second proof : (p, q) 7→ p log pq is convex on R2+ [Verify by computing the Hessian matrix and
showing that it is positive semidefinite].1
Third proof : By the Donsker-Varadhan variational representation,
D(P kQ) = sup EP [f (X)] − log EQ [exp{f (X)}] .
f ∈C

where for fixed f , P 7→ EP [f (X)] is affine, Q 7→ log EQ [exp{f (X)}] is concave. Therefore (P, Q) 7→
D(P kQ) is a pointwise supremum of convex functions, hence convex.
Remark 4.1. The first proof shows that for an arbitrary measure of similarity D(P kQ) convexity
of (P, Q) 7→ D(P kQ) is equivalent to “conditioning increases divergence” property of D. Convexity
can also be understood as “mixing decreases divergence”.
Remark 4.2 (f -divergences). Any f -divergence, cf. (1.16), satisfies all the key properties of the
usual divergence: positivity, monotonicity, data processing (DP), conditioning increases divergence
(CID) and convexity in the pair. Indeed, by previous remark the last two are equivalent. Furthermore,
proof of Theorem 2.2 showed that DP and CID are implied by monotonicity. Thus, consider PXY
and QXY and note

PXY
Df (PXY kQXY ) = EQXY f (4.1)
QXY

PY PX|Y
= EQY EQX|Y f · (4.2)
QY QX|Y

PY
≥ EQY f , (4.3)
QY
where inequality follows by applying Jensen’s inequality to the convex function f . Finally, positivity
PY
follows from Jensen’s inequality and recalling that f (1) = 0 by assumption and EQY [ Q Y
] = 1.

1 p
This is a general phenomenon: for a convex f (·) the perspective function (p, q) 7→ qf q
is convex too.

44
Theorem 4.2 (Entropy). PX 7→ H(PX ) is concave.

Proof. If PX is on a finite alphabet, then proof is complete by H(X) = log |X | − D(PX kUX ).
Otherwise, set
P0 Y = 0
PX|Y = , P (Y = 0) = λ
P1 Y = 1
Then apply H(X|Y ) ≤ H(X).

Recall that I(X, Y ) is a function of PXY , or equivalently, (PX , PY |X ). Denote I(PX , PY |X ) =

I(X; Y ).
Theorem 4.3 (Mutual Information).

• For fixed PY |X , PX 7→ I(PX , PY |X ) is concave.

• For fixed PX , PY |X 7→ I(PX , PY |X ) is convex.

Proof.

• First proof : Introduce θ ∈ Bern(λ). Define PX|θ=0 = PX0 and PX|θ=1 = PX1 . Then θ → X → Y .
Then PX = λ̄PX0 + λPX1 . I(X; Y ) = I(X, θ; Y ) = I(θ; Y ) + I(X; Y |θ) ≥ I(X; Y |θ), which is
our desired I(λ̄PX0 + λPX1 , PY |X ) ≥ λ̄I(PX0 , PY |X ) + λI(PX0 , PY |X ).
Second proof : I(X; Y ) = minQ D(PY |X kQ|PX ) – pointwise minimum of affine functions is
concave.
Third proof : Pick a Q and use the golden formula: I(X; Y ) = D(PY |X kQ|PX ) − D(PY kQ),
where PX 7→ D(PY kQ) is convex, as the composition of the PX 7→ PY (affine) and PY 7→
D(PY kQ) (convex).

• I(X; Y ) = D(PY |X kPY |PX ) and D is convex in the pair.

4.2* Local behavior of divergence

Due to the smoothness of the function (p, q) 7→ p log pq at (1, 1) it is natural to expect that the
functional
P 7→ D(P kQ)
should also be smooth as P → Q. Due to non-negativity and convexity, it is then also natural to
expect that this functional decays quadratically. Next, we show that the decay is always sublinear;
furthermore, under the assumption that χ2 (P kQ) < ∞ it is indeed quadratic.
Proposition 4.1. When D(P kQ) < ∞, the one-sided derivative in λ = 0 vanishes:

d
D(λP + λ̄QkQ) = 0
dλ λ=0
If we exchange the arguments, the criterion is even simpler:
d
D(QkλP + λ̄Q) = 0 ⇐⇒ P Q (4.4)
dλ λ=0

45
Proof.
1 1
D(λP + λ̄QkQ) = EQ (λf + λ̄) log(λf + λ̄)
λ λ
dP
where f = . As λ → 0 the function under expectation decreases to (f − 1) log e monotonically.
dQ
Indeed, the function
λ 7→ g(λ) , (λf + λ̄) log(λf + λ̄)
is convex and equals zero at λ = 0. Thus g(λ)λ is increasing in λ. Moreover, by the convexity of
x 7→ x log x:
1 1
(λf + λ)(log(λf + λ)) ≤ (λf log f + λ1 log 1) = f log f
λ λ
and by assumption f log f is Q-integrable. Thus the Monotone Convergence Theorem applies.
To prove (4.4) first notice that if P 6 Q then there is a set E with p = P [E] > 0 = Q[E].
Applying data-processing for divergence to X 7→ 1E (X), we get
1
D(QkλP + λ̄Q) ≥ d(0kλp) = log
1 − λp
dP
and derivative is non-zero. If P Q, then let f = dQ and notice simple inequalities

log λ̄ ≤ log(λ̄ + λf ) ≤ λ(f − 1) log e .

1
Dividing by λ and assuming λ < 2 we get for some absolute constants c1 , c2 :

1
log(λ̄ + λf ) ≤ c1 f + c2 .
λ

Thus, by the dominated convergence theorem we get

Z Z
1 1 λ→0
D(QkλP + λ̄Q) = − dQ log(λ̄ + λf ) −−−→ dQ(1 − f ) = 0 .
λ λ

Remark 4.3. More generally, under suitable technical conditions,

d dQ
D(λP + λQkR) = EP log − D(QkR)
dλ λ=0 dR
and

d dP1 dQ0
D(λ̄P1 + λQ1 kλ̄P0 + λQ0 ) = EQ1 log − D(P1 kP0 ) + EP1 1 − log e.
dλ λ=0 dP0 dP0

The message of Proposition 4.1 is that the function

λ 7→ D(λP + λ̄QkQ) ,

is o(λ) as λ → 0. In fact, in most cases it is quadratic in λ. To make a precise statement, we need

to define the concept of χ2 -divergence – a version of f -divergence (1.16):
Z 2
2 dP
χ (P kQ) , dQ −1 .
dQ

46
This is a very popular measure of distance between P and Q, frequently used in statistics. It has
many important properties, but we will only mention that χ2 dominates KL-divergence:

D(P kQ) ≤ log(1 + χ2 (P kQ)) .

Our second result about local properties of KL-divergence is the following:

Proposition 4.2 (KL is locally χ2 -like). We have

1 log e 2
lim inf 2
D(λP + λ̄QkQ) = χ (P kQ) , (4.5)
λ→0 λ 2
where both sides are finite or infinite simultaneously.

Proof. First, we assume that χ2 (P kQ) < ∞ and prove

λ2 log e 2
D(λP + λ̄QkQ) = χ (P kQ) + o(λ2 ) , λ → 0.
2
To that end notice that
dP
D(P kQ) = EQ g ,
dQ
where
g(x) , x log x − (x − 1) log e .
g(x) R1 sds
Note that x 7→ (x−1)2 log e
= 0 x(1−s)+s is decreasing in x on (0, ∞). Therefore

0 ≤ g(x) ≤ (x − 1)2 log e ,

and hence 2
1 dP dP
0 ≤ 2 g λ̄ + λ ≤ − 1 log e.
λ dQ dQ
By the dominated convergence theorem (which is applicable since χ2 (P kQ) < ∞) we have
" 2 #
1 dP g 00 (1) dP log e 2
lim EQ g λ̄ + λ = EQ −1 = χ (P kQ) .
λ→0 λ2 dQ 2 dQ 2

Second, we show that unconditionally

1 log e 2
lim inf 2
D(λP + λ̄QkQ) ≥ χ (P kQ) . (4.6)
λ→0 λ 2
Indeed, this follows from Fatou’s lemma:

1 dP dP log e 2
lim inf EQ 2 g λ̄ + λ ≥ EQ lim inf g λ̄ + λ = χ (P kQ) .
λ→0 λ dQ λ→0 dQ 2

Therefore, from (4.6) we conclude that if χ2 (P kQ) = ∞ then so is the LHS of (4.5).

47
4.3* Local behavior of divergence and Fisher information
Consider a parameterized set of distributions {Pθ , θ ∈ Θ} and assume Θ is an open subset of Rd .
Furthermore, suppose that distribution Pθ are all given in the form of
Pθ (dx) = f (x|θ)µ(dx) ,
where µ is some common dominating measure (e.g. Lebesgue or counting). If for a fixed x functions
θ → f (x|θ) are smooth, one can define Fisher information matrix with respect to parameter θ as

JF (θ) , EX∼Pθ V V T , V , ∇θ log f (X|θ) . (4.7)
Under suitable regularity conditions, Fisher information matrix has several equivalent expressions:
JF (θ) = covX∼Pθ [∇θ log f (X|θ)] (4.8)
Z p p
= (4 log e) µ(dx)(∇θ f (x|θ))(∇θ f (x|θ))T (4.9)

= −(log e) Eθ [Hessθ (log f (X|θ))] , (4.10)

where the latter is obtained by differentiating
Z
∂
0 = µ(dx)f (x|θ) log f (x|θ)
∂θi
in θj .
Trace of this matrix is called Fisher information and similarly can be expressed in a variety of
forms:
Z
k∇θ f (x|θ)k2
tr JF (θ) = µ(dx) (4.11)
f (x|θ)
Z p
= 4 µ(dx)k∇θ f (x|θ)k2 (4.12)
" d #
X ∂2
= −(log e) · EX∼Pθ log f (X|θ) , (4.13)
∂θi ∂θi
i=1

Significance of Fisher information matrix arises from the fact that it gauges the local behaviour
of divergence for smooth parametric families. Namely, we have (again under suitable technical
conditions):2
1
D(Pθ0 kPθ0 +ξ ) = ξ T JF (θ0 )ξ + o(kξk2 ) , (4.14)
2 log e
which is obtained by integrating the Taylor expansion:
1
log f (x|θ0 + ξ) = log f (x|θ0 ) + ξ T ∇θ log f (x|θ0 ) + ξ T Hessθ (log f (x|θ0 ))ξ + o(kξk2 ) .
2
Property (4.14) is of paramount importance in statistics. We should remember it as: Divergence
is locally quadratic on the parameter space, with Hessian given by the Fisher information matrix.
2
To illustrate the subtlety here, consider a scalar location family, i.e. f (x|θ) = f0 (x − θ) with f0 – some density.
R (f00 )2
In this case Fisher information JF (θ0 ) = (log e)2 f0
does not depend on θ0 and is well-defined even for compactly
supported f0 , provided f00 vanishes at the endpoints sufficiently fast. But at the same time the left-hand side of (4.14)
is infinite for any ξ > 0. In such cases, a better interpretation for Fisher information is as the coefficient of the
ξ2
expansion D(Pθ0 k 12 Pθ0 + 12 Pθ0 +ξ ) = 8 log J + o(ξ 2 ).
e F

48
Remark 4.4. It can be seen that if one introduces another parametrization θ̃ ∈ Θ̃ by means of a
smooth invertible map Θ̃ → Θ, then Fisher information matrix changes as

JF (θ̃) = AT JF (θ)A , (4.15)

where A = dθ
dθ̃
is the Jacobian of the map. So we can see that JF transforms similarly to the metric
tensor in Riemannian geometry. This idea can be used to define a Riemannian metric on the space
of parameters Θ, called the Fisher-Rao metric. This is explored in a field known as information
geometry [AN07].

Example: Consider Θ to be the interior of a simplex of all distributions Pd on a finite alphabet

{0, . . . , d}. We will take θ1 , . . . , θd as free parameters and set θ0 = 1 − i=1 θi . So all derivatives
are with respect to θ1 , . . . , θd only. Then we have
(
θx , x = 1, . . . , d
Pθ (x) = f (x|θ) = P
1 − x6=0 θx , x = 0

and for Fisher information matrix we get

( )
1 1 1
JF (θ) = (log2 e) diag( , . . . , ) + Pd 1 · 1T , (4.16)
θ1 θd 1 − i=1 θi

where 1 · 1T is the d × d matrix of all ones. For future reference, we also compute determinant of
JF (θ). To that end notice that det(A + xy T ) = det A · det(I + A−1 xy T ) = det A · (1 + y T A−1 x),
where we used the identity det(I + AB) = det(I + BA). Thus, we have
d
Y d
Y
1 1 1
det JF (θ) = (log e)2d = (log e)2d Pd . (4.17)
θx 1 − x=1 θx θx
x=0 x=1

4.4 Extremization of mutual information

Two problems of interest

• Fix PY |X → max I(X; Y ) — channel coding (Part IV)

PX
Note: This maximum is called “capacity” of a set of distributions {PY |X=x , x ∈ X }.

• Fix PX → min I(X; Y ) — lossy compression (Part V)

PY |X

Theorem 4.4 (Saddle point). Let P be a convex set of distributions on X . Suppose there exists
PX∗ ∈ P such that
sup I(PX , PY |X ) = I(PX∗ , PY |X ) , C
PX ∈P

PY |X
and let PX∗ −−−→ PY∗ . Then for all PX ∈ P and for all QY , we have

Note: PX∗ (resp., PY∗ ) is called a capacity-achieving input (resp., output) distribution, or a caid
(resp., the caod ).

49
Proof. Right inequality: obvious from C = I(PX∗ , PY |X ) = minQY D(PY |X kQY |PX∗ ).
Left inequality: If C = ∞, then trivial. In the sequel assume that C < ∞, hence I(PX , PY |X ) <
∞ for all PX ∈ P. Let PXλ = λPX + λPX∗ ∈ P by convexity of P, and introduce θ ∼ Bern(λ), so
that PXλ |θ=0 = PX∗ , PXλ |θ=1 = PX , and θ → Xλ → Yλ . Then

C ≥ I(Xλ ; Yλ ) = I(θ, Xλ ; Yλ ) = I(θ; Yλ ) + I(Xλ ; Yλ |θ)

Since I(PX , PY |X ) < ∞, we can subtract it to obtain

λ(C − I(PX , PY |X )) ≥ λD(PY kPYλ ).

Dividing both sides by λ, taking the lim inf and using lower semicontinuity of D, we have

C − I(PX , PY |X ) ≥ lim inf D(PY kPYλ ) ≥ D(PY kPY∗ )

Here is an even shorter proof:

C ≥ I(Xλ ; Yλ ) = D(PY |X kPYλ |PXλ )

where inequality is by the right part of (4.18) (already shown). Thus, subtracting λ̄C and dividing
by λ we get
D(PX,Y kPX PYλ ) ≤ C
and the proof is completed by taking lim inf λ→0 and applying the lower semincontinuity of divergence.

Corollary 4.1. In addition to the assumptions of Theorem 4.4, suppose C < ∞. Then caod PY∗ is
unique. It satisfies the property that for any PY induced by some PX ∈ P (i.e. PY = PY |X ◦ PX ) we
have
D(PY kPY∗ ) ≤ C < ∞ (4.19)
and in particular PY PY∗ .

Proof. The statement is: I(PX , PY |X ) = C ⇒ PY = PY∗ . Indeed:

C = D(PY |X kPY |PX ) = D(PY |X kPY∗ |PX ) − D(PY kPY∗ )

≤ D(PY |X kPY∗ |PX∗ ) − D(PY kPY∗ )
= C − D(PY kPY∗ ) ⇒ PY = PY∗

Statement (4.19) follows from the left inequality in (4.18) and “conditioning increases divergence”.

50
Remark 4.5. • Finiteness of C is necessary. Counterexample: Consider the identity channel
Y = X, where X takes values on integers. Then any distribution with infinite entropy is caid
or caod.

• Non-uniqueness of caid. Unlike the caod, caid need not be unique. Let Z1 ∼ Bern( 12 ).
Consider Y1 = X1 ⊕ Z1 and Y2 = X2 . Then maxPX1 X2 I(X1 , X2 ; Y1 , Y2 ) = log 4, achieved by
PX1 X2 = Bern(p)×Bern( 12 ) for any p. Note that the caod is unique: PY∗1 Y2 = Bern( 12 )×Bern( 12 ).

Review: Minimax and saddlepoint

Suppose we have a bivariate function f . Then we always have the minimax inequality:

inf sup f (x, y) ≥ sup inf f (x, y).

y x x y

When does it hold with equality?

1. It turns out minimax equality is implied by the existence of a saddle point (x∗ , y ∗ ),
i.e.,
f (x, y ∗ ) ≤ f (x∗ , y ∗ ) ≤ f (x∗ , y) ∀x, y
Furthermore, minimax equality also implies existence of saddle point if inf and sup
are achieved c.f. [BNO03, Section 2.6]) for all x, y [Straightforward to check. See
proof of corollary below].

2. There are a number of known criteria establishing

inf sup f (x, y) = sup inf f (x, y)

y x x y

They usually require some continuity of f , compactness of domains and concavity

in x and convexity in y. One of the most general version is due to M. Sion [Sio58].

3. The mother result of all this minimax theory is a theorem of von Neumann on
bilinear functions: Let A and B have finite alphabets, and g(a, b) be arbitrary, then

min max E[g(A, B)] = max min E[g(A, B)]

PA PB PB PA
P
Here (x, y) ↔ (PA , PB ) and f (x, y) ↔ a,b PA (a)PB (b)g(a, b).

4. A more general version is: if X and Y are compact convex domains in Rn , f (x, y)
continuous in (x, y), concave in x and convex in y then

max min f (x, y) = min max f (x, y)

x∈X y∈Y y∈Y x∈X

Applying Theorem 4.4 to conditional divergence gives the following result.

Corollary 4.2 (Minimax). Under assumptions of Theorem 4.4, we have

max I(X; Y ) = max min D(PY |X kQY |PX )

PX ∈P PX ∈P QY
= min max D(PY |X kQY |PX )
QY PX ∈P

51
Proof. This follows from saddle-point trivially: Maximizing/minimizing the leftmost/rightmost
sides of (4.18) gives

≤ min D(PY |X kQY |PX∗ ) ≤ max min D(PY |X kQY |PX ).

QY PX ∈P QY

but by definition min max ≥ max min.

4.5 Capacity = information radius

Review: Radius and diameter

Let (X, d) be a metric space. Let A be a bounded subset.

1. Radius (aka Chebyshev radius) of A: the radius of the smallest ball that covers A,
i.e., rad (A) = inf y∈X supx∈A d(x, y).

2. Diameter of A: diam (A) = supx,y∈A d(x, y).

3. Note that the radius and the diameter both measure how big/rich a set is.

4. From definition and triangle inequality we have

1
diam (A) ≤ rad (A) ≤ diam (A)
2

5. In fact, the rightmost upper bound can frequently be improved:

• A result of Bohnenblust [Boh38] shows that in Rn equipped with any norm we

n
always have rad (A) ≤ n+1 diam (A).
q
• For Rn with Euclidean distance Jung proved rad (A) ≤ n
2(n+1) diam (A),
attained by simplex. The best constant is sometimes called the Jung constant
of the space.
• For Rn with `∞ -norm the situation is even simpler: rad (A) = 12 diam (A); such
spaces are called centrable.

The next simple corollary shows that capacity is just the radius of the set of distributions {PY |X=x , x ∈
X } when distances are measured by divergence (although, we remind, divergence is not a metric).
Corollary 4.3. For fixed kernel PY |X , let P = {all distributions on X } and X is finite, then

max I(X; Y ) = max D(PY |X=x kPY∗ )

PX x
= D(PY |X=x kPY∗ ) ∀x : PX∗ (x) > 0 .

The last corollary gives a geometric interpretation to capacity: it equals the radius of the smallest
divergence-“ball” that encompasses all distributions {PY |X=x : x ∈ X }. Moreover, PY∗ is a convex
combination of some PY |X=x and it is equidistant to those.

52
The following is the information-theoretic version of “radius ≤ diameter” (in KL divergence) for
arbitrary input space:
Corollary 4.4. Let {PY |X=x : x ∈ X } be a set of distributions. Prove that

C = sup I(X; Y ) ≤ inf sup D(PY |X=x kQ) ≤ sup D(PY |X=x kPY |X=x0 )
PX Q x∈X x,x0 ∈X
| {z } | {z }
radius diameter

Proof. By the golden formula Corollary 3.1, we have

4.6 Existence of caod (general case)

We have shown above that the solution to

C = sup I(X; Y )
PX ∈P

can be a) interpreted as a saddle point; b) written in the minimax form and c) that caod PY∗ is
unique. This was all done under the extra assumption that supremum over PX is attainable. It
turns out, properties b) and c) can be shown without that extra assumption.
Theorem 4.5 (Kemperman). For any PY |X and a convex set of distributions P such that

C = sup I(PX , PY |X ) < ∞ (4.20)

PX ∈P

there exists a unique PY∗ with the property that

C = sup D(PY |X kPY∗ |PX ) . (4.21)

PX ∈P

Furthermore,

C = sup min D(PY |X kQY |PX ) (4.22)

PX ∈P QY
= min sup D(PY |X kQY |PX ) (4.23)
QY PX ∈P

= min sup D(PY |X=x kQY ) , (if P = {all PX }.) (4.24)

QY x∈X

Note: Condition (4.20) is automatically satisfied if there is any QY such that

sup D(PY |X kQY |PX ) < ∞ . (4.25)

PX ∈P

Example: Non-existence of caid. Let Z ∼ N (0, 1) and consider the problem

C= sup I(X; X + Z) . (4.26)

E[X]=0,E[X 2 ]=P
PX :
E[X 4 ]=s

53
If we remove the constraint E[X 4 ] = s the unique caid is PX = N (0, P ), see Theorem 4.6. When
s 6= 3P 2 then such PX is no longer inside the constraint set P. However, for s > 3P 2 the maximum
1
C= log(1 + P )
2
is still attainable. Indeed, we can add a small “bump” to the gaussian distribution as follows:

PX = (1 − p)N (0, P ) + pδx ,

where p → 0, px2 → 0 but px4 → s − 3P 2 > 0. This shows that for the problem (4.26) with s > 3P 2
the caid does not exist, the caod PY∗ = N (0, 1 + P ) exists and unique as Theorem 4.5 postulates.
Proof of Theorem 4.5. Let PX0 n be a sequence of input distributions achieving C, i.e., I(PX0 n , PY |X ) →
C. Let Pn be the convex hull of {PX0 1 , . . . , PX0 n }. Since Pn is a finite-dimensional simplex, the
concave function PX 7→ I(PX , PY |X ) attains its maximum at some point PXn ∈ Pn , i.e.,

In , I(PXn , PY |X ) = max I(PX , PY |X ) .

PX ∈Pn

Denote by PYn be the output distribution induced by PXn . We have then:

D(PYn kPYn+k ) = D(PY |X kPYn+k |PXn ) − D(PY |X kPYn |PXn ) (4.27)

≤ I(PXn+k , PY |X ) − I(PXn , PY |X ) (4.28)
≤ C − In , (4.29)

where in (4.28) we applied Theorem 4.4 to (Pn+k , PYn+k ). By the Pinsker-Csiszár inequality (1.15)
and since In % C, we conclude that the sequence PYn is Cauchy in total variation:

sup TV(PYn , PYn+k ) → 0 , n → ∞.

k≥1

Since the space of probability distributions is complete in total variation, the sequence must have a
limit point PYn → PY∗ . By taking a limit as k → ∞ in (4.29) and applying the lower semi-continuity
of divergence (Theorem 3.6) we get

D(PYn kPY∗ ) ≤ lim D(PYn kPYn+k ) ≤ C − In ,

k→∞

and therefore, PYn → PY∗ in the (stronger) sense of D(PYn kPY∗ ) → 0. By Theorem 3.3,

D(PY |X kPY∗ |PXn ) = In + D(PYn kPY∗ ) → C . (4.30)

S
Take any PX ∈ k≥1 Pk . Then PX ∈ Pn for all sufficiently large n and thus by Theorem 4.4

D(PY |X kPYn |PX ) ≤ In ≤ C , (4.31)

which, by the lower semi-continuity of divergence, implies

D(PY |X kPY∗ |PX ) ≤ C . (4.32)

To prove that (4.32) holds for arbitrary PX ∈ P, we may repeat the argument above with Pn
replaced by P̃n = conv({PX } ∪ Pn ), denoting the resulting sequences by P̃Xn , P̃Yn and the limit
point by P̃Y∗ , and obtain:

D(PYn kP̃Yn ) = D(PY |X kP̃Yn |PXn ) − D(PY |X kPYn |PXn ) (4.33)

≤ C − In , (4.34)

54
where (4.34) follows from (4.32) since PXn ∈ P̃n . Hence taking limit as n → ∞ we have P̃Y∗ = PY∗
and therefore (4.32) holds.
Finally, to see (4.23), note that by definition capacity as a max-min is at most the min-max, i.e.,

in view of (4.30) and (4.31).

Corollary 4.5. Let X be countable and P a convex set of distributions on X . If supPX ∈P H(X) < ∞
then X 1
sup H(X) = min sup PX (x) log <∞
PX ∈P QX PX ∈P
x
QX (x)
and the optimizer Q∗X exists and is unique. If Q∗X ∈ P, then it is also the unique maximizer of
H(X).

Proof. Just apply Kemperman’s Theorem 4.5 to the identity channel Y = X.

P
Example: Max entropy. Assume that f : Z → R is such that n∈Z exp{−λf (n)} < ∞ for all
λ > 0. Then X
max H(X) ≤ inf λa + log exp{−λf (n)} .
X:E[f (X)]≤a λ>0
n

This follows from taking Q(n) = c exp{−λf (n)}. This bound is often tight and achieved by
PX (n) = c exp{−λ∗ f (n)} with λ∗ being the minimizer, known as the Gibbs distribution for the
energy function f .

4.7 Gaussian saddle point

For additive noise, there is also a different kind of saddle point between PX and the distribution of
noise:
2 ) , N ∼ N (0, σ 2 ) , X ⊥
Theorem 4.6. Let Xg ∼ N (0, σX g N g ⊥ Ng . Then:

1. “Gaussian capacity”:
1 σ2
C = I(Xg ; Xg + Ng ) = log 1 + X
2
2 σN

2. “Gaussian input is the best for Gaussian noise”: For all X ⊥ 2 ,

⊥ Ng and varX ≤ σX

I(X; X + Ng ) ≤ I(Xg ; Xg + Ng ),
D
with equality iff X =Xg .

3. “Gaussian noise is the worst for Gaussian input”: For for all N s.t. E[Xg N ] = 0 and
EN 2 ≤ σN
2 ,

I(Xg ; Xg + N ) ≥ I(Xg ; Xg + Ng ),
D
with equality iff N =Ng and independent of Xg .

Interpretations:

55
1. For AWGN channel, Gaussian input is the most favorable. Indeed, immediately from the
second statement we have
1 σ2
max I(X; X + Ng ) = log 1 + X 2
2
X:varX≤σX 2 σN

which is the capacity formula for the AWGN channel.

2. For Gaussian source, additive Gaussian noise is the worst in the sense that it minimizes the
mutual information provided by the noisy version.

Proof. WLOG, assume all random variables have zero mean. Let Yg = Xg + Ng . Define

1 σ 2 log e x2 − σX
2
2 2 2
f (x) = D(PYg |Xg =x kPYg ) = D(N (x, σN )kN (0, σX + σN )) = log 1 + X +
2 σ2 2 σX 2 + σ2
| {z N } N
=C

1. Compute I(Xg ; Xg + Ng ) = E[f (Xg )] = C

2. Recall the inf-representation (Corollary 3.1) I(X; Y ) = minQ D(PY |X kQ|PX ). Then

I(X; X + Ng ) ≤ D(PYg |Xg kPYg |PX ) = E[f (X)] ≤ C < ∞ .

Furthermore, if I(X; X + Ng ) then uniqueness of caod, cf. Corollary 4.1, implies PY = PYg . But
PY = PX ∗ N (0, σN2 ). Then it must be that X ∼ N (0, σ 2 ) simply by considering characteristic
X
functions:
1 2 2 1 2 2 2 1 2 2
ΨX (t) · e− 2 σN t = e− 2 (σX +σN )t ⇒ ΨX (t) = e− 2 σX t =⇒ X ∼ N (0, σX
2
)

3. Let Y = Xg + N and let PY |Xg be the respective kernel. [Note that here we only assume that N
is uncorrelated with Xg , i.e., E [N Xg ] = 0, not necessarily independent.] Then

I(Xg ; Xg + N ) = D(PXg |Y kPXg |PY )

where
PXg |Yg PYg |Xg
• (4.36): P Xg = PYg

• (4.38): E[Xg N ] = 0 and E[Y 2 ] = E[N 2 ] + E[Xg2 ].

56
• (4.39): EN 2 ≤ σN
2 .

Finally, the conditions for equality in (4.35) say

D(PXg |Y kPXg |Yg |PY ) = 0

Thus, PXg |Y = PXg |Yg , i.e., Xg is conditionally Gaussian: PXg |Y =y = N (by, c2 ) for some constants
b and c. In other words, under PXg Y , we have

Xg = bY + cZ , Z ∼ Gaussian ⊥
⊥ Y.

But then Y must be Gaussian itself by Cramer’s Theorem or simply by considering characteristic
functions:
2 0 2 00 2
ΨY (t) · ect = ec t ⇒ ΨY (t) = ec t =⇒ Y– Gaussian
Therefore, (Xg , Y ) must be jointly Gaussian and hence N = Y − Xg is Gaussian. Thus we
2
conclude that it is only possible to attain I(Xg ; Xg + N ) = C if N is Gaussian of variance σN
and independent of Xg .

57
§ 5. Single-letterization. Probability of error. Entropy rate.

5.1 Extremization of mutual information for memoryless sources

and channels
Theorem 5.1. (Joint M.I. vs. marginal M.I.)
Q
(1) If PY n |X n = PYi |Xi then
X
I(X n ; Y n ) ≤ I(Xi ; Yi ) (5.1)
Q
with equality iff PY n = PYi . Consequently,
n
X
max I(X n ; Y n ) = max I(Xi ; Yi ).
PX n PXi
i=1

(2) If X1 ⊥
⊥ ... ⊥
⊥ Xn then X
I(X n ; Y n ) ≥ I(Xi ; Yi ) (5.2)
Q
with equality iff PX n |Y n = PXi |Yi PY n -almost surely.1 Consequently,
n
X
n n
min I(X ; Y ) = min I(Xi ; Yi ).
PY n |X n PYi |Xi
i=1

Note: The moral of this result is that

1. For product channel, the MI-maximizing input is a product distribution

2. For product source, the MI-minimizing channel is a product channel

This type of result is often known as single-letterization in information theory, which tremendously
simplifies the optimization problem over a high-dimensional (multi-letter) problem to a scalar (single-
letter) problem. For example, in the simplest case where X n , Y n are binary vectors, optimizing
I(X n ; Y n ) over PX n and PY n |X n entails optimizing over 2n -dimensional vectors and 2n × 2n matrices,
whereas optimizing each I(Xi ; Yi ) individually is easy.
Example:
1 Q
That is, if PX n ,Y n = PY n PXi |Yi as measures.

58
⊥ X2 ∼ Bern(1/2) on {0, 1} = F2 :
1. (5.1) fails for non-product channels. X1 ⊥
Y1 = X1 + X2
Y2 = X1
I(X1 ; Y1 ) = I(X2 ; Y2 ) = 0 but I(X 2 ; Y 2 ) = 2 bits

2. Strict inequality in (5.1).

∀k Yk = Xk = U ∼ Bern(1/2) ⇒ I(Xk ; Yk ) = 1 bit
X
I(X n ; Y n ) = 1 bit < I(Xk ; Yk ) = n bits

3. Strict inequality in (5.2). X1 ⊥

⊥ ... ⊥
⊥ Xn
Y1 = X2 , Y2 = X3 , . . . , Yn = X1 ⇒ I(Xk ; Yk ) = 0
X X
I(X n ; Y n ) = H(Xi ) > 0 = I(Xk ; Yk )

5.2* Gaussian capacity via orthogonal symmetry

Multi-dimensional case (WLOG assume X1 ⊥
⊥ ... ⊥
⊥ Xn iid): if Z1 , . . . , Zn are independent, then
n
X
Pmax I(X n ; X n + Z n ) ≤ Pmax I(Xk ; Xk + Zk )
E[ Xk2 ]≤nP E[ Xk2 ]≤nP
k=1

Given a distribution
P PX1 · · · PXn satisfying the constraint, form the “average of marginals”
P distribu-
tion P̄X = n1 nk=1 PXk , which also satisfies the single letter constraint E[X 2 ] = n1 nk=1 E[Xk2 ] ≤ P .
Then from concavity in PX of I(PX , PY |X )
n
1X
I(P̄X ; PY |X ) ≥ I(PXk , PY |X )
n
k=1

So P̄X gives the same or better MI, which shows that the extremization above ought to have the form
nC(P ) where C(P ) is the single letter capacity. Now suppose Y n = X n + ZG n where Z n ∼ N (0, I ).
G n
Since an isotropic Gaussian is rotationally symmetric, for any orthogonal transformation U ∈ O(n),
the additive noise has the same distribution ZG n ∼ U Z n , so that P
G U Y n |U X n = PY n |X n , and

I(PX n , PY n |X n ) = I(PU X n , PU Y n |U X n ) = I(PU X n , PY n |X n )

From the “average of marginal” argument above, averaging over many rotations of X n can only
make the mutual information larger. Therefore, the optimal input distribution PX n can be chosen
to be invariant under orthogonal transformations. Consequently, the (unique!) capacity achieving
output distribution PY∗ n must be rotationally invariant. Furthermore, from the conditions for
equality in (5.1) we conclude that PY∗ n must have independent components. Since the only product
distribution satisfying the power constraints and having rotational symmetry is an isotropic Gaussian,
we conclude that PY n = (PY∗ )n and PY∗ = N (0, P In ).

For the other direction in the Gaussian saddle point problem:

min I(XG ; XG + N )
PN :E[N 2 ]=1

This uses the same trick, except here the input distribution is automatically invariant under
orthogonal transformations.

59
5.3 Information measures and probability of error
Let W be a random variable and Ŵ be our prediction. There are three types of problems:

1. Random guessing: W ⊥
⊥ Ŵ .

2. Guessing with data: W → X → Ŵ .

3. Guessing with noisy data: W → X → Y → Ŵ .

We want to draw converse statements, e.g., if the uncertainty of W is too high or if the information
provided by the data is too scarce, then it is difficult to guess the value of W .
Theorem 5.2. Let |X | = M < ∞ and Pmax , maxx∈X PX (x). Then

H(X) ≤ (1 − Pmax ) log(M − 1) + h(Pmax ) , FM (Pmax ), (5.3)

1 − Pmax 1 − Pmax
with equality iff PX = (Pmax , ,..., ).
M −1 M −1
| {z }
M −1

Proof. First proof : Write RHS-LHS as a divergence. Let P = (Pmax , P2 , . . . , PM ) and introduce
Q = (Pmax , 1−P 1−Pmax
M −1 , . . . , M −1 ). Then RHS-LHS = D(P kQ) ≥ 0, with inequality iff P = Q.
max

Second proof : Given any P = (Pmax , P2 , . . . , PM ), apply a random permutation π to the last
M − 1 atoms to obtain the distribution Pπ . Then averaging Pπ over all permutation π gives Q.
Then use concavity of entropy or “conditioning reduces entropy”: H(Q) ≥ H(Pπ |π) = H(P ). P
Third proof : Directly solve the convex optimization max{H(P ) : 0 ≤ pi ≤ Pmax , i = 1, . . . , M, i pi =
1}.
Fourth proof : Data processing inequality. Later.

Note: Similar to Shannon entropy H, Pmax is also a reasonable measure for randomness of P . In fact,
1
log Pmax is known as the Rényi entropy of order ∞, denoted by H∞ (P ). Note that H∞ (P ) = log M
iff P is uniform; H∞ (P ) = 0 iff P is a point mass.
Note: The function FM on the RHS of (5.3) looks like

FM (p)

log M

log(M − 1)

p
0 1/M 1

1
which is concave with maximum log M at maximizer 1/M , but not monotone. However, Pmax ≥ M
1
and FM is decreasing on [ M , 1]. Therefore (5.3) gives a lower bound on Pmax in terms of entropy.

60
Interpretation: Suppose one is trying to guess the value of X without any information. Then
the best bet is obviously the most likely outcome (mode), i.e., the maximal probability of success
among all estimators is
max P[X = X̂] = Pmax (5.4)
⊥X
X̂⊥

Thus (5.3) means: It is hard to predict something of large entropy.

Conceptual question: Is it true (for every predictor X̂ ⊥
⊥ X) that

H(X) ≤ FM (P[X = X̂]) ? (5.5)

This is not obvious from (5.3) and (5.4) since p 7→ FM (p) is not monotone. To show (5.5) consider
the data processor (X, X̂) 7→ 1{X=X̂ } :

PX X̂ = PX PX̂ P[X = X̂] , PS 1
d PS k M ≤ D(PX X̂ kQX X̂ )
⇒
⇒
QX X̂ = UX PX̂ Q[X = X̂] = 1
M = log M − H(X)

where inequality follows by the data-processing for divergence.

The benefit of this proof is that it trivially generalizes to (possibly randomized) estimators X̂(Y ),
which may depend on some observation Y correlated with X. It is clear that the best predictor for
X given Y is the maximum posterior (MAP) rule, i.e., posterior mode: X̂(y) = argmaxx PX|Y (x|y).
Theorem 5.3 (Fano’s inequality). Let |X | = M < ∞ and X → Y → X̂. Then

H(X|Y ) ≤ FM (P[X = X̂]) = P[X 6= X̂] log(M − 1) + h(P[X 6= X̂]). (5.6)

Thus, if in addition X is uniform, then

I(X; Y ) = log M − H(X|Y ) ≥ P[X = X̂] log M − h(P[X 6= X̂]). (5.7)

Proof. This is a direct corollary of Theorem 5.2: averaging H(X|Y = y) ≤ FM (P[X = X̂|Y = y])
over PY and use the concavity of FM .
For a standalone proof, apply data processing to PXY X̂ = PX PY |X PX̂|Y vs. QXY X̂ = UX PY PX̂|Y
and the data processor (kernel) (X, Y ) 7→ 1{X6=X̂} (note that PX̂|Y is fixed).

Remark: We can also derive Fano’s inequality as follows: Let = P[X 6= X̂]. Apply data
processing for M.I.

I(X; Y ) ≥ I(X; X̂) ≥ min {I(PX , PZ|X ) : P[X = Z] ≥ 1 − }.

PZ|X

This minimum will not be zero since if we force X and Z to agree with some probability, then I(X; Z)
cannot be too small. It remains to compute the minimum, which is a nice convex optimization
problem. (Hint: look for invariants that the matrix PZ|X must satisfy under permutations (X, Z) 7→
(π(X), π(Z)) then apply the convexity of I(PX , ·)).
Theorem 5.4 (Fano inequality: general). Let X, Y ∈ X , |X | = M and let QXY = PX PY , then

I(X; Y ) ≥ d(P[X = Y ]kQ[X = Y ])

1
≥ P[X = Y ] log − h(P[X = Y ])
Q[X = Y ]
(= P[X = Y ] log M − h(P[X = Y ]) if PX or PY = uniform)

61
Proof. Apply data processing to PXY and QXY . Note that if PX or PY = uniform, then Q[X =
1
Y]= M always.

The following result is useful in providing converses for statistics and data transmission.
Corollary 5.1 (Lower bound on average probability of error). Let W → X → Y → Ŵ and W is
uniform on [M ] , {1, . . . , M }. Then

I(X; Y ) + h(Pe )
Pe , P[W 6= Ŵ ] ≥ 1 − (5.8)
log M
I(X; Y ) + log 2
≥ 1− . (5.9)
log M

Proof. Apply Theorem 5.3 and the data processing for M.I.: I(W ; Ŵ ) ≤ I(X; Y ).

5.4 Entropy rate

Definition 5.1. The entropy rate of a process X = (X1 , X2 , . . .) is
1
H(X) , lim H(X n ) (5.10)
n→∞ n

provided the limit exists.

Stationarity is a sufficient condition for entropy rate to exist. Essentially, stationarity means
D
invariance w.r.t. time shift. Formally, X is stationary if (Xt1 , . . . , Xtn )=(Xt1 +k , . . . , Xtn +k ) for any
t1 , . . . , tn , k ∈ N.
Theorem 5.5. For any stationary process X = (X1 , X2 , . . .)

1. H(Xn |X n−1 ) ≤ H(Xn−1 |X n−2 )

1 n
2. n H(X ) ≥ H(Xn |X n−1 )
1 n 1 n−1 )
3. n H(X ) ≤ n−1 H(X

4. H(X) exists and H(X) = limn→∞ n1 H(X n ) = limn→∞ H(Xn |X n−1 ).

5. For double-sided process X = (. . . , X−1 , X0 , X1 , X2 , . . .), H(X) = H(X1 |X−∞

0 ) provided that

H(X1 ) < ∞.

Proof.

1. Further conditioning + stationarity: H(Xn |X n−1 ) ≤ H(Xn |X2n−1 ) = H(Xn−1 |X n−2 )

P
2. Using chain rule: n1 H(X n ) = n1 H(Xi |X i−1 ) ≥ H(Xn |X n−1 )

3. H(X n ) = H(X n−1 ) + H(Xn |X n−1 ) ≤ H(X n−1 ) + n1 H(X n )

4. n 7→ n1 H(X n ) is a decreasing sequence and lower bounded by zero, hence has a limit H(X).
1 1 Pn
Moreover by chain rule, n H(X ) = n i=1 H(Xi |X i−1 ). Then H(Xn |X n−1 ) → H(X). Indeed,
n

from part 1 limn P H(Xn |X n−1 ) = H 0 exists. Next, recall from calculus: if an → a, then the
Cesàro’s mean n ni=1 ai → a as well. Thus, H 0 = H(X).
1

62
5. Assuming H(X1 ) < ∞ we have from (3.14):
0 0 0 0
lim H(X1 ) − H(X1 |X−n ) = lim I(X1 ; X−n ) = I(X1 ; X−∞ ) = H(X1 ) − H(X1 |X−∞ )
n→∞ n→∞

Example: (Stationary processes)

1. X − iid source ⇒ H(X) = H(X1 )

2. X − mixed sources: Flip a coin with bias p at time t = 0, if head, let X = Y, if tail, let X = Z.
Then H(X) = pH(Y) + p̄H(Z).

3. X − stationary Markov chain : X1 → X2 → X3 → · · ·

where µ is an invariant measure (possibly non-unique; unique if the chain is ergodic).

4. X − hidden Markov chain : Let X1 → X2 → X3 → · · · be a Markov chain. Fix PY |X . Let

PY |X
Xi −−−→ Yi . Then Y = (Y1 , . . .) is a stationary process. Therefore H(Y) exists but it is very
difficult to compute (no closed-form solution to date), even if X is a binary Markov chain and
PY |X is a BSC.

5.5 Entropy and symbol (bit) error rate

In this section we show that the entropy rates of two processes X and Y are close whenever they
can be “coupled”. Coupling of two processes means defining them on a common probability space
so that average distance between their realizations is small. In our case, we will require that the
symbol error rate be small, i.e.
n
1X
P[Xj 6= Yj ] ≤ . (5.11)
n
j=1

Notice that if we define the Hamming distance as

n
X
dH (xn , y n ) , 1{xj 6= yj }
j=1

then indeed (5.11) corresponds to requiring

E[dH (X n , Y n )] ≤ n .

Before showing our main result, we show that Fano’s inequality Theorem 5.3 can be tensorized:
Proposition 5.1. Let Xk take values on a finite alphabet X . Then

H(X n |Y n ) ≤ nF|X | (1 − δ) , (5.12)

where
n
1 1X
δ= E[dH (X n , Y n )] = P[Xj 6= Yj ] .
n n
j=1

63
Proof. For each j ∈ [n] consider X̂j (Y n ) = Yj . Then from (5.6) we get

H(Xj |Y n ) ≤ FM (P[Xj = Yj ) , (5.13)

where we denoted M = |X |. Then, upper-bounding joint entropy by the sum of marginals, cf. (1.1),
and combining with (5.13) we get
n
X
H(X n |Y n ) ≤ H(Xj |Y n ) (5.14)
j=1
Xn
≤ FM (P[Xj = Yj ]) (5.15)
j=1
n
1X
≤ nFM ( P[Xj = Yj ]) . (5.16)
n
j=1

where in the last step we used concavity of FM and Jensen’s inequality. Noticing that
n
1X
P[Xj = Yj ] = 1 − δ
n
j=1

concludes the proof.

Corollary 5.2. Consider two processes X and Y with entropy rates H(X) and H(Y). If

P[Xj 6= Yj ] ≤

for every j and if X takes values on a finite alphabet of size M , then

H(X) − H(Y) ≤ FM (1 − ) .

If both processes have alphabets of size M then

|H(X) − H(Y)| ≤ log M + h() → 0 as → 0

Proof. There is almost nothing to prove:

H(X n ) ≤ H(X n , Y n ) = H(Y n ) + H(X n |Y n )

and apply (5.12). For the last statement just recall the expression for FM .

5.6 Mutual information rate

Definition 5.2 (Mutual information rate).
1
I(X; Y) = lim I(X n ; Y n )
n→∞ n

provided the limit exists.

64
Example: Gaussian processes. Consider X, N two stationary Gaussian processes, independent of
each other. Assume that their auto-covariance functions are absolutely summable and thus there
exist continuous power spectral density functions fX and fN . Without loss of generality, assume all
means are zero. Let cX (k) =PE [X1 Xk+1 ]. Then fX is the Fourier transform of the auto-covariance
function cX , i.e., fX (ω) = ∞ k=−∞ cX (k)e
iωk . Finally, assume f
N ≥ δ > 0. Then recall from
Lecture 2:
1 det(ΣX n + ΣN n )
I(X n ; X n + N n ) = log
2 det ΣN n
Xn n
1 1X
= log σi − log λi ,
2 2
i=1 i=1

where σj , λj are the eigenvalues of the covariance matrices ΣY n = ΣX n + ΣN n and ΣN n , which are
all Toeplitz matrices, e.g., (ΣX n )ij = E [Xi Xj ] = cX (i − j). By Szegö’s theorem (see Section 5.7*):
n Z
1X 1 2π
log σi → log fY (ω)dω
n 2π 0
i=1

Note that cY (k) = E [(X1 + N1 )(Xk+1 + Nk+1 )] = cX (k) + cN (k) and hence fY = fX + fN . Thus,
we have Z 2π
1 1 fX (w) + fN (ω)
I(X n ; X n + N n ) → I(X; X + N) = log dω.
n 4π 0 fN (ω)
∗ (ω) = |T − f (ω)|+ .
Maximizing this over fX (ω) leads to the famous water-filling solution fX N

5.7* Toeplitz matrices and Szegö’s theorem

Theorem 5.6 (Szegö). Let f : [0, 2π) → R be the Fourier transform of a summable sequence {ak },
that is
∞
X X
f (ω) = eikω ak , |ak | < ∞
k=−∞

Then for any φ : R → R continuous on the closure of the range of f , we have

n Z
1X 1 2π
lim φ(σn,j ) = φ(f (ω))dω ,
n→∞ n 2π 0
j=1

where {σn,j , j = 1, . . . , n} are the eigenvalues of the Toeplitz matrix Tn = {a`−m }n`,m=1 .

Proof sketch. The idea is to approximate φ by polynomials, while for polynomials the statement
can be checked directly. An alternative interpretation of the strategy is theP following: Roughly
speaking we want to show that the empirical distribution of the eigenvalues n nj=1 δσn,j converges
1

weakly to the distribution of f (W ), where W is uniformly distributed on [0, 2π]. To this end, let us
check that all moments converge. Usually this does not imply weak convergence, but on compact
intervals it does.

65
For example, for φ(x) = x2 we have
n
1X 2 1
σn,j = tr Tn2
n n
j=1
n
1 X
= (Tn )`,m (Tn )m,`
n
`,m=1
1X
= a`−m am−`
n
`,m
n−1
X
1
= (n − |`|)a` a−`
n
`=−n−1
X
= (1 − |x|)anx a−nx ,
1
x∈(−1,1)∩ n Z

1
R 2π
Substituting a` = 2π 0 f (ω)eiω` we get
n ZZ
1X 2 1
σn,j = f (ω)f (ω 0 )θn (ω − ω 0 ) , (5.17)
n (2π)2
j=1

where X
θn (u) = (1 − |x|)e−inux
1
x∈(−1,1)∩ n Z

is a Fejer kernel and converges to a δ-function: θn (u) → 2πδ(u) (in the sense of convergence of
Schwartz distributions). Thus from (5.17) we get
n ZZ Z
1X 2 1 0 0 1 0
2π
σn,j → f (ω)f (ω )2πδ(ω − ω )dωdω = f 2 (ω)dω
n (2π)2 2π 0
j=1

as claimed.

66
§ 6. f -divergences: definition and properties

6.1 f -divergences
In Section 1.6 we introduced the KL divergence that measures the dissimilarity between two
distributions. This turns out to be a special case of the family of f -divergence between probability
distributions, introduced by Csiszár [Csi67]. Roughly speaking, all f -divergences quantify the
difference between a pair of distributions, each with different operational meaning.
Definition 6.1 (f -divergence). Let f : (0, ∞) → R be a convex function which is strictly convex1
at 1 and f (1) = 0. Let P and Q be two probability distributions on a measurable space (X , F).
The f -divergence of Q from P with P Q is defined as

dP/dµ
Df (P kQ) , EQ f (6.1)
dQ/dµ
where µ is any dominating probability measure (e.g., µ = (P + Q)/2) of P and Q, i.e., both P µ
and Q µ, with the understanding that

• f (0) = f (0+),

• 0f ( 00 ) = 0, and

• 0f ( a0 ) = limx↓0 xf ( xa ) for a > 0.

Remark 6.1. It is useful to consider the case when P 6 Q, in which case P and Q are “very
dissimilar.” For example, we will show later that TV(P, Q) = 1 iff P ⊥ Q. When P Q, we have

dP
Df (P kQ) , EQ f . (6.2)
dQ
P
P (x)
Similar to Definition 1.4, when X is discrete, Df (P kQ) = x∈X Q(x)f Q(x) ; when X is Eucledian
R
space, Df (P kQ) = X q(x)f p(x) q(x) dx.

The following are the common f -divergences:

• Kullback-Leibler (KL) divergence: aka relative entropy, f (x) = x log x,

P P P
D(P kQ) , EQ log = EP log .
Q Q Q
This has been extensively discussed in Section 1.6. It is worth noting
h that,
i in general
D(P kQ) 6= D(QkP ). When f (x) = − log x, we obtain Df (P kQ) = EQ − log Q = D(QkP ).
P

1
By strict convexity at 1, we mean for all s, t ∈ (0, ∞) and α ∈ (0, 1) such that αs + ᾱt = 1, we have
αf (s) + (1 − α)f (t) > f (1).

67
• Total variation: f (x) = 21 |x − 1|,
Z
1 P 1

TV(P, Q) , EQ − 1 = |dP − dQ|.
2 Q 2
Moreover, TV(·, ·) is a metric on the space of probability distributions, and hence it is a
symmetric function of P and Q.

• χ2 -divergence: f (x) = (x − 1)2 ,

" 2 # Z Z
P (P − Q)2 P2
χ2 (P kQ) , EQ −1 = = − 1.
Q Q Q

Note that we can also choose f (x) = x2 − 1. Indeed different f can lead to the same divergence.
√ 2
• Squared Hellinger distance: f (x) = (1 − x) ,
 s !2  Z
P  √ p 2
H 2 (P, Q) , EQ  1 − = P− Q .
Q

Note that H 2 (P, Q) = H 2 (Q, P ).

1−x
• Le Cam distance [LC86, p. 47]: f (x) = 2x+2 ,
Z
1 (P − Q)2
LC(P kQ) = .
2 P +Q
Note that this is also a metric.
2x 2
• Jensen-Shannon divergence (symmetrized KL): f (x) = x log x+1 + log x+1 ,
P + Q P + Q

JS(P kQ) = D P + D Q .
2 2

Theorem 6.1 (Properties of f -divergences).

• Non-negativity: Df (P kQ) ≥ 0 with equality if and only if P = Q.

• Joint convexity: (P, Q) 7→ Df (P kQ) is a jointly convex function. Consequently, P 7→

Df (P kQ) and Q 7→ Df (P kQ) are also convex functions.

• Conditioning increases f -divergence: Define the conditional f -divergence:

PY |X PY

QY |X QY

68
Then
Df (PY kQY ) ≤ Df PY |X kQY |X |PX .

Note: For the last property, one can view PY and QY as the output distributions after passing PX
through the channel transition matrices PY |X and QY |X respectively. The above relation tells us
that the average f -divergence between the corresponding channel transition rows is at least the
f -divergence between the output distributions.
h i h i
Proof. • Df (P kQ) = EQ f Q P
≥ f EQ Q P
= f (1) = 0, where the inequality follows from
the Jensen’s inequality. By strict convexity at 1, equality holds if and only if P = Q.

• For any convex function f on R+ , it follows that (p, q) 7→ qf pq is convex on R2+ (the
h i
perspective of f ). Since Df (P kQ) = EQ f Q P
, Df (P kQ) is jointly convex.

• This follows directly from the joint-convexity of Df (P kQ) and the Jensen’s inequality.

Recall the definition of f -divergences from last time. If a function f : R+ → R satisfies the
following properties:

• f is a convex function.

• f (1) = 0.

• f is strictly convex at x = 1, i.e. for all x, y, α such that αx + ᾱy = 1, the inequality
f (1) < αf (x) + ᾱf (y) is strict.

Then the functional that maps pairs of distributions to R+ defined by

dP
Df (P kQ) , EQ f
dQ
is an f -divergence.

6.2 Data processing inequality

The data processing inequality for KL divergence (Theorem 2.2) extends to all f -divergences.
Theorem 6.2. Consider a channel that produces Y given X based on the law PY |X (shown below).
If PY is the distribution of Y when X is generated by PX and QY is the distribution of Y when X
is generated by QX , then for any f -divergence Df (·k·),

Df (PY kQY ) ≤ Df (PX kQX ).

PX PY

PY |X

QX QY

69
One interpretation of this result is that processing the observation x makes it more difficult to
determine whether it came from PX or QX .

Proof.

PX (a) PXY PXY
Df (PX kQX ) = EQX f = EQXY f = EQY EQX|Y f
QX QXY QXY

PXY
Jensen’s inequality → ≥ EQY f EQX|Y
QXY

PY (b) PY
= EQY f EPX|Y = EQY f = Df (PY kQY ).
QY QY

Note that (a) means Df (PX kQX ) = Df (PXY kQXY ); (b) can be alternatively understood by noting
PXY PY
that EQ [ Q XY
|Y ] is precisely the relative density Q Y
, by checking the definition of change of measure,
PXY PXY
i.e., EP [g(Y )] = EQ [g(Y ) QXY
] = EQ [g(Y )E[ QXY
|Y ]] for any g.

Remark 6.2. PY |X can be a deterministic map so that Y = f (X). More specifically, if f (X) =
1E (X) for any event E, then Y is Bernoulli with parameter P (E) or Q(E) and the data processing
inequality gives
Df (PX kQX ) ≥ Df (Bern(P (E))kBern(Q(E))). (6.3)
This is how we will prove the converse direction of large deviation (see Lecture 14).
Example: If X = (X1 , X2 ) and f (X) = X1 , then we have

Df (PX1 X2 kQX1 X2 ) ≥ Df (PX1 kQX1 ).

As seen from the proof of Theorem 6.2, this is in fact equivalent to data processing inequality.
Remark 6.3. If Df (P kQ) is an f -divergence, then Df˜(P kQ) with f˜(x) , xf ( x1 ) is also an f -
divergence and Df (P kQ) = Df˜(QkP ). Example: Df (P kQ) = D(P kQ) then Df˜(P kQ) = D(QkP ).

Proof. First we verify that f˜ has all three properties required for Df˜(·k·) to be an f -divergence.
ᾱy
• For x, y ∈ R+ and α ∈ [0, 1] define c = αx + ᾱy so that αx
c + c = 1. Observe that

˜ 1 αx 1 ᾱy 1 αx 1 ᾱy 1
f (αx + ᾱy) = cf = cf + ≤c f +c f = αf˜(x) + ᾱf˜(y).
c c x c y c x c y

Thus f˜ : R+ → R is a convex function.

• f˜(1) = f (1) = 0.

• For x, y ∈ R+ , α ∈ [0, 1], if αx + ᾱy = 1, then by strict convexity of f at 1,

˜ 1 1 1 1
0 = f (1) = f (1) = f αx + ᾱy < αxf + ᾱyf = αf˜(x) + ᾱf˜(y).
x y x y

So f˜ is strictly convex at 1 and thus Df˜(·k·) is a valid f -divergence.

Finally,
P Q P Q
Df (P kQ) = EQ f = EP f = EP f˜ = Df˜(QkP ).
Q P Q P

70
6.3 Total variation and hypothesis testing
Recall that the choice of f (x) = 12 |x − 1| gives rise to the total variation distance,
Z
1 P 1

Df (P kQ) = EQ − 1 = |P − Q|,
2 Q 2
R R dQ
where |P − Q| is a short-hand understood in the usual sense, namely, | dP dµ − dµ |dµ where µ is a
dominating measure, e.g., µ = P + Q, and the value of the integral does not depends on µ.
We will denote total variation by TV(P, Q).
Theorem 6.3. The following definitions for total variation are equivalent:

1.
TV(P, Q) = sup P (E) − Q(E), (6.4)
E
where the supremum is over all measurable set E.

2. 1 − TV(P, Q) is the minimal sum of Type-I and Type-II error probabilities for testing P versus
Q, and2 Z
TV(P, Q) = 1 − P ∧ Q. (6.5)

3. Provided the diagonal {(x, x) : x ∈ X } is measurable,

TV(P, Q) = inf P [X 6= Y ] . (6.6)

PXY :
PX =P,PY =Q

4. Let F = {f : X → R, kf k∞ ≤ 1}. Then

1
TV(P, Q) = sup EP f (x) − EQ f (x). (6.7)
2 f ∈F

Remark 6.4 (Variational representation). The equation (6.4) and (6.7) provide sup-representation
of total variation, which will be extended to general f -divergences (later). Note that (6.6) is an
inf-representation of total variation in terms of couplings, meaning total variation is the Wasserstein
distance with respect to Hamming distance. The benefit of variational representations is that
choosing a particular coupling in (6.6) gives an upper bound on TV(P, Q), and choosing a particular
f in (6.7) yields a lower bound.
Remark 6.5 (Operational meaning). In the binary hypothesis test for H0 : X ∼ P or H1 : X ∼ Q,
Theorem 6.3 shows that 1 − TV(P, Q) is the sum of false alarm and missed detection probabilities.
This can be seen either from (6.4) where E is the decision region for deciding P or from (6.5) since
dP
the optimal test (for average probability of error) is the likelihood ratio test dQ > 1. In particular,

• TV(P, Q) = 1 ⇔ P ⊥ Q, the probability of error is zero since essentially P and Q have disjoint
supports.

• TV(P, Q) = 0 ⇔ P = Q and the minimal sum of error probabilities is one, meaning the best
thing to do is to flip a coin.
dQ
2
Here again P ∧ Q is a short-hand understood per the usual sense, namely, ( dP
R R
dµ
∧ dµ
)dµ where µ is any
dominating measure.

71
6.4 Motivating example: Hypothesis testing with multiple
samples
Observation:
1. Different f -divergences have different operational significance. For example, as we saw in
Section 6.3, testing two hypothesis boils down to total variation, which determines the
fundamental limit (minimum average probability of error). For estimation under quadratic
R 2
loss the f -divergence LC(P kQ) = (PP−Q)
+Q is useful.

2. Some f -divergence is easier to evaluate than others. For example, for product distributions,
Hellinger distance and χ2 -divergence tensorize in the sense that they are easily expressible
in terms of those of the one-dimensional marginals; however, computing the total variation
between product measures is frequently difficult. Another example is that computing the
χ2 -divergence from a mixture of distributions to a simple distribution is convenient.
Therefore the punchline is that it is often fruitful to bound one f -divergence by another and this
sometimes leads to tight characterizations. In this section we consider a specific useful example
to drive this point home. Then in the next lecture we develop inequalities between f -divergences
systematically.
Consider a binary hypothesis test where data X = (X1 , . . . , Xn ) are i.i.d drawn from either P
or Q and the goal is to test
H0 : X ∼ P ⊗n vs H1 : X ∼ Q⊗n .
As mentioned before, 1 − TV(P ⊗n , Q⊗n ) gives minimal Type-I+II probabilities of error, achieved by
the maximum likelihood test. By the data processing inequality, TV(P ⊗m , Q⊗m ) ≤ TV(P ⊗n , Q⊗n )
for m < n. From this we see that TV(P ⊗n , Q⊗n ) is an increasing sequence in n (and bounded by 1
by definition) and hence converges. One would hope that as n → ∞, TV(P ⊗n , Q⊗n ) converges to 1
and consequently, the probability of error in the hypothesis test converges to zero. It turns out that
if the distributions P, Q are independent of n, then large deviation theory (see Corollary 15.1) gives
TV(P ⊗n , Q⊗n ) = 1 − exp(−nC(P, Q) + o(n)), (6.8)
R α 1−α
where the constant C(P, Q) = − log inf 0≤α≤1 P Q is the Chernoff Information of P, Q. It
⊗n ⊗n
is clear from this that TV(P , Q ) → 1 as n → ∞, and, in fact, exponentially fast.
However, as frequently encountered in high-dimensional statistical problems, if the distributions
P = Pn and Q = Qn depend on n, then the large-deviation approach that leads to (6.8) is no longer
valid. In such a situation, total variation is still relevant for hypothesis testing, but its behavior as
n → ∞ is not obvious nor easy to compute. In this case, understanding how a more computationally
tractable f -divergence is related to total variation may give insight on hypothesis testing without
needing to directly compute the total variation. It turns out Hellinger distance is precisely suited
for this task – see Theorem 6.4 below. q 2
Recall that the squared Hellinger distance, H 2 (P, Q) = EQ 1 − Q P
is an f -divergence
√
with f (x) = (1 − x)2 , which provides a sandwich bound for total variation
r
1 2 H 2 (P, Q)
0 ≤ H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − ≤ 1. (6.9)
2 4
This can be proved by elementary manipulations and a systematic proof will be explained in the
next lecture. Direct consequences of this bound are:

72
• H 2 (P, Q) = 2, if and only if TV(P, Q) = 1.

• H 2 (P, Q) = 0 if and only if TV(P, Q) = 0.

• Hellinger consistency ⇐⇒ TV consistency, namely H 2 (Pn , Qn ) → 0 ⇐⇒ TV(Pn , Qn ) → 0.

Theorem 6.4. For any sequence of distributions Pn and Qn , as n → ∞,

⊗n ⊗n 2 1
TV(Pn , Qn ) → 0 ⇐⇒ H (Pn , Qn ) = o
n

1
TV(Pn⊗n , Q⊗n 2
n ) → 1 ⇐⇒ H (Pn , Qn ) = ω
n

Proof. Because the observations X = (X1 , X2 , ...Xn ) are i.i.d, the joint distribution factors
v 
u n
u Y Pn
H 2 (Pn⊗n , Q⊗n
n ) = 2 − 2EQ⊗n
t (Xi ) 
n Qn
i=1
n
"s # "s #!n
Y Pn Pn
By independence → = 2 − 2 EQn (Xi ) = 2 − 2 EQn
Qn Qn
i=1
n
1
= 2 − 2 1 − H 2 (Pn , Qn ) .
2

Then TV(Pn⊗n , Q⊗n 2 ⊗n ⊗n

n ) → 0 if and only if H (Pn , Qn ) → 0, which happens precisely when
2 1
H (Pn , Qn ) = o( n ).
Similarly, TV(Pn⊗n , Q⊗n 2 ⊗n ⊗n
n ) → 1 if and only if H (Pn , Qn ) → 2 which happens precisely when
2 1
H (Pn , Qn ) = ω( n ).

Remark 6.6. The proof of Theorem 6.4 relies on two ingredients:

1. Sandwich bound (6.9).

2. Tensorization properties of Hellinger:

n n
! n
Y Y Y H 2 (Pi , Qi )
2
H Pi , Qi =2−2 1− (6.10)
2
i=1 i=1 i=1

Note that there are other f -divergences that are also tensorizable, e.g., χ2 -divergences:
n n
! n
Y Y Y
2
χ Pi , Qi = 1 + χ2 (Pi , Qi ) − 1; (6.11)
i=1 i=1 i=1

however, no sandwich inequality like (6.9) exists for TV and χ2 and hence there is no χ2 -version of
Theorem 6.4. Asserting the non-existence of such inequalities requires understanding the relationship
between these two f -divergences (see next lecture).

73
6.5 Inequalities between f -divergences
We will discuss two methods for finding inequalities between f -divergences.

• ad hoc approach: case-by-case proof using results like Jensen’s inequality, max ≤ mean ≤ min,
Cauchy-Schwarz, etc.

• systematic approach: joint range of f -divergences.

Definition 6.2. The joint range between two f -divergences Df (·k·) and Dg (·k·) is the range of the
mapping (P, Q) 7→ (Df (P kQ), Dg (P kQ)), i.e., the set R ⊂ R+ × R+ where (x, y) ∈ R if there exist
distributions P, Q on some common measurable space such that x = Df (P kQ) and y = Dg (P kQ).

0.9

0.8

0.7

0.6
Dg

0.5

0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Df

The green region in the above figure shows what a joint range between Df (·k·) and Dg (·k·) might
look like. By definition of R, the lower boundary gives the sharpest lower bound of Dg in terms of
Df , namely:

Df (P kQ) ≥ V (Dg (P kQ)), where V (t) , inf{Df (P kQ) : Dg (P kQ) = t};

similarly, the upper boundary gives the best upper bound. As will be discussed in the next lecture,
the sandwich bound (6.9) correspond to precisely the lower and upper boundaries of the joint range
of H 2 and TV, therefore not improvable. It is important to note, however, that R may be an
unbounded region and some of the boundaries may not exist, meaning it is impossible to bound one
by the other, such as χ2 versus TV.
To gain some intuition, we start with the ad hoc approach by proving Pinsker’s inequality, which
bounds total variation from above by the KL divergence.

74
Theorem 6.5 (Pinsker’s inequality).

D(P kQ) ≥ 2 log eTV2 (P, Q). (6.12)

Proof. First we show that, by the data processing inequality, it suffices to prove the result for
Bernoulli distributions. For any event E, let Y = 1{X∈E} which is Bernoulli with parameter P (E)
or Q(E). By data processing inequality, D(P kQ) ≥ d(P (E)kQ(E)). If Pinsker’s inequality is true
for all Bernoulli random variables, we have
r
1
D(P kQ) ≥ TV(Bern(P (E)), Bern(Q(E)) = |P (E) − Q(E)|
2
q
1
Taking the supremum over E gives 2 D(P kQ) ≥ supE |P (E) − Q(E)| = TV(P, Q), in view of
Theorem 6.3.
The binary case follows easily from Taylor’s theorem (with integral remainder form):
Z p Z p
p−t
d(pkq) = dt ≥ 4 (p − t)dt = 2(p − q)2
q t(1 − t) q

and TV(Bern(p), Bern(q)) = |p − q|.

Remark 6.7. Pinsker’s inequality is known to be sharp in the sense that the constant “2” in (1.15)
LHS
is not improvable, i.e., there exist {Pn , Qn } such that RHS → 2 as n → ∞. (Why? Think about
the local quadratic behavior in Proposition 4.2) Nevertheless, this does not mean that (1.15) itself
is not improvable because it might be possible to subtract some higher-order term from the RHS.
This is indeed the case and there are many refinements of Pinsker’s inequality. But what is the best
inequality? Settling this question rests on characterizing the joint range and the lower boundary.
This is the topic of next lecture.

75
§ 7. Inequalities between f -divergences via joint range

In the last lecture we proved the Pinkser’s inequality that D(P kQ) ≥ 2TV2 (P, Q) in an ad hoc
manner. The downside of ad hoc approaches is that it is hard to tell whether those inequalities can
be improved or not. However, the key step when we proved the Pinkser’s inequality, reduction to
the case for Bernoulli random variables, is inspiring: is it possible to reduce inequalities between
any two f -divergences to the binary case?

7.1 Inequalities via joint range

A systematic method is to prove those inequalities via their joint range. For example, to prove
a lower bound of D(P kQ) by a function of TV(P, Q) that D(P kQ) ≥ F (TV(P, Q)) for some
F : [0, 1] 7→ [0, ∞], the best choice, by definition, is the following:

F () , inf D(P kQ).

(P,Q):TV(P,Q)=

The problem boils to the characterization of the region {(TV(P, Q), D(P kQ)) : P, Q} ⊆ R2 , their
joint range, whose lower boundary is the function F .
D

1.5

1.0

0.5

0.0 TV
0.0 0.2 0.4 0.6 0.8

Figure 7.1: Joint range of TV and D.

76
Definition 7.1 (Joint range). Consider two f -divergences Df (P kQ) and Dg (P kQ). Their joint
range is a subset of R2 defined by

R , {(Df (P kQ), Dg (P kQ)) : P, Q are probability measures on some measurable space} ,

Rk , {(Df (P kQ), Dg (P kQ)) : P, Q are probability measures on [k]} .

The region R seems difficult to characterize since we need to consider P, Q over all measurable
spaces; on the other hand, the region Rk for small k is easy to obtain. The main theorem we will
prove is the following [HV11]:
Theorem 7.1 (Harremoës-Vajda ’11).

R = co(R2 ).

It is easy to obtain a parametric formula of R2 . By Theorem 7.1, the region R is no more than
the convex hull of R2 .
Theorem 7.1 implies that R is a convex set; however, as a warmup, it is instructive to prove con-
vexity of R directly, which simply follows from the arbitrariness of the alphabet size of distributions.
Given any two points (Df (P0 kQ0 ), Dg (P0 kQ0 )) and (Df (P1 kQ1 ), Dg (P1 kQ1 )) in the joint range, it
is easy to construct another pair of distributions (P, Q) by alphabet extension that produces any
convex combination of those two points.
Theorem 7.2. R is convex.

Proof. Given any two pairs of distributions (P0 , Q0 ) and (P1 , Q1 ) on some space X and given any
α ∈ [0, 1], we define another pair of distributions (P, Q) on X × {0, 1} by

P = ᾱ(P0 × δ0 ) + α(P1 × δ1 ),
Q = ᾱ(Q0 × δ0 ) + α(Q1 × δ1 ).

In other words, we construct a random variable Z = (X, B) with B ∼ Bern(α), where PX|B=i = Pi
and QX|B=i = Qi . Then

P P
Df (P kQ) = EQ f = EB EQZ|B f = ᾱDf (P0 kQ0 ) + αDf (P1 kQ1 ),
Q Q

P P
Dg (P kQ) = EQ g = EB EQZ|B g = ᾱDg (P0 kQ0 ) + αDg (P1 kQ1 ).
Q Q

Therefore, ᾱ(Df (P0 kQ0 ), Dg (P0 kQ0 )) + α(Df (P1 kQ1 ), Dg (P1 kQ1 )) ∈ R and thus R is convex.

Theorem 7.1 is proved by the following two lemmas:

Lemma 7.1 (non-constructive/existential). R = R4 .
Lemma 7.2 (constructive/algorithmic).

Rk+1 = co(R2 ∪ Rk ) for any k ≥ 2

and hence
Rk = co(R2 ), for any k ≥ 3.

77
7.1.1 Representation of f -divergences
To prove Lemma 7.1 and Lemma 7.2, we first express f -divergences by means of expectation over
the likelihood ratio.
Lemma 7.3. Given two f -divergences Df (·k·) and Dg (·k·), their joint range is

E[f (X)] + f˜(0)(1 − E[X])
R= : X ≥ 0, E[X] ≤ 1 ,
E[g(X)] + g̃(0)(1 − E[X])
( )
E[f (X)] + f˜(0)(1 − E[X]) X ≥ 0, E[X] ≤ 1, X takes at most k − 1 values,
Rk = : ,
E[g(X)] + g̃(0)(1 − E[X]) or X ≥ 0, E[X] = 1, X takes at most k values

where f˜(0) , limx→0 xf (1/x) and g̃(0) , limx→0 xg(1/x).

In the statement of Lemma 7.3, we remark that f˜(0) and g̃(0) are both well-defined (possibly
+∞) by the convexity of x 7→ xf (1/x) and x 7→ xg(1/x) (from the last lecture).
Before proving above lemma, we look at the following two examples to understand the correspon-
dence between a point in the joint range and a random variable. The first example is the simple case
that P Q, when the likelihood ratio of P and Q (or Radon-Nikodym derivative defined on the
union of the spaces of P and Q) is well-define. Example: Consider the following two distributions
P, Q on [3]:
1 2 3
P 0.34 0.34 0.32
Q 0.85 0.1 0.05
Then Df (P kQ) = 0.85f (0.4) + 0.1f (3.4) + 0.05f (6.4), which is E[f (X)] where X is the likelihood
ratio of P and Q taking 3 values with the following pmf:
x 0.4 3.4 6.4
µ(x) 0.85 0.1 0.05
On the other direction, given the above pmf of a non-negative, unit-mean random variable X ∼ µ
that takes 3 values, we can construct a pair of distribution by Q(x) = µ(x) and P (x) = xµ(x).
In general cases P is not necessarily absolutely continuous w.r.t. Q, and the likelihood ratio X
may not always exist. However, it is still well-defined on the event {Q > 0}. Example: Consider
the following two distributions P, Q on [2]:
1 2
P 0.4 0.6
Q 0 1
p
Then Df (P kQ) = f (0.6) + 0f ( 0.4
0 ), where 0f ( 0 ) is understood as
p p x p
0f = lim xf = p lim f = pf˜(0).
0 x→0 x x→0 p x

Therefore Df (P kQ) = f (0.6) + 0.4f˜(0) = E[f (X)] + f˜(0)(1 − E[X]) where X is defined on {Q > 0}:
x 0.6
µ(x) 1

78
On the other direction, given above pmf of a non-negative random variable X ∼ µ with E[X] ≤ 1
that takes 1 value, we let Q(x) = µ(x), let P (x) = xµ(x) on {Q > 0} and let P have an extra point
mass 1 − E[X].

Proof of Lemma 7.3. We first prove it for R. Given any pair of distributions (P, Q) that produces a
point of R, let p, q denote the densities of P, Q under some dominating measure µ, respectively. Let
p
X= on {q > 0} , µX = Q, (7.1)
q
then X ≥ 0 and E[X] = P [q > 0] ≤ 1. Then
Z Z Z
p q p p
Df (P kQ) = f dQ + f dP = f dQ + f˜(0)P [q = 0]
{q>0} q {q=0} p q {q>0} q
= E[f (X)] + f˜(0)(1 − E[X]),

Analogously,
Dg (P kQ) = E[g(X)] + g̃(0)(1 − E[X]),
On the other direction, given any random variable X ≥ 0 and E[X] ≤ 1 where X ∼ µ, let

dQ = dµ, dP = Xdµ + (1 − E[X])δ∗ , (7.2)

where ∗ is an arbitrary symbol outside the support of X. Then

Df (P kQ) E[f (X)] + f˜(0)(1 − E[X])
= .
Dg (P kQ) E[g(X)] + g̃(0)(1 − E[X])

Now we consider Rk . Given two probability measures P, Q on [k], the likelihood ratio defined in
(7.1) takes at most k values. If P Q then E[X] = 1; if P 6 Q then X takes at most k − 1 values.
On the other direction, if E[X] = 1 then the construction of P, Q in (7.2) are on the same
support of X; if E[X] < 1 then the support of P is increased by one.

7.1.2 Proof of Theorem 7.1

Aside: Fenchel-Eggleston-Carathéodory’s theorem: Let S ⊆ Rd and x ∈ co(S). Then there
exists a set of d + 1 points S 0 = {x1 , x2 , . . . , xd+1 } ∈ S such that x ∈ co(S 0 ). If S is connected, then
d points are enough.

Proof of Lemma 7.1. It suffices to prove that

R ⊆ R4 .

Let S , {(x, f (x), g(x)) : x ≥ 0} which is a connected set. For any pair of distributions (P, Q) that
produces a point of R, we construct a random variable X as in (7.1), then (E[X], E[f (X)], E[g(X)]) ∈
co(S). By Fenchel-Eggleston-Carathéodory’s theorem,1 there exists (xi , f (xi ), g(xi )) and the corre-
sponding weight αi for i = 1, 2, 3 such that
3
X
(E[X], E[f (X)], E[g(X)]) = αi (xi , f (xi ), g(xi )).
i=1
1
To prove Theorem 7.1, it suffices to invoke the basic Carathéodory’s theorem, which proves a weaker version of
Lemma 7.1 that R = R5 .

79
We construct another random variable X 0 that takes value xi with probability αi . Then X takes 3
values and
(E[X], E[f (X)], E[g(X)]) = (E[X 0 ], E[f (X 0 )], E[g(X 0 )]). (7.3)
By Lemma 7.3 and (7.3),

Df (P kQ) E[f (X)] + f˜(0)(1 − E[X]) E[f (X 0 )] + f˜(0)(1 − E[X 0 ])
= = ∈ R4 .
Dg (P kQ) E[g(X)] + g̃(0)(1 − E[X]) E[g(X 0 )] + g̃(0)(1 − E[X 0 ])

We observe from Lemma 7.3 that Df (P kQ) only depends on the distribution of X for some
X ≥ 0 and E[X] ≤ 1. To find a pair of distributions that produce a point in Rk it suffices to find a
random variable X ≥ 0 taking k values with E[X] = 1, or taking k − 1 values with E[X] ≤ 1. In
Example 7.1.1 where (P, Q) produces a point in R3 , we want to show that it also belongs to co(R2 ).
The decomposition of a point in R3 is equivalent to the decomposition of the likelihood ratio X that

µX = αµ1 + ᾱµ2 .

A solution of such decomposition is that µX = 0.5µ1 + 0.5µ2 where µ1 , µ2 has the following pmf:
x 0.4 3.4 x 0.4 6.4
µ1 (x) 0.8 0.2 µ2 (x) 0.9 0.1
Then by (7.2) we obtain two pairs of distributions
P1 0.32 0.68 P2 0.36 0.64
Q1 0.8 0.2 Q2 0.9 0.1
We obtain that
Df (P kQ) Df (P1 kQ1 ) Df (P2 kQ2 )
= 0.5 + 0.5 .
Dg (P kQ) Dg (P1 kQ1 ) Dg (P2 kQ2 )

Proof of Lemma 7.2. It suffices to prove the first statement, namely, Rk+1 = co(R2 ∪ Rk ) for any
k ≥ 2. By the same argument as in the proof of Theorem 7.2 we have co(Rk ) ⊆ Rk+1 and note
that R2 ∪ Rk = Rk . We only need to prove that

Rk+1 ⊆ co(R2 ∪ Rk ).

Given any pair of distributions (P, Q) that produces a point of (Df (P kQ), Dg (P kQ)) ∈ Rk+1 , we
construct a random variable X as in (7.1) that takes at most k + 1 values. Let µ denote the
distribution of X. We consider two cases that Eµ [X] < 1 and Eµ [X] = 1 separately.

• Eµ [X] < 1. Then X takes at most k values since otherwise P Q. Denote the smallest value
of X by x and then x < 1. Suppose µ(x) = q and then µ can be represented as

µ = qδx + q̄µ0 ,

where µ0 is supported on at most k − 1 values of X other than x. Let µ2 = δx . We need to

construct another probability measure µ1 such that

µ = αµ1 + ᾱµ2 ,

– Eµ0 [X] ≤ 1. Let µ1 = µ0 and let α = q̄.

80
Eµ0 [X]−1
– Eµ0 [X] > 1. Let µ1 = pδx + p̄µ0 where p = Eµ0 [X]−x such that Eµ1 [X] = 1. Let
Eµ [X]−x
α= 1−x .

• Eµ [X] = 1.2 Denote the smallest value of X by x and the largest value by y, respectively, and
then x ≤ 1, y ≥ 1. Suppose µ(x) = r and µ(y) = s and then µ can be represented as

µ = rδx + (1 − r − s)µ0 + sδy ,

where µ0 is supported on at most k − 1 values of X other than x, y. Let µ2 = βδx + β̄δy where
y−1
β = y−x such that Eµ2 [X] = 1. We need to construct another probability measure µ1 such
that
µ = αµ1 + ᾱµ2 ,
1−Eµ0 [X]
– Eµ0 [X] ≤ 1. Let µ1 = pδy + p̄µ0 where p = y−Eµ0 [X] such that Eµ1 [X] = 1. Let ᾱ = r/β.
Eµ0 [X]−1
– Eµ0 [X] > 1. Let µ1 = pδx + p̄µ0 where p = Eµ0 [X]−x such that Eµ1 [X] = 1. Let ᾱ = s/β̄.

Applying the construction in (7.2) with µ1 and µ2 , we obtain two pairs of distributions (P1 , Q1 )
supported on k values and (P2 , Q2 ) supported on two values, respectively. Then

Df (P kQ) Eµ [f (X)] + f˜(0)(1 − Eµ [X])
=
Dg (P kQ) Eµ [g(X)] + g̃(0)(1 − Eµ [X])

Eµ1 [f (X)] + f˜(0)(1 − Eµ1 [X]) Eµ2 [f (X)] + f˜(0)(1 − Eµ2 [X])
=α + ᾱ
Eµ1 [g(X)] + g̃(0)(1 − Eµ1 [X]) Eµ2 [g(X)] + g̃(0)(1 − Eµ2 [X])

Df (P1 kQ1 ) Df (P2 kQ2 )
=α + ᾱ .
Dg (P1 kQ1 ) Dg (P2 kQ2 )

Remark 7.1. Theorem 7.1 can be viewed as a consequence of Krein-Milman’s theorem. Consider
the space of {PX : X ≥ 0, E[X] ≤ 1}, which has only two types of extreme points:

1. X = x for 0 ≤ x ≤ 1;

2. X takes two values x1 , x2 with probability α1 , α2 , respectively, and E[X] = 1.

For the first case, let P = Bern(x) and Q = δ1 ; for the second case, let P = Bern(α2 x2 ) and
Q = Bern(α2 ).

7.2 Examples
7.2.1 Hellinger distance versus total variation
The upper and lower bound we mentioned in the last lecture is the following [Tsy09, Sec. 2.4]:
1 2 p
H (P, Q) ≤ TV(P, Q) ≤ H(P, Q) 1 − H 2 (P, Q)/4. (7.4)
2
2
Many thanks to Pengkun Yang for correcting the error in the original proof.

81
Their joint range R2 has a parametric formula
√ √
(2(1 − pq − p̄q̄), |p − q|) : 0 ≤ p ≤ 1, 0 ≤ q ≤ 1

and is the gray region in Fig. 7.2. The joint range R is the convex hull of R2 (grey region, non-convex)
and exactly described by (7.4); so (7.4) is not improvable. Indeed, with t ranges from 0 to 1,

• the upper boundary is achieved by P = Bern( 1+t 1−t

2 ), Q = Bern( 2 ),

• the lower boundary is achieved by P = (1 − t, t, 0), Q = (1 − t, 0, t).

1.0

0.8

0.6
TV

0.4

0.2

0.0
0.0 0.5 1.0 1.5 2.0
H^2

Figure 7.2: Joint range of TV and H 2 .

7.2.2 KL divergence versus total variation

Pinsker’s inequality states that

D(P kQ) ≥ 2 log eTV2 (P, Q). (7.5)

There are various kinds of improvements of Pinsker’s inequality. Now we know that the best lower
bound is the lower boundary of Fig. 7.1, which is exactly the boundary of R2 . Although there is no
known close-form expression, a paremetric formula of the lower boundary (see Fig. 7.1) not hard to
write it down [FHT03, Theorem 1]:
( 2
Dt = 12 t 1 − coth(t) − 1t
, t ≥ 0. (7.6)
TVt = −t2 csch2 (t) + t coth(t) + log(tcsch(t))

Here is a corollary (weaker bound) that we will use later:

1 + TV(P, Q)
D(P kQ) ≥ TV(P, Q) log .
1 − TV(P, Q)

Consequences:

• The original Pinsker’s inequality shows that D → 0 ⇒ TV → 0.

82
• TV → 1 ⇒ D → ∞. Thus D = O(1) ⇒ TV is bounded away from one. This is not obtainable
from Pinsker’s inequality.

Also from Fig. 7.1 we know that it is impossible to have an upper bound of D(P kQ) using a function
of TV(P, Q) due to the lack of upper boundary.
For more examples see [Tsy09, Sec. 2.4].

7.3 Joint range between various divergences

Joint range between various pairs of f -divergences are summarizes as follows in terms of inequalities,
each of which is tight.

• KL vs TV: see (7.6).

There is partial comparison in the other direction (“reverse Pinsker”):

2 2 2 log e
D(µkπ) ≤ log 1 + TV(µ, π) ≤ TV(µ, π)2 , π∗ = min π(x)
π∗ π∗ x

• KL vs Hellinger:
2
D(P ||Q) ≥ 2 log . (7.7)
2 − H 2 (P, Q)
There is a partial result in the opposite direction (log-Sobolev inequality for Bonami-Beckner
semigroup)

log( π1∗ − 1)
D(µkπ) ≤ (H 2 (µ, π) − H 4 (µ, π)/4) , π∗ = min π(x)
1 − 2π∗ x

• KL vs χ2 :
0 ≤ D(P ||Q) ≤ log(1 + χ2 (P ||Q)) . (7.8)
(i.e. no lower-bound on KL in terms of χ2 is possible).

• TV and Hellinger: see (7.4). Another bound [Gil10]:

r
H 2 (P, Q)
TV(P, Q) ≤ −2 ln 1 −
2

• Le Cam and Hellinger [LC86, p. 48]:

1 2
H (P, Q) ≤ LC(P, Q) ≤ H 2 (P, Q). (7.9)
2

• Le Cam and Jensen-Shannon [Top00]:

LC(P ||Q) log e ≤ JS(P ||Q) ≤ LC(P ||Q) · 2 log 2 (7.10)

• χ2 and TV [HV11, Eq. (12)]:

χ2 (P kQ) ≥ 4TV(P, Q)2 . (7.11)

83
Part II

Lossless data compression

84
§ 8. Variable-length Lossless Compression

The principal engineering goal of data compression is to represent a given sequence a1 , a2 , . . . , an

produced by a source as a sequence of bits of minimal possible length with possible algorithmic
constraints. Of course, reducing the number of bits is generally impossible, unless the source satisfies
certain restrictions, that is, only a small subset of all sequences actually occur in practice. Is this
the case for real world data?
As a simple demonstration, one may take two English novels and compute empirical frequencies
of each letter. It will turn out to be the same for both novels (approximately). Thus, we can see
that there is some underlying structure in English texts restricting possible output sequences. The
structure goes beyond empirical frequencies of course, as further experimentation (involving digrams,
word frequencies etc) may reveal. Thus, the main reason for the possibility of data compression is
the experimental (empirical) law: real-world sources produce very restricted sets of sequences.
How do we model these restrictions? Further experimentation (with language, music, images)
reveals that frequently, the structure may be well described if we assume that sequences are generated
probabilistically [Sha48, Sec. III]. This is one of the main contributions of Shannon: another empirical
law states that real-world sources may be described probabilistically with increasing precision starting
from i.i.d., 1st order Markov, 2nd order Markov etc. Note that sometimes one needs to find an
appropriate basis in which this “law” holds – this is the case of images (i.e. rasterized sequence
of pixels won’t appear to have local probabilistic laws, because of forgetting the 2-D constraints;
wavelets and local Fourier transform provide much better bases).1
So our initial investigation will be about representing one random variable X ∼ PX in terms of
bits efficiently. Types of compression:

• Lossless:
P (X 6= X̂) = 0. variable-length code, uniquely decodable codes, prefix codes, Huffman codes

• Almost lossless:
P (X 6= X̂) ≤ . fixed-length codes

• Lossy:
X → W → X̂ s.t. E[(X − X̂)2 ] ≤ distortion.

8.1 Variable-length, lossless, optimal compressor

Coding paradigm:
Remark 8.1.

• Codeword: f (x) ∈ {0, 1}∗ ; Codebook: {f (x) : x ∈ X } ⊂ {0, 1}∗

1
Of course, one should not take these “laws” too far. In regards to language modeling, (finite-state) Markov
assumption is too simplistic to truly generate all proper sentences, cf. Chomsky [Cho56].

85
X Compressor
{0, 1}∗ Decompressor X
f : X →{0,1}∗ g: {0,1}∗ →X

• Since {0, 1}∗ = {∅, 0, 1, 00, 01, . . . } is countable, lossless compression is only possible for
discrete R.V.;

• If we want g ◦ f = 1X (lossless), then f must be injective;

• WLOG, we can relabel X such that X = N = {1, 2, . . . } and order the pmf decreasingly:
PX (i) ≥ PX (i + 1).

• Note that at this point we do not impose any conditions on the codebook (such as prefix
or unique-decodability). This is sometimes called single-shot compression setting. Original
results for this setting can be found in [KV14].

Length function:
l : {0, 1}∗ → N
e.g., l(01001) = 5.
Objectives: Find the best compressor f to minimize

E[l(f (X))]

sup l(f (X))

median l(f (X))
It turns out that there is a compressor f ∗ that minimizes all objectives simultaneously!
Main idea: Assign longer codewords to less likely symbols, and reserve the shorter codewords
for more probable symbols.
Aside: It is useful to introduce the partial order of stochastic dominance: For real-valued RV X
st.
and Y , we say Y stochastically dominates (or, is stochastically larger than) X, denoted by X ≤ Y ,
st.
if P [Y ≤ t] ≤ P [X ≤ t] for all t ∈ R. In other words, X ≤ Y iff the CDF of X is larger than the
CDF of Y pointwise. In particular, if X is dominated by Y stochastically, so are their means,
medians, supremum, etc.
Theorem 8.1 (optimal f ∗ ). Consider the compressor f ∗ defined by

Then

1. length of codeword:
l(f ∗ (i)) = blog2 ic

86
2. l(f ∗ (X)) is stochastically the smallest: for any lossless f ,
st.
l(f ∗ (X)) ≤ l(f (X))

i.e., for any k, P[l(f (X)) ≤ k] ≤ P[l(f ∗ (X)) ≤ k].

Proof. Note that

k
X
|Ak | , |{x : l(f (x)) ≤ k| ≤ 2i = 2k+1 − 1 = |{x : l(f ∗ (x)) ≤ k| , |A∗k |.
i=0

where the inequality is because of f is lossless and hence |Ak | must be at least the total number of
binary strings of length at most k. Then
X X
P[l(f (X)) ≤ k] = PX (x) ≤ PX (x) = P[l(f ∗ (X)) ≤ k],
x∈Ak x∈A∗k

since |Ak | ≤ |A∗k | and A∗k contains all 2k+1 − 1 most likely symbols.

The following lemma (homework) is useful in bounding the expected code length of f ∗ . It says
if the random variable is integer-valued, then its entropy can be controlled using its mean.
Lemma 8.1. For any Z ∈ N s.t. E[Z] < ∞, H(Z) ≤ E[Z]h( E[Z]
1
), where h(·) is the binary entropy
function.
Theorem 8.2 (Optimal average code length: exact expression). Suppose X ∈ N and PX (1) ≥
PX (2) ≥ . . .. Then
∞
X
E[l(f ∗ (X))] = P[X ≥ 2k ].
k=1
P
Proof. Recall that expectation of U ∈ Z+P can be written as E [U ]P
= k≥1 P [U ≥ k]. Then by
Theorem 8.1, E[l(f ∗ (X))] = E [blog2 Xc] = k≥1 P [blog2 Xc ≥ k] = k≥1 P [log2 X ≥ k].

Theorem 8.3 (Optimal average code length v.s. entropy).

H(X) bits − log2 [e(H(X) + 1)] ≤ E[l(f ∗ (X))] ≤ H(X) bits

Note: Theorem 8.3 is the first example of a coding theorem, which relates the fundamental limit
E[l(f ∗ (X))] (operational quantity) to the entropy H(X) (information measure).

Proof. Define L(X) = l(f ∗ (X))).

RHS: observe that since the pmf are ordered decreasingly by assumption, PX (m) ≤ 1/m, so
L(m) ≤ log2 m ≤ log2 (1/PX (m)). Taking expectation yields E[L(X)] ≤ H(X).

87
LHS:
H(X) = H(X, L) = H(X|L) + H(L)

1
≤ E[L] + h (1 + E[L]) (Lemma 8.1)
1 + E[L]

1
= E[L] + log2 (1 + E[L]) + E[L] log 1 +
E[L]
≤ E[L] + log2 (1 + E[L]) + log2 e (x log(1 + 1/x) ≤ log e, ∀x > 0)
≤ E[L] + log2 (e(1 + H(X))) (by RHS)
where we have used H(X|L = k) ≤ k bits, since given l(f ∗ (X))) = k, X has at most 2k choices.
Note: (Memoryless source) If X = S n is an i.i.d. sequence, then
nH(S) ≥ E[l(f ∗ (S n ))] ≥ nH(S) − log n + O(1).
For iid sources, the exact asymptotic behavior is found in [SV11, Theorem 4] as:
1
E[`(f ∗ (S n ))] = nH(S) −
log n + O(1) ,
2
unless the source is uniform (in which case it is nH(S) + O(1).
Theorem 8.3 relates the mean of l(f ∗ (X)) ≤ k to that of log2 PX1(X) (entropy). The next result
relates their CDFs.
Theorem 8.4 (Code length distribution of f ∗ ). ∀τ > 0, k ≥ 0,

1 ∗ 1
P log2 ≤ k ≤ P [l(f (X)) ≤ k] ≤ P log2 ≤ k + τ + 2−τ +1
PX (X) PX (X)

Proof. LHS (achievability): Use PX (m) ≤ 1/m. Then similarly as in Theorem 8.3, L(m) =
blog2 mc ≤ log2 m ≤ log2 PX1(m) . Hence L(X) ≤ log2 PX1(X) a.s.
RHS (converse): By truncation,

1 1
P [L ≤ k] = P L ≤ k, log2 ≤ k + τ + P L ≤ k, log2 >k+τ
PX (X) PX (X)
X
1
≤ P log2 ≤k+τ + PX (x)1{l(f ∗ (x))≤k} 1{PX (x)≤2−k−τ }
PX (X)
x∈X

1
≤ P log2 ≤ k + τ + (2k+1 − 1) · 2−k−τ
PX (X)
So far our discussion applies to an arbitrary random variable X. Next we consider the source as
a random process (S1 , S2 , . . .) and introduce blocklength n. We apply our results to X = S n , that is,
by treating the first n symbols as a supersymbol. The following corollary states that the limiting
behavior of l(f ∗ (S n )) and log PSn1(S n ) always coincide.
Corollary 8.1. Let (S1 , S2 , . . .) be some random process and U be some random variable. Then
1 1 D 1 ∗ n D
log2 n
−
→U ⇔ l(f (S ))−
→U (8.1)
n PS (S )
n n
and
1 1 D 1 D
√ log2 n
− H(S ) −→V ⇔ √ (l(f ∗ (S n )) − H(S n ))−
→V (8.2)
n PS n (S n ) n

88
Proof. The proof is simply logic. First recall: convergence in distribution is equivalent to convergence
D
of CDF at all continuity point, i.e., Un −→U ⇔ P [Un ≤ u] → P [U ≤ u] for all u at which point the
CDF of U is continuous (i.e., not an atom of U ).
√
To get (8.1), apply Theorem 8.4 with k = un and τ = n:
√
1 1 1 ∗ 1 1 1
P log2 ≤u ≤P l(f (X)) ≤ u ≤ P log2 ≤ u + √ + 2− n+1 .
n PX (X) n n PX (X) n
√
To get (8.2), apply Theorem 8.4 with k = H(S n ) + nu and τ = n1/4 :
∗ n
1 1 l(f (S )) − H(S n )
P √ log − H(S ) ≤ u ≤ P
n
√ ≤u
n PS n (S n ) n

1 1 −1/4 1/4
≤P √ log n
n
− H(S ) ≤ u + n + 2−n +1
n PS n (S )

Remark 8.2 (Memoryless source). Now let us consider S n that are i.i.d. Then the important
observation is that the log likihood becomes an i.i.d. sum:

X n
1 1
log n
= log .
PS n (S ) PS (Si )
i=1 | {z }
i.i.d.

P
1. By the Law of Large Numbers (LLN), we know that n1 log PSn1(S n ) − →E log PS1(S) = H(S).
Therefore in (8.1) the limiting distribution U is degenerate, i.e., U = H(S), and we have
P
1 ∗ n →
n l(f (S ))− E log PS1(S)
= H(S). [Note: convergence in distribution to a constant ⇔ conver-
gence in probability to a constant]

2. By the Central Limit Theorem (CLT), if V (S) , Var log PS1(S) < ∞,2 then we know that V
in (8.2) is Gaussian, i.e.,

1 1 D
p log − nH(S) −→N (0, 1).
nV (S) PS n (S n )

Consequently, we have the following Gaussian approximation for the probability law of the
optimal code length
1 D
p (l(f ∗ (S n )) − nH(S))−
→N (0, 1),
nV (S)
or, in shorthand,
p
l(f ∗ (S n )) ∼ nH(S) + nV (S)N (0, 1) in distribution.

Gaussian approximation tells us the speed of n1 l(f ∗ (S n )) to entropy and give us a good
approximation at finite n. In the next section we apply our bounds to approximate the
distribution of l(f ∗ (S n )) in a concrete example:
2
V (S) is often known as the varentropy of S.

89
8.1.1 Compressing iid ternary source
Consider the source outputing n ternary letters each independent and distributed as

PX = [.445 .445 .11] .

For iid source it can be shown

1
E[`(f ∗ (X n ))] = nH(X) − log(2πeV n) + O(1) ,
2
where we denoted the varentropy of X by

1
V (X) , Var log .
PX (X)

The Gaussian approximation to `(f ∗ (X)) is defined as

1 √
nH(X) − log 2πeV n + nV Z ,
2
where Z ∼ N (0, 1).
On Fig. 8.1, 8.2, 8.3 we plot the distribution of the length of the optimal compressor for different
values of n and compare with the Gaussian approximation.
Upper/lower bounds on the expectation:

H(X n ) − log(H(X n ) + 1) − log e ≤ E[`(f ∗ (X n ))] ≤ H(X n )

Here are the numbers for different n

n = 20 21.5 ≤ 24.3 ≤ 27.8
n = 100 130.4 ≤ 134.4 ≤ 139.0
n = 500 684.1 ≤ 689.2 ≤ 695.0

In all cases above E[`(f ∗ (X))] is close to a midpoint between the two.

90
Optimal compression: CDF, n = 20, PX = [0.445 0.445 0.110]
1
True CDF
Lower bound
Upper bound
0.9 Gaussian approximation

0.8

0.7

0.6

0.5
P

0.4

0.3

0.2

0.1

0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Rate

Optimal compression: PMF, n = 20, PX = [0.445 0.445 0.110]

True PMF
Gaussian approximation
0.16

0.14

0.12

0.1
P

0.08

0.06

0.04

0.02

0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5
Rate

Figure 8.1: CDF and PMF of optimal compressor

91
Optimal compression: CDF, n = 100, PX = [0.445 0.445 0.110]
1
True CDF
0.9 Lower bound
Upper bound
0.8 Gaussian approximation

0.7

0.6

0.5
P

0.4

0.3

0.2

0.1

0
1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55
Rate

Optimal compression: PMF, n = 100, PX = [0.445 0.445 0.110]

True PMF
0.06 Gaussian approximation

0.05

0.04
P

0.03

0.02

0.01

0
1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55
Rate

Figure 8.2: CDF and PMF, Gaussian is shifted to the true E[`(f ∗ (X))]

92
Optimal compression: CDF, n = 500, PX = [0.445 0.445 0.110]
1
True CDF
Lower bound
Upper bound
0.9 Gaussian approximation

0.8

0.7

0.6

0.5
P

0.4

0.3

0.2

0.1

0
1.3 1.35 1.4 1.45 1.5 1.55
Rate

Optimal compression: PMF, n = 500, PX = [0.445 0.445 0.110]

True PMF
Gaussian approximation

0.03

0.025

0.02
P

0.015

0.01

0.005

0
1.3 1.35 1.4 1.45 1.5 1.55
Rate

Figure 8.3: CDF and PMF of optimal compressor

93
8.2 Uniquely decodable codes, prefix codes and Huffman codes

all lossless codes

uniquely decodable codes

prefix codes

Huffman
code

We have studied f ∗ , which achieves the stochastically smallest code length among all variable-
length compressors. Note that f ∗ is obtained by ordering the pmf and assigning shorter codewords to
more likely symbols. In this section we focus on a specific class of compressors with good algorithmic
properties which lead to low complexity and short delay when decoding from a stream of compressed
bits. This part is more combinatorial in nature.S
We start with a few definitions. Let A+ = n≥1 An denotes all non-empty finite-length strings
consisting of symbols from the alphabet A. Throughout this lecture A is a countable set.
Definition 8.1 (Extension of a code). The (symbol-by-symbol) extension of f : A → {0, 1}∗ is
f : A+ → {0, 1}∗ where f (a1 , . . . , an ) = (f (a1 ), . . . , f (an )) is defined by concatenating the bits.
Definition 8.2 (Uniquely decodable codes). f : A → {0, 1}∗ is uniquely decodable if its extension
f : A+ → {0, 1}∗ is injective.
Definition 8.3 (Prefix codes). f : A → {0, 1}∗ is a prefix code 3 if no codeword is a prefix of another
(e.g., 010 is a prefix of 0101).
Example: A = {a, b, c}.

• f (a) = 0, f (b) = 1, f (c) = 10 – not uniquely decodable, since f (ba) = f (c) = 10.

• f (a) = 0, f (b) = 10, f (c) = 11 – uniquely decodable and prefix.

• f (a) = 0, f (b) = 01, f (c) = 011, f (d) = 0111 – uniquely decodable but not prefix, since as long
as 0 appears, we know that the previous codeword has terminated.

Remark 8.3.

1. Prefix codes are uniquely decodable.

3
Also known as prefix-free/comma-free/self-punctuatingf/instantaneous code.

94
2. Similar to prefix-free codes, one can define suffix-free codes. Those are also uniquely decodable
(one should start decoding in reverse direction).

3. By definition, any uniquely decodable code does not have the empty string as a codeword.
Hence f : X → {0, 1}+ in both Definition 8.2 and Definition 8.3.

4. Unique decodability means that one can decode from a stream of bits without ambiguity, but
one might need to look ahead in order to decide the termination of a codeword. (Think of the
last example). In contrast, prefix codes allow the decoder to decode instantaneously without
looking ahead.

5. Prefix code ↔ binary tree (codewords are leaves) ↔ strategy to ask “yes/no” questions

Theorem 8.5 (Kraft-McMillan).

1. Let f : A → {0, 1}∗ be uniquely decodable. Set la = l(f (a)). Then f satisfies the Kraft
inequality X
2−la ≤ 1. (8.3)
a∈A

2. Conversely, for any set of code length {la : a ∈ A} satisfying (8.3), there exists a prefix code
f , such that la = l(f (a)). Moreover, such an f can be computed efficiently.

Note: The consequence of Theorem 8.5 is that as far as compression efficiency is concerned, we can
forget about uniquely decodable codes that are not prefix codes.

Proof. We prove the Kraft inequality for prefix codes and uniquely decodable codes separately.
The purpose for doing a separate proof for prefix codes is to illustrate the powerful technique of
probabilistic method. The idea is from [AS08, Exercise 1.8, p. 12].
Let f be a prefix code. Let us construct a probability space such that the LHS of (8.3) is the
probability of some event, which cannot exceed one. To this end, consider the following scenario:
Generate independent Bern( 12 ) bits. Stop
P if a codeword has been written, otherwise continue. This
process terminates with probability a∈A 2−la . The summation makes sense because the events
that a given codeword is written are mutually exclusive, thanks to the prefix condition.
Now let f be a uniquely decodable code. The proof uses generating function as a device for
counting. (The analogy in coding theory is the weight P enumerator
PL function.) First assume A is
l l
finite. Then L = maxa∈A la is finite. Let Gf (z) = a∈A z = l=0 Al (f )z , where Al (f ) denotes
a

the number of codewords of length l in f . For k ≥ 1, define f k : Ak → {0, 1}+ as the symbol-by-
P k k P P
symbol extension of f . Then Gf k (z) = ak ∈Ak z l(f (a )) = a1 · · · ak z la1 +···+lak = [Gf (z)]k =
PkL k l f , f k is lossless. Hence Al (f k ) ≤ 2l . Therefore we have
l=0 Al (f )z . By the unique decodability of P
Gf (1/2) = Gf k (1/2) ≤ kL for all k. Then a∈A 2−la = Gf (1/2) ≤ limk→∞ (kL)1/k → 1. If A is
k
P
countably infinite, for any finite subset A0 ⊂ A, repeating the same argument gives a∈A0 2−la ≤ 1.
The proof is complete by the arbitrariness of A0 . P
Conversely, given a set of code lengths {la : a ∈ A} s.t. a∈A 2−la ≤ 1, construct a prefix code
f as follows: First relabel A to N and assume that 1 ≤ l1 ≤ l2 ≤ . . .. For each i, define
i−1
X
ai , 2−lk
k=1

95
with a1 = 0. Then ai < 1 by Kraft inequality. Thus we define the codeword f (i) ∈ {0, 1}+ as the
first li bits in the binary expansion of ai . Finally, we prove that f is a prefix code by contradiction:
Suppose for some j > i, f (i) is the prefix of f (j), since lj ≥ li . Then aj − ai ≤ 2−li , since they agree
on the most significant li bits. But aj − ai = 2−li + 2−li+1 + . . . > 2−li , which is a contradiction.

Open problems:

1. Find a probabilistic proof of Kraft inequality for uniquely decodable codes.

P −la
2. There is a conjecture of Ahslwede that for any sets of lengths for which 2 ≤ 34 there exists
a fix-free code (i.e. one which is simultaneously prefix-free and suffix-free). So far, existence
has only been shown when the Kraft sum is ≤ 85 , cf. [Yek04].

In view of Theorem 8.5, the optimal average code length among all prefix (or uniquely decodable)
codes is given by the following optimization problem
X
L∗ (X) , min PX (a)la (8.4)
a∈A
X
s.t. 2−la ≤ 1
a∈A
la ∈ N

This is an integer programming (IP) problem, which, in general, is computationally hard to solve.
It is remarkable that this particular IP can be solved in near-linear time, thanks to the Huffman
algorithm. Before describing the construction of Huffman codes, let us give bounds to L∗ (X) in
terms of entropy:
Theorem 8.6.
H(X) ≤ L∗ (X) ≤ H(X) + 1 bit. (8.5)

l m
Proof. “≤” Consider the following length assignment la = log2 PX1(a) ,4 which satisfies Kraft
P −la ≤
P
since a∈A l 2 m a∈A PX (a) = 1. By Theorem 8.5, there exists a prefix code f such that
l(f (a)) = log2 PX (a) and El(f (X)) ≤ H(X) + 1.
1

“≥” We give two proofs for the converse. One of the commonly used ideas to deal with
combinatorial optimization is relaxation. Our first idea is to drop the integer constraints in (8.4)
and relax it into the following optimization problem, which obviously provides a lower bound
X
L∗ (X) , min PX (a)la (8.6)
a∈A
X
s.t. 2−la ≤ 1 (8.7)
a∈A

This is a nice convex programming problem, since the objective function is affine and the feasible set
is convex. Solving (8.6) by Lagrange multipliers (Exercise!) yields the minimum is equal to H(X)
(achieved at la = log2 PX1(a) ).
4
Such a code is called a Shannon code.

96
Another proof is the following: For any f satisfying Kraft inequality, define a probability measure
−la
Q(a) = P 2 2−la . Then
a∈A

X
El(f (X)) − H(X) = D(P kQ) − log 2−la
a∈A
≥0

Next we describe the Huffman code, which achieves the optimum in (8.4). In view of the fact
that prefix codes and binary trees are one-to-one, the main idea of Huffman code is to build the
binary tree bottom-up: Given a pmf {PX (a) : a ∈ A},

1. Choose the two least-probable symbols in the alphabet

2. Delete the two symbols and add a new symbol (with combined probabilities). Add the new
symbol as the parent node of the previous two symbols in the binary tree.

The algorithm terminates in |A| − 1 steps. Given the binary tree, the code assignment can be
obtained by assigning 0/1 to the branches. Therefore the time complexity is O(|A|) (sorted pmf) or
O(|A| log |A|) (unsorted pmf).
Example: A = {a, b, c, d, e}, PX = {0.25, 0.25, 0.2, 0.15, 0.15}.
Huffman tree: Codebook:
0 1
f (a) = 00
0.55 0.45 f (b) = 10
0 1 0 1 f (c) = 11
a 0.3 b c f (d) = 010
0 1 f (e) = 011
d e
Theorem 8.7 (Optimality of Huffman codes). The Huffman code achieves the minimal average
code length (8.4) among all prefix (or uniquely decodable) codes.

Proof. [CT06, Sec. 5.8].

Remark 8.4 (Drawbacks of Huffman codes).

1. Does not exploit memory. Solution: block Huffman coding. Shannon’s original idea from
1948 paper: in compressing English text, instead of dealing with letters and exploiting the
nonequiprobability of the English alphabet, working with pairs of letters to achieve more
compression (more generally, n-grams). Indeed, compressing the block (S1 , . . . , Sn ) using its
Huffman code achieves H(S1 , . . . , Sn ) within one bit, but the complexity is |A|n !

2. Non-universal (constructing the Huffman code needs to know the source distribution). This
brings us the question: Is it possible to design universal compressor which achieves entropy for
a class of source distributions? And what is the price to pay? See Homework and Lecture 11.

There are much more elegant solutions, e.g.,

1. Arithmetic coding: sequential encoding, linear complexity in compressing (S1 , . . . , Sn ) –

Section 11.1.

97
2. Lempel-Ziv algorithm: low-complexity, universal, provably optimal in a very strong sense –
Section 11.7.

To sum up: Comparison of average code length (in bits):

H(X) − log2 [e(H(X) + 1)] ≤ E[l(f ∗ (X))] ≤ H(X) ≤ E[l(fHuffman (X))] ≤ H(X) + 1.

98
§ 9. Fixed-length (almost lossless) compression. Slepian-Wolf.

9.1 Fixed-length code, almost lossless, AEP

Coding paradigm:

X Compressor {0, 1}k Decompressor X ∪ {e}

f : X →{0,1}k g: {0,1}k →X ∪{e}

Note: If we want g ◦ f = 1X , then k ≥ log2 |X |. But, the transmission link is erroneous anyway...
and it turns out that by tolerating a little error probability , we gain a lot in terms of code length!
Indeed, the key idea is to allow errors: Instead of insisting on g(f (x)) = x for all x ∈ X ,
consider only lossless decompression for a subset S ⊂ X :
(
x x∈S
g(f (x)) =
e x 6∈ S

and the probability of error:

P [g(f (X)) 6= X] = P [g(f (X)) = e] = P [X ∈ S] .

Definition 9.1. A compressor-decompressor pair (f, g) is called a (k, )-code if:

f : X → {0, 1}k
g : {0, 1}k → X ∪ {e}

such that g(f (x)) ∈ {x, e} and P [g(f (X)) = e] ≤ .

Fundamental limit:
∗ (X, k) , inf{ : ∃(k, )-code for X}
The following result connects the respective fundamental limits of fixed-length almost lossless
compression and variable-length lossless compression (Lecture 8):
Theorem 9.1 (Fundamental limit of error probabiliy).

∗ (X, k) = P [l(f ∗ (X)) ≥ k] = 1 − sum of 2k − 1 largest masses of X.

Proof. The proof is essentially tautological. Note 1 + 2 + · · · + 2k−1 = 2k − 1. Let S = {2k −

1 most likely realizations of X}. Then

∗ (X, k) = P [X 6∈ S] = P [l(f ∗ (X)) ≥ k] .

Optimal codes:

99
• Variable-length: f ∗ encodes the 2k −1 symbols with the highest probabilities to {φ, 0, 1, 00, . . . , 1k−1 }.

• Fixed-length: The optimal compressor f maps the elements of S into (00 . . . 00), . . . , (11 . . . 10)
and the rest to (11 . . . 11). The decompressor g decodes perfectly except for outputting e upon
receipt of (11 . . . 11).

Note: In Definition 9.1 we require that the errors are always detectable, i.e., g(f (x)) = x or e.
Alternatively, we can drop this requirement and allow undetectable errors, in which case we can of
course do better since we have more freedom in designing codes. It turns out that we do not gain
much by this relaxation. Indeed, if we define

˜∗ (X, k) = inf{P [g(f (X)) 6= X] : f : X → {0, 1}k , g : {0, 1}k → X ∪ {e}},

then ∗ k
P ˜ (X, k) = 1−sum of 2 largest masses of X. This follows immediately from P [g(f (X)) = X] =
P (x) where S , {x : g(f (x)) = x} satisfies |S| ≤ 2 k , because f takes no more than 2k values.
x∈S X
Compared to Theorem 9.1, we see that ˜∗ (X, k) and ˜∗ (X, k) do not differ much. In particular,
∗ (X, k + 1) ≤ ˜∗ (X, k) ≤ ∗ (X, k).
Corollary 9.1 (Shannon). Let S n be i.i.d. Then

∗ n 0 R > H(S)
lim (S , nR) =
n→∞ 1 R < H(S)
p
lim ∗ (S n , nH(S) + nV (S)γ) = 1 − Φ(γ).
n→∞

where Φ(·) is the CDF of N (0, 1), H(S) = E[log PS1(S) ] is the entropy, V (S) = Var[log PS1(S) ] is the
varentropy which is assumed to be finite.

Proof. Combine Theorem 9.1 with Corollary 8.1.

Theorem 9.2 (Converse).

∗ ∗ 1
(X, k) ≥ ˜ (X, k) ≥ P log2 > k + τ − 2−τ , ∀τ > 0.
PX (X)

8.4. Let S = {x : g(f (x)) k

Proof. Identical to the converse of Theorem
h i = x}. Then |S| ≤ 2 and
1
P [X ∈ S] ≤ P log2 PX1(X) ≤ k + τ + P X ∈ S, log2 >k+τ
PX (X)
| {z }
≤2−τ

Two achievability bounds

Theorem 9.3.
∗ 1
(X, k) ≤ P log2 ≥k . (9.1)
PX (X)

100
Proof. Construction: use those 2k − 1 symbols with the highest probabilities.
The analysis is essentially the same as the lower bound in Theorem 8.4 from Lecture 8. Note
1
that the mth largest mass PX (m) ≤ m . Therefore
X X X
∗ (X, k) = PX (m) = 1{m≥2k } PX (m) ≤ 1n 1
≥2k
o P (m)
X = E1nlog 1
o.
PX (m) 2 P (X) ≥k
X
m≥2k

Theorem 9.4.
1
∗ (X, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0. (9.2)
PX (X)
Note: In fact, Theorem 9.3 is always stronger than Theorem 9.4. Still, we present the proof of
Theorem 9.4 and the technology behind it – random coding – a powerful technique for proving
existence (achievability) which we heavily rely on in this course. To see that Theorem 9.3 gives
a better bound, note that even the first term in (9.2) exceeds (9.1). Nevertheless, the method of
proof for this weaker bound will be useful for generalizations.

Proof. Construction: random coding (Shannon’s magic). For a given compressor f , the optimal
decompressor which minimizes the error probability is the maximum a posteriori (MAP) decoder,
i.e.,
g ∗ (w) = argmax PX|f (X) (x|w) = argmax PX (x),
x x:f (x)=w

which can be hard to analyze. Instead, let us consider the following (suboptimal) decompressor g:

 x, ∃! x ∈ X s.t. f (x) = w and log2 PX1(x) ≤ k − τ,
g(w) = (exists unique high-probability x that is mapped to w)

e, o.w.

Note that log2 1

PX (x) ≤ k − τ ⇐⇒ PX (x) ≥ 2−(k−τ ) . We call those x “high-probability”.
Denote f (x) = cx and the codebook C = {cx : x ∈ X } ⊂ {0, 1}k . It is instructive to think of C
as a hashing table.
Error probability analysis: There are two ways to make an error ⇒ apply union bound. Before
proceeding, define

0 0 1
J(x, C) , x ∈ X : cx0 = cx , x 6= x, log2 ≤k−τ
PX (x0 )

to be the set of high-probability inputs whose hashes collide with that of x. Then we have the
following estimate for probability of error:

1
P [g(f (X)) = e] = P log2 > k − τ ∪ {J(X, C) 6= ∅}
PX (X)

1
≤ P log2 > k − τ + P [J(X, C) 6= φ]
PX (X)

The first term does not depend on the codebook C, while the second term does. The idea now
is to randomize over C and show that when we average over all possible choices of codebook, the
second term is smaller than 2−τ . Therefore there exists at least one codebook that achieves the
desired bound. Specifically, let us consider C which is uniformly distributed over all codebooks and

101
independently of X. Equivalently, since C can be represented by an |X | × k binary matrix, whose
rows correspond to codewords, we choose each entry to be independent fair coin flips.
Averaging the error probability (over C and over X), we have

EC [P [J(X, C) 6= φ]] = EC,X 1 ∃x0 6=X:log
n
1
o
2 P (x0 ) ≤k−τ,cx0 =cX
X
 
X
≤ EC,X  1nlog 1
≤k−τ
o1
{cx0 =cX }
 (union bound)
2 P (x0 )
x0 6=X X
 
X
= 2−k EX  1{PX (x0 )≥2−k+τ } 
x0 6=X
X
≤ 2−k 1{PX (x0 )≥2−k+τ }
x0 ∈X
−k k−τ
≤2 2 = 2−τ .

Remark 9.1 (Why the proof works). Compressor f (x) = cx can be thought as hashing x ∈ X to a
random k-bit string cx ∈ {0, 1}k .

Here: high-probability x ⇔ log2 1

PX (x) ≤ k − τ ⇔ PX (x) ≥ 2−k+τ . Therefore the cardinality of
high-probability x’s is at most 2k−τ 2k = number of strings. Hence the chance of collision is
small.
Remark 9.2. The random coding argument is a canonical example of probabilistic method : To
prove the existence of an object with certain property, we construct a probability distribution
(randomize) and show that on average the property is satisfied. Hence there exists at least one
realization with the desired property. The downside of this argument is that it is not constructive,
i.e., does not give us an algorithm to find the object.
Remark 9.3. This is a subtle point: Notice that in the proof we choose the random codebook
to be uniform over all possible codebooks. In other words, C = {cx : x ∈ X } consists of iid k-bit
strings. In fact, in the proof we only need pairwise independence, i.e., cx ⊥ ⊥ cx0 for any x 6= x0
(Why?). Now, why should we care about this? In fact, having access to external randomness is
also a lot of resources. It is more desirable to use less randomness in the random coding argument.
Indeed, if we use zero randomness, then it is a deterministic construction, which is the best situation!
Using pairwise independent codebook requires significantly less randomness than complete random
coding which needs |X |k bits. To see this intuitively, note that one can use 2 independent random
bits to generate 3 random bits that is pairwise independent but not mutually independent, e.g.,
{b1 , b2 , b1 ⊕ b2 }. This observation is related to linear compression studied in the next section, where
the codeword we generated are not iid, but related through a linear mapping.

102
Remark 9.4 (AEP for memoryless sources). Consider iid S n . By WLLN,
1 1 P
log −
→H(S). (9.3)
n PS n (S n )

For any δ > 0, define the set

1 1
Tnδ = s : log
n
− H(S)≤δ .

n PS n (sn )
i.i.d. n n
For example: S n ∼ Bern(p), since PS n (sn ) = pw(s ) q n−w(s ) , the typical set corresponds to those
sequences whose Hamming is close to the expectation: Tnδ = {sn ∈ {0, 1}n : w(sn ) ∈ [p ± δ 0 ]n},
where δ 0 is a constant depending on δ.
As a consequence of (9.3),

1. P S n ∈ Tnδ → 1 as n → ∞.

2. |Tnδ | ≤ 2(H(S)+δ)n |S|n .

In other words, S n is concentrated on the set Tnδ which is exponentially smaller than the whole
space. In almost loss compression we can simply encode this set losslessly. Although this is different
than the optimal encoding, Corollary 9.1 indicates that in the large-n limit the optimal compressor
is no better.
The property (9.3) is often referred as the Asymptotic Equipartition Property (AEP), in the
sense that the random vector is concentrated on a set wherein each reliazation is roughly equally
likely up to the exponent. Indeed, Note that for any sn ∈ Tnδ , its likelihood is concentrated around
PS n (sn ) ∈ 2−(H(S)±δ)n , called δ-typical sequences.

103
Next we still consider fixed-blocklength code and study the fundamental limit of error probability
∗ (X, k)for the following coding paradigms:

• Linear Compression

• Compression with Side Information

– side info available at both sides

– side info available only at decompressor
– multi-terminal compressor, single decompressor

9.2 Linear Compression

From Shannon’s theorem:

∗ (X, nR) −→ 0 or 1 R ≶ H(S)

Our goal is to find compressor with structures. The simplest one can think of is probably linear
operation, which is also highly desired for its simplicity (low complexity). But of course, we have to
be on a vector space where we can define linear operations. In this part, we assume X = S n , where
each coordinate takes values in a finite field (Galois Field), i.e., Si ∈ Fq , where q is the cardinality
of Fq . This is only possible if q = pn for some prime p and n ∈ N. So Fq = Fpn .
Definition 9.2 (Galois Field). F is a finite set with operations (+, ·) where

• a + b associative and commutative

• a · b associative and commutative

• 0, 1 ∈ F s.t. 0 + a = 1 · a = a.

• ∀a, ∃ − a, s.t. a + (−a) = 0

• ∀a 6= 0, ∃a−1 , s.t. a−1 a = 1

• distributive: a · (b + c) = (a · b) + (a · c)

Example:

• Fp = Z/pZ, where p is prime

• F4 = {0, 1, x, x + 1} with addition and multiplication as polynomials mod (x2 + x + 1) over

F2 [x].

Linear Compression Problem: x ∈ Fnq , w = Hx where H : Fnq → Fkq is linear represented by a

matrix H ∈ Fk×n
q .
    
w1 h11 . . . h1n x1
 ..   .. .  
..   ... 
 . = . 
wk hk1 . . . hkn xn

Compression is achieved if k ≤ n, i.e., H is a fat matrix. Of course, we have to tolerate some error
(almost lossless). Otherwise, lossless compression is only possible with k ≥ n, which not interesting.

104
Theorem 9.5 (Achievability). Let X ∈ Fnq be a random vector. ∀τ > 0, ∃ linear compressor
H : Fnq → Fkq and decompressor g : Fkq → Fnq ∪ {e}, s.t.

1
P [g(HX) 6= X] ≤ P logq > k − τ + q −τ
PX (X)
Remark 9.5. Consider the Hamming space q = 2. In comparison with Shannon’s random coding
achievability, which uses k2n bits to construct a completely random codebook, here for linear codes
we need kn bits to randomly generate the matrix H, and the codebook is a k-dimensional linear
subspace of 6the Hamming space.
Proof. Fix τ . As pointed in the proof of Shannon’s random coding theorem (Theorem 9.4), given
the compressor H, the optimal decompressor is the MAP decoder, i.e., g(w) = argmaxx:Hx=w PX (x),
which outputs the most likely symbol that is compatible with the codeword received. Instead, let us
consider the following (suboptimal) decoder for its ease of analysis:
(
x ∃!x ∈ Fnq : w = Hx, x − h.p.
g(w) =
e otherwise
where we used the short-hand:
1
x − h.p. (high probability) ⇔ logq < k − τ ⇔ PX (x) ≥ q −k+τ .
PX (x)
Note that this decoder is the same as in the proof of Theorem 9.4. The proof is also mostly the
same, except now hash collisions occur under the linear map H. By union bound,

1
P [g(f (X)) = e] ≤ P logq > k − τ + P ∃x0 − h.p. : x0 6= X, Hx0 = HX
PX (x)
X X
1
(union bound) ≤ P logq >k−τ + PX (x) 1{Hx0 = Hx}
PX (x) x 0 0x −h.p.,x 6=x

Now we use random coding to average the second term over all possible choices of H. Specifically,
choose H as a matrix independent of X where each entry is iid and uniform on Fq . For distinct x0
and x1 , the collision probability is
PH [Hx1 = Hx0 ] = PH [Hx2 = 0] (x2 , x1 − x0 6= 0)
= PH [H1 · x2 = 0] k
(iid rows)
where H1 is the first row of the matrix H, and each row of H is independent. This is the probability
that Hi is in the orthogonal complement of x2 . On Fnq , the orthogonal complement of a given
non-zero vector has cardinality q n−1 . So the probability for the first row to lie in this subspace is
q n−1 /q n = 1/q, hence the collision probability 1/q k . Averaging over H gives
X X
EH 1{Hx0 = Hx} = PH [Hx0 = Hx] = |{x0 : x0 − h.p., x0 6= x}|q −k ≤ q k−τ q −k = q −τ
x0 −h.p.,x0 6=x x0 −h.p.,x0 6=x

Thus the bound holds.

Remark 9.6. 1. Compared to Theorem 9.4, which is obtained by randomizing over all possible
compressors, Theorem 9.5 is obtained by randomizing over only linear compressors, and the
bound we obtained is identical. Therefore restricting on linear compression almost does not
lose anything.

105
2. Note that in this case it is not possible to make all errors detectable.

3. Can we loosen the requirement on Fq to instead be a commutative ring? In general, no, since
zero divisors in the commutative ring ruin the key proof ingredient of low collision probability
in the random hashing. E.g. in Z/6Z
       
1 2
  0     0  
       
P H  .  = 0 = 6−k but P H  .  = 0 = 3−k ,
  ..     ..  
0 0

since 0 · 2 = 3 · 2 = 0 in Z/6Z.

9.3 Compression with side information at both compressor and

decompressor

X {0, 1}k X ∪ {e}

Compressor Decompressor

Definition 9.3 (Compression wih Side Information). Given PXY ,

• f : X × Y → {0, 1}k

• g : {0, 1}k × Y → X ∪ {e}

• P[g(f (X, Y ), Y ) 6= X] <

• Fundamental Limit: ∗ (X|Y, k) = inf{ : ∃(k, ) − S.I. code}

Note: The side information Y need not be discrete. The source X is, of course, discrete.
Note that conditioned on Y = y, the problem reduces to compression without side information
where the source X is distributed according to PX|Y =y . Since Y is known to both the compressor
and decompressor, they can use the best code tailored for this distribution. Recall ∗ (X, k) defined
in Definition 9.1, the optimal probability of error for compressing X using k bits, which can also be
denoted by ∗ (PX , k). Then we have the following relationship

∗ (X|Y, k) = Ey∼PY [∗ (PX|Y =y , k)],

which allows us to apply various bounds developed before.

Theorem 9.6.

1 −τ ∗ 1
P log > k + τ − 2 ≤ (X|Y, k) ≤ P log2 > k − τ + 2−τ , ∀τ > 0
PX|Y (X|Y ) PX|Y (X|Y )

106
i.i.d.
Corollary 9.2. (X, Y ) = (S n , T n ) where the pairs (Si , Ti ) ∼ PST
(
0 R > H(S|T )
lim ∗ (S n |T n , nR) =
n→∞ 1 R < H(S|T )

Proof. Using the converse Theorem 9.2 and achievability Theorem 9.4 (or Theorem 9.3) for com-
pression without side information, we have

1 −τ ∗ 1
P log > k + τ Y = y − 2 ≤ (PX|Y =y , k) ≤ P log > k Y = y
PX|Y (X|y) PX|Y (X|y)

By taking the average over all y ∼ PY , we get the theorem. For the corollary
n
1 1 1X 1 P
log n n
= log −
→H(S|T )
n PS n |T n (S |T ) n PS|T (Si |Ti )
i=1

as n → ∞, using the WLLN.

9.4 Slepian-Wolf (Compression with side information at

decompressor only)
Consider the compression with side information problem, except now the compressor has no access
to the side information.

X {0, 1}k X ∪ {e}

Compressor Decompressor

Definition 9.4 (S.W. code). Given PXY ,

• f : X → {0, 1}k

• g : {0, 1}k × Y → X ∪ {e}

• P[g(f (X), Y ) 6= X] ≤

• Fundamental Limit: ∗SW (X|Y, k) = inf{ : ∃(k, )-S.W. code}

Now the very surprising result: Even without side information at the compressor, we can still
compress down to the conditional entropy!
Theorem 9.7 (Slepian-Wolf, ’73).

∗ 1
(X|Y, k) ≤ ∗SW (X|Y, k) ≤ P log ≥ k − τ + 2−τ
PX|Y (X|Y )

107
Corollary 9.3.
(
0 R > H(S|T )
lim ∗SW (S n |T n , nR) =
n→∞ 1 R < H(S|T )

Remark 9.7. Definition 9.4 does not include the zero-undected-error condition (that is g(f (x), y) =
x or e). In other words, we allow for the possibility of undetected errors. Indeed, if we require this
condition, the side-information savings will be mostly gone. Indeed, assuming PX,Y (x, y) > 0 for
all (x, y) it is clear that under zero-undetected-error condition, if f (x1 ) = f (x2 ) = c then g(c) = e.
Thus except for c all other elements in {0, 1}k must have unique preimages. Similarly, one can
show that Slepian-Wolf theorem does not hold in the setting of variable-length lossless compression
(i.e. average length is H(X) not H(X|Y ).)

Proof. LHS is obvious, since side information at the compressor and decompressor is better than
only at the decompressor.
For the RHS, first generate a random codebook with iid uniform codewords: C = {cx ∈ {0, 1}k :
x ∈ X } independently of (X, Y ), then define the compressor and decoder as

f (x) = cx
(
x ∃!x : cx = w, x − h.p.|y
g(w, y) =
0 o.w.

where we used the shorthand x − h.p.|y ⇔ log2 P 1(x|y) < k − τ . The error probability of this
X|Y
scheme, as a function of the code book C, is

1
E(C) = P log ≥ k − τ or J(X, C|Y ) 6= ∅
PX|Y (X|Y )

1
≤ P log ≥ k − τ + P [J(X, C|Y ) 6= ∅]
PX|Y (X|Y )
X
1
= P log ≥k−τ + PXY (x, y)1{J(x,C|y)6=∅} .
PX|Y (X|Y ) x,y

where J(x, C|y) , {x0 6= x : x0 − h.p.|y, cx = cx0 }.

Now averaging over C and applying the union bound: use |{x0 : x0 − h.p.|y}| ≤ 2k−τ and
P[cx0 = cx ] = 2−k for any x 6= x0 ,
 
X
PC [J(x, C|y) 6= ∅] ≤ EC  1{x0 −h.p.|y} 1{cx0 =cx } 
x0 6=x

= 2k−τ P[cx0 = cx ]
= 2−τ

Hence the theorem follows as usual from two terms in the union bound.

9.5 Multi-terminal Slepian Wolf

Distributed compression: Two sources are correlated. Compress individually, decompress jointly.
What are those rate pairs that guarantee successful reconstruction?

108
X {0, 1}k1
Compressor f1

Decompressor g
(X̂, Ŷ )

Y {0, 1}k2
Compressor f2

Definition 9.5. Given PXY ,

• (f1 , f2 , g) is (k1 , k2 , )-code if f1 : X → {0, 1}k1 , f2 : Y → {0, 1}k2 , g : {0, 1}k1 × {0, 1}k2 →
X × Y, s.t. P[(X̂, Ŷ ) 6= (X, Y )] ≤ , where (X̂, Ŷ ) = g(f1 (X), f2 (Y )).

• Fundamental limit: ∗SW (X, Y, k1 , k2 ) = inf{ : ∃(k1 , k2 , )-code}.

Theorem 9.8. (X, Y ) = (S n , T n ) - iid pairs

(
0 (R1 , R2 ) ∈ int(RSW )
lim ∗ (S n , T n , nR1 , nR2 ) =
n→∞ SW 1 (R1 , R2 ) 6∈ RSW

where RSW denotes the Slepian-Wolf rate region



 a ≥ H(S|T )
RSW = (a, b) : b ≥ H(T |S)


a + b ≥ H(S, T )

Note: The rate region RSW typically looks like:

Achievable
H(T )
Region

H(T |S)
R1
H(S|T ) H(S)

Since H(T ) − H(T |S) = H(S) − H(S|T ) = I(S; T ), the slope is −1.

Proof. Converse: Take (R1 , R2 ) 6∈ RSW . Then one of three cases must occur:

1. R1 < H(S|T ). Then even if encoder and decoder had full T n , still can’t achieve this (from
compression with side info result – Corollary 9.2).

2. R2 < H(T |S) (same).

3. R1 + R2 < H(S, T ). Can’t compress below the joint entropy of the pair (S, T ).

109
Achievability: First note that we can achieve the two corner points. The point (H(S), H(T |S))
can be approached by almost lossless compressing S at entropy and compressing T with side infor-
mation S at the decoder. To make this rigorous, let k1 = n(H(S) + δ) and k2 = n(H(T |S) + δ). By
Corollary 9.1, there exist f1 : S n → {0, 1}k1 and g1 : {0, 1}k1 → S n s.t. P [g1 (f1 (S n )) 6= S n ] ≤
n → 0. By Theorem 9.7, there exist f2 : T n → {0, 1}k2 and g2 : {0, 1}k1 × S n → T n
s.t. P [g2 (f2 (T n ), S n ) 6= T n ] ≤ n → 0. Now that S n is not available, feed the S.W. decompressor
with g(f (S n )) and define the joint decompressor by g(w1 , w2 ) = (g1 (w1 ), g2 (w2 , g1 (w1 ))) (see below):

Sn Ŝ n
f1 g1

Tn T̂ n
f2 g2

Apply union bound:

P [g(f1 (S n ), f2 (T n )) 6= (S n , T n )]
= P [g1 (f1 (S n )) 6= S n ] + P [g2 (f2 (T n ), g(f1 (S n ))) 6= T n , g1 (f1 (S n )) = S n ]
≤ P [g1 (f1 (S n )) 6= S n ] + P [g2 (f2 (T n ), S n ) 6= T n ]
≤ 2n → 0.

Similarly, the point (H(S), H(T |S)) can be approached.

To achieve other points in the region, use the idea of time sharing: If you can achieve with
vanishing error probability any two points (R1 , R2 ) and (R10 , R20 ), then you can achieve for λ ∈ [0, 1],
(λR1 + λ̄R10 , λR2 + λ̄R20 ) by dividing the block of length n into two blocks of length λn and λ̄n and
apply the two codes respectively

λnR1
(S1λn , T1λn ) → using (R1 , R2 ) code
λnR2

λ̄nR10
n n
(Sλn+1 , Tλn+1 ) → using (R10 , R20 ) code
λ̄nR20

(Exercise: Write down the details rigorously yourself!) Therefore, all convex combinations of points
in the achievable regions are also achievable, so the achievable region must be convex.

9.6* Source-coding with a helper (Ahlswede-Körner-Wyner)

Yet another variation of distributed compression problem is compressing X with a helper, see
figure below. Note that the main difference from the previous section is that decompressor is only
required to produce the estimate of X, using rate-limited help from an observer who has access to
Y . Characterization of rate pairs R1 , R2 is harder than in the previous section.
Theorem 9.9 (Ahlswede-Körner-Wyner). Consider i.i.d. source (X n , Y n ) ∼ PX,Y with X discrete.
If rate pair (R1 , R2 ) is achievable with vanishing probability of error P[X̂ n 6= X n ] → 0, then there
exists an auxiliary random variable U taking values on alphabet of cardinality |Y| + 1 such that
PX,Y,U = PX,Y PU |X,Y and
R1 ≥ H(X|U ), R2 ≥ I(Y ; U ) . (9.4)

110
X {0, 1}k1
Compressor f1

Decompressor g
X̂

Y {0, 1}k2
Compressor f2

Furthermore, for every such random variable U the rate pair (H(X|U ), I(Y ; U )) is achievable with
vanishing error.

Proof. We only sketch some crucial details.

First, note that iterating over all possible random variables U (without cardinality constraint)
the set of pairs (R1 , R2 ) satisfying (9.4) is convex. Next, consider a compressor W1 = f1 (X n ) and
W2 = f2 (Y n ). Then from Fano’s inequality (5.7) assuming P[X n 6= X̂ n ] = o(1) we have

H(X n |W1 , W2 )) = o(n) .

Thus, from chain rule and conditioning-decreases-entropy, we get

nR1 ≥ I(X n ; W1 |W2 ) ≥ H(X n |W2 ) − o(n) (9.5)

Xn
= H(Xk |W2 , X k−1 ) − o(n) (9.6)
k=1
Xn
≥ H(Xk | W2 , X k−1 , Y k−1 ) − o(n) (9.7)
| {z }
k=1
,Uk

On the other hand, from (5.2) we have

n
X
n
nR2 ≥ I(W2 ; Y ) = I(W2 ; Yk |Y k−1 ) (9.8)
k=1
Xn
= I(W2 , X k−1 ; Yk |Y k−1 ) (9.9)
k=1
Xn
= I(W2 , X k−1 , Y k−1 ; Yk ) (9.10)
k=1

where (9.9) follows from I(W2 , X k−1 ; Yk |Y k−1 ) = I(W2 ; Yk |Y k−1 ) + I(X k−1 ; Yk |W2 , Y k−1 ) and the
⊥ X k−1 |Y k−1 ; and (9.10) from Y k−1 ⊥
fact that (W2 , Yk ) ⊥ ⊥ Yk . Comparing (9.7) and (9.10) we
notice that denoting Uk = (W2 , X k−1 , Y k−1 ) we have
n
1X
(R1 , R2 ) ≥ (H(Xk |Uk ), I(Uk ; Yk ))
n
k=1

and thus (from convexity) the rate pair must belong to the region spanned by all pairs (H(X|U ), I(U ; Y )).

111
To show that without loss of generality the auxiliary random variable U can be taken to be
|Y| + 1 valued, one needs to invoke Caratheodory’s theorem on convex hulls. We omit the details.
Finally, showing that for each U the mentioned rate-pair is achievable, we first notice that if there
were side information at the decompressor in the form of the i.i.d. sequence U n correlated to X n ,
then Slepian-Wolf theorem implies that only rate R1 = H(X|U ) would be sufficient to reconstruct
X n . Thus, the question boils down to creating a correlated sequence U n at the decompressor by
using the minimal rate R2 . This is the content of the so called covering lemma, see Theorem 26.5
below: It is sufficient to use rate I(U ; Y ) to do so. We omit further details.

112
§ 10. Compressing stationary ergodic sources

We have studyig the compression of i.i.d. sequence {Si }, for which

1 ∗ n P
l(f (S ))−
→H(S) (10.1)
n
0 R > H(S)
lim ∗ (S n , nR) = (10.2)
n→∞ 1 R < H(S)

In this lecture, we shall examine similar results for ergodic processes and we first state the main
theory as follows:
Theorem 10.1 (Shannon-McMillan). Let {S1 , S2 , . . . } be a stationary and ergodic discrete process,
then
1 1 P
log −
→H, also a.s. and in L1 (10.3)
n PS n (S n )

where H = limn→∞ n1 H(S n ) is the entropy rate.

Corollary 10.1. For any stationary and ergodic discrete process {S1 , S2 , . . . }, (10.1) – (10.2) hold
with H(S) replaced by H.

Proof. Shannon-McMillan (we only need convergence in probability) + Theorem 8.4 + Theorem 9.1
which tie together the respective CDF of the random variable l(f ∗ (S n )) and log PSn1(sn ) .

In Lecture 9 we learned the asymptotic equipartition property (AEP) for iid sources. Here we
generalize it to stationary ergodic sources thanks to Shannon-McMillan.
Corollary 10.2 (AEP for stationary ergodic sources). Let {S1 , S2 , . . . } be a stationary and ergodic
discrete process. For any δ > 0, define the set

n 1 1
δ
Tn = s : log
− H ≤ δ .
n PS n (sn )

Then

1. P S n ∈ Tnδ → 1 as n → ∞.

2. 2n(H−δ) (1 + o(1)) ≤ |Tnδ | ≤ 2(H+δ)n (1 + o(1)).

Note:

• Convergence in probability for stationary ergodic Markov chains [Shannon 1948]

• Convergence in L1 for stationary ergodic processes [McMillan 1953]

113
• Convergence almost surely for stationary ergodic processes [Breiman 1956] (Either of the last
two results implies the convergence Theorem 10.1 in probability.)
• For a Markov chain, existence of typical sequences can be understood by thinking of Markov
process as sequence of independent decisions regarding which transitions to take. It is then
clear that Markov process’s trajectory is simply a transformation of trajectories of an i.i.d.
process, hence must similarly concentrate similarly on some typical set.

10.1 Bits of ergodic theory

Let’s start with a dynamic system view and introduce a few definitions:
Definition 10.1 (Measure preserving transformation). τ : Ω → Ω is measure preserving (more
precisely, probability preserving) if
∀E ∈ F, P (E) = P (τ −1 E).
The set E is called τ -invariant if E = τ −1 E. The set of all τ -invariant sets forms a σ-algrebra
(check!) denoted Finv .
Definition 10.2 (stationary process). A process {Sn , n = 0, . . .} is stationary if there exists a
measure preserving transformation τ : Ω → Ω such that:
Sj = Sj−1 ◦ τ = S0 ◦ τ j
Therefore a stationary process can be described by the tuple (Ω, F, P, τ, S0 ) and Sk = S0 ◦ τ k .
Notes:
1. Alternatively, a random process (S0 , S1 , S2 , . . . ) is stationary if its joint distribution is invariant
with respect to shifts in time, i.e., PSnm = PS m+t , ∀n, m, t. Indeed, given such a process we
n+t
can define a m.p.t. as follows:
τ
(s0 , s1 , . . . ) −
→ (s1 , s2 , . . . ) (10.4)
So τ is a shift to the right.
2. An event E ∈ F is shift-invariant if
(s1 , s2 , . . . ) ∈ E ⇒ (s0 , s1 , s2 , . . . ) ∈ E, s0
or equivalently E = τ −1 E (check!). Thus τ -invariant events are also called shift-invariant,
when τ is interpreted as (10.4).
3. Some examples of shift-invariant events are {∃n : xi = 0∀i ≥ n}, {lim sup xi < 1} etc. A non
shift-invariant event is A = {x0 = x1 = · · · = 0}, since τ (1, 0, 0, . . .) ∈ A but (1, 0, . . .) 6∈ A.
4. Also recall that the tail σ-algebra is defined as
\
Ftail , σ{Sn , Sn+1 , . . .} .
n≥1

It is easy to check that all shift-invariant events belong to Ftail . The inclusion is strict, as for
example the event
{∃n : xi = 0, ∀ odd i ≥ n}
is in Ftail but not shift-invariant.

114
Proposition 10.1 (Poincare recurrence). Let τ be measure-preserving for (Ω, F, P). Then for any
measurable A with P[A] > 0 we have
[
P[ τ −k A|A] = P[τ k (ω) ∈ A occurs infinitely often|A] = 1 .
k≥1
S
Proof. Let B = k≥1 τ
−k A. It is sufficient to show that P[A ∩ B] = P[A] or equivalently
P[A ∪ B] = P[B] . (10.5)
To that end notice that τ −1 A ∪ τ −1 B = B and thus
P[τ −1 (A ∪ B)] = P[B] ,
but the left-hand side equals P[A ∪ B] by the measure-preservation of τ , proving (10.5).
Note: Consider τ mapping initial state of the conservative (Hamiltonian) mechanical system to its
state after passage of a given unit of time. It is known that τ preserves Lebesgue measure in phase
space (Liouville’s theorem). Thus Poincare recurrence leads to rather counter-intuitive conclusions.
For example, opening the barrier separating two gases in a cylinder allows them to mix. Poincare
recurrence says that eventually they will return back to the original separated state (with each gas
occupying roughly its half of the cylinder).
Definition 10.3 (Ergodicity). A transformation τ is ergodic if ∀E ∈ Finv we have P[E] = 0 or 1.
A process {Si } is ergodic if all shift invariant events are deterministic, i.e., for any shift invariant
event E, P [S1∞ ∈ E] = 0 or 1.
Example:
• {Sk = k 2 }: ergodic but not stationary
• {Sk = S0 }: stationary but not ergodic (unless S0 is a constant). Note that the singleton set
E = {(s, s, . . .)} is shift invariant and P [S1∞ ∈ E] = P [S0 = s] ∈ (0, 1) – not deterministic.
• {Sk } i.i.d. is stationary and ergodic (by Kolmogorov’s 0-1 law, tail events have no randomness)
• (Sliding-window construction of ergodic processes)
If {Si } is ergodic, then {Xi = f (Si , Si+1 , . . . )} is also ergodic. It is called a B-process if Si
is i.i.d. P∞ −n−1
Example, Si ∼ Bern( 12 ) i.i.d., Xk = n=0 2 Sk+n = 2Xk−1 mod 1. The marginal
distribution of Xi is uniform on [0, 1]. Note that Xk ’s behavior is completely deterministic:
given X0 , all the future Xk ’s are determined exactly. This example shows that certain
deterministic maps exhibit ergodic/chaotic behavior under iterative application: although
the trajectory is completely deterministic, its time-averages converge to expectations and in
general “look random”.
• There are also stronger conditions than ergodicity. Namely, we say that τ is mixing (or strong
mixing) if
P[A ∩ τ −n B] → P[A]P[B] .
We say that τ is weakly mixing if
n
X 1
P[A ∩ τ −n B] − P[A]P[B] → 0 .
n
k=1

Strong mixing implies weak mixing, which implies ergodicity (check!).

115
• {Si }: finite irreducible Markov chain with recurrent states is ergodic (in fact strong mixing),
regardless of initial distribution.
Toy example: kernel P (0|1) = P (1|0) = 1 with initial dist. P (S0 = 0) = 0.5. This process only
has two sample paths: P [S1∞ = (010101 . . .)] = P [S1∞ = (101010 . . .)] = 12 . It is easy to verify
this process is ergodic (in the sense defined above!). Note however, that in Markov-chain
literature a chain is called ergodic if it is irreducible, aperiodic and recurrent. This example
does not satisfy this definition (this clash of terminology is a frequent source of confusion).

• (optional) {Si }: stationary zero-mean Gaussian process with autocovariance function R(n) =
E[S0 Sn∗ ].
n
1 X
lim R[t] = 0 ⇔ {Si } ergodic ⇔ {Si } weakly mixing
n→∞ n + 1
t=0
lim R[n] = 0 ⇔ {Si } mixing
n→∞

Intuitively speaking, an ergodic process can have infinite memory in general, but the memory
is weak. Indeed, we see that for a stationary Gaussian process ergodicity means the correlation
dies (in the Cesaro-mean sense).
The spectral measure is defined as the (discrete time) Fourier transform of the autocovariance
sequence {R(n)}, in the sense that there exists a unique probability measure µ on [− 12 , 12 ] such
that R(n) = E exp(i2nπX) where X ∼ µ. The spectral criteria can be formulated as follows:

{Si } ergodic ⇔ spectral measure has no atoms (CDF is continuous)

{Si } B-process ⇔ spectral measure has density

Detailed exposition on stationary Gaussian processes can be found in [Doo53, Theorem 9.3.2,
pp. 474, Theorem 9.7.1, pp. 493–494].1

10.2 Proof of Shannon-McMillan

We shall show the convergence in L1 , which implies convergence in probability automatically. In
order to prove Shannon-McMillan, let’s first introduce the Birkhoff-Khintchine’s convergence theorem
for ergodic processes, the proof of which is presented in the next subsection.
Theorem 10.2 (Birkhoff-Khintchine’s Ergodic Theorem). If {Si } stationary and ergodic, ∀ function
f ∈ L1 , i.e., E |f (S1 , . . . )| < ∞,
n
1X
lim f (Sk , . . . ) = E f (S1 , . . . ). a.s. and in L1
n→∞ n
k=1

In the special case where f depends on finitely many coordinates, say, f = f (S1 , . . . , Sm ), we have
n
1X
lim f (Sk , . . . , Sk+m−1 ) = E f (S1 , . . . , Sm ). a.s. and in L1
n→∞ n
k=1

1
Thanks Prof. Bruce Hajek for the pointer.

116
Interpretation: time average converges to ensemble average.
Example: Consider f = f (S1 )
• {Si } is iid. Then Theorem 10.2 is SLLN (strong LLN).
• {Si } is such that Si = S1 for all i – non-ergodic. Then Theorem 10.2 fails unless S1 is a
constant.
Definition 10.4. {Si : i ∈ N} is an mth order Markov chain if PSt+1 |S1t = PSt+1 |St−m+1
t for all t ≥ m.
It is called time homogeneous if PSt+1 |St−m+1
t = PSm+1 |S1m .

Remark 10.1. Showing (10.3) for an mth order time homogeneous Markov chain {Si } is a direct
application of Birkhoff-Khintchine.
n
1 1 1X 1
log n
= log
n PS (S )
n n PSt |S t−1 (St |S t−1 )
t=1
n
1 1 1 X 1
= log m
+ log l−1
n PS (S ) n
m PSt |S t−1 (Sl |Sl−m )
t=m+1 t−m
n
1 1 1 X 1
= log m + log t−1 , (10.6)
n PS1 (S1 ) n PSm+1 |S1m (St |St−m )
| {z } | t=m+1
{z }
→0
→H(Sm+1 |S1m ) by Birkhoff-Khintchine
1
where we applied Theorem 10.2 with f (s1 , s2 , . . .) = log P m (sm+1 |sm
.
Sm+1 |S1 1 )

Now let’s prove (10.3) for a general stationary ergodic process {Si } which might have infinite
memory. The idea is to approximate the distribution of that ergodic process by an m-th order
MC (finite memory) and make use of (10.6); then let m → ∞ to make the approximation accurate
(Markov approximation).
Proof of Theorem 10.1 in L1 . To show that (10.3) converges in L1 , we want to show that
1 1

E log n
− H → 0, n → ∞.
n PS n (S )
To this end, fix an m ∈ N. Define the following auxiliary distribution for the process:
∞
Y
Q (m)
(S1∞ ) =P S1m (S1m ) t−1
PSt |S t−1 (St |St−m )
t−m
t=m+1
∞
Y
t−1
stat.
= PS1m (S1m ) PSm+1 |S1m (St |St−m )
t=m+1

Note that under Q(m) , {Si } is an mth -order time-homogeneous Markov chain.
By triangle inequality,
1 1 1 1 1 1

E log − H ≤E log − log
n PS n (S n ) n PS n (S n ) n (m) n
QS n (S )
| {z }
,A
1 1

+ E log (m) − Hm + |Hm − H|
n n
QS n (S ) | {z }
| {z } ,C
,B

117
where Hm , H(Sm+1 |S1m ).
Now

• C = |Hm − H| → 0 as m → ∞ by Theorem 5.4 (Recall that for stationary processes:

H(Sm+1 |S1m ) → H from above).

• As shown in Remark 10.1, for any fixed m, B → 0 in L1 as n → ∞, as a consequence of

Birkhoff-Khintchine. Hence for any fixed m, EB → 0 as n → ∞.

• For term A,
1 dPS n 1 (m) 2 log e
E[A] = EP log (m)
≤ D(P S n kQ n ) +
S
n dQ n n en
S

where
" #
1 (m) 1 PS n (S n )
D(PS n kQS n ) = E log Q
n n PS m (S m ) nt=m+1 PSm+1 |S 1 (St |S t−1 )
m t−m

stat. 1 (−H(S n ) + H(S m ) + (n − m)H )

= m
n
→ Hm − H as n → ∞

and the next Lemma 10.1.

Combining all three terms and sending n → ∞, we obtain for any m,

1 1

lim sup E log − H ≤ 2(Hm − H).
n→∞ n PS n (S n )
Sending m → ∞ completes the proof of L1 -convergence.

Lemma 10.1.
dP
EP log ≤ D(P kQ) + 2 log e .
dQ e

Proof. |x log x| − x log x ≤ 2 log e

e , ∀x > 0, since LHS is zero if x ≥ 1, and otherwise upper bounded
1 2 log e
by 2 sup0≤x≤1 x log x = e .

10.3* Proof of Birkhoff-Khintchine

Proof of Theorem 10.2. ∀ function f˜ ∈ L1 , ∀, there exists a decomposition f˜ = f + h such that f
is bounded, and h ∈ L1 , khk1 ≤ .
Let us first focus on the bounded function f . Note that in the bounded domain L1 ⊂ L2 , thus
f ∈ L2 . Furthermore, L2 is a Hilbert space with inner product (f, g) = E[f (S1∞ )g(S1∞ )].
For the measure preserving transformation τ that generates the stationary process {Si }, define
the operator T (f ) = f ◦ τ . Since τ is measure preserving, we know that kT f k22 = kf k22 , thus T is a
unitary and bounded operator.
Define the operator
n
1X
An (f ) = f ◦ τk
n
k=1

118
Intuitively:
n
1X k 1
An = T = (I − T n )(I − T )−1
n n
k=1
Then, if f ⊥ ker(I − T ) we should have An f → 0, since only components in the kernel can blow up.
This intuition is formalized in the proof below.
Let’s further decompose f into two parts f = f1 +f2 , where f1 ∈ ker(I −T ) and f2 ∈ ker(I −T )⊥ .
Observations:
• if g ∈ ker(I − T ), g must be a constant function. This is due to the ergodicity. Consider
indicator function 1A , if 1A = 1A ◦ τ = 1τ −1 A , then P[A] = 0 or 1. For a general case, suppose
g = T g and g is not constant, then at least some set {g ∈ (a, b)} will be shift-invariant and
have non-trivial measure, violating ergodicity.
• ker(I − T ) = ker(I − T ∗ ). This is due to the fact that T is unitary:

g = T g ⇒ kgk2 = (T g, g) = (g, T ∗ g) ⇒ (T ∗ g, g) = kgkkT ∗ gk ⇒ T ∗ g = g

where in the last step we used the fact that Cauchy-Schwarz (f, g) ≤ kf k · kgk only holds with
equality for g = cf for some constant c.
• ker(I − T )⊥ = ker(I − T ∗ )⊥ = [Im(I − T )], where [Im(I − T )] is an L2 closure.
• g ∈ ker(I − T )⊥ ⇐⇒ E[g] = 0. Indeed, only zero-mean functions are orthogonal to constants.
With these observations, we know that f1 = m is a const. Also, f2 ∈ [Im(I − T )] so we further
approximate it by f2 = f0 + h1 , where f0 ∈ Im(I − T ), namely f0 = g − g ◦ τ for some function
g ∈ L2 , and kh1 k1 ≤ kh1 k2 < . Therefore we have

An f1 = f1 = E[f ]
1
An f0 = (g − g ◦ τ n ) → 0 a.s. and L1
n
P n P 1 a.s.
since E[ n≥1 ( g◦τ
n ) ] = E[g ]
2 2
n2
< ∞ =⇒ n1 g ◦ τ n −−→0.
The proof is completed by showing

2
P lim sup An (h + h1 ) ≥ δ ≤ . (10.7)
n δ
Indeed, then by taking → 0 we will have shown

P lim sup An (f ) ≥ E[f ] + δ = 0
n

as required.
Proof of (10.7) makes use of the Maximal Ergodic Lemma stated as follows:
Theorem 10.3 (Maximal Ergodic Lemma). Let (P, τ ) be a probability measure and a measure-
preserving transformation. Then for any f ∈ L1 (P) we have
E[f 1{sup
n≥1 An f >a} ] kf k1
P sup An f > a ≤ ≤
n≥1 a a
1 Pn−1
where An f = n k=0 f ◦ τ k.

119
Note: This is a so-called “weak L1 ” estimate for a sublinear operator supn An (·). In fact, this
theorem is exactly equivalent to the following result:
Lemma 10.2 (Estimate for the maximum of averages). Let {Zn , n = 1, . . .} be a stationary process
with E[|Z|] < ∞ then

|Z1 + . . . + Zn | E[|Z|]
P sup >a ≤ ∀a > 0
n≥1 n a

Proof. The argument for this Lemma has originally been quite involved, until a dramatically simple
proof (below) was found by A. Garcia.
Define
n
X
Sn = Zk (10.8)
k=1
Ln = max{0, Z1 , . . . , Z1 + · · · + Zn } (10.9)
Mn = max{0, Z2 , Z2 + Z3 , . . . , Z2 + · · · + Zn } (10.10)
Sn
Z ∗ = sup (10.11)
n≥1 n

It is sufficient to show that

E[Z1 1{Z ∗ >0} ] ≥ 0 . (10.12)
Indeed, applying (10.12) to Z̃1 = Z1 − a and noticing that Z̃ ∗ = Z ∗ − a we obtain

E[Z1 1{Z ∗ >a} ] ≥ aP[Z ∗ > a] ,

from which Lemma follows by upper-bounding the left-hand side with E[|Z1 |].
In order to show (10.12) we first notice that {Ln > 0} % {Z ∗ > 0}. Next we notice that

Z1 + Mn = max{S1 , . . . , Sn }

and furthermore
Z1 + Mn = Ln on {Ln > 0}
Thus, we have
Z1 1{Ln >0} = Ln − Mn 1{Ln >0}
where we do not need indicator in the first term since Ln = 0 on {Ln > 0}c . Taking expectation we
get

E[Z1 1{Ln >0} ] = E[Ln ] − E[Mn 1{Ln >0} ] (10.13)

≥ E[Ln ] − E[Mn ] (10.14)
= E[Ln ] − E[Ln−1 ] = E[Ln − Ln−1 ] ≥ 0 , (10.15)

where we used Mn ≥ 0, the fact that Mn has the same distribution as Ln−1 , and Ln ≥ Ln−1 ,
respectively. Taking limit as n → ∞ in (10.15) we obtain (10.12).

120
10.4* Sinai’s generator theorem
It turns out there is a way to associate to every probability-preserving transformation (p.p.t.) τ
a number, called Kolmogorov-Sinai entropy. This number is invariant to isomorphisms of p.p.t.’s
(appropriately defined).
Definition 10.5. Fix a probability-preserving transformation τ acting on probability space (Ω, F, P).
Kolmogorov-Sinai entropy of τ is defined as
1
H(τ ) , sup lim H(X0 , X0 ◦ τ, . . . , X0 ◦ τ n−1 ) ,
X0 n→∞ n

where supremum is taken over all finitely-valued random variables X0 : Ω → X and measurable
with respect to F.
Note that every random variable X0 generates a stationary process adapted to τ , that is

Xk , X0 ◦ τ k .

In this way, Kolmogorov-Sinai entropy of τ equals the maximal entropy rate among all stationary
processes adapted to τ . This quantity may be extremely hard to evaluate, however. One help comes
in the form of the famous criterion of Y. Sinai. We need to elaborate on some more concepts before:

• σ-algebra G ⊂ F is P-dense in F, or sometimes we also say G = F mod P or even G = F

mod 0, if for every E ∈ F there exists E 0 ∈ G s.t.

P[E∆E 0 ] = 0 .

• Partition A = {Ai , i = 1, 2, . . .} measurable with respect to F is called generating if

∞
_
σ{τ −n A} = F mod P .
n=0

• Random variable Y : Ω → Y with a countable alphabet Y is called a generator of (Ω, F, P, τ ) if

σ{Y, Y ◦ τ, . . . , Y ◦ τ n , . . .} = F mod P

Theorem 10.4 (Sinai’s generator theorem). Let Y be the generator of a p.p.t. (Ω, F, P, τ ). Let
H(Y) be the entropy rate of the process Y = {Yk = Y ◦ τ k , k = 0, . . .}. If H(Y) is finite, then
H(τ ) = H(Y).

Proof. Notice that since H(Y) is finite, we must have H(Y0n ) < ∞ and thus H(Y ) < ∞. First, we
argue that H(τ ) ≥ H(Y). If Y has finite alphabet, then it is simply from the definition. Otherwise
let Y be Z+ -valued. Define a truncated version Ỹm = min(Y, m), then since Ỹm → Y as m → ∞ we
have from lower semicontinuity of mutual information, cf. (3.13), that

lim I(Y ; Ỹm ) ≥ H(Y ) ,

m→∞

and consequently for arbitrarily small and sufficiently large m

H(Y |Ỹ ) ≤ ,

121
Then, consider the chain

H(Y0n ) = H(Ỹ0n , Y0n ) = H(Ỹ0n ) + H(Y0n |Ỹ0n )

X n
= H(Ỹ0n ) + H(Yi |Ỹ0n , Y0i−1 )
i=0
n
X
≤ H(Ỹ0n ) + H(Yi |Ỹi )
i=0
= H(Ỹ0n ) + nH(Y |Ỹ ) ≤ H(Ỹ0n ) + n

Thus, entropy rate of Ỹ (which has finite-alphabet) can be made arbitrarily close to the entropy
rate of Y, concluding that H(τ ) ≥ H(Y).
The main part is showing that for any stationary process X adapted to τ the entropy rate is
upper bounded by H(Y). To that end, consider X : Ω → X with finite X and define as usual the
process X = {X ◦ τ k , k = 0, 1, . . .}. By generating property of Y we have that X (perhaps after
modification on a set of measure zero) is a function of Y0∞ . So are all Xk . Thus

H(X0 ) = I(X0 ; Y0∞ ) = lim I(X0 ; Y0n ) ,

n→∞

where we used the continuity-in-σ-algebra property of mutual information, cf. (3.14). Rewriting the
latter limit differently, we have
lim H(X0 |Y0n ) = 0 .
n→∞

Fix > 0 and choose m so that H(X0 |Y0m ) ≤ . Then consider the following chain:

H(X0n ) ≤ H(X0n , Y0n ) = H(Y0n ) + H(X0n |Y0n )

X n
n
≤ H(Y0 ) + H(Xi |Yin )
i=0
n
X
= H(Y0n ) + H(X0 |Y0n−i )
i=0
≤ H(Y0n ) + m log |X | + (n − m) ,

where we used stationarity of (Xk , Yk ) and the fact that H(X0 |Y0n−i ) < for i ≤ n − m. After
dividing by n and passing to the limit our argument implies

H(X) ≤ H(Y) + .

Taking here → 0 completes the proof.

Alternative proof: Suppose X0 is taking values on a finite alphabet X and X0 = f (Y0∞ ). Then
(this is a measure-theoretic fact) for every > 0 there exists m = m() and a function f : Y m+1 → X
s.t.
P[f (Y0∞ ) 6= f (Y0m )] ≤ .
S
(This is just another way to say that n σ{Y0n } is P-dense in σ(Y0∞ ).) Define a stationary process
X̃ as
X̃j , f (Yjm+j ) .

122
Notice that since X̃0n is a function of Y0n+m we have
H(X̃0n ) ≤ H(Y0n+m ) .
Dividing by m and passing to the limit we obtain that for entropy rates
H(X̃) ≤ H(Y) .
Finally, to relate X̃ to X notice that by construction
P[X̃j 6= Xj ] ≤ .
Since both processes take values on a fixed finite alphabet, from Corollary 5.2 we infer that
|H(X) − H(X̃)| ≤ log |X | + h() .
Altogether, we have shown that
H(X) ≤ H(Y) + log |X | + h() .
Taking → 0 we conclude the proof.
Examples:
• Let Ω = [0, 1], F–Borel σ-algebra, P = Leb and
(
2ω, ω < 1/2
τ (ω) = 2ω mod 1 =
2ω − 1, ω ≥ 1/2
It is easy to show that Y (ω) = 1{ω < 1/2} is a generator and that Y is an i.i.d. Bernoulli(1/2)
process. Thus, we get that Kolmogorov-Sinai entropy is H(τ ) = log 2.
• Let Ω be the unit circle S1 , F – Borel σ-algebra, P be the normalized length and
τ (ω) = ω + γ
γ
i.e. τ is a rotation by the angle γ. (When 2π is irrational, this is known to be an ergodic
p.p.t.). Here Y = 1{|ω| < 2π} is a generator for arbitrarily small and hence
H(τ ) ≤ H(X) ≤ H(Y0 ) = h() → 0 as → 0 .
This is an example of a zero-entropy p.p.t.
Remark 10.2. Two p.p.t.’s (Ω1 , τ1 , P1 ) and (Ω0 , τ0 , P0 ) are called isomorphic if there exists fi :
Ωi → Ω1−i defined Pi -almost everywhere and such that 1) τ1−i ◦ fi = f1−i ◦ τi ; 2) fi ◦ f1−i is identity
−1
on Ωi (a.e.); 3) Pi [f1−i E] = P1−i [E]. It is easy to see that Kolmogorov-Sinai entropies of isomorphic
p.p.t.s are equal. This observation was made by Kolmogorov in 1958. It was revoluationary, since it
allowed to show that p.p.t.s corresponding shifts of iid Bern(1/2) and iid Bern(1/3) procceses are
not isomorphic. Before, the only invariants known were those obtained from studying the spectrum
of a unitary operator
Uτ : L2 (Ω, P) → L2 (Ω, P) (10.16)
φ(x) 7→ φ(τ (x)) . (10.17)
However, the spectrum of τ corresponding to any non-constant i.i.d. process consists of the entire
unit circle, and thus is unable to distinguish Bern(1/2) from Bern(1/3).2
2
To see thePstatement about the spectrum, let Xi be iid with zero mean and unit variance. Then consider φ(x∞
1 )
defined as m m
√1
k=1 e
iωk
xk . This φ has unit energy and as m → ∞ we have kUτ φ − eiω φkL2 → 0. Hence every eiω
belongs to the spectrum of Uτ .

123
§ 11. Universal compression

In this lecture we will discuss how to produce compression schemes that do not require apriori
knowledge of the distribution. Here, compressor is a map X n → {0, 1}∗ . Now, however, there is no
one fixed probability distribution PX n on X n . The plan for this lecture is as follows:

1. We will start by discussing the earliest example of a universal compression algorithm (of
Fitingof). It does not talk about probability distributions at all. However, it turns out to be
asymptotically optimal simulatenously for all i.i.d. distributions and with small modifications
for all finite-order Markov chains.

2. Next class of universal compressors is based on assuming that a the true distribution PX n
belongs to a given class. These methods proceed by choosing a good model distribution QX n
serving as the minimax approximation to each distribution in the class. The compression
algorithm for a single distribution QX n is then designed as in previous chapters.

3. Finally, an entirely different idea are algorithms of Lempel-Ziv type. These automatically
adapt to the distribution of the source, without any prior assumptions required.

Throughout this section instead of describing each compression algorithm, we will merely specify
some distribution QX n and apply one of the following constructions:

• Sort all xn in the order of decreasing QX n (xn ) and assign values from {0, 1}∗ as in Theorem 8.1,
this compressor has lengths satisfying
1
`(f (xn )) ≤ log .
QX n (xn )

• Set lengths to be
1
`(f (xn )) , dlog e
QX n (xn )
and apply Kraft’s inequality Theorem 8.5 to construct a prefix code.

• Use arithmetic coding (see next section).

The important conclusion is that in all these cases we have

1
`(f (xn )) ≤ log + const ,
QX n (xn )

and in this way we may and will always replace lengths with log QX n1(xn ) . In this way, the only job
of a universal compression algorithm is to specify QX n .

124
Remark 11.1. Furthermore, if we only restrict attention to prefix codes, then any code f : X n →
n
{0, 1}∗ defines a distribution QX n (xn ) = 2−`(f (x )) (we assume the code’s tree is full). In this
way, for prefix-free codes results on redundancy, stated in terms of optimizing the choice of QX n ,
imply tight converses too. For one-shot codes without prefix constraints the optimal answers
are slightly different, however. (For example, the optimal universal code for all i.i.d. sources
satisfies E[`(f (X n ))] ≈ H(X n ) + |X 2|−3 log n in contrast with |X 2|−1 log n for prefix-free codes,
cf. [BF14, KS14].)

11.1 Arithmetic coding

Constructing an encoder table from QX n may require a lot of resources if n is large. Arithmetic
coding provides a convenient workaround by allowing the encoder to output bits sequentially. Notice
that to do so, it requires that not only QX n but also its marginalizations QX 1 , QX 2 , · · · be easily
computable. (This is not the case, for example, for Shtarkov distributions (11.10)-(11.11), which are
not compatible for different n.)
Let us agree upon some ordering on the alphabet of X (e.g. a < b < · · · < z) and extend this
order lexicographically to X n (that is for x = (x1 , . . . , xn ) and y = (y1 , . . . , yn ), we say x < y if
xi < yi for the first i such that xi 6= yi , e.g., baba < babb). Then let
X
Fn (xn ) = QX n (y n ) .
y n <xn

Associate to each xn an interval Ixn = [Fn (xn ), Fn (xn ) + QX n (xn )). These intervals are disjoint
subintervals of [0, 1). Now encode

xn 7→ largest dyadic interval contained in Ixn .

Recall that dyadic intervals are intervals of the type [m2−k , (m + 1)2−k ] where a is an odd integer.
Clearly each dyadic interval can be associated with a binary string in {0, 1}∗ . We set f (xn ) to be
that string. The resulting code is a prefix code satisfying

n 1
`(f (x )) ≤ log2 + 1.
QX n (xn )
(This is an exercise.)
Observe that
X
Fn (xn ) = Fn−1 (xn−1 ) + QX n−1 (xn−1 ) QXn |X n−1 (y|xn−1 )
y<xn

and thus Fn (xn ) can be computed sequentially if QX n−1 and QXn |X n−1 are easy to compute.
This method is the method of choice in many modern compression algorithms because it allows
to dynamically incorporate the learned information about the stream, in the form of updating
QXn |X n−1 (e.g. if the algorithm detects that an executable file contains a long chunk of English text,
it may temporarily switch to QXn |X n−1 modeling the English language).

11.2 Combinatorial construction of Fitingof

Fitingof suggested that a sequence xn ∈ X n should be prescribed information Φ0 (xn ) equal to
the logarithm of the number of all possible permutations obtainable from xn (i.e. log-size of the

125
type-class containing xn ). From Stirling’s approximation this can be shown to be

Φ0 (xn ) = nH(xT ) + O(log n) T ∼ Unif[n] (11.1)

= nH(P̂xn ) + O(log n) , (11.2)

where P̂xn is the empirical distribution of the sequence xn :

n
1X
P̂ (a) ,
xn 1{xi = a} . (11.3)
n
i=1

Then Fitingof argues that it should be possible to produce a prefix code with

`(f (xn )) = Φ0 (xn ) + O(log n) . (11.4)

This can be done in many ways. In the spirit of what we will do next, let us define

QX n (xn ) , exp{−Φ0 (xn )}cn , (11.5)

where cn is a normalization constant cn . Counting the number of different possible empirical

distributions (types), we get
cn = O(n−(|X |−1) ) ,
and thus, by Kraft inequality, there must exist a prefix code with lengths satisfying (11.4). Now
i.i.d.
taking expectation over X n ∼ PX we get

E[`(f (X n ))] = nH(PX ) + (|X | − 1) log n + O(1) ,

for every i.i.d. source on X .

11.2.1 Universal compressor for all finite-order Markov chains

Fitingof’s idea can be extended as follows. Define now the 1-st order information content Φ1 (xn )
to be the log of the number of all sequences, obtainable by permuting xn with extra restriction
that the new sequence should have the same statistics on digrams. Asymptotically, Φ1 is just the
conditional entropy

Φ1 (xn ) = nH(xT |xT −1 mod n ) + O(log n), T ∼ Unif[n] .

Again, it can be shown that there exists a code such that lengths

`(f (xn )) = Φ1 (xn ) + O(log n) .

This implies that for every 1-st order stationary Markov chain X1 → X2 → · · · → Xn we have

E[`(f (X n ))] = nH(X2 |X1 ) + O(log n) .

This can be further continued to define Φ2 (xn ) and build a universal code, asymptotically
optimal for all 2-nd order Markov chains etc.

126
11.3 Optimal compressors for a class of sources. Redundancy.
So we have seen that we can construct compressor f : X n → {0, 1}∗ that achieves

E[`(f (X n ))] ≤ H(X n ) + o(n) ,

simultaneously for all i.i.d. sources (or even all r-th order Markov chains). What should we do next?
Krichevsky suggested that the next barrier should be to optimize regret, or redundancy:

E[`(f (X n ))] − H(X n ) → min

simultaneously for a class of sources. We proceed to rigorous definitions.

Given a collection {PX n |θ : θ ∈ Θ} of sources, and a compressor f : X n → {0, 1}∗ we define its
redundancy as
sup E[`(f (X n ))|θ = θ0 ] − H(X n |θ = θ0 ) .
θ0

Replacing here lengths with log QX1 n we define redundancy of the distribution QX n as

sup D(PX n |θ=θ0 kQX n ) .

θ0

Thus, the question of designing the best universal compressor (in the sense of optimizing worst-case
deviation of the average length from the entropy) becomes the question of finding solution of:

Q∗X n = argmin sup D(PX n |θ=θ0 kQX n ) .

QX n θ0

We therefore get to the following definition

Definition 11.1 (Redundancy in universal compression). Given a class of sources {PX n |θ=θ0 , θ0 ∈
Θ, n = 1, . . .} we define its minimax redundancy as

Rn∗ , min sup D(PX n |θ=θ0 kQX n ) . (11.6)

Q X n θ0

Note that under condition of finiteness of Rn∗ , Theorem 4.5 gives the maximin and capacity
representation

Rn∗ = sup min D(PX n |θ kQX n |Pθ ) (11.7)

Pθ QX n

= sup I(θ; X n ) . (11.8)

Pθ

Thus redundancy is simply the capacity of the channel θ → X n . This result, obvious in hindsight,
was rather surprising in the early days of universal compression.
Finding exact QX n -minimizer in (11.6) is a daunting task even for the simple class of all i.i.d.
Bernoulli sources (i.e. Θ = [0, 1], PX n |θ = Bernn (θ)). It turns out, however, that frequently the
approximate minimizer has a rather nice structure: it matches the Jeffreys prior.
Remark 11.2. (Shtarkov, Fitingof and individual sequence approach) There is a connection between
the combinatorial method of Fitingof and the method of optimality for a class. Indeed, following
(S)
Shtarkov we may want to choose distribution QX n so as to minimize the worst-case redundancy for
each realization xn (not average!):
PX n |θ (xn |θ0 )
min max sup log (11.9)
QX n n x θ0 QX n (xn )

127
This leads to Shtarkov’s distribution (also known as the normalized maximal likehood (NML) code):
(S)
QX n (xn ) = c sup PX n |θ (xn |θ0 ) , (11.10)
θ0

where c is the normalization constant. If class {PX n |θ , θ ∈ Θ} is chosen to be all i.i.d. distributions
on X then
(S)
i.i.d. QX n (xn ) = c exp{−nH(P̂xn )} , (11.11)
(S)
and thus compressing w.r.t. QX n recovers Fitingof’s construction Φ0 up to O(log n) differences
between nH(P̂xn ) and Φ0 (xn ). If we take PX n |θ to be all 1-st order Markov chains, then we get
construction Φ1 etc. Note also, that the problem (11.9) can also be written as minimization of the
regret for each individual sequence (under log-loss, with respect to a parameter class PX n |θ ):

1 1
min max log n
− inf log . (11.12)
QX n x n
QX (x )
n θ0 PX n |θ (xn |θ0 )

The gospel is that if there is a reason to believe that real-world data xn is likely to be generated by
one of the models PX n |θ , then using minimizer of (11.12) will resu lt in the compressor that both
learns the right model and compresses with respect to it.

11.4* Approximate minimax solution: Jeffreys prior

In this section we will only consider the simple setting of a class of sources consisting of all
i.i.d. distributions on a given finite alphabet. We will show that the prior, asymptoticall solving
capacity question (11.8), is given by the Dirichlet-distribution with parameters set to 1/2, namely
the pdf
1
Pθ∗ = const qQ .
d
j=0 θj

First, we give the formal setting as follows:

• Fix X – finite alphabet of size |X | = d + 1, which we will enumerate as X = {0, . . . , d}.

P
• Θ = {(θj , j = 1, . . . , d) : dj=1 θj ≤ 1, θj ≥ 0} – is the collection of all probability distributions
on X . Note that Θ is a d-dimensional simplex. We will also define
d
X
θ0 , 1 − θj .
j=1

• The source class is

n
( )
Y X 1
PX n |θ (xn |θ) , θxj = exp −n θa log ,
j=1 a∈X
P̂xn (a)

where as before P̂xn is the empirical distribution of xn , cf. (11.3).

128
In order to derive the caod Q∗X n we first propose a guess that the caid Pθ in (11.8) is some
distribution with smooth density on Θ (this can only be justified by an apriori belief that the caid
in such a natural problem should be something that involves all θ’s). Then, we define
Z
n
QX n (x ) , PX n |θ (xn |θ0 )Pθ (θ0 )dθ0 . (11.13)
Θ

Before proceeding further, we recall the following method of approximating exponential integrals
(called Laplace method). Suppose that f (θ) has a unique minimum at the interior point θ̂ of Θ
and that Hessian Hessf is uniformly lower-bounded by a multiple of identity (in particular, f (θ) is
strongly convex). Then taking Taylor expansion of π and f we get
Z Z
1 T 2
−nf (θ)
π(θ)e dθ = (π(θ̂) + O(ktk))e−n(f (θ̂)− 2 t Hessf (θ̂)t+o(ktk )) dt (11.14)
Θ
Z
T dx
= π(θ̂)e−nf (θ̂) e−x Hessf (θ̂)x √ (1 + O(n−1/2 )) (11.15)
Rd nd
d
−nf (θ̂) 2π 1
2
= π(θ̂)e q (1 + O(n−1/2 )) (11.16)
n
det Hessf (θ̂)

where in the last step we computed Gaussian integral.

Next, we notice that

PX n |θ (xn |θ0 ) = e−n(D(P̂xn kPX|θ=θ0 )+H(P̂xn )) log e ,

and therefore, denoting

θ̂(xn ) , P̂xn
we get from applying (11.16) to (11.13)

d 2π Pθ (θ̂) 1
log QX n (xn ) = −nH(θ̂) + log + log q + O(n− 2 ) ,
2 n log e
det JF (θ̂)

where we used the fact that Hessθ0 D(P̂ kPX|θ=θ0 ) = 1 0

log e JF (θ ) with JF – Fisher information matrix,
see (4.14). From here, using the fact that under Xn ∼ PX n |θ=θ0 the random variable θ̂ = θ0 +O(n−1/2 )
we get by linearizing JF (·) and Pθ (·)

d Pθ (θ0 ) 1
D(PX n |θ=θ0 kQX n ) = n(E[H(θ̂)] − H(X|θ = θ0 )) + log n − log p + const + O(n− 2 ) ,
2 0
det JF (θ )
(11.17)
where const is some constant (independent of prior Pθ or θ0 ). The first term is handled by the next
Lemma.
i.i.d.
Lemma 11.1. Let X n ∼ P on finite alphabet X and let P̂ be the empirical type of X n then

|X | − 1 1
E[D(P̂ kP )] = log e + o .
2n n
√
Proof. Notice that n(P̂ − P ) converges in distribution to N (0, Σ), where Σ = diag(P ) − P P T ,
where P is an |X |-by-1 column vector. Thus, computing second-order Taylor expansion of D(·kP ),
cf. (4.16), we get the result.

129
Continuing (11.17) we get in the end

d Pθ (θ0 ) 1
D(PX n |θ=θ0 kQX n ) = log n − log p + const + O(n− 2 ) (11.18)
2 0
det JF (θ )

under the assumption of smoothness of prior Pθ and that θ0 is not too close to the boundary.
Consequently, we can see that in order for the prior Pθ be the saddle point solution, we should have
p
Pθ (θ0 ) ∼ det JF (θ0 ) ,

provided that such density is normalizable. Prior proportional to square-root of the determinant of
Fisher information matrix is known as Jeffreys prior. In our case, using the explicit expression for
Fisher information (4.17) we get
1
Pθ∗ = Beta(1/2, 1/2, · · · , 1/2) = cd qQ , (11.19)
d
j=0 θj

where cd is the normalization constant. The corresponding redundancy is then

d n
Rn∗ = log − log cd + o(1) . (11.20)
2 2πe
Remark 11.3. In statistics Jeffreys prior is justified as being invariant to smooth reparametrization,
as evidenced by (4.15). For example, in answering “will the sun rise tomorrow”, Laplace proposed
to estimate the probability by modeling sunrise as i.i.d. Bernoulli process with a uniform prior on
θ ∈ [0, 1]. However,
√ this is clearly not very1 logical, as one may equally well postulate uniformity of
α = θ or β = θ. Jeffreys prior θ ∼ √
10 is invariant to reparametrization in the sense that if
θ(1−θ)
p
one computed det JF (α) under α-parametrization the result would be exactly the pushforward of
the √ 1 along the map θ 7→ θ10 .
θ(1−θ)

Making the arguments in this subsection rigorous is far from trivial, see [CB90, CB94] for details.

11.5 Sequential probability assignment: Krichevsky-Trofimov

From (11.19) it is not hard to derive the (asymptotically) optimal universal probability assignment
QX n . For simplicity we consider Bernoulli case, i.e. d = 1 and θ ∈ [0, 1] is the 1-dimensional
parameter. Then,1
1
Pθ∗ = p (11.21)
π θ(1 − θ)
(KT ) (2t0 − 1)!! · (2t1 − 1)!!
QX n (xn ) = , ta = #{j ≤ n : xj = a} (11.22)
2n n!
This assignment can now be used to create a universal compressor via one of the methods outlined
in the beginning of this lecture. However, what is remarkable
R is that it has a very nice sequential
interpretation (as does any assignment obtained via QX = Pθ PX n |θ with Pθ not depending on n).
n

R1 θ a (1−θ)b
1
This is obtained from the identity 0
√ dθ = π 1·3···(2a−1)·1·3···(2b−1)
2a+b (a+b)!
for integer a, b ≥ 0. This identity can
θ(1−θ)
θ
be derived by change of variable z = 1−θ
and using the standard keyhole contour on the complex plain.

130
(KT ) t1 + 21
QXn |X n−1 (1|xn−1 ) = , t1 = #{j ≤ n − 1 : xj = 1} (11.23)
n
(KT ) t0 + 21
QXn |X n−1 (0|xn−1 ) = , t0 = #{j ≤ n − 1 : xj = 0} (11.24)
n
This is the famous “add 1/2” rule of Krichevsky and Trofimov. Note that this sequential assignment
is very convenient for use in prediction as well as in implementing an arithmetic coder.
Remark 11.4 (Laplace “add 1” rule). A slightly less optimal choice of QX n results from Laplace
prior: just take Pθ to be uniform on [0, 1]. Then, in the Bernoulli (d = 1) case we get

(Lap) 1
QX n = n
, w = #{j : xj = 1} . (11.25)
w (n + 1)

The corresponding successive probability is given by

(Lap) t1 + 1
QXn |X n−1 (1|xn−1 ) = , t1 = #{j ≤ n − 1 : xj = 1} .
n+1
We notice two things. First, the distribution (11.25) is exactly the same as Fitingof’s (11.5). Second,
this distribution “almost” attains the optimal first-order term in (11.20). Indeed, when X n is iid
Bern(θ) we have for the redundancy:
" #
1 n
E log (Lap) − H(X ) = log(n + 1) + E log
n
− nh(θ) , W ∼ Bino(n, θ) . (11.26)
Q n (X n ) W
X

From Stirling’s expansion we know that as n → ∞ this redundancy evaluates to 12 log n + O(1),
uniformly in θ over compact subsets of (0, 1). However, for θ = 0 or θ = 1 the Laplace redun-
dancy (11.26) clearly equals log(n + 1). Thus, supremum over θ ∈ [0, 1] is achieved close to endpoints
and results in suboptimal redundancy log n + O(1). Jeffrey’s prior (11.21) fixes the problem at the
endpoints.

11.6 Individual sequence and universal prediction

The problem of selecting one QX n serving as good prior for a whole class of distributions can also
be interpreted in terms of so-called “universal prediction”. We discuss this connection next.
Consider the following problem: a sequence xn is observed sequentially and our goal is to predict
probability distribution of the next letter given the past observations. The experiment proceeds as
follows:

1. A string xn ∈ X n is selected by the nature.

2. Having observed samples x1 , . . . , xt−1 we are requested to output probability distribution

Qt (·|xt−1 ) on X n .

3. After that nature reveals the next sample xt and our loss for t-th prediction is evaluated as
1
log .
Qt (xt |xt−1 )

131
Goal (informal): Come up with the algorithm to minimize the average per-letter loss:
n
1X 1
`({Qt }, xn ) , log .
n Qt (xt |xt−1 )
t=1

Note that to make this goal formal, we need to explain how xn is generated. Consider first a
naive requirement that the worst-case loss is minimized:

min max
n
`({Qt }, xn ) .
{Qt }n
t=1
x

This is clearly trivial. Indeed, at any step t the distribution P̂t must have at least one atom with
weight ≤ |X1 | , and hence for any predictor

max
n
`({Qt }, xn ) ≥ log |X | ,
x

which is clearly achieved iff Qt (·) ≡ |X1 | , i.e. if predictor makes absolutely no prediction. This is of
course very natural: in the absence of whatsoever prior information on xn it is impossible to predict
anything.
The exciting idea, originated by Feder, Merhav and Gutman, cf. [FMG92, MF98], is to replace
loss with regret, i.e. the gap to the best possible static oracle. More exactly, suppose a non-causal
oracle can look at the entire string xn and output a constant Qt = Q1 . From non-negativity of
divergence this non-causal oracle achieves:
n
1X 1
`oracle (xn ) = min log = H(P̂xn ) .
Q n Q(xt )
t=1

Can causal (but time-varying) predictor come close to this performance? In other words, we define
regret as
reg({Qt }, xn ) , `({Qt }, xn ) − H(P̂xn )
and ask to minimize the worst-case regret:

reg∗n , min max

n
reg({Qt }, xn ) . (11.27)
{Qt } x

Excitingly, non-trivial predictors emerge as solutions to the above problem, which furthermore
do not rely on any assumptions on the prior distribution of xn .
We next consider the case of X = {0, 1} for simplicity. To solve (11.27), first notice that designing
a sequence {Qt (·|xt−1Q
} is equivalent to defining one joint distribution QX n and then factorizing the
latter as QX n (xn ) = t Qt (xt |xt−1 ). (At this point, one should recall the worst-case redundancy of
Shtarkov, cf. Remark 11.2.) Then the problem (11.27) becomes simply
1 1
reg∗n = minn max log − H(P̂xn ) .
QX
n
x n QX n (xn )

We may lower-bound the max over xn with the average over the X n ∼ Bern(θ)n and obtain (also
applying Lemma 11.1):
1 |X | − 1 1
reg∗n ≥ Rn∗ + + o( ) ,
n 2n n
where Rn∗ is the universal compression redundancy defined in (11.6), whose asymptotics we derived
in (11.20).

132
(KT )
On the other hand, taking QX n from Krichevsky-Trofimov (11.22) we find after some algebra
and Stirling’s expansions:

(KT ) 1
max − log QX n (xn ) − nH(P̂xn ) = log n + O(1) .
n x 2
In all, we conclude that,
1 ∗ |X | − 1 log n
reg∗n = Rn + O(1) = + O(1/n) ,
n 2 n
and remarkable, the regret converges to zero. i.e. the causal predictor can approach the perforamnce
of a non-causal oracle. Explicit (asymptotically optimal) sequential prediction rules are given by
Krichevsky-Trofimov’s “add 1/2” rules (11.24). We note that the resulting rules are also independent
of n (“horizon-free”). This is a very desirable property not shared by the sequential predictors (also
asymptotically optimal) derived from factorizing the Shtarkov’s distribution (11.10).

11.7 Lempel-Ziv compressor

So given a class of sources {PX n |θ , θ ∈ Θ} we have shown how to produce an asymptotically optimal
compressors by using Jeffreys’ prior. Although we have done so only for i.i.d. class, it can be
extended to handle a class of all r-th order Markov chains with minimal modifications. However,
the resulting sequential probability becomes rather complex. Can we do something easier at the
expense of losing optimal redundancy?
In principle, the problem is rather straightforward: as we observe a stationary process, we may
estimate with better and better precision the conditional probability P̂Xn |X n−1 and then use it as
n−r

the basis for arithmetic coding. As long as P̂ converges to the actual conditional probability, we
n−1
will get to the entropy rate of H(Xn |Xn−r ). Note that Krichevsky-Trofimov assignment (11.24)
is clearly learning the distribution too: as n grows, the estimator QXn |X n−1 converges to the true
PX (provided sequence is i.i.d.). So in some sense the converse is also true: any good universal
compression scheme is inherently learning the true distribution.
The main drawback of the learn-then-compress approach is the following. Once we extend the
class of sources to include those with memory, we invariably are lead to the problem of learning the
joint distribution PX r−1 of r-blocks. However, the number of samples required to obtain a good
0
estimate of PX r−1 is exponential in r. Thus learning may proceed rather slowly. Lempel-Ziv family
0
of algorithms works around this in an ingeniously elegant way:

• First, estimating probabilities of rare substrings takes longest, but it is also the least useful,
as these substrings almost never appear at the input.

• Second, and most crucial, observation is that a great estimate of PX r (xr ) is given by the
reciprocal of the time till the last observation of xr in the incoming stream.

• Third, there is a prefix code2 mapping any integer n to binary string of length roughly log2 n:

fint : Z+ → {0, 1}+ , `(fint (n)) = log2 n + O(log log n) . (11.28)

Thus, by encoding the pointer to the last observation of xr via such a code we get a string of
length roughly log PX r (xr ) automatically.
2
2− log2 k−2 log2 log(k+1) < ∞ and use Kraft’s inequality.
P
For this just notice that k≥1

133
There are a number of variations of these basic ideas, so we will only attempt to give a rough
explanation of why it works, without analyzing any particular algorithm.
We proceed to formal details. First, we need to establish a Kac’s lemma.
Lemma 11.2 (Kac). Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . .. Let
−1
L = inf{t > 0 : X−t = X0 } be the last appearance of symbol X0 in the sequence X−∞ . Then for any
u such that P[X0 = u] > 0 we have
1
E[L|X0 = u] = .
P[X0 = u]
In particular, mean recurrence time E[L] = |supp(PX )|.
Proof. Note that from stationarity the following probability

P[∃t ≥ k : Xt = u]

does not depend on k ∈ Z. Thus by continuity of probability we can take k = −∞ to get

P[∃t ≥ 0 : Xt = u] = P[∃t ∈ Z : Xt = u] .

However, the last event is shift-invariant and thus must have probability zero or one by ergodic
assumption. But since P[X0 = u] > 0 it cannot be zero. So we conclude

P[∃t ≥ 0 : Xt = u] = 1 . (11.29)

Next, we have
X
E[L|X0 = u] = P[L ≥ t|X0 = u] (11.30)
t≥1
1 X
= P[L ≥ t, X0 = u] (11.31)
P[X0 = u]
t≥1
1 X
= P[X−t+1 6= u, . . . , X−1 6= u, X0 = u] (11.32)
P[X0 = u]
t≥1
1 X
= P[X0 6= u, . . . , Xt − 2 6= u, Xt−1 = u] (11.33)
P[X0 = u]
t≥1
1
= P[∃t ≥ 0 : Xt = u] (11.34)
P[X0 = u]
1
= , (11.35)
P[X0 = u]
where (11.30) is the standard expression for the expectation of a Z+ -valued random variable, (11.33)
is from stationarity, (11.34) is because the events corresponding to different t are disjoint, and (11.35)
is from (11.29).
The following proposition serves to explain the basic principle behind operation of Lempel-Ziv:
Theorem 11.1. Consider a finite-alphabet stationary ergodic process . . . , X−1 , X0 , X1 . . . with
−1
entropy rate H. Suppose that X−∞ is known to the decoder. Then there exists a sequence of
n−1 −1
prefix-codes fn (x0 , x−∞ ) with expected length
1 −1
E[`(fn (X0n−1 , X∞ ))] → H ,
n

134
Proof. Let Ln be the last occurence of the block xn−1
0 in the string x−1
−∞ (recall that the latter is
known to decoder), namely

Ln = inf{t > 0 : x−t+n−1

−t = x0n−1 } .
(n)
Then, by Kac’s lemma applied to the process Yt = Xtt+n−1 we have

1
E[Ln |X0n−1 = xn−1
0 ]= .
P[X0n−1 = x0n−1 ]

We know encode Ln using the code (11.28). Note that there is crucial subtlety: even if Ln < n and
thus [−t, −t + n − 1] and [0, n − 1] overlap, the substring xn−1
0 can be decoded from the knowledge
of Ln .
We have, by applying Jensen’s inequality twice and noticing that n1 H(X0n−1 ) & H and
1 n−1
n log H(X0 ) → 0 that

1 1 1
E[`(fint (Ln ))] ≤ E[log ] + o(1) → H .
n n PX n−1 (X0n−1 )
0

From Kraft’s inequality we know that for any prefix code we must have
1 1 −1
E[`(fint (Ln ))] ≥ H(X0n−1 |X−∞ )=H.
n n

135
Part III

Binary hypothesis testing

136
§ 12. Binary hypothesis testing

12.1 Binary Hypothesis Testing

Two possible distributions on a space X

H0 : X ∼ P
H1 : X ∼ Q

Where under hypothesis H0 (the null hypothesis) X is distributed according to P , and under H1
(the alternative hypothesis) X is distributed according to Q. A test between two distributions
chooses either H0 or H1 based on an observation of X

• Deterministic test: f : X → {0, 1}

• Randomized test: PZ|X : X → {0, 1}, so that PZ|X (0|x) ∈ [0, 1].

Let Z = 0 denote that the test chooses P , and Z = 1 when the test chooses Q.
Remark: This setting is called “testing simple hypothesis against simple hypothesis”. Simple
here refers to the fact that under each hypothesis there is only one distribution that could generate
the data. Composite hypothesis is when X ∼ P and P is only known to belong to some class of
distributions.

12.1.1 Performance Metrics

In order to determine the “effectiveness” of a test, we look at two metrics. Let πi|j denote the
probability of the test choosing i when the correct hypothesis is j. With this

α = π0|0 = P [Z = 0] (Probability of success given H0 true)

β = π0|1 = Q[Z = 0] (Probability of error given H1 true)

Morever, π1|1 (true positive) is called the power of a test.

Remark
P 12.1. P [Z = 0] is a slight abuse of notation; more accurately, it means P [Z = 0] =
x∈X P (x)PZ|X (0|x) = E[1 − f (X)], where f (x) = P [reject|X = x]. Also, the choice of these two
metrics to judge the test is not unique, we can use many other pairs from {π0|0 , π0|1 , π1|0 , π1|1 }.

So for any test PZ|X there is an associated (α, β). There are a few ways to determine the “best
test”

• Bayesian: Assume prior distributions P[H0 ] = π0 and P[H1 ] = π1 , minimize the expected error

Pb∗ = min π0 π1|0 + π1 π0|1

tests

137
• Minimax: Assume there is a prior distribution but it is unknown, so choose the test that
preforms the best for the worst case priors
∗
Pm = min max π0 π1|0 + π1 π0|1
tests π0

• Neyman-Pearson: Minimize error β subject to success probability at least α.

In this course, the Neyman-Pearson formulation will play a vital role.

12.2 Neyman-Pearson formulation

Definition 12.1. Given that we require P [Z = 0] ≥ α,

βα (P, Q) , inf Q[Z = 0]

P [Z=0]≥α

Definition 12.2. Given (P, Q), the region of achievable points for all randomized tests is
[
R(P, Q) = {(P [Z = 0], Q[Z = 0])} ⊂ [0, 1]2 (12.1)
PZ|X

R(P, Q)

βα (P, Q)

Remark 12.2. This region encodes a lot of useful information about the relationship between P
and Q. For example,1

P = Q ⇔ R(P, Q) = P ⊥ Q ⇔ R(P, Q) =

Moreover, TV(P, Q) = maximal length of vertical line intersecting the lower half of R(P, Q) (HW).
Theorem 12.1 (Properties of R(P, Q)).

1. R(P, Q) is a closed, convex subset of [0, 1]2 .

1
Recall that P is mutually singular w.r.t. Q, denoted by P ⊥ Q, if P [E] = 0 and Q[E] = 1 for some E.

138
2. R(P, Q) contains the diagonal.

3. Symmetry: (α, β) ∈ R(P, Q) ⇔ (1 − α, 1 − β) ∈ R(P, Q).

Proof. 1. For convexity, suppose (α0 , β0 ), (α1 , β1 ) ∈ R(P, Q), then each specifies a test PZ0 |X , PZ1 |X
respectively. Randomize between these two test to get the test λPZ0 |X + λ̄PZ1 |X for λ ∈ [0, 1],
which achieves the point (λα0 + λ̄α1 , λβ0 + λ̄β1 ) ∈ R(P, Q).
Closedness will follow from the explicit determination of all boundary points via Neyman-
Pearson Lemma – see Remark 12.3. In more complicated situations (e.g. in testing against
composite hypothesis) simple explicit solutions similar to Neyman-Pearson Lemma are not
available but closedness of the region can frequently be argued still. The basic reason is that
the collection of functions {g : RX → [0,
R 1]} forms a weakly-compact set and hence its image
under a linear functional g 7→ ( gdP, gdQ) is closed.

2. Test by blindly flipping a coin, i.e., let Z ∼ Bern(1 − α) ⊥

⊥ X. This achieves the point (α, α).

3. If (α, β) ∈ R(P, Q), then form the test that chooses P whenever PZ|X choses Q, and chooses
Q whenever PZ|X choses P , which gives (1 − α, 1 − β) ∈ R(P, Q).

The region R(P, Q) consists of the operating points of all randomized tests, which include
deterministic tests as special cases. The achievable region of deterministic tests are denoted by
[
Rdet (P, Q) = {(P (E), Q(E)}. (12.2)
E

One might wonder the relationship between these two regions. It turns out that R(P, Q) is given by
the closed convex hull of Rdet (P, Q).
We first recall a couple of notations:

• Closure: cl(E) , the smallest closed set containing E.

P P
• Convex hull: co(E) , the smallest convex set containing E = { ni=1 αi xi : αi ≥ 0, ni=1 αi =
1, xi ∈ E, n ∈ N}. A useful example: if (f (x), g(x)) ∈ E, ∀x, then (E [f (X)] , E [g(X)]) ∈
cl(co(E)).

Theorem 12.2 (Randomized test v.s. deterministic tests).

R(P, Q) = cl(co(Rdet (P, Q))).

Consequently, if P and Q are on a finite alphabet X , then R(P, Q) is a polygon of at most 2|X |
vertices.

Proof. “⊃”: Comparing (12.1) and (12.2), by definition, R(P, Q) ⊃ Rdet (P, Q)). By Theorem 12.1,
R(P, Q) is closed convex, and we are done with the ⊃ direction.
“⊂”: Given any randomized test PZ|X , put g(x) = PZ=0|X=x . Then g is a measurable function.
Let’s recall the following lemma:
R
Lemma 12.1 (Area rule). For any positive random variable U ≥ 0, E[U ] = R+ P [U ≥ u] du.
RU R R
Proof. By Fubini, E[U ] = E 0 du = E 1{U ≥u} du = E1{U ≥u} du.

139
Thus,
X Z 1
P [Z = 0] = g(x)P (x) = EP [g(X)] = P [g(X) ≥ t]dt
x 0
X Z 1
Q[Z = 0] = g(x)Q(x) = EQ [g(X)] = Q[g(X) ≥ t]dt
x 0
R
where we applied the formula E[U ] = P [U ≥ t] dt for U ≥ 0. Therefore the point (P [Z = 0], Q[Z =
0]) ∈ R is a mixture of points (P [g(X) ≥ t], Q[g(X) ≥ t]) ∈ Rdet , averaged according to t uniformly
distributed on the unit interval. Hence R ⊂ cl(co(Rdet )).
The last claim follows because there are at most 2|X | subsets in (12.2).
1
Example: Testing Bern(p) versus Bern(q), p < 2 < q. Using Theorem 12.2, note that there are
22 = 4 events E = ∅, {0}, {1}, {0, 1}. Then
β

)
q)
(p, q)
n(
er
,B
p)
n(
er

(p̄, q̄)
(B
R

α
0 1

12.3 Likelihood ratio tests

Definition 12.3. The log likelihood ratio (LLR) is T = log dQdP
: X → R ∪ {±∞}. The likelihood
ratio test (LRT) with threshold τ ∈ R is 1{log dQ ≤ τ }. Formally, we assume that dP = p(x)dµ
dP

and dQ = q(x)dµ (one can take µ = P + Q, for example) and set




log p(x)
q(x) , p(x) > 0, q(x) > 0

+∞, p(x) > 0, q(x) = 0
T (x) ,

−∞, p(x) = 0, q(x) > 0


undefined, p(x) = 0, q(x) = 0

Notes:

• LRT is a deterministic test. The intuition is that upon observing x, if Q(x)

P (x) exceeds a certain
threshold, suggesting Q is more likely, one should reject the null hypothesis and declare Q.

140
• The rationale for defining extended values ±∞ of T (x) are the following observations:

∀x, ∀τ ∈ R : (p(x) − exp{τ }q(x))1{T (x) > τ } ≥ 0

(p(x) − exp{τ }q(x))1{T (x) ≥ τ } ≥ 0
(q(x) − exp{−τ }p(x))1{T (x) < τ } ≥ 0
(q(x) − exp{−τ }p(x))1{T (x) ≤ τ } ≥ 0

This leads to the following useful consequence: For any g ≥ 0 and any τ ∈ R (note: τ = ±∞
is excluded) we have

EP [g(X)1{T ≥ τ }] ≥ exp{τ } · EQ [g(X)1{T ≥ τ }] (12.3)

EQ [g(X)1{T ≤ τ }] ≥ exp{−τ } · EP [g(X)1{T ≤ τ }] (12.4)

Below, these and similar inequalities are only checked for the cases of T taking real (not
extended) values, but from this remark it should be clear how to treat the general case.

• Another useful observation:

Q[T = +∞] = P [T = −∞] = 0 . (12.5)

Theorem 12.3.

1. T is a sufficient statistic for testing H0 vs H1 .

2. (Change of measure) For discrete alphabet X and when Q P we have

Q[T = t] = exp(−t)P [T = t] ∀f ∈ R ∪ {+∞}

More generally, we have for any g : R ∪ {±∞} → R

EQ [g(T )] = g(−∞)Q[T = −∞] + EP [exp{−T }g(T )] (12.6)

EP [g(T )] = g(+∞)P [T = +∞] + EQ [exp{T }g(T )] (12.7)

Proof. (2)
X P (x) X
QT (t) = Q(x)1{log = t} = Q(x)1{et Q(x) = P (x)}
Q(x)
X X
X P (x)
= e−t P (x)1{log = t} = e−t PT (t)
Q(x)
X

To prove the general version (12.6), note that

Z
EQ [g(T )] = dµ q(x)g(T (x)) + g(−∞)Q[T = −∞]
{−∞<T (x)<∞}
Z
= dµ p(x) exp{−T (x)}g(T (x)) + g(−∞)Q[T = −∞]
{−∞<T (x)<∞}
= EP [exp{−T }g(T )] + g(−∞)Q[T = −∞] ,

141
where we used (12.5) to justify restriction to finite values of T .
(1) To show T is a s.s, we need to show PX|T = QX|T . For the discrete case we have:
P (x) P (x)
PX (x)PT |X (t|x) P (x)1{ Q(x) = et } ef Q(x)1{ Q(x) = et }
PX|T (x|t) = = =
PT (t) PT (t) PT (t)
QXT (xt) (2) QXT
= −t = = QX|T (x|t).
e PT (t) QT
The general argument is done similarly to the proof of (12.6).
From Theorem 12.2 we know that to obtain the achievable region R(P, Q), one can iterate over
all subsets and compute the region Rdet (P, Q) first, then take its closed convex hull. But this is a
dP
formidable task if the alphabet is huge or infinite. But we know that the LLR log dQ is a sufficient
dP
statistic. Next we give bounds to the region R(P, Q) in terms of the statistics of log dQ . As usual,
there are two types of statements:
• Converse (outer bounds): any point in R(P, Q) must satisfy ...

• Achievability (inner bounds): the following point belong to R(P, Q) ...

12.4 Converse bounds on R(P, Q)

Theorem 12.4 (Weak Converse). ∀(α, β) ∈ R(P, Q),

d(αkβ) ≤ D(P kQ)

d(βkα) ≤ D(QkP )

where d(·k·) is the binary divergence.

Proof. Use data processing for KL divergence with PZ|X .
dP

Lemma 12.2 (Deterministic tests). ∀E, ∀γ > 0 : P [E] − γQ[E] ≤ P log dQ > log γ
Proof. (Discrete version)
X X
P [E] − γQ[E] = p(x) − γq(x) ≤ (p(x) − γq(x))1{p(x)>γq(x)}
x∈E x∈E
h dP i h dP i h dP i
= P log > log γ, X ∈ E − γQ log > log γ, X ∈ E ≤ P log > log γ .
dQ dQ dQ
(General version) WLOG, suppose P, Q µ for some measure µ (since we can always take
µ = P + Q). Then dP = p(x)dµ, dQ = q(x)dµ. Then
Z Z
P [E] − γQ[E] = dµ(p(x) − γq(x)) ≤ dµ(p(x) − γq(x))1{p(x)>γq(x)}
E E
h dP i h dP i h dP i
= P log > log γ, X ∈ E − Q log > log γ, X ∈ E ≤ P log > log γ .
dQ dQ dQ
dP
p dµ dP
where the second line follows from q = dQ = dQ .
dµ
[So we see that the only difference between the discrete and the general case is that the counting
measure is replaced by some other measure µ.]

142
Note: In this case, we do not need P Q, since ±∞ is a reasonable and meaningful value for the
log likelihood ratio.
dP

Lemma 12.3 (Randomized tests). P [Z = 0] − γQ[Z = 0] ≤ P log dQ > log γ .
Proof. Almost identical to the proof of the previous Lemma 12.2:
X X
P [Z = 0] − γQ[Z = 0] = PZ|X (0|x)(p(x) − γq(x)) ≤ PZ|X (0|x)(p(x) − γq(x))1{p(x)>γq(x)}
x x
h dP i h dP i
= P log > log γ, Z = 0 − Q log > log γ, Z = 0
dQ dQ
h dP i
≤ P log > log γ .
dQ
Theorem 12.5 (Strong Converse). ∀(α, β) ∈ R(P, Q), ∀γ > 0,
h dP i
α − γβ ≤ P log > log γ (12.8)
dQ
1 h dP i
β − α ≤ Q log < log γ (12.9)
γ dQ
Proof. Apply Lemma 12.3 to (P, Q, γ) and (Q, P, 1/γ).
Note: Theorem 12.5 provides an outer bound for the region R(P, Q) in terms of half-spaces. To see
this, suppose one fixes γ > 0 and looks at the line α − γβ = c and slowing increases c from zero,
there is going to be a maximal c, say c∗ , at which point the line touches the lower boundary of the
region. Then (12.8) says that c∗ cannot exceed P [log dQdP
> log γ]. Hence R must lie to the left of
the line. Similarly, (12.9) provides bounds for the upper boundary. Altogether Theorem 12.5 states
that R(P, Q) is contained in the intersection of a collection of half-spaces indexed by γ.
Note: To apply the strong converse Theorem 12.5, we need to know the CDF of the LLR, whereas
to apply the weak converse Theorem 12.4 we need only to know the expectation of the LLR, i.e.,
divergence.

12.5 Achievability bounds on R(P, Q)

Since we know that the set R(P, Q) is convex, it is natural to try to find all of its supporting lines
(hyperplanes), as it is well known that closed convex set equals the intersection of the halfspaces
correposponding to all supporting hyperplanes. So thus, we are naturally lead to solving the problem
max{α − tβ : (α, β) ∈ R(P, Q)} .
This can be done rather simply:
X X
α∗ − tβ ∗ = max (α − tβ) = max (P (x) − tQ(x))PZ|X (0|x) = |P (x) − tQ(x)|+
(α,β)∈R PZ|X
x∈X x∈X

where the last equality follows from the fact that we are free to choose PZ|X (0|x), and the best
choice is obvious:
P (x)
PZ|X (0|x) = 1 log ≥ log t .
Q(x)
Thus, we have shown that all supporting hyperplanes are parameterized by LLR-tests. This
completely recovers the region R(P, Q) except for the points corresponding to the faces (flat pieces)
of the region. To be precise, we state the following result.

143
Theorem 12.6 (Neyman-Pearson Lemma: “LRT is optimal”). For any α, βα is attained by the
following test:  dP
1 log dQ > τ

dP
PZ|X (0|x) = λ log dQ =τ (12.10)

 dP
0 log dQ <τ

where τ ∈ R and λ ∈ [0, 1] are the unique solutions to α = P [log dQ

dP dP
> τ ] + λP [log dQ = τ ].

Proof of Theorem 12.6. Let t = exp(τ ). Given any test PZ|X , let g(x) = PZ|X (0|x) ∈ [0, 1]. We
want to show that
h dP i h dP i
α = P [Z = 0] = EP [g(X)] = P > t + λP =t (12.11)
dQ dQ
goal h dP i h dP i
⇒ β = Q[Z = 0] = EQ [g(X)] ≥ Q > t + λQ =t (12.12)
dQ dQ

Using the simple fact that EQ [f (X)1n dP ≤to ] ≥ t−1 EP [f (X)1n dP ≤to ] for any f ≥ 0 twice, we have
dQ dQ

β = EQ [g(X)1n dP ≤to ] + EQ [g(X)1n dP >to ]

dQ dQ

1
≥ EP [g(X)1n dP ≤to ] +EQ [g(X)1n dP >to ]
t dQ dQ
| {z }
(12.11) 1
h dP i
= EP [(1 − g(X))1n dP >to ] + λP = t + EQ [g(X)1n dP >to ]
t dQ dQ dQ
| {z }
h dP i
≥ EQ [(1 − g(X))1n dP >to ] + λQ = t + EQ [g(X)1n dP >to ]
dQ dQ dQ
h dP i h dP i
=Q > t + λQ =t .
dQ dQ
Remark 12.3. As a consequence of the Neyman-Pearson lemma, all the points on the boundary of
the region R(P, Q) are attainable. Therefore

R(P, Q) = {(α, β) : βα ≤ β ≤ 1 − β1−α }.

Since α 7→ βα is convex on [0, 1], hence continuous, the region R(P, Q) is a closed convex set.
Consequently, the infimum in the definition of βα is in fact a minimum.
Furthermore, the lower half of the region R(P, Q) is the convex hull of the union of the following
two sets: (
dP
α = P log dQ >τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ >τ
and (
dP
α = P log dQ ≥τ
dP
τ ∈ R ∪ {±∞}.
β = Q log dQ ≥τ
dP
Therefore it does not lose optimality to restrict our attention on tests of the form 1{log dQ ≥ τ } or
dP
1{log dQ > τ }. The convex combination (randomization) of the above two styles of tests lead to the
achievability of the Neyman-Pearson lemma (Theorem 12.6).

144
Remark 12.4. The test (12.10) is related to LRT2 as follows:
dP dP
P [log dQ > t] P [log dQ > t]

1 1
α α

t t
τ τ

dP
1. Left figure: If α = P [log dQ > τ ] for some τ , then λ = 0, and (12.10) becomes the LRT
Z = 1 log dP ≤τ .
n o
dQ

dP
2. Right figure: If α 6= P [log dQ > τ ] for any τ , then we have λ ∈ (0, 1), and (12.10) is equivalent
to randomize over tests: Z = 1nlog dP ≤τ o with probability λ̄ or 1nlog dP <τ o with probability λ.
dQ dQ

Corollary 12.1. ∀τ ∈ R, there exists (α, β) ∈ R(P, Q) s.t.

h dP i
α = P log >τ
dQ
h dP i
β ≤ exp(−τ )P log > τ ≤ exp(−τ )
dQ

Proof.
h dP i X n P (x) o
Q log >τ = Q(x)1 > eτ
dQ Q(x)
X n P (x) o h dP i
≤ P (x)e−τ 1 > eτ = e−τ P log >τ .
Q(x) dQ

12.6 Asymptotics
Now we have many samples from the underlying distribution
i.i.d.
H0 : X1 , . . . , Xn ∼ P
i.i.d.
H1 : X1 , . . . , Xn ∼ Q

We’re interested in the asymptotics of the error probabilities π0|1 and π1|0 . There are two main
types of tests, both which the convergence rate to zero error is exponential.

1. Stein Regime: What is the best exponential rate of convergence for π0|1 when π1|0 has to be
≤ ?
(
π1|0 ≤
π0|1 → 0
2
Note that it so happens that in Definition 12.3 the LRT is defined with an ≤ instead of <.

145
2. Chernoff Regime: What is the trade off between exponents of the convergence rates of π1|0
and π0|1 when we want both errors to go to 0?
(
π1|0 → 0
π0|1 → 0

146
§ 13. Hypothesis testing asymptotics I

Setup:

H0 : X n ∼ PX n H1 : X n ∼ QX n
test PZ|X n : X n → {0, 1}
specification 1 − α = π1|0 β = π0|1

13.1 Stein’s regime

false alarm: 1 − α = π1|0 ≤

missed detection: β = π0|1 → 0 at the rate 2−nV

Note: Motivation of this objective: usually a “miss”(0|1) is much worse than a “false alarm” (1|0).
Definition 13.1 (-optimal exponent). V is called an -optimal exponent in Stein’s regime if

V = sup{E : ∃n0 , ∀n ≥ n0 , ∃PZ|X n s.t. α > 1 − , β < 2−nE , }

1 1
⇔ V = lim inf log
n→∞ n β1− (PX n , QX n )

where βα (P, Q) = minPZ|X ,P (Z=0)≥α Q(Z = 0).

Exercise: Check the equivalence.
Definition 13.2 (Stein’s exponent).

V = lim V .
→0

Theorem 13.1 (Stein’s lemma). Let PX n = PXn i.i.d. and QX n = QnX i.i.d. Then

V = D(P kQ), ∀ ∈ (0, 1).

Consequently,
V = D(P kQ).

Example: If it is required that α ≥ 1 − 10−3 , and β ≤ 10−40 , what’s the number of samples needed?
10−40
Stein’s lemma provides a rule of thumb: n & − logD(P kQ) .

147
dP dPX n Pn dP
Proof. Denote F = log dQ , and Fn = log dQ X n
= i=1 log dQ (Xi ) – iid sum.
Recall Neyman Pearson’s lemma on optimal tests (likelihood ratio test): ∀τ ,

α = P (F > τ ), β = Q(F > τ ) ≤ e−τ

Also notice that by WLLN, under P , as n → ∞,

n
1 1X dP (Xi ) P dP
Fn = log →EP log
− = D(P kQ). (13.1)
n n dQ(Xi ) dQ
i=1

Alternatively, under Q, we have

1 P dP
→EQ log
Fn − = −D(QkP ) (13.2)
n dQ

1. Show V ≥ D(P kQ) = D.

Pick τ = n(D − δ), for some small δ > 0. Then the optimal test achieves:

α = P (Fn > n(D − δ)) → 1, by (13.1)

β ≤ e−n(D−δ)

then pick n large enough (depends on , δ) such that α ≥ 1−, we have the exponent E = D −δ
achievable, V ≥ E. Further let δ → 0, we have that V ≥ D.

2. Show V ≤ D(P kQ) = D.

a) (weak converse) ∀(α, β) ∈ R(PX n , QX n ), we have

1
−h(α) + α log ≤ d(αkβ) ≤ D(PX n kQX n ) (13.3)
β
where the first inequality is due to
α ᾱ 1 1
d(αkβ) = α log + ᾱ log = −h(α) + α log + ᾱ log
β β̄ β β̄
| {z }
≥ 0 and ≈ 0 for small β

and the second is due to the weak converse Theorem 12.4 proved in the last lecture (data
processing inequality for divergence).
∀ achievable exponent E < V , by definition, there exists a sequence of tests PZ|X n such
that αn ≥ 1 − and βn ≤ 2−nE . Plugging it in (13.3) and using h ≤ log 2, we have

D(P kQ) log 2

− log 2 + (1 − )nE ≤ nD(P kQ) ⇒ E ≤ + .
1− n(1 − )
| {z }
→0, as n→∞

Therefore
D(P kQ)
V ≤
1−
Notice that this is weaker than what we hoped to prove, and this weak converse result is
tight for → 0, i.e., for Stein’s exponent we did have the desired result V = lim→0 V ≥
D(P kQ).

148
b) (strong converse) In proving the weak converse, we only made use of the expectation of
Fn in (13.3), we need to make use of the entire distribution (CDF) in order to obtain
stronger results.
Recall the strong converse result which we showed in the last lecture:

∀(α, β) ∈ R(P, Q), ∀γ, α − γβ ≤ P (F > log γ)

Here, suppose there exists a sequence of tests PZ|Xn which achieve αn ≥ 1 − and
βn ≤ 2−nE . Then

1 − − γ2−nE ≤ αn − γβn ≤ PX n [Fn > log γ].

Pick log γ = n(D + δ), by (13.1) the RHS goes to 0, and we have

1 − − 2n(D+δ) 2−nE ≤ o(1)

1
⇒D + δ − E ≥ log(1 − + o(1)) → 0
n
⇒E ≤ D as δ → 0
⇒V ≤ D

Remark 13.1 (Ergodic). Just like in last section of data compression. Ergodic assumptions on
PX n and QX n allow one to show that
1
V = lim D(PX n kQX n )
n→∞ n

the counterpart of (13.3), which is the key for picking the appropriate τ , for ergodic sequence X n is
the Birkhoff-Khintchine convergence theorem.
Remark 13.2. The theoretical importance of knowing the Stein’s exponents is that:

∀E ⊂ X n , PX n [E] ≥ 1 − ⇒ QX n [E] ≥ 2−nV +o(n)

Thus knowledge of Stein’s exponent V allows one to prove exponential bounds on probabilities of
arbitrary sets, the technique is known as “change of measure”.

13.2 Chernoff regime

We are still considering i.i.d. sequence X n , and binary hypothesis

H0 : X n ∼ PXn H1 : X n ∼ QnX

But our objective in this section is to have both types of error probability to vanish exponentially
fast simultaneously. We shall look at the following specification:

1 − α = π1|0 → 0 at the rate 2−nE0

β = π0|1 → 0 at the rate 2−nE1

149
Apparently, E0 (resp. E1 ) can be made arbitrarily big at the price of making E1 (resp. E0 ) arbitrarily
small. So the problem boils down to the optimal tradeoff, i.e., what’s the achievable region of
(E0 , E1 )? This problem is solved by [Hoe65, Bla74].

The optimal tests give the explict error probability:

1 1
αn = P Fn > τ , βn = Q Fn > τ
n n

and we are interested in the asymptotics when n → ∞, in which scenario we know (13.1) and (13.2)
occur.
Stein’s regime corresponds to the corner points. Indeed, Theorem 13.1 tells us that when fixing
αn = 1 − , namely E0 = 0, picking τ = D(P kQ) − δ (δ → 0) gives the exponential convergence
rate of βn as E1 = D(P kQ). Similarly, exchanging the role of P and Q, we can achieves the point
(E0 , E1 ) = (D(QkP ), 0). More generally, to achieve the optimal tradeoff between the two corner
points, we need to introduce a powerful tool – Large Deviation Theory.
Note: Here is a roadmap of the upcoming 2 lectures:
∗ , tilted distribution P )
1. basics of large deviation (ψX , ψX λ

2. information projection problem

min D(QkP ) = ψ ∗ (γ)

Q:EQ [X]≥γ

3. use information projection to prove tight Chernoff bound

" n #
1X ∗
P Xk ≥ γ = 2−nψ (γ)+o(n)
n
k=1

4. apply the above large deviation theorem to (E0 , E1 ) to get

(E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ) characterize the achievable boundary.

13.3 Basics of Large deviation theory

Let X n be an i.i.d. sequence and Xi ∼ P . Large deviation focuses on the following inequality:
" n #
X
P Xi ≥ nγ = 2−nE(γ)+o(n)
i=1

150
h Pn i
i=1 Xi
what is the rate function E(γ) = − limn→∞ n1 log P n ≥ γ ? (Chernoff’s ineq.)
To motivate, let us recall the usual Chernoff bound: For iid X n , for any λ ≥ 0,
" n # " n
! #
X X
P Xi ≥ nγ = P exp λ Xi ≥ exp(nλγ)
i=1 i=1
" n
!#
Markov X
≤ exp(−nλγ)E exp λ Xi
i=1
= exp {−nλγ + n log E [exp(λX)]} .

Optimizing over λ ≥ 0 gives the non-asymptotic upper bound (concentration inequality) which
holds for any n:
" n #
X n o
P Xi ≥ nγ ≤ exp − n sup(λγ − log E [exp(λX)]) .
λ≥0 | {z }
i=1 log MGF

Of course we still need to show the lower bound.

Let’s first introduce the two key quantities: log MGF (also known as the cumulant generating
function) ψX (λ) and tilted distribution Pλ .

13.3.1 Log MGF

Definition 13.3 (Log MGF).

ψX (λ) = log(E[exp(λX)]), λ ∈ R.

Per the usual convention, we will also denote ψP (λ) = ψX (λ) if X ∼ P .

Assumptions: In this section, we shall restrict to the distribution PX such that

1. MGF exists, i.e., ∀λ ∈ R, ψX (λ) < ∞, which, in particular, implies all moments exist.

2. X 6=const.

Example:
λ2
• Gaussian: X ∼ N (0, 1) ⇒ ψX (λ) = 2 .

• Example of R.V. such that ψX (λ) does not exist: X = Z 3 with Z ∼ Gaussian. Then
ψX (λ) = ∞, ∀λ] 6= 0.

Theorem 13.2 (Properties of ψX ).

1. ψX is convex;

2. ψX is continuous;

3. ψX is infinitely differentiable and

0 E[XeλX ]
ψX (λ) = = e−ψX (λ) E[XeλX ].
E[eλX ]
0 (0) = E [X].
In particular, ψX (0) = 0, ψX

151
0 ≤ b;
4. If a ≤ X ≤ b a.s., then a ≤ ψX

5. Conversely, if
0 0
A = inf ψX (λ), B = sup ψX (λ),
λ∈R λ∈R

then A ≤ X ≤ B a.s.;
0 is strictly increasing.
6. ψX is strictly convex, and consequently, ψX

7. Chernoff bound:
P (X ≥ γ) ≤ exp(−λγ + ψX (λ)), λ ≥ 0.

Remark 13.3. The slope of log MGF encodes the range of X. Indeed, 4) and 5) of Theorem 13.2
together show that the smallest closed interval containing the support of PX equals (closure of) the
0 . In other words, A and B coincide with the essential infimum and supremum (min and
range of ψX
max of RV in the probabilistic sense) of X respectively,

A = essinf X , sup{a : X ≥ a a.s.}

B = esssup X , inf{b : X ≤ b a.s.}

Proof. Note: 1–4 can be proved right now. 7 is the usual Chernoff bound. The proof of 5–6 relies
on Theorem 13.4, which can be skipped for now.

1. Fix θ ∈ (0, 1). Recall Holder’s inequality:

1 1
E[|U V |] ≤ kU kp kV kq , for p, q ≥ 1, + =1
p q

where the Lp -norm of RV is defined by kU kp = (E|U |p )1/p . Applying to E[e(θλ1 +θ̄λ2 )X ] with
p = 1/θ, q = 1/θ̄, we get

E[exp((λ1 /p + λ2 /q)X)] ≤ k exp(λ1 X/p)kp k exp(λ2 X/q)kq = E[exp(λ1 X)]θ E[exp(λ2 X)]θ̄ ,

i.e., eψX (θλ1 +θ̄λ2 ) ≤ eψX (λ1 )θ eψX (λ2 )θ̄ .

2. By our assumptions on X, domain of ψX is R, and by the fact that convex function must be
continuous on the interior of its domain, we have that ψX is continuous on R.

3. Be careful when exchanging the order of differentiation and expectation.

Assume λ > 0 (similar for λ ≤ 0).
First, we show that E[|XeλX |] exists. Since

e|X| ≤ eX + e−X
|XeλX | ≤ e|(λ+1)X| ≤ e(λ+1)X + e−(λ+1)X

by assumption on X, both of the summands are absolutely integrable in X. Therefore by

dominated convergence theorem (DCT), E[|XeλX |] exists and is continuous in λ.

152
Second, by the existence and continuity of E[|XeλX |], u 7→ E[|XeuX |] is integrable on [0, λ],
we can switch order of integration and differentiation as follows:
Z λ Z λ
Fubini
eψX (λ) = E[eλX ] = E 1 + XeuX du = 1 + E XeuX du
0 0
0
⇒ ψX (λ)eψX (λ) = E[Xe λX
]

0 (λ) = e−ψX (λ) E[XeλX ] exists and is continuous in λ on R.

thus ψX
Furthermore, using similar application of DCT we can extend to λ ∈ C and show that
λ 7→ E[eλX ] is a holomorphic function. Thus it is infinitely differentiable.

4.
0 E[XeλX ]
a ≤ X ≤ b ⇒ ψX (λ) = ∈ [a, b].
E[eλX ]

5. Suppose PX [X > B] > 0 (for contradiction), then PX [X > B + 2] > 0 for some small > 0.
But then Pλ [X ≤ B + ] → 0 for λ → ∞ (see Theorem 13.4.3 below). On the other hand, we
know from Theorem 13.4.2 that EPλ [X] = ψX 0 (λ) ≤ B. This is not yet a contradiction, since

Pλ might still have some very small mass at a very negative value. To show that this cannot
happen, we first assume that B − > 0 (otherwise just replace X with X − 2B). Next note
that

B ≥ EPλ [X] = EPλ [X1{X<B−} ] + EPλ [X1{B−≤X≤B+} ] + EPλ [X1{X>B+} ]

≥ EPλ [X1{X<B−} ] + EPλ [X1{X>B+} ]
≥ − EPλ [|X|1{X<B−} ] + (B + ) Pλ [X > B + ] (13.4)
| {z }
→1

therefore we will obtain a contradiction if we can show that EPλ [|X|1{X<B−} ] → 0 as λ → ∞.

To that end, notice that convexity of ψX implies that ψX 0 % B. Thus, for all λ ≥ λ we have
0
0
ψX (λ) ≥ B − 2 . Thus, we have for all λ ≥ λ0

ψX (λ) ≥ ψX (λ0 ) + (λ − λ0 )(B − ) = c + λ(B − ) , (13.5)
2 2

153
for some constant c. Then,

EPλ [|X|1{X < B − }] = E[|X|eλX−ψX (λ) 1{X < B − }] (13.6)
λX−c−λ(B− 2 )
≤ E[|X|e 1{X < B − }] (13.7)
λ(B−)−c−λ(B− 2 )
≤ E[|X|e ] (13.8)
−λ 2 −c
= E[|X|]e →0 λ→∞ (13.9)

where the first inequality is from (13.5) and the second from X < B − . Thus, the first term
in (13.4) goes to 0 implying the desired contradiction.

6. Suppose ψX is not strictly convex. Since we know that ψX is convex, then ψX must be
“flat” (affine) near some point, i.e., there exists a small neighborhood of some λ0 such that
ψX (λ0 + u) = ψX (λ0 ) + ur for some r ∈ R. Then ψPλ (u) = ur for all u in small neighborhood
of zero, or equivalently EPλ [eu(X−r) ] = 1 for u small. The following Lemma 13.1 implies
Pλ [X = r] = 1, but then P [X = r] = 1, contradicting the assumption X 6= const.

Lemma 13.1. E[euS ] = 1 for all u ∈ (−, ) then S = 0.

Proof. Expand in Taylor series around u = 0 to obtain E[S] = 0, E[S 2 ] = 0. Alternatively, we can
extend the argument we gave for differentiating ψX (λ) to show that the function z 7→ E[ezS ] is
holomorphic on the entire complex plane1 . Thus by uniqueness, E[euS ] = 1 for all u.
∗ : R → R ∪ {+∞} is given by the Legendre-
Definition 13.4 (Rate function). The rate function ψX
Fenchel transform of the log MGF:
∗
ψX (γ) = sup λγ − ψX (λ) (13.10)
λ∈R

Note: The maximization (13.10) is a nice convex optimization problem since ψX is strictly convex,
so we are maximizing a strictly concave function. So we can find the maximum by taking the
∗ is the dual of ψ in the sense of convex
derivative and finding the stationary point. In fact, ψX X
analysis.

∗ ).
Theorem 13.3 (Properties of ψX
1
More precisely, if we only know that E[eλS ] is finite for |λ| ≤ 1 then the function z 7→ E[ezS ] is holomorphic in
the vertical strip {z : |Rez| < 1}.

154
1. Let A = essinf X and B = esssup X. Then
 0
 λγ − ψX (λ) for some λ s.t. γ = ψX (λ), A<γ<B
∗ 1
ψX (γ) = log P (X=γ) γ = A or B

+∞, γ < A or γ > B

∗ is strictly convex and strictly positive except ψ ∗ (E[X]) = 0.

2. ψX X
∗ is decreasing when γ ∈ (A, E[X]), and increasing when γ ∈ [E[X], B)
3. ψX

Proof. By Theorem 13.2.4, since A ≤ X ≤ B a.s., we have A ≤ ψX 0 ≤ B. When γ ∈ (A, B), the

strictly concave function λ 7→ λγ − ψX (λ) has a single stationary point which achieves the unique
maximum. When γ > B (resp. < A), λ 7→ λγ − ψX (λ) increases (resp. decreases) without bounds.
When γ = B, since X ≤ B a.s., we have
∗
ψX (B) = sup λB − log(E[exp(λX)]) = − log inf E[exp(λ(X − B))]
λ∈R λ∈R

= − log lim E[exp(λ(X − B))] = − log P (X = B),

λ→∞

by monotone convergence theorem.

∗ are inverse to each
By Theorem 13.2.6, since ψX is strictly convex, the derivative of ψX and ψX
∗ ∗
other. Hence ψX is strictly convex. Since ψX (0) = 0, we have ψX (γ) ≥ 0. Moreover, ψX∗ (E[X]) = 0

follows from E[X] = ψX0 (0).

13.3.2 Tilted distribution

As early as in Lecture 3, we have already introduced tilting in the proof of Donsker-Varadhan’s
variational characterization of divergence (Theorem 3.5). Let us formally define it now.
Definition 13.5 (Tilting). Given X ∼ P , the tilted measure Pλ is defined by

eλx
Pλ (dx) = P (dx) = eλx−ψX (λ) P (dx) (13.11)
E[eλX ]

In other words, if P has a pdf p, then the pdf of Pλ is given by pλ (x) = eλx−ψX (λ) p(x).
Note: The set of distributions {Pλ : λ ∈ R} parametrized by λ is called a standard exponential
family, a very useful model in statistics. See [Bro86, p. 13].
Example:

• Gaussian: P = N (0, 1) with density p(x) = √1 exp(−x2 /2). Then Pλ has density
2π
exp(λx)
√1 exp(−x2 /2) = √1 exp(−(x − λ)2 /2). Hence Pλ = N (λ, 1).
exp(λ2 /2) 2π 2π

eλ
• Binary: P is uniform on {±1}. Then Pλ (1) = eλ +e−λ
which puts more (resp. less) mass on 1
D
if λ > 0 (resp. < 0). Moreover, Pλ −
→δ1 if λ → ∞ or δ−1 if λ → −∞.

• Uniform: P is uniform on [0, 1]. Then Pλ is also supported on [0, 1] with pdf pλ (x) = λ exp(λx)
eλ −1
.
Therefore as λ increases, Pλ becomes increasingly concentrated near 1, and Pλ → δ1 as λ → ∞.
Similarly, Pλ → δ0 as λ → −∞.

155
So we see that Pλ shifts the mean of P to the right (resp. left) when λ > 0 (resp. < 0). Indeed, this
is a general property of tilting.
Theorem 13.4 (Properties of Pλ ).

1. Log MGF:
ψPλ (u) = ψX (λ + u) − ψX (λ)

2. Tilting trades mean for divergence:

0
EPλ [X] = ψX (λ) ≷ EP [X] if λ ≷ 0. (13.12)
∗ 0 ∗
D(Pλ kP ) = ψX (ψX (λ)) = ψX (EPλ [X]). (13.13)

P (X > b) > 0 ⇒ ∀ > 0, Pλ (X ≤ b − ) → 0 as λ → ∞

P (X < a) > 0 ⇒ ∀ > 0, Pλ (X ≥ a + ) → 0 as λ → −∞

D D
Therefore if Xλ ∼ Pλ , then Xλ −
→ essinf X = A as λ → −∞ and Xλ −
→ esssup X = B as
λ → ∞.

Proof. 1. By definition. (DIY)

E[X exp(λX)]
2. EPλ [X] = E[exp(λX)]
0 (λ), which is strictly increasing in λ, with ψ 0 (0) = E [X].
= ψX X P
exp(λX)
D(Pλ kP ) = EPλ log dPdP = EPλ log E[exp(λX)] = λEPλ [X] − ψX (λ) = λψX (λ) − ψX (λ) =
λ 0
∗ (ψ 0 (λ)), where the last equality follows from Theorem 13.3.1.
ψX X

Pλ (X ≤ b − ) = EP [eλX−ψX (λ) 1{X≤b−} ]

≤ EP [eλ(b−)−ψX (λ) 1{X≤b−} ]
≤ e−λ eλb−ψX (λ)
e−λ
≤ → 0 as λ → ∞
P [X > b]

where the last inequality is due to the usual Chernoff bound (Theorem 13.2.7): P [X > b] ≤
exp(−λb + ψX (λ)).

156
§ 14. Information projection and Large deviation

14.1 Large-deviation exponents

Large deviations problems deal with rare events by making statements about the tail probabilities
of a sequence of distributions.
We’re interested in the speed of decay for probabilities such as
1 Pn
P n k=1 Xk ≥ γ for iid Xk .
In the last lecture we used Chernoff bound to obtain an upper bound on the exponent via the
log-MGF and tilting. Next we use a different method to give a formula for the exponent as a convex
optimization problem involving the KL divergence (information projection). Later in Section 14.3
we shall revisit the Chernoff bound after we have computed the value of the information projection.
i.i.d.
Theorem 14.1. Let X1 , X2 , . . . ∼ P . Then for any γ ∈ R,
1 1
lim log 1 Pn = inf D(QkP ) (14.1)
n→∞ n P n k=1 Xk > γ Q : EQ [X]>γ
1 1
lim log 1 Pn = inf D(QkP ) (14.2)
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

Furthermore, the upper bound on the probabilities hold for every n.

i.i.d.
Example: [Binomial tail] Applying Theorem 14.1 to Xi ∼ Bern(p), we get the following bounds on
the binomial tail:
k
P(Bin(n, p) ≥ k) ≤ exp(−nd(k/nkp)), >p
n
k
P(Bin(n, p) ≤ k) ≤ exp(−nd(k/nkp)), <p
n
where we used the fact that minQ:EQ [X]≥k/n D(QkBern(p)) = minq≥k/n d(qkp) = d( nk kp).

Proof. First note that if the events have zero probability, then both sides coincide with infinity.
1 Pn
Indeed, if P n k=1 Xk > γ = 0, then P [X > γ] = 0. Then EQ [X] > γ ⇒ Q[X > γ] > 0 ⇒ Q 6
P ⇒ D(QkP ) = ∞ and hence (14.1) holds trivially. The case for (14.2) is similar.
In the sequel
1 Pn we assume both probabilities are nonzero. We start by proving (14.1). Set
P [En ] = P n k=1 Xk > γ .
Lower Bound on P [En ]: Fix a Q such that EQ [X] > γ. Let X n be iid. Then by WLLN,
" n #
X LLN
Q[En ] = Q Xk > nγ = 1 − o(1).
k=1

Now the data processing inequality gives

d(Q[En ]kP [En ]) ≤ D(QX n kPX n ) = nD(QkP )

157
And a lower bound for the binary divergence is
1
d(Q[En ]kP [En ]) ≥ −h(Q[En ]) + Q[En ] log
P [En ]

Combining the two bounds on d(Q[En ]kP [En ]) gives

−nD(QkP ) − log 2
P [En ] ≥ exp (14.3)
Q[En ]

Optimizing over Q to give the best bound:

1 1
lim sup log ≤ inf D(QkP ).
n→∞ n P [En ] Q:EQ [X]>γ

Upper Bound on P [En ]: The key observation is that given any X and any event E, PX (E) > 0
can be expressed via the divergence between the conditional and unconditional
P distribution as:
1
log PX (E) = D(PX|X∈E kPX ). Define P̃X n = PX n | Xi >nγ , under which
P Xi > nγ holds a.s. Then

1
log = D(P̃X n kPX n ) ≥ inf D(QX n kPX n ) (14.4)
P [En ]
P
QX n :EQ [ Xi ]>nγ

We now show that the last problem “single-letterizes”, i.e., reduces n = 1. Consider the following
two steps:
n
X
D(QX n kPX n ) ≥ D(QXj kP ) (14.5)
j=1
n
1X
≥ nD(Q̄kP ) , Q̄ , QXj , (14.6)
n
j=1

where the first step follows from Corollary 2.1 after noticing that PX n = P n , and the second step is
by convexity of divergence Theorem 4.1. From this argument we conclude that

inf
P D(QX n kPX n ) = n · inf D(QkP ) (14.7)
QX n :EQ [ Xi ]>nγ Q:EQ [X]>γ

inf
P D(QX n kPX n ) = n · inf D(QkP ) (14.8)
QX n :EQ [ Xi ]≥nγ Q:EQ [X]≥γ

In particular, (14.4) and (14.7) imply the required lower bound in (14.1).
Next we prove (14.2). First, notice that the lower bound argument (14.4) applies equally well,
so that for each n we have
1 1
log 1 Pn ≥ inf D(QkP ) .
n P n k=1 Xk ≥ γ Q : EQ [X]≥γ

To get a matching upper bound we consider two cases:

• Case I: P
P [X > γ] = 0. If P [X ≥ γ] = 0, then both sides of (14.2) are +∞. If P [X = γ] > 0,
then P [ Xk ≥ nγ] = P [X1 = . . . = Xn = γ] = P [X = γ]n . For the right-hand side, since
D(QkP ) < ∞ =⇒ Q P =⇒ Q(X ≤ γ) = 1, the only possibility for EQ [X] ≥ γ is that
1
Q(X = γ) = 1, i.e., Q = δγ . Then inf EQ [X]≥γ D(QkP ) = log P (X=γ) .

158
P P
• Case II: P [X > γ] > 0. Since P[ Xk ≥ γ] ≥ P[ Xk > γ] from (14.1) we know that
1 1
lim sup log 1 Pn ≤ inf D(QkP ) .
n→∞ n P n k=1 Xk ≥ γ Q : EQ [X]>γ

We next show that in this case

inf D(QkP ) = inf D(QkP ) (14.9)
Q : EQ [X]>γ Q : EQ [X]≥γ

Indeed, let P̃ = PX|X>γ which is well defined since P [X > γ] > 0. For any Q such
that EQ [X] ≥ γ, set Q̃ = ¯Q + P̃ satisfies EQ̃ [X] > γ. Then by convexity, D(Q̃kP ) ≤
1
¯D(QkP ) + D(P̃ kP ) = ¯D(QkP ) + log P [X>γ] . Sending → 0, we conclude the proof
of (14.9).

14.2 Information Projection

The results of Theorem 14.1 motivate us to study the following general information projection
problem: Let E be a convex set of distributions on some abstract space Ω, then for the distribution
P on Ω, we want
inf D(QkP )
Q∈E

Denote the minimizing distribution Q by Q∗ . The next result shows that intuitively the “line”
between P and optimal Q∗ is “orthogonal” to E.

Q∗

Distributions on X
Theorem 14.2. Suppose ∃Q∗ ∈ E such that D(Q∗ kP ) = minQ∈E D(QkP ), then ∀Q ∈ E
D(QkP ) ≥ D(QkQ∗ ) + D(Q∗ kP )
Proof. If D(QkP ) = ∞, then we’re done, so we can assume that D(QkP ) < ∞, which also implies
that D(Q∗ kP ) < ∞. For θ ∈ [0, 1], form the convex combination Q(θ) = θ̄Q∗ + θQ ∈ E. Since Q∗ is
the minimizer of D(QkP ), then1

∂
0≤ D(Q(θ) kP ) = D(QkP ) − D(QkQ∗ ) − D(Q∗ kP )
∂θ θ=0
and we’re done.
1
This can be found by taking the derivative and matching terms (Exercise). Be careful with exchanging derivatives
and integrals. Need to use dominated convergence theorem similar as in the “local behavior of divergence” in
Proposition 4.1.

159
Remark 14.1. If we view the picture above in the Euclidean setting, the “triangle” formed by P ,
Q∗ and Q (for Q∗ , Q in a convex set, P outside the set) is always obtuse, and is a right triangle
only when the convex set has a “flat face”. In this sense, the divergence is similar to the squared
Euclidean distance, and the above theorem is sometimes known as a “Pythagorean” theorem.
The interesting set of Q’s that we will focus next is the “half-space” of distributions E = {Q :
EQ [X] ≥ γ}, where X : Ω → R is some fixed function (random variable). This is justified by
relation (to be established) with the large deviation exponent in Theorem 14.1. First, we solve this
I-projection problem explicitly.
Theorem 14.3. Given a distribution P on Ω and X : Ω → R let
0
A = inf ψX = essinf X = sup{a : X ≥ a P -a.s.} (14.10)
0
B= sup ψX = esssup X = inf{b : X ≤ b P -a.s.} (14.11)

1. The information projection problem over E = {Q : EQ [X] ≥ γ} has solution




0 γ < EP [X]

ψ ∗ (γ)
P EP [X] ≤ γ < B
min D(QkP ) = 1 (14.12)
Q : EQ [X]≥γ 
log P (X=B) γ = B


+∞ γ>B

= ψP∗ (γ)1{γ ≥ EP [X]} (14.13)

2. Whenever the minimum is finite, minimizing distribution is unique and equal to tilting of P
along X, namely2
dPλ = exp{λX − ψ(λ)} · dP (14.14)

3. For all γ ∈ [EP [X], B) we have

min D(QkP ) = inf D(QkP ) = min D(QkP ) .

EQ [X]≥γ EQ [X]>γ EQ [X]=γ

Note: An alternative expression is

min D(QkP ) = sup λγ − ψX (λ) .

Q:EQ [X]≥γ λ≥0

Proof. First case: Take Q = P .

Fourth case: If EQ [X] > B, then Q[X ≥ B + ] > 0 for some > 0, but P [X ≥ B + ] = 0, since
P (X ≤ B) = 1, by Theorem 13.2.5. Hence Q 6 P =⇒ D(QkP ) = ∞.
Third case: If P (X = B) = 0, then X < B a.s. under P , and Q 6 P for any Q s.t. EQ [X] ≥ B.
Then the minimum is ∞. Now assume P (X = B) > 0. Since D(QkP ) < ∞ =⇒ Q P =⇒
Q(X ≤ B) = 1. Therefore the only possibility for EQ [X] ≥ B is that Q(X = B) = 1, i.e., Q = δB .
1
Then D(QkP ) = log P (X=B) .
Second case: Fix EP [X] ≤ γ < B, and find the unique λ such that ψX 0 (λ) = γ = E [X] where
Pλ
dPλ = exp(λX − ψX (λ))dP . This corresponds to tilting P far enough to the right to increase its
2
Note that unlike previous Lecture, here P and Pλ are measures on an abstract space Ω, not necessarily on the
real line.

160
mean from EP X to γ, in particular λ ≥ 0. Moreover, ψX
∗ (γ) = λγ − ψ (λ). Take any Q such that
X
EQ [X] ≥ γ, then

dQdPλ
D(QkP ) = EQ log (14.15)
dP dPλ
dPλ
= D(QkPλ ) + EQ [log ]
dP
= D(QkPλ ) + EQ [λX − ψX (λ)]
≥ D(QkPλ ) + λγ − ψX (λ)
∗
= D(QkPλ ) + ψX (γ)
∗
≥ ψX (γ), (14.16)

where the last inequality holds with equality if and only if Q = Pλ . In addition, this shows the
minimizer is unique, proving the second claim. Note that even in the corner case of γ = B (assuming
P (X = B) > 0) the minimizer is a point mass Q = δB , which is also a tilted measure (P∞ ), since
Pλ → δB as λ → ∞, cf. Theorem 13.4.3.
Another version of the solution, given by expression (14.13), follows from Theorem 13.3.
For the third claim, notice that there is nothing to prove for γ < EP [X], while for γ ≥ EP [X]
we have just shown
∗
ψX (γ) = min D(QkP )
Q:EQ [X]≥γ

while from the next corollary we have

∗
inf D(QkP ) = inf
0
ψX (γ 0 ) .
Q:EQ [X]>γ γ >γ

∗ is increasing and continuous by Theorem 13.3, and hence the

The final step is to notice that ψX
right-hand side infimum equalis ψX ∗ (γ). The case of min
Q:EQ [X]=γ is handled similarly.

Corollary 14.1. ∀Q with EQ [X] ∈ (A, B), there exists a unique λ ∈ R such that the tilted
distribution Pλ satisfies

EPλ [X] = EQ [X]

D(Pλ kP ) ≤ D(QkP )

and furthermore the gap in the last inequality equals D(QkPλ ) = D(QkP ) − D(Pλ kP ).

Proof. Proceed as in the proof of Theorem 14.3, and find the unique λ s.t. EPλ [X] = ψX
0 (λ) = E [X].
Q
∗
Then D(Pλ kP ) = ψX (EQ [X]) = λEQ [X] − ψX (λ). Repeat the steps (14.15)-(14.16) obtaining
D(QkP ) = D(QkPλ ) + D(Pλ kP ).

Remark: For any Q, this allows us to find a tilted measure Pλ that has the same mean yet
smaller (or equal) divergence.

14.3 Interpretation of Information Projection

The following picture describes many properties of information projections.

161
Q 6≪ P
One Parameter Family

γ=A

P b
D(Pλ ||P ) EQ [X] = γ
λ=0 = ψ ∗ (λ)
b

λ>0 Q
b
γ=B
Q∗
=Pλ

Q 6≪ P

Space of distributions on R

• Each set {Q : EQ [X] = γ} corresponds to a slice. As γ varies from A to B, the curves fill the
entire space except for the corner regions.

• When γ < A or γ > B, Q 6 P .

• As γ varies, the Pλ ’s trace out a curve via ψ ∗ (γ) = D(Pλ kP ). This set of distributions is
called a one parameter family, or exponential family.

Key Point: The one parameter family curve intersects each γ-slice E = {Q : EQ [X] = γ}
“orthogonally” at the minimizing Q∗ ∈ E, and the distance from P to Q∗ is given by ψ ∗ (λ). To see
this, note that applying Theorem 14.2 to the convex set E gives us D(QkP ) ≥ D(QkQ∗ ) + D(Q∗ kP ).
Now thanks to Corollary 14.1, we in fact have equality D(QkP ) = D(QkQ∗ ) + D(Q∗ kP ) and
Q∗ = Pλ for some tilted measure.
Chernoff bound revisited: The proof of upper bound in Theorem 14.1 is via the definition of
information projection. Theorem 14.3 shows that the value of the information projection coincides
with the rate function (conjugate of log-MGF). This shows the optimality of the Chernoff bound
(recall Theorem 13.2.7). Indeed, we directly verify this for completeness: For all λ ≥ 0,
" n #
X
P Xk ≥ nγ ≤ e−nγλ (EP [eλX ])n = e−n(λγ−ψX (λ))
k=1

where we used iid Xk ’s in the expectation. Optimizing over λ ≥ 0 to get the best upper bound:
∗
sup λγ − ψX (λ) = sup λγ − ψX (λ) = ψX (γ)
λ≥0 λ∈R

where the first equality follows since γ ≥ EP [X], therefore λ 7→ λγ − ψX (λ) is increasing when λ ≤ 0.
Hence " n #
X ∗
P Xk ≥ nγ ≤ e−nψX (γ) . (14.17)
k=1

Remark: The Chernoff bound is tight precisely because, from information projection, the lower
bound showed that the best change of measure is to change to the tilted measure Pλ .

162
14.4 Generalization: Sanov’s theorem
Theorem 14.4 (Sanov’s Theorem). P Consider observing n samples X1 , . . . , Xn ∼ iid P . Let P̂ be
the empirical distribution, i.e., P̂ = n nj=1 δXj . Let E be a convex set of distributions. Then under
1

regularity conditions on E and P we have

P[P̂ ∈ E] = e−n minQ∈E D(QkP )+o(n)

Note: Examples of regularity conditions: space X is finite and E is closed with non-empty interior;
space X is Polish and the set E is weakly closed and has non-empty interior.

Proof sketch. The lower bound is proved as in Theorem 14.1: Just take an arbitrary Q ∈ E and
apply a suitable version of WLLN to conclude Qn [P̂ ∈ E] = 1 + o(1).
For the upper bound we can again adapt the proof from Theorem 14.1. Alternatively, we can
write the convex set E as an intersection of half spaces. Then we’ve already solved the problem
for half-spaces {Q : EQ [X] ≥ γ}. The general case follows by the following consequence of
Theorem 14.2: if Q∗ is projection of P onto E1 and Q∗∗ is projection of Q∗ on E2 , then Q∗∗ is also
projection of P onto E1 ∩ E2 :
(
D(Q∗ kP ) = minQ∈E1 D(QkP )
D(Q∗∗ kP ) = min D(QkP ) ⇐
Q∈E1 ∩E2 D(Q∗∗ kQ∗ ) = minQ∈E2 D(QkQ∗ )

(Repeated projection property)

Indeed, by first tilting from P to Q∗ we find
∗ kP )
P [P̂ ∈ E1 ∩ E2 ] ≤ 2−nD(Q Q∗ [P̂ ∈ E1 ∩ E2 ]
∗ kP )
≤ 2−nD(Q Q∗ [P̂ ∈ E2 ]

and from here proceed by tilting from Q∗ to Q∗∗ and note that D(Q∗ kP ) + D(Q∗∗ kQ∗ ) = D(Q∗∗ kP ).

Remark: Sanov’s theorem tells us the probability that, after observing n iid samples of a
distribution, the empirical distribution is still far away from the true distribution, is exponentially
small.

163
§ 15. Hypothesis testing asymptotics II

Setup:
H0 : X n ∼ P X n H1 : X n ∼ QX n (i.i.d.)
n
test PZ|X n : X → {0, 1}
(n) (n)
specification: 1 − α = π1|0 ≤ 2−nE0 β = π0|1 ≤ 2−nE1
Bounds:
• achievability (Neyman Pearson)
α = 1 − π1|0 = P [Fn > τ ], β = π0|1 = Q[Fn > τ ]

• converse (strong): from Theorem 12.5:

∀(α, β) achievable, α − γβ ≤ P [Fn > log γ] (15.1)
where
dPX n
F = log (X n ),
dQX n

15.1 (E0 , E1 )-Tradeoff

Goal:
π1|0 = 1 − α ≤ 2−nE0 , π0|1 = β ≤ 2−nE1 .
Our goal in the Chernoff regime is to find the best tradeoff, which we formally define as follows
(compare to Stein’s exponent in Lecture 13)
E1∗ (E0 ) , sup{E1 : ∃n0 , ∀n ≥ n0 , ∃PZ|X n s.t. α > 1 − 2−nE0 , β < 2−nE1 }
1 1
= lim inf log
n→∞ n β1−2−nE0 (P n , Qn )
Define
Xn
dQ dQ dQn n
T = log (X), Tk = log (Xk ), thus log (X ) = Tk
dP dP dP n
k=1
Log MGF of T under P (again assumed to be finite and also T 6= const since P 6= Q):
ψP (λ) = log EP [eλT ]
X Z
= log P (x) Q(x) = log (dP )1−λ (dQ)λ
1−λ λ

x
ψP∗ (θ) = sup θλ − ψP (λ)
λ∈R

164
Note that since ψP (0) = ψP (1) = 0 from convexity ψP (λ) is finite on 0 ≤ λ ≤ 1. Furthermore,
assuming P Q and Q P we also have that λ 7→ ψP (λ) continuous everywhere on [0, 1] (
on (0, 1) it follows from convexity, but for boundary points we need more detailed arguments).
Consequently, all the results in this section apply under just the conditions of P Q and Q P .
However, since in previous lecture we were assuming that log-MGF exists for all λ, we will only
present proofs under this extra assumption.
Theorem 15.1. Let P Q, Q P , then

E0 (θ) = ψP∗ (θ), E1 (θ) = ψP∗ (θ) − θ (15.2)

parametrized by −D(P kQ) ≤ θ ≤ D(QkP ) characterizes the best exponents on the boundary of
achievable (E0 , E1 ).
Note: The geometric interpretation of the above theorem is shown in the following picture, which rely
on the properties of ψP (λ) and ψP∗ (θ). Note that ψP (0) = ψP (1) = 0. Moreover, by Theorem 13.3
∗ ), θ 7→ E (θ) is increasing, θ 7→ E (θ) is decreasing.
(Properties of ψX 0 1

Remark 15.1 (Rényi divergence). Rényi defined a family of divergence indexed by λ 6= 1

" #
1 dP λ
Dλ (P kQ) , log EQ ≥ 0,
λ−1 dQ

λ→1
which generalizes Kullback-Leibler divergence since Dλ (P kQ) −−−→ D(P kQ). Note that ψP (λ) =
(λ − 1)Dλ (QkP ) = −λD1−λ (P kQ). This provides another explanation that ψP is negative between
0 and 1, and the slope at endpoints is: ψP0 (0) = −D(P kQ) and ψP0 (1) = D(QkP ).
Corollary 15.1 (Bayesian criterion). Fix a prior (π0 , π1 ) such that π0 + π1 = 1 and 0 < π0 < 1.
Denote the optimal Bayesian (average) error probability by

Pe∗ (n) , inf π0 π1|0 + π1 π0|1

PZ|X n

with exponent
1 1
E , lim log ∗ .
n→∞ n Pe (n)
Then
E = max min(E0 (θ), E1 (θ)) = ψP∗ (0) = − inf ψP (λ),
θ λ∈[0,1]

regardless of the prior, and ψP∗ (0) , C(P, Q) is called the Chernoff exponent.

165
Remark 15.2 (Bhattacharyya distance). There is an important special case in which Chernoff
exponent simplifies. Instead of i.i.d. observations, consider independent, but not identically
distributed observations. Namely, suppose that two hypotheses correspond to two different strings
xn and x̃n over a finite alphabet X . The hypothesis tester observes Y n = (Yj , j = 1, . . . n) obtained
by applying one of the strings to the input of the memoryless channel PY |X (the alphabet Y does
not need to be finite, but we assume this below). Extending Corollary it can be shown, that in this
case optimal probability Pe∗ (xn , x̃n ) has (Chernoff) exponent1
n
1X X
E = − inf log PY |X (y|xt )λ PY |X (y|x̃t )1−λ .
λ∈[0,1] n
t=1 y∈Y

If |X | = 2 and if compositions of xn and x̃n are equal (!), the expression is invariant under λ ↔ (1−λ)
and thus from convexity in λ we infer that λ = 12 is optimal, yielding E = n1 dB (xn , x̃n ), where
n
X Xq
dB (xn , x̃n ) = − log PY |X (y|xt )PY |X (y|x̃t )
t=1 y∈Y

is known as Bhattacharyya distance between codewords xn and x̃n . Without the two assumptions
stated, dB (·, ·) does not necessarily give the optimal error exponent. We do, however, always have
the bounds
1 −2dB (xn ,x̃n ) n n
2 ≤ Pe∗ (xn , x̃n ) ≤ 2−dB (x ,x̃ )
4
with the upper bound being the more tight the more joint composition of (xn , x̃n ) resembles that of
(x̃n , xn ).
P
Proof of Theorem 15.1. The idea is to apply the large deviation theory to iid sum nk=1 Tk . Specif-
ically, let’s rewrite the bounds in terms of T :

• Achievability (Neyman Pearson)

" n
# " n #
(n)
X (n)
X
let τ = −nθ, π1|0 =P Tk ≥ nθ π0|1 =Q Tk < nθ
k=1 k=1

• Converse (strong): from (15.1)

" n
#
X
−nθ −nθ
let γ = 2 , π1|0 + 2 π0|1 ≥ P Tk ≥ nθ
k=1

Achievability: Using Neyman Pearson test, for fixed τ = −nθ, apply the large deviation
theorem:
" n #
(n)
X ∗
1 − α = π1|0 = P Tk ≥ nθ = 2−nψP (θ)+o(n) , for θ ≥ EP T = −D(P kQ)
k=1
" n #
(n)
X ∗
β= π0|1 =Q Tk < nθ = 2−nψQ (θ)+o(n) , for θ ≤ EQ T = D(QkP )
k=1

1
In short, the tilting parameter λ does not need to change between coordinates t corresponding to different values
of (xt , x̃t ).

166
Notice that by the definition of T we have

ψQ (λ) = log EQ [eλ log(Q/P ) ] = log EP [e(λ+1) log(Q/P ) ] = ψP (λ + 1)

∗
⇒ ψQ (θ) = sup θλ − ψP (λ + 1) = ψP∗ (θ) − θ
λ∈R

thus (E0 , E1 ) in (15.2) is achievable.

Converse: We want to show that any achievable (E0 , E1 ) pair must be below the curve
(E0 (θ), E1 (θ)) in the above Neyman-Pearson test with parameter θ. Apply the strong converse
bound we have:
∗
2−nE0 + 2−nθ 2−nE1 ≥ 2−nψP (θ)+o(n)
⇒ min(E0 , E1 + θ) ≤ ψP∗ (θ), ∀n, θ, −D(P kQ) ≤ θ ≤ D(QkP )
⇒ either E0 ≤ ψP∗ (θ) or E1 ≤ ψP∗ (θ) − θ

15.2 Equivalent forms of Theorem 15.1

Alternatively, the optimal (E0 , E1 )-tradeoff can be stated in the following equivalent forms:
Theorem 15.2. 1. The optimal exponents are given (parametrically) in terms of λ ∈ [0, 1] as

E0 = D(Pλ kP ), E1 = D(Pλ kQ) (15.3)

where the distribution Pλ 2 is tilting of P along T , cf. (14.14), which moves from P0 = P to
P1 = Q as λ ranges from 0 to 1:

dPλ = (dP )1−λ (dQ)λ exp{−ψP (λ)}.

2. Yet another characterization of the boundary is

E1∗ (E0 ) = min D(Q0 kQ) , 0 ≤ E0 ≤ D(QkP ) (15.4)

Q0 :D(Q0 kP )≤E0

Proof. The first part is verified trivially. Indeed, if we fix λ and let θ(λ) , EPλ [T ], then from (13.13)
we have
D(Pλ kP ) = ψP∗ (θ) ,
whereas

dPλ dPλ dP
D(Pλ kQ) = EPλ log = EPλ log = D(Pλ kP ) − EPλ [T ] = ψP∗ (θ) − θ .
dQ dP dQ
Also from (13.12) we know that as λ ranges in [0, 1] the mean θ = EPλ [T ] ranges from −D(P kQ) to
D(QkP ).
To prove the second claim (15.4), the key observation is the following: Since Q is itself a tilting
of P along T (with λ = 1), the following two families of distributions

dPλ = exp{λT − ψP (λ)} · dP

dQλ0 = exp{λ0 T − ψQ (λ0 )} · dQ
2
This is called a geometric mixture of P and Q.

167
are in fact the same family with Qλ0 = Pλ0 +1 .
Now, suppose that Q∗ achieves the minimum in (15.4) and that Q∗ 6= Q, Q∗ 6= P (these cases
should be verified separately). Note that we have not shown that this minimum is achieved, but it
will be clear that our argument can be extended to the case of when Q0n is a sequence achieving the
infimum. Then, on one hand, obviously

D(Q∗ kQ) = min D(Q0 kQ) ≤ D(P kQ)

Q0 :D(Q0 kP )≤E0

On the other hand, since E0 ≤ D(QkP ) we also have

D(Q∗ kP ) ≤ D(QkP ) .

Therefore,

dQ∗ dQ
EQ∗ [T ] = EQ∗ log = D(Q∗ kP ) − D(Q∗ kQ) ∈ [−D(P kQ), D(QkP )] . (15.5)
dP dQ∗

Next, we have from Corollary 14.1 that there exists a unique Pλ with the following three properties:3

EPλ [T ] = EQ∗ [T ]
D(Pλ kP ) ≤ D(Q∗ kP )
D(Pλ kQ) ≤ D(Q∗ kQ)

Thus, we immediately conclude that minimization in (15.4) can be restricted to Q∗ belonging to the
family of tilted distributions {Pλ , λ ∈ R}. Furthermore, from (15.5) we also conclude that λ ∈ [0, 1].
Hence, characterization of E1∗ (E0 ) given by (15.3) coincides with the one given by (15.4).

3
Small subtlety: In Corollary 14.1 we ask EQ∗ [T ] ∈ (A, B). But A, B – the essential range of T – depend on the
distribution under which the essential range is computed, cf. (14.10). Fortunately, we have Q P and P Q, so
essential range is the same under both P and Q. And furthermore (15.5) implies that EQ∗ [T ] ∈ (A, B).

168
Note: Geometric interpretation of (15.4) is as follows: As λ increases from 0 to 1, or equivalently,
θ increases from −D(P kQ) to D(QkP ), the optimal distribution traverses down the curve. This
curve is in essense a geodesic connecting P to Q and exponents E0 ,E1 measure distances to P and
Q. It may initially sound strange that the sum of distances to endpoints actually varies along the
geodesic, but it is a natural phenomenon: just consider the unit circle with metric induced by the
ambient Euclidean metric. Than if p and q are two antipodal points, the distance from intermediate
point to endpoints do not sum up to d(p, q) = 2.

15.3* Sequential Hypothesis Testing

Review: Filtrations, stopping times

• A sequence of nested σ-algebras F0 ⊂ F1 ⊂ F2 · · · ⊂ Fn · · · ⊂ F is called a filtration

of F.

• A random variable τ is called a stopping time of a filtration Fn if a) τ is valued in

Z+ and b) for every n ≥ 0 the event {τ ≤ n} ∈ Fn .

• The σ-algebra Fτ consists of all events E such that E ∩ {τ ≤ n} ∈ Fn for all n ≥ 0.

• When Fn = σ{X1 , . . . , Xn } the interpretation is that τ is a time that can be

determined by causally observing the sequence Xj , and random variables measurable
with respect to Fτ are precisely those whose value can be determined on the basis
of knowing (X1 , . . . , Xτ ).

• Let Mn be a martingale adapted to Fn , i.e. Mn is Fn -measurable and E[Mn |Fk ] =

Mmin(n,k) . Then M̃n = Mmin(n,τ ) is also a martingale. If collection {Mn } is uniformly
integrable then
E[Mτ ] = E[M0 ] .

• For more details, see [Ç11, Chapter V].

169
Different realizations of Xk are informative to different levels, the total “information” we receive
follows a random process. Therefore, instead of fixing the sample size n, we can make n a stopping
time τ , which gives a “better” (E0 , E1 ) tradeoff. Solution is the concept of sequential test:

• Informally: Sequential test Z at each step declares either “H0 ”, “H1 ” or “give me one more
sample”.

• Rigorous definition is as follows: A sequential hypothesis test is a stopping time τ of the

filtration Fn , σ{X1 , . . . , Xn } and a random variable Z ∈ {0, 1} measurable with respect to
Fτ .

• Each sequential test has the following performance metrics:

α = P[Z = 0], β = Q[Z = 0] (15.6)

l0 = EP [τ ], l1 = EQ [τ ] (15.7)

The easiest way to see why sequential tests may be dramatically superior to fixed-sample-size
tests is the following example: Consider P = 12 δ0 + 12 δ1 and Q = 12 δ0 + 12 δ−1 . Since P 6⊥ Q, we
also have P n 6⊥ Qn . Consequently, no finite-sample-size test can achieve zero error under both
hypotheses. However, an obvious sequential test (wait for the first appearance of ±1) achieves zero
error probability with finite average number of samples (2) under both hypotheses. This advantage
is also very clear in the achievable error exponents as the next figure shows.

Theorem 15.3 (Wald). Assume bounded LLR:4

P (x)

log ≤ c0 , ∀x
Q(x)

where c0 is some positive constant. If the error probabilities satisfy:

π1|0 ≤ 2−l0 E0 , π0|1 ≤ 2−l1 E1

for large l0 , l1 , then the following inequality for the exponents holds

E0 E1 ≤ D(P kQ)D(QkP ).
4
This assumption is satisfied for discrete distributions on finite spaces.

170
with optimal boundary achieved by the sequential probability ratio test SPRT(A, B) (A, B are large
positive numbers) defined as follows:

τ = inf{n : Sn ≥ B or Sn ≤ −A}

0, if Sτ ≥ B
Z=
1, if Sτ < −A

where
n
X P (Xk )
Sn = log
Q(Xk )
k=1

is the log likelihood function of the first k observations.

Note: (Intuition on SPRT) Under the usual hypothesis testing setup, we collect n samples, evaluate
the LLR Sn , and compare it to the threshold to give the optimal test. Under the sequential setup
with iid data, {Sn : n ≥ 1} is a random walk, which has positive (resp. negative) drift D(P kQ)
(resp. −D(QkP )) under the null (resp. alternative)! SPRT test simply declare P if the random walk
crosses the upper boundary B, or Q if the random walk crosses the upper boundary −A.

Proof. As preparation we show two useful identities:

• For any stopping time with EP [τ ] < ∞ we have

EP [Sτ ] = EP [τ ]D(P kQ) (15.8)

and similarly, if EQ [τ ] < ∞ then

EQ [Sτ ] = − EQ [τ ]D(QkP ) .

To prove these, notice that

Mn = Sn − nD(P kQ)
is clearly a martingale w.r.t. Fn . Consequently,

M̃n , Mmin(τ,n)

is also a martingale. Thus

E[M̃n ] = E[M̃0 ] = 0 ,
or, equivalently,
E[Smin(τ,n) ] = E[min(τ, n)]D(P kQ) . (15.9)
This holds for every n ≥ 0. From boundedness assumption we have |Sn | ≤ nc and thus
|Smin(n,τ ) | ≤ nτ , implying that collection {Smin(n,τ ) , n ≥ 0} is uniformly integrable. Thus, we
can take n → ∞ in (15.9) and interchange expectation and limit safely to conclude (15.8).

• Let τ be a stopping time. The Radon-Nikodym derivative of P w.r.t. Q on σ-algebra Fτ is

given by
dP|Fτ
= exp{Sτ } .
dQ|Fτ
Indeed, what we need to verify is that for every event E ∈ Fτ we have

EP [1E ] = EQ [exp{Sτ }1E ] (15.10)

171
To that end, consider a decomposition
X
1E = 1E∩{τ =n} .
n≥0

By monotone convergence theorem applied to (15.10) it is sufficient to verify that for every n
EP [1E∩{τ =n} ] = EQ [exp{Sτ }1E∩{τ =n} ] . (15.11)
dP|Fn
This, however, follows from the fact that E ∩ {τ = n} ∈ Fn and dQ|Fn = exp{Sn } by the very
definition of Sn .
We now proceed to the proof. For achievability we apply (15.10) to infer
π1|0 = P[Sτ ≤ −A]
= EQ [exp{Sτ }1{Sτ ≤ −A}]
≤ e−A
Next, we denot τ0 = inf{n : Sn ≥ B} and observe that τ ≤ τ0 , whereas expectation of τ0 we estimate
from (15.8):
EP [τ ] ≤ EP [τ0 ] = EP [Sτ0 ] ≤ B + c0 ,
where in the last step we used the boundedness assumption to infer
Sτ0 ≤ B + c0
Thus
B + c0 B
l0 = EP [τ ] ≤ EP [τ0 ] ≤ ≈ for large B
D(P kQ) D(P kQ)
Similarly we can show π0|1 ≤ e−B and l1 ≤ A
D(QkP ) for large A. Take B = l0 D(P kQ), A =
l1 D(QkP ), this shows the achievability.

Converse: Assume (E0 , E1 ) achievable for large l0 , l1 and apply data processing inequality of
divergence:

d(P(Z = 1)kQ(Z = 1)) ≤ D(PkQ) Fτ
= EP [Sτ ] = EP [τ ]D(P kQ) from (15.8)
= l0 D(P kQ)
notice that for l0 E0 and l1 E1 large, we have d(P(Z = 1)kQ(Z = 1)) ≈ l1 E1 , therefore l1 E1 .
l0 D(P kQ). Similarly we can show that l0 E0 . l1 D(QkP ), finally we have
E0 E1 ≤ D(P kQ)D(QkP ), as l0 , l1 → ∞

172
Part IV

Channel coding

173
§ 16. Channel coding

Objects we have studied so far:

1. PX - Single distribution, data compression
2. PX vs QX - Comparing two distributions, Hypothesis testing
3. Now: PY |X : X → Y (called a channel, dealing with a collection of distributions).

16.1 Channel Coding

Definition 16.1. An M -code for PY |X is an encoder/decoder pair (f, g) of (randomized) functions1
• encoder f : [M ] → X
• decoder g : Y → [M ] ∪ {e}
Here [M ] , {1, . . . , M }.
In most cases f and g are deterministic functions, in which case we think of them (equivalently)
in terms of codewords, codebooks, and decoding regions
• ∀i ∈ [M ] : ci = f (i) are codewords, the collection C = {c1 , . . . , cM } is called a codebook.
• ∀i ∈ [M ], Di = g −1 ({i}) is the decoding region for i.

c1 b
b
b

D1 b

b b

b cM
b b
b

Figure 16.1: When X = Y, the decoding regions can be pictured as a partition of the space, each
containing one codeword.

Note: The underlying probability space for channel coding problems will always be
f PY |X g
W −→ X −→ Y −→ Ŵ
1
For randomized encoder/decoders, we identify f and g as probability transition kernels PX|W and PŴ |Y .

174
When the source alphabet is [M ], the joint distribution is given by:
1
(general) PW XY Ŵ (m, a, b, m̂) = P (a|m)PY |X (b|a)PŴ |Y (m̂|b)
M X|W
1
(deterministic f, g) PW XY Ŵ (m, cm , b, m̂) = P (b|cm )1{b ∈ Dm̂ }
M Y |X
Throughout the notes, these quantities will be called:

• W - Original message

• X - (Induced) Channel input

• Y - Channel output

• Ŵ - Decoded message

16.1.1 Performance Metrics

Three ways to judge the quality of a code in terms of error probability:

1. Average error probability: Pe , P[W 6= Ŵ ].

2. Maximum error probability: Pe,max , maxm∈[M ] P[Ŵ 6= m|W = m].

3. Bit error rate: in the special case when M = 2k , we identify W with a k-bit string S k ∈ Fk2 .
P
Then the bit error rate is Pb , k1 kj=1 P[Sj 6= Ŝj ], which means the average fraction of errors
in a k-bit block. It is also convenient to introduce in this case the Hamming distance

dH (S k , Ŝ k ) , #{i : Si 6= Ŝj } .

Then, the bit-error rate becomes the normalized expected Hamming distance:
1
Pb = E[dH (S k , Ŝ k )] .
k
To distinguish the bit error rate Pb from the previously defined Pe (resp. Pe,max ), we will also
call the latter the average (resp. max) block error rate.

The most typical metric is average probability of error, but the others will be used occasionally in
the course as well. By definition, Pe ≤ Pe,max . Therefore the maximum error probability is a more
stringent criterion which offers uniform protection for all codewords.

16.1.2 Fundamental Limit for a given PY |X

Definition 16.2. A code (f, g) is an (M, )-code for PY |X if f : [M ] → X , g : Y → [M ] ∪ {e}, and
Pe ≤ . Similarly, an (M, )max -code must satisfy Pe,max ≤ .
Then the fundamental limits of channel codes are defined as

M ∗ () = max{M : ∃(M, )-code}

∗
Mmax () = max{M : ∃(M, )max -code}

Remark: log2 M ∗ gives the maximum number of bits that we can pump through a channel PY |X
while still guaranteeing the error probability (in the appropriate sense) is at most .

175
Example: The random transformation BSC(n, δ) (binary symmetric channel) is defined as

X = {0, 1}n
Y = {0, 1}n

where the input X n is contaminated by additive noise Z n ⊥

⊥ X n and the channel outputs

Y n = Xn ⊕ Zn
i.i.d.
where Z n ∼ Bern(δ). Pictorially, the BSC(n, δ) channel takes a binary sequence length n and flips
the bits independently with probability δ:
0 1 0 0 1 1 0 0 1 1

PY n |X n

1 1 0 1 0 1 0 0 0 1

Question: When δ = 0.11, n = 1000, what is the maximum number of bits you can send with
Pe ≤ 10−3 ?
Ideas:

0. Can one send 1000 bits with Pe ≤ 10−3 ? No and apparently the probability that at least one
bit is flipped is Pe = 1 − (1 − δ)n ≈ 1. This implies that uncoded transmission does not meet
our objective and coding is necessary – tradeoff: reduce the number of bits to send, increase
the probability of success.

1. Take each bit and repeat it l times (l-repetition code).

0 0 1 0

0000000 0000000 1111111 0000000

Decoding by majority vote, the probability of error of this scheme is Pe ≈ kP[Binom(l, δ) > l/2]
and kl ≤ n = 1000, which for Pe ≤ 10−3 gives l = 21, k = 47 bits.

0 , . . . , ar−1 ∈ F2 as the polynomial (in this

2. Reed-Muller Codes (1, r). Interpret a message aP r
r−1
case, a degree-1 and (r − 1)-variate polynomial) i=1 ai xi + a0 . Then codewords are formed
by evaluating the polynomial at all possible xr−1 ∈ Fr−1 2 . This code, which maps r bits to
2r−1 bits, has minimum distance 2r−2 . For r = 7, there is a [64, 7, 32] Reed-Muller code and
it can be shown that the MAP decoder of this code passed over the BSC(n = 64, δ = 0.11)
achieves probability of error ≤ 6 · 10−6 . Thus, we can use 16 such blocks (each carrying 7
data bits and occupying 64 bits on the channel) over the BSC(1024, δ), and still have (union
bound) overall Pe . 10−4 . This allows us to send 7 · 16 = 112 bits in 1024 channel uses, more
than double that of the repetition code.

176
3. Shannon’s theorem (to be shown soon) tells us that over memoryless channel of blocklength n
the fundamental limit satisfies
log M ∗ = nC + o(n) (16.1)
as n → ∞ and for arbitrary ∈ (0, 1). Here C = maxX I(X1 ; Y1 ) is the capacity of the
single-letter channel. In our case we have
1
I(X; Y ) = max I(X; X + Z) = log 2 − h(δ) ≈ bit
PX 2
So Shannon’s expansion (16.1) can be used to predict (non-rigorously, of course) that it should
be possible to send around 500 bits reliably. As it turns out, for this blocklength this is not
quite possible.
4. Even though calculating log M ∗ is not computationally feasible (searching over all codebooks
is doubly exponential in block length n), we can find bounds on log M ∗ that are easy to
compute. We will show later in the course that in fact, for BSC(1000, .11)

414 ≤ log M ∗ ≤ 416

5. The first codes to approach the bounds on log M ∗ are called Turbo codes (after the turbocharger
engine, where the exhaust is fed back in to power the engine). This class of codes is known as
sparse graph codes, of which LDPC codes are particularly well studied. As a rule of thumb,
these codes typically approach 80 . . . 90% of log M ∗ when n ≈ 103 . . . 104 .

16.2 Basic Results

Recall that the object of our study is M ∗ () = max{M : ∃(M, )-code}.

16.2.1 Determinism
1. Given any encoder f : [M ] → X , the decoder that minimizes Pe is the Maximum A Posteriori
(MAP) decoder, or equivalently, the Maximal Likelihood (ML) decoder, since the codewords
are equiprobable (W is uniform):

g ∗ (y) = argmax P [W = m|Y = y]

m∈[M ]

= argmax P [Y = y|W = m]
m∈[M ]

Furthermore, for a fixed f , the MAP decoder g is deterministic

2. For given M , PY |X , the Pe -minimizing encoder is deterministic.

Proof. Let f : [M ] → X be a random transformation. We can always represent randomized

encoder as deterministic encoder with auxiliary randomness. So instead of f (a|m), consider
the deterministic encoder f˜(m, u), that receives external randomness u. Then looking at all
possible values of the randomness,

Pe = P [W 6= Ŵ ] = EU [P[W 6= Ŵ |U ] = EU [Pe (U )].

Each u in the expectation gives a deterministic encoder, hence there is a deterministic encoder
that is at least as good as the average of the collection, i.e., ∃u0 s.t. Pe (u0 ) ≤ P[W 6= Ŵ ]

177
Remark: If instead we use maximal probability of error as our performance criterion, then
these results don’t hold; randomized encoders and decoders may improve performance. Example:
consider M = 2 and we are back to the binary hypotheses testing setup. The optimal decoder (test)
that minimizes the maximal Type-I and II error probability, i.e., max{1 − α, β}, is not deterministic,
if max{1 − α, β} is not achieved at a vertex of the region R(P, Q).

16.2.2 Bit Error Rate vs Block Error Rate

Now we give a bound on the average probability of error in terms of the bit error probability.
Theorem 16.1. For all (f, g), M = 2k =⇒ Pb ≤ Pe ≤ kPb
Note: The most often used direction Pb ≥ k1 Pe is rather loose for large k.
Proof. Recall that M = 2k gives us the interpretation of W = S k sequence of bits.
k k
1X k k
X
1{Si 6= Ŝi } ≤ 1{S 6= Ŝ } ≤ 1{Si 6= Ŝi },
k
i=1 i=1

where the first inequality is obvious and the second follow from the union bound. Taking expectation
of the above expression gives the theorem.
Theorem 16.2 (Assouad). If M = 2k then
Pb ≥ min{P[Ŵ = c0 |W = c] : c, c0 ∈ Fk2 , dH (c, c0 ) = 1} .
Proof. Let ei be a length k vector that is 1 in the i-th position, and zero everywhere else. Then
k
X k
X
1{Si 6= Ŝi } ≥ 1{S k = Ŝ k + ei }
i=1 i=1

Dividing by k and taking expectation gives

k
1X
Pb ≥ P[S k = Ŝ k + ei ]
k
i=1
≥ min{P[Ŵ = c0 |W = c] : c, c0 ∈ Fk2 , dH (c, c0 ) = 1} .

Similarly, we can prove the following generalization:

Theorem 16.3. If A, B ∈ Fk2 (with arbitrary marginals!) then for every r ≥ 1 we have

1 k−1
Pb = E[dH (A, B)] ≥ Pr,min (16.2)
k r−1
Pr,min , min{P[B = c0 |A = c] : c, c0 ∈ Fk2 , dH (c, c0 ) = r} (16.3)
Proof. First, observe that
X
k
P[dH (A, B) = r|A = a] = PB|A (b|a) ≥ Pr,min .
r
b:dH (a,b)=r

Next, notice
dH (x, y) ≥ r1{dH (x, y) = r}
and take the expectation with x ∼ A, y ∼ B.

178
Remark: In statistics, Assouad’s Lemma is a useful tool for obtaining lower bounds on the
minimax risk of an estimator. Say the data X is distributed according to Pθ parameterized by θ ∈ Rk
and let θ̂ = θ̂(X) be an estimator for θ. The goal is to minimize the maximal risk supθ∈Θ Eθ [kθ − θ̂k1 ].
A lower bound (Bayesian) to this worst-case risk is the average risk E[kθ − θ̂k1 ], where θ is distributed
to any prior. Consider θ uniformly distributed on the hypercube {0, }k with side length embedded
in the space of parameters. Then
k
inf sup E[kθ − θ̂k1 ] ≥ min (1 − TV(Pθ , Pθ0 )) . (16.4)
θ̂ θ∈{0,}k 4 dH (θ,θ0 )=1

This can be proved using similar ideas to Theorem 16.2. WLOG, assume that = 1.
(a)
1 1
E[kθ − θ̂k1 ] ≥ E[kθ − θ̂dis k1 ] = E[dH (θ, θ̂dis )]
2 2
Xk k
1 (b) 1 X
≥ min P[θi 6= θ̂i ] = (1 − TV(PX|θi =0 , PX|θi =1 ))
2 θ̂i =θ̂i (X) 4
i=1 i=1
(c) k
≥ min (1 − TV(Pθ , Pθ0 )) .
4 dH (θ,θ0 )=1

Here θ̂dis is the discretized version of θ̂, i.e. the closest point on the hypercube to θ̂ and so
(a) follows from |θi − θ̂i | ≥ 12 1{|θi −θ̂i |>1/2} = 12 1{θi 6=θ̂dis,i } , (b) follows from the optimal binary
hypothesis testing for θi given X, (c) follows from the convexity of TV: TV(PX|θi =0 , PX|θi =1 ) =
1 P 1 P 1 P
TV( 2k−1 θ:θi =0 PX|θ , 2k−1 θ:θi =1 PX|θ ) ≤ 2k−1 θ:θi =0 TV(PX|θ , PX|θ⊕ei ) ≤ maxdH (θ,θ0 )=1 TV(Pθ , Pθ0 ).
Alternatively, (c) also follows from by providing the extra information θ\i and allowing θ̂i = θ̂i (X, θ\i )
in the second line.

16.3 General (Weak) Converse Bounds

Theorem 16.4 (Weak converse).
1. Any M -code for PY |X satisfies
supX I(X; Y ) + h(Pe )
log M ≤
1 − Pe

2. When M = 2k
supX I(X; Y )
log M ≤
log 2 − h(Pb )

Proof. 1. Since W → X → Y → Ŵ , we have the following chain of inequalities, cf. Fano’s

inequality (Theorem 5.4):

sup I(X; Y ) ≥ I(X; Y ) ≥ I(W ; Ŵ )

X
dpi 1
≥ d(P[W = Ŵ ]k )
M
6 Ŵ ]) + P[W = Ŵ ] log M
≥ −h(P[W =

Plugging in Pe = P[W 6= Ŵ ] finishes the proof of the first part.

179
P
2. Now S k → X → Y → Ŝ k . Recall from Theorem 5.1 that for iid S n , I(Si ; Ŝi ) ≤ I(S k ; Ŝ k ).
This gives us
k
X
sup I(X; Y ) ≥ I(X; Y ) ≥ I(Si , Ŝi )
X i=1
1
1X
≥k d P[Si = Ŝi ]
k 2
!
1X
k
1
≥ kd P[Si = Ŝi ]
k 2
i=1
1

= kd 1 − Pb = k(log 2 − h(Pb ))
2

where the second line used Fano’s inequality (Theorem 5.4) for binary random variable (or
divergence data processing), and the third line used the convexity of divergence.

16.4 General achievability bounds: Preview

Remark: Regarding differences between information theory and statistics: in statistics, there is
a parametrized set of distributions on a space (determined by the model) from which we try to
estimate the underlying distribution or parameter from samples. In data transmission, the challenge
is to choose the structure on the parameter space (channel coding) such that, upon observing a
sample, we can estimate the correct parameter with high probability. With this in mind, it is natural
to view
PY |X=x
log
PY
as an LLR of a binary hypothesis test, where we compare the hypothesis X = x to the distribution
induced by our codebook: PY = PY |X ◦ PX (so compare ci to “everything else”). To decode, we
ask M different questions of this form. This motivates importance of the random variable (called
information density):

PY |X (Y |X)
i(X; Y ) = log ,
PY (Y )

where PY = PY |X ◦ PX . Note that

I(X; Y ) = E[i(X; Y )],

which is what the weak converse is based on.

Shortly, we will show a result (Shannon’s Random Coding Theorem), that states: ∀PX ,
∀τ , ∃(M, )-code with

≤ P[i(X; Y ) ≤ log M + τ ] + e−τ

Details in the next lecture.

180
§ 17. Channel coding: achievability bounds

Notation: in the following proofs, we shall make use of the independent pairs (X, Y ) ⊥
⊥ (X, Y )

X→Y (X : sent codeword)

X→Y (X : unsent codeword)

The joint distribution is given by:

PXY XY (a, b, a, b) = PX (a)PY |X (b|a)PX (a)PY |X (b|a).

17.1 Information density

Definition 17.1 (Information density). Given joint distribution PX,Y we define
PY |X (y|x) dPY |X=x (y)
iPXY (x; y) = log = log (17.1)
PY (y) dPY (y)
and we define iPXY (x; y) = +∞ for all y in the singular set where PY |X=x is not absolutely continuous
w.r.t. PY . We also define iPXY (x; y) = −∞ for all y such that dPY |X=x /dPY equals zero. We will
almost always abuse notation and write i(x; y) dropping the subscript PX,Y , assuming that the joint
distribution defining i(·; ·) is clear from the context.
Notice that i(x; y) depends on the underlying PX and PY |X , which should be understood from the
context.
Remark 17.1 (Intuition). Information density is a useful concept in understanding decoding. In
discriminating between two codewords, one concerns with (as we learned in binary hypothesis
P
testing) the LLR, log PY |X=c1 . In M -ary hypothesis testing, a similar role is played by information
Y |X=c2
density i(c1 ; y), which, loosely speaking, evaluates the likelihood of c1 against the average likelihood,
or “everything else”, which we model by PY .
Remark 17.2 (Joint measurability). There is a measure-theoretic subtlety in (17.1): The so-defined
function i(·; ·) may not be a measurable function on the product space X × Y. For a resolution, see
Section 2.6* and Remark 2.4 in particular.
Remark 17.3 (Alternative definition). Observe that for discrete X , Y, (17.1) is equivalently written
as
PX,Y (x, y) PX|Y (x|y)
i(x; y) = log = log
PX (x)PY (y) PX (x)
For the general case, we often use the alternative definition, which is symmetric in X and Y and is
measurable w.r.t. X × Y:
dPX,Y
i(x; y) = log (x, y) (17.2)
dPX × PY

181
Notice a subtle difference between (17.1) and (17.2) for the continuous case: In (17.2) the Radon-
Nikodym derivative is only defined up to sets of measure zero, therefore whenever PX (x) = 0 the
value of PY (i(x, Y ) > t) is undefined. This problem does not occur with definition (17.1), and that
is why we prefer it. In any case, for discrete X , Y, or under other regularity conditions, all the
definitions are equivalent.
Proposition 17.1 (Properties of information density).

1. E[i(X; Y )] = I(X; Y ). This justifies the name “(mutual) information density”.

2. If there is a bijective transformation (X, Y ) → (A, B), then almost surely iPXY (X; Y ) =
iPAB (A; B) and in particular, distributions of i(X; Y ) and i(A; B) coincide.

3. (Conditioning and unconditioning trick) Suppose that f (y) = 0 and g(x) = 0 whenever
i(x; y) = −∞, then1

E[f (Y )] = E[exp{−i(x; Y )}f (Y )|X = x], ∀x (17.3)

E[g(X)] = E[exp{−i(X; y)}g(X)|Y = y], ∀y (17.4)

4. Suppose that f (x, y) = 0 whenever i(x; y) = −∞, then

E[f (X, Y )] = E[exp{−i(X; Y )}f (X, Y )] (17.5)

E[f (X, Y )] = E[exp{−i(X; Y )}f (X, Y )]. (17.6)

Proof. The proof is simply a change of measure. For example, to see (17.3), note
X X PY (y)
Ef (Y ) = PY (y)f (y) = PY |X (y|x) f (y)
PY |X (y|x)
y∈Y y∈Y

notice that by the assumption on f (·), the summation is valid even if for some y we have that
PY |X (y|x) = 0. Similarly, E[f (x, Y )] = E[exp{−i(x; Y )}f (x, Y )|X = x]. Integrating over x ∼ PX
gives (17.5). The rest are by interchanging X and Y .

Corollary 17.1.

P[i(x; Y ) > t] ≤ exp(−t), ∀x (17.7)

P[i(X; Y ) > t] ≤ exp(−t) (17.8)

Proof. Pick f (Y ) = 1 {i(x; Y ) > t} in (17.3).

Remark 17.4. We have used this trick before: For any probability measure P and any measure Q,
h dP i
Q log ≥ t ≤ exp(−t). (17.9)
dQ

for example, in hypothesis testing (Corollary 12.1). In data compression, we frequently used the fact
that |{x : log PX (x) ≥ t}| ≤ exp(−t), which is also of the form (17.9) with Q = counting measure.
1 dPY |X
Note that (17.3) holds when i(x; y) is defined as i = log PY
, and (17.4) holds when i(x; y) is defined as
dP
i = log PX|Y
X
. (17.5) and (17.6) hold under either of the definitions. Since in the following we shall only make use of
(17.3) and (17.5), this is another reason we adopted definition (17.1).

182
17.2 Shannon’s achievability bound
Theorem 17.1 (Shannon’s achievability bound). For a given PY |X , ∀PX , ∀τ > 0, ∃(M, )-code
with

≤ P[i(X; Y ) ≤ log M + τ ] + exp(−τ ). (17.10)

Proof. Recall that for a given codebook {c1 , . . . , cM }, the optimal decoder is MAP, or equivalently,
ML, since the codewords are equiprobable:

g ∗ (y) = argmax PX|Y (cm |y)

m∈[M ]

= argmax PY |X (y|cm )
m∈[M ]

= argmax i(cm ; y).

m∈[M ]

The step of maximizing the likelihood can make analyzing the error probability difficult. Similar to
what we did in almost loss compression (e.g., Theorem 7.4), the magic in showing the following
two achievability bounds is to consider a suboptimal decoder. In Shannon’s bound, we consider a
threshold-based suboptimal decoder g(y) as follows:

m, ∃!cm s.t. i(cm ; y) ≥ log M + τ
g(y) =
e, o.w.

Interpretation
i(cm ; y) ≥ log M + τ ⇔ PX|Y (cm |y) ≥ M exp(τ )PX (cm ),
i.e., the likelihood of cm being the transmitted codeword conditioned on receiving y exceeds some
threshold.
For a given codebook (c1 , . . . , cM ), the error probability is:

Pe (c1 , . . . , cM ) = P[{i(cW ; Y ) ≤ log M + τ } ∪ {∃m 6= W, i(cm ; Y ) > log M + τ }]

where W is uniform on [M ].
We generate the codebook (c1 , . . . , cM ) randomly with cm ∼ PX i.i.d. for m ∈ [M ]. By symmetry,
the error probability averaging over all possible codebooks is given by:

E[Pe (c1 , . . . , cM )]
= E[Pe (c1 , . . . , cM )|W = 1]
= P[{i(c1 ; Y ) ≤ log M + τ } ∪ {∃m 6= 1, i(cm , Y ) > log M + τ }|W = 1]
M
X
≤ P[i(c1 ; Y ) ≤ log M + τ |W = 1] + P[i(cm ; Y ) > log M + τ |W = 1] (union bound)
m=2

= P [i(X; Y ) ≤ log M + τ ] + (M − 1)P i(X; Y ) > log M + τ (random codebook)
≤ P [i(X; Y ) ≤ log M + τ ] + (M − 1) exp(−(log M + τ )) (by Corollary 17.1)
≤ P [i(X; Y ) ≤ log M + τ ] + τ ) + exp(−τ )

Finally, since the error probability averaged over the random codebook satisfies the upper bound,
there must exist some code allocation whose error probability is no larger than the bound.

183
Remark 17.5 (Typicality).

• The property of a pair (x, y) satisfying the condition {i(x; y) ≥ γ} can be interpreted as “joint
typicality”. Such version of joint typicality is useful when random coding is done in product
spaces with cj ∼ PXn (i.e. coordinates of the codeword are iid).

• A popular alternative to the definition of typicality is to require that the empirical joint
distribution is close to the true joint distribution, i.e., P̂xn ,yn ≈ PXY , where

1
P̂xn ,yn (a, b) = · #{j : xj = a, yj = b} .
n
This definition is natural for cases when random coding is done with cj ∼ uniform on the set
{xn : P̂xn ≈ PX } (type class).

17.3 Dependence-testing bound

The following result is a refinement of Theorem 17.1:
Theorem 17.2 (DT bound). ∀PX , ∃(M, )-code with

M − 1 +
≤ E exp − i(X; Y ) − log (17.11)
2

where x+ , max(x, 0).

Proof. For a fixed γ, consider the following suboptimal decoder:

m, for the smallest m s.t. i(cm ; y) ≥ γ
g(y) =
e, o/w

Note that given a codebook {c1 , . . . , cM }, we have by union bound

P[Ŵ 6= j|W = j] = P[i(cj ; Y ) ≤ γ|W = j] + P[i(cj ; Y ) > γ, ∃k ∈ [j − 1], s.t. i(ck ; Y ) > γ]
j−1
X
≤ P[i(cj ; Y ) ≤ γ|W = j] + P[i(ck ; Y ) > γ|W = j].
k=1

Averaging over the randomly generated codebook, the expected error probability is upper bounded

184
by:
M
1 X
E[Pe (c1 , . . . , cM )] = P[Ŵ 6= j|W = j]
M
j=1

1 X
M j−1
X
≤ P[i(X; Y ) ≤ γ] + P[i(X; Y ) > γ]
M
j=1 k=1
M −1
= P[i(X; Y ) ≤ γ] + P[i(X; Y ) > γ]
2
M −1
= P[i(X; Y ) ≤ γ] + E[exp(−i(X; Y ))1 {i(X; Y ) > γ}] (by (17.3))
2
h M −1 i
= E 1 {i(X; Y ) ≤ γ} + exp(−i(X; Y ))1 {i(X, Y ) > γ}
2
h M −1 i M −1
= E min 1, exp(−i(X; Y )) (γ = log minimizes the upper bound)
2 2
M − 1 +
= E exp − i(X; Y ) − log .
2

To optimize over γ, note the simple observation that U 1E + V 1{E c } ≥ min{U, V }, with equality iff
U ≥ V on E. Therefore for any x, y, 1[i(x; y) ≤ γ]+ M2−1 e−i(x;y) 1[i(x; y) > γ] ≥ min(1, M2−1 e−i(x;y) ),
achieved by γ = log M2−1 regardless of x, y.

Note: Dependence-testing: The RHS of (17.11) is equivalent to the minimum error probability of
the following Bayesian hypothesis testing problem:

H0 : X, Y ∼ PX,Y versus H1 : X, Y ∼ PX PY
2 M −1
prior prob.: π0 = , π1 = .
M +1 M +1

Note that X, Y ∼ PX,Y and X, Y ∼ PX PY , where X is the sent codeword and X is the unsent
codeword. As we know from binary hypothesis testing, the best threshold for the LRT to minimize
the weighted probability of error is log ππ10 .
Note: In Theorem 17.2 we have avoided minimizing over τ in Shannon’s bound (17.10) to get the
minimum upper bound in Theorem 17.1. Moreover, DT bound is stronger than the best Shannon’s
bound (with optimized τ ).
Note: Similar to the random coding achievability bound of almost lossless compression (Theorem
7.4), in Theorem 17.1 and Theorem 17.2 we only need the random codewords to be pairwise
independent.

17.4 Feinstein’s Lemma

The previous achievability results are obtained using probabilistic methods (random coding). In
contrast, the following achievability due to Feinstein uses a greedy construction. Moreover,
Feinstein’s construction holds for maximal probability of error.
Theorem 17.3 (Feinstein’s lemma). ∀PX , ∀γ > 0, ∀ ∈ (0, 1), ∃(M, )max -code such that

M ≥ γ( − P[i(X; Y ) < log γ]) (17.12)

185
Remark 17.6 (Comparison with Shannon’s bound). We can also interpret (17.12) as for fixed M ,
there exists an (M, )max -code that achieves the maximal error probability bounded as follows:
M
≤ P[i(X; Y ) < log γ] +
γ
Take log γ = log M + τ , this gives the bound of exactly the same form in (17.10). However, the
two are proved in seemingly quite different ways: Shannon’s bound is by random coding, while
Feinstein’s bound is by greedily selecting the codewords. Nevertheless, Feinstein’s bound is stronger
in the sense that it concerns about the maximal error probability instead of the average.
Proof. Recall the goal is to find codewords c1 , . . . , cM ∈ X and disjoint subsets (decoding regions)
D1 , . . . , DM ⊂ Y, s.t.
PY |X (Di |ci ) ≥ 1 − , ∀i ∈ [M ].
The idea is to construct a codebook of size M in a greedy way.
For every x ∈ X , associate it with a preliminary decode region defined as follows:
Ex , {y : i(x; y) ≥ log γ}
Notice that the preliminary decoding regions {Ex } may be overlapping, and we will trim them into
final decoding regions {Dx }, which will be disjoint.
We can assume that P[i(X; Y ) < log γ] ≤ , for otherwise the R.H.S. of (17.12) is negative and
there is nothing to prove. We first claim that there exists some c such that PY [Ec |X = c] ≥ 1 − .
Show by contradiction. Assume that ∀c ∈ X , P[i(c; Y ) ≥ log γ|X = c] < 1 − , then average over
c ∼ PX , we have P[i(X; Y ) ≥ log γ] < 1 − , which is a contradiction.
Then we construct the codebook in the following greedy way:
1. Pick c1 to be any codeword such that PY [Ec1 |X = c1 ] ≥ 1 − , and set D1 = Ec1 ;
2. Pick c2 to be any codeword such that PY [Ec2 \D1 |X = c2 ] ≥ 1 − , and set D2 = Ec2 \D1 ;
...
−1
3. Pick cM to be any codeword such that PY [EcM \ ∪M j=1 Dj |X = cM ] ≥ 1 − , and set DM =
−1
EcM \ ∪M
j=1 Dj . We stop if no more codeword can be found, i.e., M is determined by the
stopping condition:
∀c ∈ X , PY [Ex0 \ ∪M
j=1 Dj |X = c] < 1 −

Averaging the stopping condition over c ∼ PX , we have

P({i(X; Y ) ≥ log γ}\{Y ∈ ∪M
j=1 Dj }) < 1 −

by union bound P (A\B) ≥ P (A) − P (B), we have

M
X
P(i(X; Y ) ≥ log γ) − PY (Dj ) < 1 −
j=1
M
⇒ P(i(X; Y ) ≥ log γ) − <1−
γ
where the last step makes use of the following key observation:
1
PY (Dj ) ≤ PY (Ecj ) = PY (i(cj ; Y ) ≥ log γ) < , (by Corollary 17.1).
γ

186
§ 18. Linear codes. Channel capacity

Recall that last time we showed the following achievability bounds:

Shannon’s: Pe ≤ P [i(X; Y ) ≤ log M + τ ] + exp{−τ }

⇑

M − 1 +
DT: Pe ≤ E exp − i(X; Y ) − log
2
Feinstein’s: Pe,max ≤ P [i(X; Y ) ≤ log M + τ ] + exp{−τ }

This time we shall use a shortcut to prove the above bounds and in which case Pe = Pe,max .

18.1 Linear coding

Recall the definition of Galois field from Section 9.2.
Definition 18.1 (Linear code). Let X = Y = Fnq , M = q k . Denote the codebook by C , {cu : u ∈
Fkq }. A code f : Fkq → Fnq is a linear code if ∀u ∈ Fkq , cu = uG (row-vector convention), where
G ∈ Fk×n
q is a generator matrix.
Proposition 18.1.

c∈C
⇔ c ∈ row span of G
⇔ c ∈ KerH, for some H ∈ Fq(n−k)×n s.t. HGT = 0.

Note: For linear codes, the codebook is a k-dimensional linear subspace of Fnq (ImG or KerH). The
matrix H is called a parity check matrix.
Example: (Hamming code) The [7, 4, 3]2 Hamming code over F2 is a linear code with G = [I; P ]
and H = [−P T ; I] is a parity check matrix.
 
1 0 0 0 1 1 0  
 0 1 1 0 1 1 0 0
1 0 0 1 0 1 
G=
 0
 H= 1 0 1 1 0 1 0 
0 1 0 0 1 1  x5
0 1 1 1 0 0 1
0 0 0 1 1 1 1
x2 x1
x4
Note:
x6 x7
x3
• Parity check: all four bits in the same circle sum up to zero.

• The minimum distance of this code is 3. Hence it can correct 1

bit of error and detect 2 bits of error.

187
Linear codes are almost always examined with channels of additive noise, a precise definition of
which is given below:
Definition 18.2 (Additive noise). PY |X is additive-noise over Fnq if

PY |X (y|x) = PZ n (y − x) ⇔ Y = X + Z n where Z n ⊥
⊥X

Now: Given a linear code and an additive-noise channel PY |X , what can we say about the
decoder?
Theorem 18.1. Any [k, n]Fq linear code over an additive-noise PY |X has a maximum likelihood
decoder g : Fnq → Fnq such that:

1. g(y) = y − gsynd (Hy T ), i.e., the decoder is a function of the “syndrome” Hy T only

2. Decoding regions are translates: Du = cu + D0 , ∀u

3. Pe,max = Pe ,
where gsynd : Fn−k
q → Fnq , defined by gsynd (s) = argmaxz:HxT =s PZ (z), is called the “syndrome
decoder”, which decodes the most likely realization of the noise.
Proof. 1. The maximum likelihood decoder for linear code is

g(y) = argmax PY |X (y|c) = argmax PZ (y − c) = y − argmax PZ (z),

c∈C c:HcT =0 z:Hz T =Hy T
| {z }
,gsynd (Hy T )

2. For any u, the decoding region

Du = {y : g(y) = cu } = {y : y−gsynd (Hy T ) = cu } = {y : y−cu = gsynd (H(y−cu )T )} = cu +D0 ,

where we used HcTu = 0 and c0 = 0.

3. For any u,

P[Ŵ 6= u|W = u] = P[g(cu +Z) 6= cu ] = P[cu +Z−gsynd (HcTu +HZ T ) 6= cu ] = P[gsynd (HZ T ) 6= Z].

Remark 18.1. The advantages of linear codes include at least

1. Low-complexity encoding

2. Slightly lower complexity ML decoding (syndrome decoding)

3. Under ML decoding, maximum probability of error = average probability of error. This is a

consequence of the symmetry of the codes. Note that this holds as long as the decoder is a
function of the syndrome only. As shown in Theorem 18.1, syndrome is a sufficient statistic
for decoding a linear code.
Theorem 18.2 (DT bounds for linear codes). Let PY |X be additive noise over Fnq . ∀k, ∃ a linear code f :
Fkq → Fnq with the error probability:
+
h − n−k−log 1
n
i
q P n (Z )
Pe,max = Pe ≤ E q Z (18.1)

188
Remark 18.2. The analogy between Theorem 17.2 and Theorem 18.2 is the same as Theorem 9.4
and Theorem 9.5 (full random coding vs random linear codes).

Proof. Recall that in proving the Shannon’s achievability bound or the DT bound (Theorem 17.1
and 17.2), we select the code words c1 , . . . , cM i.i.d ∼ PX and showed that

M −1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y ) ≤ γ] + P[i(X; Y ) ≥ γ]
2
As noted after the proof of the DT bound, we only need the random codewords to be pairwise
independent. Here we will adopt a similar approach. Note that M = q k .
Let’s first do a quick check of the capacity achieving input distribution for PY |X with additive
noise over Fnq :

max I(X; Y ) = max H(Y ) − H(Y |X) = max H(Y ) − H(Z n ) = n log q − H(Z n ) ⇒ PX∗ uniform on Fnq
PX PX PX

We shall use the uniform distribution PX in the “random coding” trick.

Moreover, the optimal (MAP) decoder with uniform input is the ML decoder, whose decoding
regions are translational invariant by Theorem 18.1, namely Du = cu + D0 , ∀u, and therefore:

Pe,max = Pe = P[Ŵ 6= u|W = u], ∀u.

Step 1: Random linear coding with dithering:

cu = uG + h, ∀u ∈ Fkq

G and h are drawn from the new ensemble, where the k × n entries of G and the 1 × n
entries of h are i.i.d. uniform over Fq . We add the dithering to eliminate the special role
that the all-zero codeword plays (since it is contained in any linear codebook).

6 u0 , (cu , cu0 ) ∼ (X, X),

Step 2: Claim that the codewords are pairwise independent and uniform: ∀u =
2n
where PX,X (x, x) = 1/q . To see this:

cu ∼ uniform on Fnq
cu0 = u0 G + h = uG + h + (u0 − u)G = cu + (u0 − u)G

We claim that cu ⊥ ⊥ G because conditioned on the generator matrix G = G0 , cu ∼

uniform on Fnq due to the dithering h.
⊥ cu0 because conditioned on cu , (u0 − u)G ∼ uniform on Fnq .
We also claim that cu ⊥
Thus random linear coding with dithering indeed gives codewords cu , cu0 pairwise independent
and are uniformly distributed.

Step 3: Repeat the same argument in proving DT bound for the symmetric and pairwise independent
codewords, we have
M −1
E[Pe (c1 , . . . , cM )] ≤ P[i(X; Y ) ≤ γ] + P(i(X, Y ) ≥ γ)
2
+ +
M − 1 + q k −1
⇒Pe ≤ E[exp{− i(X; Y ) − log }] = E[q − i(X;Y )−logq 2 ] ≤ E[q − i(X;Y )−k
]
2
where we used M = q k and picked the base of log to be q.

189
Step 4: compute i(X; Y ):

PZ n (b − a) 1
i(a; b) = logq −n
= n − logq
q PZ n (b − a)

therefore
+
1
− n−k−logq
Pe ≤ E[q PZ n (Z n )
] (18.2)

Step 5: Kill h. We claim that there exists a linear code without dithering such that (18.2) is
satisfied. Indeed shifting a codebook has no impact on its performance. We modify the
coding scheme with G, h which achieves the bound in the following way: modify the
decoder input Y 0 = Y − h, then when cu is sent, the additive noise PY 0 |X becomes then
Y 0 = uG + h + Z n − h = uG + Z n , which is equivalent to that the linear code generated by
G is used. Note that this is possible thanks to the additivity of the noisy channel.

Remark 18.3. • The ensemble cu = uG + h has the pairwise independence property. The joint
entropy H(c1 , . . . , cM ) = H(G) + H(h) = (nk + n) log q is significantly smaller than Shannon’s
“fully random” ensemble we used in the previous lecture. Recall that in that ensemble each cj
was selected independently uniform over Fnq , implying H(c1 , . . . , cM ) = q k n log q. Question:

min H(c1 , . . . , cM ) =??

where minimum is over all distributions with P [ci = a, cj = b] = q −2n when i 6= j (pairwise
independent, uniform codewords). Note that H(c1 , . . . , cM ) ≥ H(c1 , c2 ) = 2n log q. Similarly,
we may ask for (ci , cj ) to be uniform over all pairs of distinct elements. In this case Wozencraft
ensemble for the case of n = 2k achieves H(c1 , . . . , cqk ) ≈ 2n log q.

• There are many different ensembles of random codebooks:

i.i.d.
– Shannon ensemble: C = {c1 , . . . , cM } ∼ PX – fully random
– Elias ensemble [Eli55]: C = {uG : u ∈ Fkq }, with the k × n generator matrix G uniformly
drawn at random.
– Gallager ensemble: C = {c : HcT = 0}, with the (n − k) × n parity-check matrix H
uniformly drawn at random.

• With some non-zero probability G may fail to be full rank [Exercise: Find P [rank(G) < k] as a
function of n, k, q!]. In such a case, there are two identical codewords and hence Pe,max ≥ 1/2.
There are two alternative ensembles of codes which do not contain such degenerate codebooks:

1. G ∼ uniform on all full rank matrices

2. search codeword cu ∈ KerH where H ∼ uniform on all n × (n − k) full row rank matrices.
(random parity check construction)

Analysis of random coding over such ensemble is similar, except that this time (X, X̄) have
distribution
1
PX,X̄ = 2n 1 0
q − q n {X6=X }
uniform on all pairs of distinct codewords and not pairwise independent.

190
18.2 Channels and channel capacity
Basic question of data transmission: How many bits can one transmit reliably if one is allowed to
use the channel n times?

• Rate = # of bits per channel use

• Capacity = highest achievable rate

Next we formalize these concepts.

Definition 18.3 (Channel). A channel is specified by:

• input alphabet A

• output alphabet B

• a sequence of random transformation kernels PY n |X n : An → B n , n = 1, 2, . . . .

• The parameter n is called the blocklength.

Note: we do not insist on PY n |X n to have any relation for different n, but it is common that
the conditional distribution of the first k letters of the n-th transformation is in fact a function of
only the first k letters of the input and this function equals PY k |X k – the k-th transformation. Such
channels, in particular, are non-anticipatory: channel outputs are causal functions of channel inputs.
Channel characteristics:

• A channel is discrete if A and B are finite.

• A channel is additive-noise if A = B are abelian group, and

Pyn |xn = PZ n (y n − xn ) ⇔ Y n = X n + Z n .

• A channel is memoryless if there Qexists a sequence {PXk |Yk , k = 1, . . .} of transformations

acting A → B such that PY n |X n = nk=1 PYk |Xk (in particular, the channels are compatible at
different blocklengths).
Q
• A channel is stationary memoryless if PY n |X n = nk=1 PY1 |X1 .

• DMC (discrete memoryless stationary channel): A DMC can be specified in two ways:

– an |A| × |B|-dimensional (row-stochastic) matrix PY |X where elements specify the transi-

tion probabilities
– a bipartite graph with edge weight specifying the transition probabilities.

Example:

191
Definition 18.4 (Fundamental Limits). For any channel,

• An (n, M, )-code is an (M, )-code for the n-th random transformation PY n |X n .

• An (n, M, )max -code is analogously defined for maximum probability of error.

The non-asymptotic fundamental limits are

M ∗ (n, ) = max{M : ∃ (n, M, )-code} (18.3)

∗
Mmax (n, ) = max{M : ∃ (n, M, )max -code} (18.4)

Definition 18.5 (Channel capacity). The -capacity C and Shannon capacity C are
1
C , lim inf log M ∗ (n, )
n→∞ n
C = lim C .
→0+

Remark 18.4.

• This operational definition of the capacity represents the maximum achievable rate at which
one can communicate through a channel with probability of error less than . In other words,
for any R < C, there exists an (n, exp(nR), n )-code, such that n → 0.

• Typically, the -capacity behaves like the plot below on the left-hand side, where C0 is called
the zero-error capacity, which represents the maximal achievable rate with no error. Often
times C0 = 0, meaning without tolerating any error zero information can be transmitted. If C
is constant for all (see plot on the right-hand side), then we say that the strong converse
holds (more on this later).

Cǫ Cǫ

strong converse
holds

Zero error b
C0
Capacity
ǫ ǫ
0 1 0 1

192
Proposition 18.2 (Equivalent definitions of C and C).

C = sup{R : ∀δ > 0, ∃n0 (δ), ∀n ≥ n0 (δ), ∃(n, 2n(R−δ) , ) code}

C = sup{R : ∀ > 0, ∀δ > 0, ∃n0 (δ, ), ∀n ≥ n0 (δ, ), ∃(n, 2n(R−δ) , ) code}

Proof. This trivially follows from applying the definitions of M ∗ (n, ) (DIY).

Question: Why do we define capacity C and C with respect to average probability of error, say,
(max)
C and C (max) ? Why not maximal probability of error? It turns out that these two definitions
are equivalent, as the next theorem shows.
Theorem 18.3. ∀τ ∈ (0, 1),

τ M ∗ (n, (1 − τ )) ≤ Mmax

∗
(n, ) ≤ M ∗ (n, )

Proof. The second inequality is obvious, since any code that achieves a maximum error probability
also achieves an average error probability of .
For the first inequality, take an (n, M, (1 − τ ))-code, and define the error probability for the j th
codeword as

λj , P[Ŵ 6= j|W = j]

Then X X X
M (1 − τ ) ≥ λj = λj 1{λj ≤} + λj 1{λj >} ≥ |{j : λj > }|.
Hence |{j : λj > }| ≤ (1 − τ )M . [Note that this is exactly Markov inequality!] Now by
removing those codewords1 whose λj exceeds , we can extract an (n, τ M, )max -code. Finally, take
M = M ∗ (n, (1 − τ )) to finish the proof.
(max)
Corollary 18.1 (Capacity under maximal probability of error). C = C for all > 0 such
that C = C− . In particular, C (max) = C.2

Proof. Using the definition of M ∗ and the previous theorem, for any fixed τ > 0
1
C ≥ C(max) ≥ lim inf log τ M ∗ (n, (1 − τ )) ≥ C(1−τ )
n→∞ n
(max)
Sending τ → 0 yields C ≥ C ≥ C− .

18.3 Bounds on C ; Capacity of Stationary Memoryless Channels

Now that we have the basic definitions for C , we define another type of capacity, and show that
for a stationary memoryless channels, the two notions (“operational” and “information” capacity)
coincide.
Definition 18.6. The information capacity of a channel is
1
Ci = lim inf sup I(X n ; Y n )
n→∞ n PX n
1
This operation is usually referred to as expurgation which yields a smaller code by killing part of the codebook to
reach a desired property.
2
Notation: f (x−) , limy%x f (y).

193
Remark: This quantity is not the same as the Shannon capacity, and has no direct operational
interpretation as a quantity related to coding. Rather, it is best to think of this only as taking the
n-th random transformation in the channel, maximizing over input distributions, then normalizing
and looking at the limit of this sequence.
Next we give coding theorems to relate information capacity (information measures) to
Shannon capacity (operational quantity).
Ci
Theorem 18.4 (Upper Bound for C ). For any channel, ∀ ∈ [0, 1), C ≤ 1− and C ≤ Ci .

Proof. Recall the general weak converse bound, Theorem 16.4:

supPX n I(X n ; Y n ) + h()
log M ∗ (n, ) ≤
1−
Normalizing this by n the taking the lim inf gives

1 1 supPX n I(X n ; Y n ) + h() Ci

C = lim inf log M ∗ (n, ) ≤ lim inf =
n→∞ n n→∞ n 1− 1−

Next we give an achievability bound:

Theorem 18.5 (Lower Bound for C ). For a stationary memoryless channel, C ≥ Ci , for any
∈ (0, 1].
The following result follows from pairing the upper and lower bounds on C .
Theorem 18.6 (Shannon ’1948). For a stationary memoryless channel,

C = Ci = sup I(X; Y ). (18.5)

Remark 18.5. The above result, known as Shannon’s Noisy Channel Theorem, is perhaps
the most significant result in information theory. For communications engineers, the major surprise
was that C > 0, i.e. communication over a channel is possible with strictly positive rate for any
arbitrarily small probability of error. This result influenced the evolution of communication systems
to block architectures that used bits as a universal currency for data, along with encoding and
decoding procedures.
Before giving the proof of Theorem 18.5, we show the second equality in (18.5). Notice that
Ci for stationary memoryless channels is easy to compute: Rather than solving an optimization
problem for each n and taking the limit of n → ∞, computing Ci boils down to maximizing mutual
information for n = 1. This type of result is known as “single-letterization” in information theory.
Proposition 18.3 (Memoryless input is optimal for memoryless channels).

• For memoryless channels,

n
X
n n
sup I(X ; Y ) = sup I(Xi ; Yi ).
PX n i=1 PXi

• For stationary memoryless channels,

Ci = sup I(X; Y ).
PX

194
Q Pn
Proof. Recall that for product kernels PY n |X n = PYi |Xi , we have I(X n ; Y n ) ≤ k=1 I(Xk ; Yk ),
with equality when Xi ’s are independent. Then
1
Ci = lim inf sup I(X n ; Y n ) = lim inf sup I(X; Y ) = sup I(X; Y ).
n→∞ n PX n n→∞ P
X PX

Proof of Theorem 18.5. ∀PX , and let PX n = PXn (iid product). Recall Shannon’s or Feinstein’s
achievability bound (Theorem 17.1 or 17.3): For any n, M and any τ > 0, there exists (n, M, n )-code,
s.t.
n ≤ P[i(X n ; Y n ) ≤ log M + τ ] + exp(−τ )
Here the information density is defined as
n
X n
X
dPY n |X n n n dPY |X
i(X n ; Y n ) = log (Y |X ) = log (Yk |Xk ) = i(Xk ; Yk ),
dPY n dPY
k=1 k=1

which is a sum of iid r.v.’s with mean I(X; Y ). Set log M = n(I(X; Y ) − 2δ) for δ > 0, and taking
τ = δn in Shannon’s bound, we have
hX n i
n→∞
n ≤ P i(Xk ; Yk ) ≤ nI(X; Y ) − δn + exp(−δn) −−−→ 0,
k=1

where the first term goes to zero by WLLN.

Therefore, ∀PX , ∀δ > 0, there exists a sequence of (n, Mn , n )-codes with n → 0 (where
log Mn = n(I(X; Y ) − 2δ)). Hence, for all n such that n ≤
log M ∗ (n, ) ≥ n(I(X; Y ) − 2δ)
And so
1
C = lim inf log M ∗ (n, ) ≥ I(X; Y ) − 2δ ∀PX , ∀δ
n→∞ n
Since this holds for all PX and all δ, we conclude C ≥ supPX I(X; Y ) = Ci .
Remark 18.6. Shannon’s noisy channel theorem (Theorem 18.6) shows that by employing codes
of large blocklength, we can approach the channel capacity arbitrarily close. Given the asymptotic
nature of this result (or any other asymptotic result), two natural questions are in order dealing
with the price to pay for reaching capacity:
1. The complexity of achieving capacity: Is it possible to find low-complexity encoders and
decoders with polynomial number of operations in the blocklength n which achieve the
capacity? This question is resolved by Forney in 1966 who showed that this is possible in
linear time with exponentially small error probability. His main idea is concatenated codes.
We will study the complexity question in detail later.
Note that if we are content with polynomially small probability of error, e.g., Pe = O(n−100 ),
then we can construct polynomial-time decodable codes as follows. First, it can be shown that
with rate strictly below capacity, the error probability of optimal codes decays exponentially
w.r.t. the blocklenth. Now divide the block of length n into shorter block of length C log n
and apply the optimal code for blocklength C log n with error probability n−101 . The by the
union bound, the whole block is has error with probability at most n−100 . The encoding and
exhaustive-search decoding are obviously polynomial time.

195
2. The speed of achieving capacity: Suppose we want to achieve 90% of the capacity, we want
to know how long do we need wait? The blocklength is a good proxy for delay. In other
words, we want to know how fast the gap to capacity vanish as blocklength grows. Shannon’s
theorem shows that the gap C − n1 log M ∗ (n, ) = o(1). Next theorem shows that under proper
conditions, the o(1) term is in fact O( √1n ).

The main tool in the proof of Theorem 18.5 is the WLLN. The lower bound C ≥ Ci in
Theorem 18.5 shows that log M ∗ (n, ) ≥ nC + o(n) (since normalizing by n and taking the liminf
must result in something ≥ C). If instead we do a more refined analysis using the CLT, we find
Theorem 18.7. For any stationary memoryless channel with C = maxPX I(X; Y ) (i.e. ∃PX∗ =
argmaxPX I(X; Y )) such that V = Var[i(X ∗ ; Y ∗ )] < ∞, then
√ √
log M ∗ (n, ) ≥ nC − nV Q−1 () + o( n),

where Q(·) is the complementary Gaussian CDF and Q−1 (·) is its functional inverse.

Proof. Writing the little-o notation in terms of lim inf, our goal is

log M ∗ (n, ) − nC
lim inf √ ≥ −Q−1 () = Φ−1 (),
n→∞ nV
where Φ(t) is the standard normal CDF.
Recall Feinstein’s bound

∃(n, M, )max : M ≥ β ( − P[i(X n ; Y n ) ≤ log β])

√
Take log β = nC + nV t, then applying the CLT gives
√ hX √ i
log M ≥ nC + nV t + log − P i(Xk ; Yk ) ≤ nC + nV t
√
=⇒ log M ≥ nC + nV t + log ( − Φ(t)) ∀Φ(t) <
log M − nC log( − Φ(t))
=⇒ √ ≥t+ √
nV nV
Where Φ(t) is the standard normal CDF. Taking the liminf of both sides

log M ∗ (n, ) − nC
lim inf √ ≥ t ∀t s.t. Φ(t) <
n→∞ nV

Taking t % Φ−1 (), and writing the liminf in little o form completes the proof
√ √
log M ∗ (n, ) ≥ nC − nV Q−1 () + o( n)

196
18.4 Examples of DMC
Binary symmetric channels
C
δ̄
0 0 1 bit

δ
δ
δ
δ̄ 0 1 1
1 1 2

Y = X + Z, Z ∼ Bern(δ) ⊥
⊥X

Capacity of BSC:
C = sup I(X; Y ) = 1 − h(δ)
PX

Proof. I(X; X + Z) = H(X + Z) − H(X + Z|X) = H(X + Z) − H(Z) ≤ 1 − h(δ), with equality iff
X ∼ Bern(1/2).

Note: More generally, for all additive-noise channel over a finite abelian group G, C = supPX I(X; X+
Z) = log |G| − H(Z), achieved by uniform X.

Binary erasure channels

δ̄ C
0 0
δ 1 bit

δ
1 1 δ
δ̄ 0 1

BEC is a multiplicative channel: If we think about the in-

put X ∈ {±1}, and output Y ∈ {±1, 0}. Then equivalently
we can write Y = XZ with Z ∼ Bern(δ) ⊥ ⊥ X.

Capacity of BEC:
C = sup I(X; Y ) = 1 − δ bits
PX

Note: Without evaluating Shannon’s formula, it is clear thta C ≤ 1 − δ, because δ-fraction of the
message are lost, i.e., even if the encoder knows a priori where the erasures are going to occur, the
rate still cannot exceed 1 − δ.

Proof. Note that P (X = 0|Y = e) = P (X=0)δ

δ = P (X = 0). Therefore I(X; Y ) = H(X)−H(X|Y ) =
H(X) − H(X|Y = e) ≤ (1 − δ)H(X) ≤ 1 − δ, with equality iff X ∼ Bern(1/2).

197
18.5* Information Stability
We saw that C = Ci for stationary memoryless channels, but what other channels does this hold
for? And what about non-stationary channels? To answer this question, we introduce the notion of
information stability.
Definition 18.7. A channel is called information stable if there exists a sequence of input distribu-
tion {PX n , n = 1, 2, . . .} such that
1 P
i(X n ; Y n )−
→ Ci .
n
For example, we can pick PX n = (PX∗ )n for stationary memoryless channels. Therefore stationary
memoryless channels are information stable.
The purpose for defining information stability is the following theorem.
Theorem 18.8. For an information stable channel, C = Ci .
Proof. Like the stationary, memoryless case, the upper bound comes from the general converse Theo-
rem 16.4, and the lower bound uses a similar strategy as Theorem 18.5, except utilizing the definition
of information stability in place of WLLN.
The next theorem gives conditions to check for information stability in memoryless channels
which are not necessarily stationary.
Theorem 18.9. A memoryless channel is information stable if there exists {Xk∗ : k ≥ 1} such that
both of the following hold:
n
1X
I(Xk∗ ; Yk∗ ) → Ci (18.6)
n
k=1
∞
X 1
V ar[i(Xn∗ ; Yn∗ )] < ∞ . (18.7)
n2
n=1

In particular, this is satisfied if

|A| < ∞ or |B| < ∞ (18.8)

Proof. To show the first part, it is sufficient to prove

" n #
1 X ∗ ∗ ∗

∗
P i(Xk ; Yk ) − I(Xk , Yk ) > δ → 0
n
k=1
1 n n
So that n i(X ; Y ) → Ci in probability. We bound this by Chebyshev’s inequality
" n #
1 Pn
1 X ∗ ∗ ∗ ∗

n2
∗ ∗
k=1 Var[i(Xk ; Yk )]
P i(Xk ; Yk ) − I(Xk , Yk ) > δ ≤ → 0,
n δ2
k=1

where convergence to 0 follows from Kronecker lemma (Lemma 18.1 to follow) applied with
bn = n2 , xn = Var[i(Xn∗ ; Yn∗ )]/n2 .
The second part follows from the first. Indeed, notice that
n
1X
Ci = lim inf sup I(Xk ; Yk ) .
n→∞ n P Xk
k=1

198
Now select PXk∗ such that
I(Xk∗ ; Yk∗ ) ≥ sup I(Xk ; Yk ) − 2−k .
P Xk

(Note that each supPX I(Xk ; Yk ) ≤ log min{|A|, |B|} < ∞.) Then, we have
k

n
X n
X
I(Xk∗ ; Yk∗ ) ≥ sup I(Xk ; Yk ) − 1 ,
k=1 k=1 PXk

and hence normalizing by n we get (18.6). We next show that for any joint distribution PX,Y we
have
Var[i(X; Y )] ≤ 2 log2 (min(|A|, |B|)) . (18.9)
The argument is symmetric in X and Y , so assume for concreteness that |B| < ∞. Then

E[i2 (X; Y )]
Z X
2 2
, dPX (x) PY |X (y|x) log PY |X (y|x) + log PY (y) − 2 log PY |X (y|x) · log PY (y)
A y∈B
Z X
≤ dPX (x) PY |X (y|x) log2 PY |X (y|x) + log2 PY (y) (18.10)
A y∈B
   
Z X X
= dPX (x)  PY |X (y|x) log2 PY |X (y|x) +  PY (y) log2 PY (y)
A y∈B y∈B
Z
≤ dPX (x)g(|B|) + g(|B|) (18.11)
A
=2g(|B|) ,

where (18.10) is because 2 log PY |X (y|x) · log PY (y) is always non-negative, and (18.11) follows
because each term in square-brackets can be upper-bounded using the following optimization
problem:
X n
g(n) , sup aj log2 aj . (18.12)
P n
aj ≥0: j=1 aj =1 j=1

Since the x log2 x has unbounded derivative at the origin, the solution of (18.12) is always in the
interior of [0, 1]n . Then it is straightforward to show that for n > e the solution is actually aj = n1 .
For n = 2 it can be found directly that g(2) = 0.5629 log2 2 < log2 2. In any case,

2g(|B|) ≤ 2 log2 |B| .

Finally, because of the symmetry, a similar argument can be made with |B| replaced by |A|.

P∞ (Kronecker Lemma). Let a sequence 0 < bn % ∞ and a non-negative sequence {xn }

Lemma 18.1
such that n=1 xn < ∞, then
n
1 X
bj xj → 0
bn
j=1

199
Proof. Since bn ’s are strictly increasing, we can split up the summation and bound them from above
n
X m
X n
X
bk xk ≤ bm xk + bk xk
k=1 k=1 k=m+1

Now throw in the rest of the xk ’s in the summation

n ∞ n ∞ ∞
1 X bm X X bk bm X X
=⇒ bk xk ≤ xk + xk ≤ xk + xk
bn bn bn bn
k=1 k=1 k=m+1 k=1 k=m+1
n
X ∞
X
1
=⇒ lim bk xk ≤ xk → 0
n→∞ bn
k=1 k=m+1

Since this holds for any m, we can make the last term arbitrarily small.

Important example: For jointly Gaussian (X, Y ) we always have bounded variance:

cov[X, Y ]
Var[i(X; Y )] = ρ2 (X, Y ) log2 e ≤ log2 e , ρ(X, Y ) = p . (18.13)
Var[X] Var[Y ]

Indeed, first notice that we can always represent Y = X̃ + Z with X̃ = aX ⊥⊥ Z. On the other
hand, we have
log e x̃2 + 2x̃z σ2 2
i(x̃; y) = − 2 2z , z , y − x̃ .
2 σY2 σY σZ
From here by using Var[·] = Var[E[·|X̃]] + Var[·|X̃] we need to compute two terms separately:
 2

X̃ 2 − σX̃
log e  2
σZ 
E[i(X̃; Y )|X̃] =  2 ,
2 σY

and hence
2 log2 e 4
Var[E[i(X̃; Y )|X̃]] = σX̃ .
4σY4
On the other hand,
2 log2 e 2 2 4
Var[i(X̃; Y )|X̃] = [4σX̃ σZ + 2σX̃ ].
4σY4
Putting it all together we get (18.13). Inequality (18.13) justifies information stability of all sorts of
Gaussian channels (memoryless and with memory), as we will see shortly.

200
§ 19. Channels with input constraints. Gaussian channels.

19.1 Channel coding with input constraints

Motivations: Let us look at the additive Gaussian noise. Then the Shannon capacity is infinite,
since supPX I(X; X + Z) = ∞ achieved by X ∼ N (0, P ) and P → ∞. But this is at the price of
infinite second moment. In reality, limitation of transmission power ⇒ constraints on the encoding
operations ⇒ constraints on input distribution.
Definition 19.1. An (n, M, )-code satisfies the input constraint Fn ⊂ An if the encoder is
f : [M ] → Fn . (Without constraint, the encoder maps into An ).

An
b b
b
b Fn b
b b
b b b
b b
b

Codewords all land in a subset of An

Definition 19.2 (Separable cost constraint). A channel with separable cost constraint is specified
as follows:
1. A, B: input/output spaces
2. PY n |X n : An → B n , n = 1, 2, . . .
3. Cost function c : A → R̄
Input constraint: average per-letter cost of a codeword xn (with slight abuse of notation)
n
1X
c(xn ) = c(xk ) ≤ P
n
k=1

Example: A = B = R
• Average power constraint (separable):
n
1X √
|xi |2 ≤ P ⇔ kxn k2 ≤ nP
n
i=1

• Peak power constraint (non-separable):1

max |xi | ≤ A ⇔ kxn k∞ ≤ A
1≤i≤n
1
In fact, this reduces to the case without cost with input space replaced by [−A, A].

201
Definition 19.3. Some basic definitions in parallel with the channel capacity without input
constraint.

• APcode is an (n, M, , P )-code if it is an (n, M, )-code satisfying input constraint Fn , {xn :

1
n c(xk ) ≤ P }

• Finite-n fundamental limits:

M ∗ (n, , P ) = max{M : ∃(n, M, , P )-code}

∗
Mmax (n, , P ) = max{M : ∃(n, M, , P )max -code}

• -capacity and Shannon capacity

1
C (P ) = lim inf log M ∗ (n, , P )
n→∞ n
C(P ) = lim C (P )
↓0

• Information capacity
1
Ci (P ) = lim inf sup I(X n ; Y n )
n→∞ n PX n :E[Pnk=1 c(Xk )]≤nP

• Information stability: Channel is information stable if for all (admissible) P , there exists a
sequence of channel input distributions PX n such that the following two properties hold:
1 P
iPX n ,Y n (X n ; Y n )−
→Ci (P ) (19.1)
n
P[c(X n ) > P + δ] → 0 ∀δ > 0 . (19.2)

Note: These are the usual definitions, except that in Ci (P ), we are Pn permitted to maximize
I(X ; Y ) using input distributions from the constraint set {PX n : E[ k=1 c(Xk )] ≤ nP } instead
n n

of the distributions supported on Fn .

Definition 19.4 (Admissible constraint). We say P is an admissible constraint if ∃x0 ∈ A
s.t. c(x0 ) ≤ P , or equivalently, ∃PX : E[c(X)] ≤ P . The set of admissible P ’s is denoted by
Dc , and can be either in the form (P0 , ∞) or [P0 , ∞), where P0 , inf x∈A c(x).
Clearly, if P ∈
/ Dc , then there is no code (even a useless one, with 1 codeword) satisfying the
input constraint. So in the remaining we always assume P ∈ Dc .
Proposition 19.1. Define f (P ) = supPX :E[c(X)]≤P I(X; Y ). Then

1. f is concave and non-decreasing. The domain of f is dom f , {x : f (x) > −∞} = Dc .

2. One of the following is true: f (P ) is continuous and finite on (P0 , ∞), or f = ∞ on (P0 , ∞).

Furthermore, both properties hold for the function P 7→ Ci (P ).

202
Proof. In the first part all statements are obvious, except for concavity, which follows from the
concavity of PX 7→ I(X; Y ). For any PXi such that E [c(Xi )] ≤ Pi , i = 0, 1, let X ∼ λ̄PX0 + λPX1 .
Then E [c(X)] ≤ λ̄P0 + λP1 and I(X; Y ) ≥ λ̄I(X0 ; Y0 ) + λI(X1 ; Y1 ). Hence f (λ̄P0 + λP1 ) ≥
λ̄f (P0 ) + λf (P1 ). The second claim follows from concavity of f (·).
To extend these results to Ci (P ) observe that for every n
1
P 7→ sup I(X n ; Y n )
n PX n :E[c(X n )]≤P

is concave. Then taking lim inf n→∞ the same holds for Ci (P ).

An immediate consequence is that memoryless input is optimal for memoryless channel with
separable cost, which gives us the single-letter formula of the information capacity:
Corollary 19.1 (Single-letterization). Information capacity of stationary memoryless channel with
separable cost:
Ci (P ) = f (P ) = sup I(X; Y ).
E[c(X)]≤P

Proof. Ci (P ) ≥ f (P ) is obvious by using PX n = (PX )n . For “≤”, use the concavity of f (·), we have
that for any PX n ,
 
Xn n
X X n
1
I(X n ; Y n ) ≤ I(Xj ; Yj ) ≤ f (E[c(Xj )])≤nf  E[c(Xj )] ≤ nf (P ).
n
j=1 j=1 j=1

?
19.2 Capacity under input constraint C(P ) = Ci (P )
Theorem 19.1 (General weak converse).

Ci (P )
C (P ) ≤
1−

Proof. The argument is the same as before: Take any (n, M, , P )-code, W → X n → Y n → Ŵ .
Apply Fano’s inequality, we have

−h() + (1 − ) log M ≤ I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ sup I(X n ; Y n ) ≤ nf (P )

PX n :E[c(X n )]≤P

Theorem 19.2 (Extended Feinstein’s Lemma). Fix a random transformation PY |X . ∀PX , ∀F ⊂

X , ∀γ > 0, ∀M , there exists an (M, )max -code with:

• Encoder satisfies the input constraint: f : [M ] → F ⊂ X ;

• Probability of error bound:

M
PX (F ) ≤ P[i(X; Y ) < log γ] +
γ

203
Note: when F = X , it reduces to the original Feinstein’s Lemma.

Proof. Similar to the proof of the original Feinstein’s Lemma, define the preliminary decoding
regions Ec = {y : i(c; y) ≥ log γ} for all c ∈ X . Sequentially pick codewords {c1 , . . . , cM } from
the set F and the final decoding region {D1 , . . . , DM } where Dj , Ecj \ ∪j−1
k=1 Dk . The stopping
criterion is that M is maximal, i.e.,

∀x0 ∈ F, PY [Ex0 \ ∪M
j=1 Dj X = x0 ] < 1 −

⇔ ∀x0 ∈ X , PY [Ex \ ∪M c
j=1 Dj X = x0 ] < (1 − )1[x0 ∈ F ] + 1[x0 ∈ F ]
0

⇒ average over x0 ∼ PX , P[{i(X; Y ) ≥ log γ}\ ∪M c

j=1 Dj ] ≤ (1 − )PX (F ) + PX (F ) = 1 − PX (F )

From here, we can complete the proof by following the same steps as in the proof of Feinstein’s
lemma (Theorem 17.3).

Theorem 19.3 (Achievability). For any information stable channel with input constraints and
P > P0 we have
C(P ) ≥ Ci (P ) (19.3)

Proof. Let us consider a special case of the stationary memoryless channel (the proof for general
information stable channel follows similarly). So we assume PY n |X n = (PY |X )n .
Fix n ≥ 1. Since the channel is stationary memoryless, we have PY n |X n = (PY |X )n . Choose a
PX such that E[c(X)] < P , Pick log M = n(I(X; P Y ) − 2δ) and log γ = n(I(X; Y ) − δ).
With the input constraint set Fn = {xn : n1 c(xk ) ≤ P }, and iid input distribution PX n = PXn ,
we apply the extended Feinstein’s Lemma, there exists an (n, M, n , P )max -code with the encoder
satisfying input constraint F and the error probability

n PX (F ) ≤ P (i(X n ; Y n ) ≤ n(I(X; Y ) − δ)) + exp(−nδ)

| {z } | {z } | {z }
→1 →0 as n→∞ by WLLN and stationary memoryless assumption →0

P
Also, since E[c(X)] < P , by WLLN, we have PX n (Fn ) = P ( n1 c(xk ) ≤ P ) → 1.

n (1 + o(1)) ≤ o(1)
⇒ n → 0 as n → ∞
⇒ ∀, ∃n0 , s.t. ∀n ≥ n0 , ∃(n, M, n , P )max -code, with n ≤

Therefore
1
C (P ) ≥ log M = I(X; Y ) − 2δ, ∀δ > 0, ∀PX s.t. E[c(X)] < P
n
⇒ C (P ) ≥ sup lim (I(X; Y ) − 2δ)
PX :E[c(X)]<P δ→0

⇒ C (P ) ≥ sup I(X; Y ) = Ci (P −) = Ci (P )
PX :E[c(X)]<P

where the last equality is from the continuity of Ci on (P0 , ∞) by Proposition 19.1. Notice
that for general information stable channel, we just need to use the definition to show that
P (i(X n ; Y n ) ≤ n(Ci − δ)) → 0, and all the rest follows.

204
Theorem 19.4 (Shannon capacity). For an information stable channel with cost constraint and
for any admissible constraint P we have

C(P ) = Ci (P ).

Proof. The case of P = P0 is treated in the homework. So assume P > P0 . Theorem 19.1
shows C (P ) ≤ C1−
i (P )
, thus C(P ) ≤ Ci (P ). On the other hand, from Theorem 19.3 we have
C(P ) ≥ Ci (P ).

Note: In homework, you will show that C(P0 ) = Ci (P0 ) also holds, even though Ci (P ) may be
discontinuous at P0 .

19.3 Applications
19.3.1 Stationary AWGN channel

Z ∼ N (0, σ 2 )

X Y
+

Definition 19.5 (AWGN). The additive Gaussian noise (AWGN) channel is a stationary memoryless
additive-noise channel with separable cost constraint: A = B = R, c(x) = x2 , PY |X is given by
⊥ X, and average power constraint EX 2 ≤ P .
Y = X + Z, where Z ∼ N (0, σ 2 ) ⊥
In other words, Y = X + Z , where Z n ∼ N (0, In ).
n n n

Gaussian
Note: Here “white” = uncorrelated = independent.
Note: Complex AWGN channel is similarly defined: A = B = C, c(x) = |x|2 , and Z n ∼ CN (0, In )
Theorem 19.5. For stationary (C)-AWGN channel, the channel capacity is equal to information
capacity, and is given by:

1 P
C(P ) = Ci (P ) = log 1 + 2 for AWGN
2 σ

P
C(P ) = Ci (P ) = log 1 + 2 for C-AWGN
σ

Proof. By Corollary 19.1,

Ci = sup I(X; X + Z)
PX :EX 2 ≤P

Then use Theorem 4.6 (Gaussian saddle point) to conclude X ∼ N (0, P ) (or CN (0, P )) is the
unique caid.

205
Note: Since Z n ∼ N (0, σ 2√ ), then with high probability,
kZ n k2 concentrates around nσ 2 . Similarly, duethe power
constraint and the fact that Z n ⊥
⊥ X n , we have E kY n k2 = c3

E kY n k2 + E kZ n k2 ≤ n(P + σ 2 ) and the received vector c4
p

p n
√ c2
Y n lies in an `2 -ball of radius approximately n(P + 2
√ σ ). nσ 2

(P
c1

+
Since the noise can at most perturb the codeword by √nσ 2

σ
2
)
in Euclidean distance, if we 2
p can pack M balls of radius nσ c5
c8
2
into the `2 -ball of radius n(P + σ ) centered at the origin,
···
then this gives a good codebook and decision regions. The
packing number is related to the volume ratio. Note that c6
c7
the volume of an `2 -ball of radius r in Rn is given by cn rn
+σ 2 ))n/2 n/2 cM
for some constant cn . Then cn (n(P cn (nσ 2 )n/2
= 1 + σP2 .

Take the log and divide by n, we get 21 log 1 + σP2 .
Why the above is not a proof for either achievability or converse?

• Packing number is not necessarily given by the volume ratio.

• Codewords need not correspond to centers of disjoint `2 -balls.

Theorem 19.5 applies to Gaussian noise. What if the noise is non-Gaussian and how sensitive is
the capacity formula 12 log(1 + SNR) to the Gaussian assumption? Recall the Gaussian saddlepoint
result we have studied in Lecture 4 where we showed that for the same variance, Gaussian noise
is the worst which shows that the capacity of any non-Gaussian noise is at least 12 log(1 + SNR).
Conversely, it turns out the increase of the capacity can be controlled by how non-Gaussian the
noise is (in terms of KL divergence). The following result is due to Ihara.
Theorem 19.6 (Additive Non-Gaussian noise). Let Z be a real-valued random variable independent
of X and EZ 2 < ∞. Let σ 2 = Var Z. Then

1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (EZ, σ 2 )).
2 σ PX :EX 2 ≤P 2 σ

Proof. Homework.

Note: The quantity D(PZ kN (EZ, σ 2 )) is sometimes called the non-Gaussianness of Z, where
N (EZ, σ 2 ) is a Gaussian with the same mean and variance as Z. So if Z has a non-Gaussian density,
say, Z is uniform on [0, 1], then the capacity can only differ by a constant compared to AWGN,
which still scales as 12 log SNR in the high-SNR regime. On the other hand, if Z is discrete, then
D(PZ kN (EZ, σ 2 )) = ∞ and indeed in this case one can show that the capacity is infinite because
the noise is “too weak”.

19.3.2 Parallel AWGN channel

Definition 19.6 (Parallel
P AWGN). A parallel AWGN channel with L branches is defined as follows:
A = B = RL ; c(x) = L k=1 |x k |2; P 2
Y L |X L : Yk = Xk + Zk , for k = 1, . . . , L, and Zk ∼ N (0, σk ) are
independent for each branch.

206
Theorem 19.7 (Waterfilling). The capacity of L-parallel AWGN channel is given by
L
1X + T
C = log
2
j=1
σj2

where log+ (x) , max(log x, 0), and T ≥ 0 is determined by

L
X
P = |T − σj2 |+
j=1

Proof.

Ci (P ) = sup I(X L ; Y L )
E[Xi2 ]≤P
P
PX L :
L
X
≤ sup sup I(Xk ; Yk )
Pk ≤P,Pk ≥0 k=1 E[Xk2 ]≤Pk
P

L
X 1 Pk
= sup log(1 + )
P
Pk ≤P,Pk ≥0 k=1 2 σk2

with equality if Xk ∼ N (0, Pk ) are independent. So the question boils down to the last
Pmaximization
problem – power allocation: Denote the Lagragian multipliers P1 for the constraint Pk ≤ PPby λ
Pk
and for the constraint Pk ≥ 0 by µk . We want to solve max 2 log(1 + σ2 ) − µk Pk + λ(P − Pk ).
k
First-order condition on Pk gives that
1 1
2 = λ − µk , µ k P k = 0
2 σ k + Pk

therefore the optimal solution is

L
X
Pk = |T − σk2 |+ , T is chosen such that P = |T − σk2 |+
k=1

Note: The figure illustrates the power allocation via water-filling. In this particular case, the second
branch is too noisy (σ2 too big) such that it is better be discarded, i.e., the assigned power is zero.

Note: [Significance of the waterfilling theorem] In the high SNR regime, the capacity for 1 AWGN
channel is approximately 12 log P , while the capacity for L parallel AWGN channel is approximately

207
L
2 log( PL ) ≈ L2 log P for large P . This L-fold increase in capacity at high SNR regime leads to the
powerful technique of spatial multiplexing in MIMO.
Also notice that this gain does not come from multipath diversity. Consider the scheme that a
single stream of data is sent through every parallel channel simultaneously, with multipath diversity,
the effective noise level is reduced to L1 , and the capacity is approximately log(LP ), which is much
smaller than L2 log( PL ) for P large.

19.4* Non-stationary AWGN

Definition 19.7 (Non-stationary AWGN). A non-stationary AWGN channel is defined as follows:
A = B = R, c(x) = x2 , PYj |Xj : Yj = Xj + Zj , where Zj ∼ N (0, σj2 ).
Theorem 19.8. Assume that for every T the following limits exist:
n
1X1 T
C̃i (T ) = lim log+ 2
n→∞ n 2 σj
j=1
n
1X
P̃ (T ) = lim |T − σj2 |+
n→∞ n
j=1

then the capacity of the non-stationary AWGN channel is given by the parameterized form: C(T ) =
C̃i (T ) with input power constraint P̃ (T ).
Proof. Fix T > 0. Then it is clear from the waterfilling solution that
n
X 1 T
sup I(X n ; Y n ) = log+ , (19.4)
j=1
2 σj2

where supremum is over all PX n such that

n
1X
E[c(X )] ≤
n
|T − σj2 |+ . (19.5)
n
j=1

Now, by assumption, the LHS of (19.5) converges to P̃ (T ). Thus, we have that for every δ > 0
Ci (P̃ (T ) − δ) ≤ C̃i (T ) (19.6)
Ci (P̃ (T ) + δ) ≥ C̃i (T ) (19.7)
Taking δ → 0 and invoking continuity of P 7→ Ci (P ), we get that the information capacity satisfies
Ci (P̃ (T )) = C̃i (T ) .
The channel is information stable. Indeed, from (18.13)
log2 e Pj log2 e
Var(i(Xj ; Yj )) = ≤
2 Pj + σj2 2
and thus
n
X 1
Var(i(Xj ; Yj )) < ∞ .
n2
j=1
From here information stability follows via Theorem 18.9.
Note: Non-stationary AWGN is primarily interesting due to its relationship to the stationary
Additive Colored Gaussian noise channel in the following discussion.

208
19.5* Stationary Additive Colored Gaussian noise channel
Definition 19.8 (Additive colored Gaussian noise channel ). An Additive Colored Gaussian noise
channel is defined as follows: A = B = R, c(x) = x2 , PYj |Xj : Yj = Xj + Zj , where Zj is a stationary
Gaussian process with spectral density fZ (ω) > 0, ω ∈ [−π, π].
Theorem 19.9. The capacity of stationary ACGN channel is given by the parameterized form:
Z 2π
1 1 T
C(T ) = log+ dω
2π 0 2 fZ (ω)
Z 2π +
1
P (T ) = T − fZ (ω) dω
2π 0

Proof. Take n ≥ 1, consider the diagonalization of the covariance matrix of Z n :

e
Cov(Z n ) = Σ = U ∗ ΣU, e = diag(σ1 , . . . , σn )
such that Σ
e n = U X n and Ye n = U Y n ,
Since Cov(Z n ) is positive semi-definite, U is a unitary matrix. Define X
e n e n
the channel between X and Y is thus
Ye n = X
e n + U Z n,
e
Cov(U Z n ) = U Cov(Z n )U ∗ = Σ
Therefore we have the equivalent channel as follows:
Ye n = X
en + Z
en , Z
en ∼ N (0, σ 2 ) indep across j
j j

By Theorem 19.8, we have that

X n Z 2π
e = lim 1
C
T
log+ 2 =
1 1
log+
T
dω. ( by Szegö, Theorem 5.6)
n→∞ n σj 2π 0 2 fZ (ω)
j=1
n
1X
lim |T − σj2 |+ = P (T )
n→∞ n
j=1

e
Finally since U is unitary, C = C.

Note: Noise is born white, the colored noise is essentially due to some filtering.

209
19.6* Additive White Gaussian Noise channel with Intersymbol
Interference
Definition 19.9 (AWGN with ISI). An AWGN channel with ISI is defined as follows: A = B = R,
c(x) = x2 , and the channel law PY n |X n is given by
n
X
Yk = hk−j Xj + Zk , k = 1, . . . , n
j=1

where Zk ∼ N (0, 1) is white Gaussian noise, {hk , k = −∞, . . . , ∞} are coefficients of a discrete-time
channel filter.
Theorem 19.10. Suppose that the sequence {hk } is an inverse Fourier transform of a frequency
response H(ω):
Z 2π
1
hk = eiωk H(ω)dω .
2π 0
Assume also that H(ω) is a continuous function on [0, 2π]. Then the capacity of the AWGN channel
with ISI is given by
Z 2π
1 1
C(T ) = log+ (T |H(ω)|2 )dω
2π 0 2
Z 2π +
1 1
P (T ) = T− dω
2π 0 |H(ω)|
2

1
Proof. (Sketch) At the decoder apply the inverse filter with frequency response ω 7→ H(ω) . The
equivalent channel then becomes a stationary colored-noise Gaussian channel:
Ỹj = Xj + Z̃j ,
where Z̃j is a stationary Gaussian process with spectral density
1
fZ̃ (ω) = .
|H(ω)|2
Then apply Theorem 19.9 to the resulting channel.
Remark: to make the above argument rigorous one must carefully analyze the non-zero error
introduced by truncating the deconvolution filter to finite n.

19.7* Gaussian channels with amplitude constraints

We have examined some classical results of additive Gaussian noise channels. In the following, we
will list some more recent results without proof.
Theorem 19.11 (Amplitude-constrained capacity of AWGN channel). PFor an AWGN channel
Yi = Xi + Zi with amplitude constraint |Xi | ≤ A and energy constraint ni=1 Xi2 ≤ nP , we denote
the capacity by:
C(A, P ) = max I(X; X + Z).
PX :|X|≤A,E|X|2 ≤P

Capacity achieving input distribution PX∗ is discrete, with finitely many atoms on [−A, A]. Moreover,
2
the convergence speed of limA→∞ C(A, P ) = 12 log(1 + P ) is of the order e−O(A ) .
For details, see [Smi71] and [PW14, Section III].

210
19.8* Gaussian channels with fading
Fading channels are often used to model the urban signal propagation with multipath or shadowing.
The received signal Yi is modeled to be affected by multiplicative fading coefficient Hi and additive
noise Zi :
Yi = Hi Xi + Zi , Zi ∼ N (0, 1)
In the coherent case (also known as CSIR – for channel state information at the receiver), the
receiver has access to the channel state information of Hi , i.e. the channel output is effectively
(Yi , Hi ). Whenever Hj is a stationary ergodic process, we have the channel capacity given by:

1
C(P ) = E 2
log(1 + P |H| )
2

and the capacity achieving input distribution is the usual PX = N (0, P ). Note that the capacity
C(P ) is in the order of log P and we call the channel “energy efficient”.
In the non-coherent case where the receiver does not have the information of Hi , no simple
expression for the channel capacity is known. It is known, however, that the capacity achieving
input distribution is discrete [AFTS01], and the capacity scales as [TE97, LM03]

C(P ) = O(log log P ), P →∞ (19.8)

This channel is said to be “energy inefficient”.

With introduction of multiple antenna channels, there are endless variations, theoretical open
problems and practically unresolved issues in the topic of fading channels. We recommend consulting
the textbook [TV05] for details.

211
§ 20. Lattice codes (by O. Ordentlich)

Consider the n-dimensional additive white Gaussian noise (AWGN) channel

Y =X+Z
where Z ∼ N (0, In×n ) is statistically independent of the input X. Our goal is to communicate
reliably over this channel, under the power constraint
1
kXk2 ≤ SNR
n
where SNR is the signal-to-noise-ratio. The capacity of the AWGN channel is
1
C= 2 log(1 + SNR) bits/channel use,
and is achieved with high probability by a codebook drawn at random from the Gaussian i.i.d. ensem-
ble. However, a typical codebook from this ensemble has very little structure, and is therefore not
applicable for practical systems. A similar problem occurs in discrete additive memoryless stationary
channels, e.g., BSC, where most members of the capacity achieving i.i.d. uniform codebook ensemble
have no structure. In the discrete case, engineers resort to linear codes to circumvent the lack of
structure. Lattice codes are the Euclidean space counterpart of linear codes, and as we shall see,
enable to achieve the capacity of the AWGN channel with much more structure than random codes.
In fact, we will construct a lattice code with rate that approaches 12 log(1 + SNR) that is guaranteed
to achieve small error probability for essentially all additive noise channels with the same noise
second moment. More precisely, our scheme will work if the noise vector Z is semi norm-ergodic.
Definition 20.1. We say that a sequence in n of random noise vectors Z(n) of length n with (finite)
2 , 1 EkZ(n) k2 , is semi norm-ergodic if for any , δ > 0 and n large enough
effective variance σZ n
q
(n) 2
Pr Z ∈ / B( (1 + δ)nσZ ≤ , (20.1)

where B(r) is an n-dimensional ball of radius r.

20.1 Lattice Definitions

A lattice Λ is a discrete subgroup of Rn which is closed under reflection and real addition. Any
lattice Λ in Rn is spanned by some n × n matrix G such that
Λ = {t = Ga : a ∈ Zn }.
We will assume G is full-rank. Denote the nearest neighbor quantizer associated with the lattice Λ
by
QΛ (x) , argmin kx − tk, (20.2)
t∈Λ

212
where ties are broken in a systematic manner. We define the modulo operation w.r.t. a lattice Λ as

[x] mod Λ , x − QΛ (x),

and note that it satisfies the distributive law,

[x] mod Λ + y mod Λ = [x + y] mod Λ.

The basic Voronoi region of Λ, denoted by V, is the set of all points in Rn which are quantized
to the zero vector. The systematic tie-breaking in (20.2) ensures that
[
· (V + t) = Rn ,
t∈Λ
S
where · denotes disjoint union. Thus, V is a fundamental cell of Λ.
Definition 20.2. A measurable set S ∈ Rn is called a fundamental cell of Λ if
[
· (S + t) = Rn .
t∈Λ

We denote the volume of a set S ∈ Rn by Vol(S).

Proposition 20.1. If S is a fundamental cell of Λ, then Vol(S) = Vol(V). Furthermore

S mod Λ = {[s] mod Λ : s ∈ S} = V.

Proof ([Zam14]). For any t ∈ Λ define

At , S ∩ (t + V); Dt , V ∩ (t + S).

Note that

Dt = [(−t + V) ∩ S] + t
= A−t + t.

Thus
X X X
Vol(S) = Vol(At ) = Vol(A−t + t) = Vol(Dt ) = Vol(V).
t∈Λ t∈Λ t∈Λ

Moreover
[ [ [
S = · At = · A−t = · Dt − t,
t∈Λ t∈Λ t∈Λ

and therefore
[
[S] mod Λ = · Dt = V.
t∈Λ

213
Corollary 20.1. If S is a fundamental cell of a lattice Λ with generating matrix G, then Vol(S) =
| det(G)|. In Particular, Vol(V) = | det(G)|.

Proof. Let P = G · [0, 1)n and note that it is a fundamental cell of Λ as Rn = Zn + [0, 1)n . The
claim now follows from Proposition 20.1 since Vol(P) = | det(G)| · Vol([0, 1)n ) = | det(G)|.

Definition 20.3 (Lattice decoder). A lattice decoder w.r.t. the lattice Λ returns for every y ∈ Rn
the point QΛ (y).
Remark 20.1. Recall that for linear codes, the ML decoder merely consisted of mapping syndromes
to shifts. Similarly, it can be shown that a lattice decoder can be expressed as

QΛ (y) = y − gsynd [G−1 y] mod 1 , (20.3)

for some gsynd : [0, 1)n 7→ Rn , where the mod 1 operation above is to be understood as componentwise
modulo reduction. Thus, a lattice decoder is indeed much more “structured” than ML decoder for a
random code.
Note that for an additive channel Y = X + Z, if X ∈ Λ we have that

Pe = Pr (QΛ (Y) 6= X) = Pr(Z ∈

/ V). (20.4)

We therefore see that the resilience of a lattice to additive

p noise is dictated by its Voronoi region.
Since we know that Z will be inside a ball of radius n(1 + δ) with high probability, we would like
the Voronoi region to be as close as possible to a ball. We define the effective radius of a lattice,
denoted reff (Λ) as the radius of a ball with the same volume as V, namely Vol (B (reff (Λ))) = Vol(V).
Definition 20.4 (Goodness for coding). A sequence of lattices Λ(n) with growing dimension,
satisfying
2 (Λ(n) )
reff
lim =Φ
n→∞ n
for some Φ > 0, is called good for channel coding if for any additive semi norm-ergodic noise sequence
Z(n) with effective variance σZ2 = 1 EkZk2 < Φ
n

(n) (n)
lim Pr Z ∈ /V = 0.
n→∞

An alternative interpretation of this property, is that for a sequence Λ(n) that is good for coding,
for any 0 < δ < 1 holds

Vol B (1 − δ)reff (Λ(n) ) ∩ V (n)
lim = 1.
n→∞ Vol B (1 − δ)reff (Λ(n) )

Roughly speaking, the Voronoi region of a lattice that is good for coding is as resilient to semi
norm-ergodic noise as a ball with the same volume.

214
reff

(b)
(a)

Figure 20.1: (a) shows a lattice in R2 , and (b) shows its Voronoi region and the corresponding
effective ball.

20.2 First Attempt at AWGN Capacity

p
Assume we have a lattice Λ ⊂ Rn with reff (Λ) = n(1 + δ) that is good for coding, and we would
like to use it for communicating over an additive noise channel. In order to meet the power constraint,
we must first intersect Λ, or a shifted version of Λ, with some compact set S√that enforces the power
constraint. The most obvious choice is taking S to be a ball with radius nSNR, and take some
shift v ∈ Rn , such that the codebook
\ √
C = (v + Λ) B( nSNR) (20.5)

satisfies the power constraint. Moreover [Loe97], there exist a shift v such that

Vol (S)
|C| ≥
Vol(V)
√ !n
nSNR
=
reff (Λ)
n
= 2 2 (log(SNR)−log(1+δ)) .

To see this, let V ∼ Uniform(V), and write the expected size of |C| as
X
E|C| = E 1((t + V) ∈ S)
t∈Λ
Z X
1
= 1((t + v) ∈ S)dv
Vol(V) v∈V t∈Λ
Z
1
= 1(x ∈ S)dx
Vol(V) x∈Rn
Vol(S)
= . (20.6)
Vol(V)

215
For decoding, we will simply apply the lattice decoder QΛ (Y − v) on the shifted output. Since
Y − v = t + Z for some t ∈ Λ, the error probability is

Pe = Pr (QΛ (Y − v) 6= t) = Pr(Z ∈
/ V).

r2 (Λ)
Since Λ is good for coding and effn = (1 + δ) > n1 EkZk2 , the error probability of this scheme over
an additive semi norm-ergodic noise channel will vanish with n. Taking δ → 0 we see that any rate
R < 21 log(SNR) can be achieved reliably. Note that for this coding scheme (encoder+decoder) the
average error probability and the maximal error probability are the same.
The construction above gets us close to the AWGN channel capacity. We note that a possible
reason for the loss of +1 in the achievable rate is the suboptimality of the lattice decoder for the
codebook C. The lattice decoder assumes all points of Λ were equally likely to be transmitted.
However, in C only lattice points inside the ball can be transmitted. Indeed, it was shown [UR98]
that if one replaces the lattice decoder with a decoder that takes
T the√ shaping region into account,
there exist lattices and shifts for which the codebook (v + Λ) B( nSNR) is capacity achieving.
The main drawback of this approach is that the decoder no longer exploits the full structure of
the lattice, so the advantages of using a lattice code w.r.t. some typical member of the Gaussian
i.i.d. ensemble are not so clear anymore.

20.3 Nested Lattice Codes/Voronoi Constellations

A lattice Λc is said to be nested in Λf if Λc ⊂ Λf . The lattice Λc is referred to as the coarse lattice
and Λf as the fine lattice. The nesting ratio is defined as
1/n
Vol(Vc )
Γ(Λf , Λc ) , (20.7)
Vol(Vf )

A nested lattice code (sometimes also called “Voronoi constellation”) based on the nested lattice
pair Λc ⊂ Λf is defined as [CS83, FJ89, EZ04]

L , Λf ∩ V c . (20.8)

Proposition 20.2.

Vol(Vc )
|L| = .
Vol(Vf )
1
Thus, the codebook L has rate R = n log |L| = log Γ(Λf , Λc ).

Proof. First note that

[
Λf , · (t + Λc ).
t∈L

Let
[
S , · (t + Vf ),
t∈L

216
and note that
[
Rn = · (b + Vf )
b∈Λf
[ [
= · · (a + t + Vf )
a∈Λc t∈L
!!
[ [
= · a+ · (t + Vf )
a∈Λc t∈L
[
= · (a + S) .
a∈Λc

Thus, S is a fundamental cell of Λc , and we have

Vol(Vc ) = Vol(S) = |L| · Vol(Vf ).

We will use the codebook L with a standard lattice decoder, ignoring the fact that only points
in Vc were transmitted. Therefore, the resilience to noise will be dictated mainly by Λf . The role of
the coarse lattice Λc is to perform shaping. In order to maximize the rate of the codebook L without
violating the power constraint, we would like Vc to have the maximal possible volume, under the
constraint that the average power of a transmitted codeword is no more than nSNR.
The average transmission power of the codebook L is related to a quantity called the second
moment of a lattice. Let U ∼ Uniform(V). The second moment of Λ is defined as σ 2 (Λ) , n1 EkUk2 .
Let W ∼ Uniform(B(reff (Λ)). By the isoperimetric inequality [Zam14]

1 r2 (Λ)
σ 2 (Λ) ≥ EkWk2 = eff .
n n+2
A lattice Λ exhibits a good tradeoff between average power and volume if its second moment is close
to that of B(reff (Λ).
Definition 20.5 (Goodness for MSE quantization). A sequence of lattices Λ(n) with growing
dimension, is called good for MSE quantization if

nσ 2 Λ(n)
lim = 1.
n→∞ r 2 Λ(n)
eff

Remark 20.2. Note that both “goodness for coding” and “goodness for quantization” are scale
invariant properties: if Λ satisfy them, so does αΛ for any α ∈ R.

Theorem 20.1 ([OE15]). If Λ is good for MSE quantization and U ∼ Uniform(V), then U is semi
norm-ergodic. Furthermore, if Z is semi norm-ergodic and statistically independent of U, then for
any α, β ∈ R the random vector αU + βZ is semi norm-ergodic.
Theorem 20.2 ([ELZ05, OE15]). For any finite nesting ratio Γ(Λf , Λc ), there exist a nested lattice
pair Λc ⊂ Λf where the coarse lattice Λc is good for MSE quantization and the fine lattice Λf is
good for coding.

217
Figure 20.2: An example of a nested lattice code. The points and Voronoi region of Λc are plotted
in blue, and the points of the fine lattice in black.

Z
L X LY α L Yeff t̂
t mod-Λ QΛf (·) mod-Λ
−

Figure 20.3: Schematic illustration of the Mod-Λ scheme.

We now describe the Mod-Λ coding scheme introduced by Erez and Zamir [EZ04]. Let Λc ⊂ Λf
be a nested lattice pair, where the coarse lattice is good for MSE quantization and has σ 2 (Λc ) =
SNR(1 − ), whereas the fine lattice is good for coding and has reff2 (Λ ) = n SNR (1 + ). The rate
f 1+SNR
is therefore

1 Vol(Vc )
R = log
n Vol(Vf )
2
1 reff (Λc )
= log 2 (Λ )
2 reff f
!
1 SNR(1 − )
→ log SNR
(20.9)
2 1+SNR (1 + )
1
→ log (1 + SNR) ,
2
r2 (Λ )
where in (20.9) we have used the goodness of Λc for MSE quantization, that implies effn c → σ 2 (Λc ).
The scheme also uses common randomness, namely a dither vector U ∼ Uniform(Vc ) statistically
independent of everything, known to both the transmitter and the receiver. In order to transmit a
message w ∈ [1, . . . , 2nR ] the encoder maps it to the corresponding point t = t(w) ∈ L and transmits

X = [t + U] mod Λ. (20.10)

Lemma 20.1 (Crypto Lemma). Let Λ be a lattice in Rn , let U ∼ Uniform(V) and let V be a
random vector in Rn , statistically independent of U. The random vector X = [V + U] mod Λ is
uniformly distributed over V and statistically independent of V.

218
Proof. For any v ∈ Rn the set v + V is a fundamental cell of Λ. Thus, by Proposition 20.1 we have
that [v + V] mod Λ = V and Vol(v + V) = Vol(V). Thus, for any v ∈ Rn
X|V = v ∼ [v + U] mod Λ ∼ Uniform(V).

The Crypto Lemma ensures that n1 EkXk2 = (1 − )SNR, but our power constraint was kXk2 ≤
nSNR. Since X is uniformly distributed over Vc and Λc is good for MSE quantization, Theorem 20.1
implies that kXk2 ≤ nSNR with high probability. Thus, whenever the power constraint is violated
we can just transmit 0 instead of X, and this will have a negligible effect on the error probability of
the scheme.
The receiver scales its observation by a factor α > 0 to be specified later, subtracts the dither U
and reduces the result modulo the coarse lattice
Yeff = [αY − U] mod Λc
= [X − U + (α − 1)X + αZ] mod Λc
= [t + (α − 1)X + αZ] mod Λc (20.11)
= [t + Zeff ] mod Λc , (20.12)
where we have used the modulo distributive law in (20.11), and
Zeff = (α − 1)X + αZ (20.13)
is effective noise, that is statistically independent of t, with effective variance
2 1
σeff (α) , EkZeff k2 < α2 + (1 − α)2 SNR. (20.14)
n
Since Z is semi norm-ergodic, and X is uniformly distributed over the Voronoi region of a lattice that
is good for MSE quantization, Theorem 20.1 implies that Zeff is semi norm-ergodic with effective
2 (α). Setting α = SNR/(1 + SNR), such as to minimize the upper bound on σ 2 (α)
variance σeff eff
2 < SNR/(1 + SNR).
results in effective variance σeff
The receiver next computes
t̂ = [QΛf (Yeff )] mod Λc

= QΛf (t + Zeff ) mod Λc , (20.15)
and outputs the message corresponding to t̂ as its estimate. Since Λf is good for coding, Zeff is
semi norm-ergodic, and
2 (Λ )
reff f SNR 2
= (1 + ) > σeff ,
n 1 + SNR
we have that Pr(t̂ 6= t) → 0 as the lattice dimension tends to infinity. Thus, we have proved the
following.
Theorem 20.3. There exist a coding scheme based on a nested lattice pair, that reliably achieves
any rate below 12 log(1 + SNR) with lattice decoding for all additive semi norm-ergodic channels. In
particular, if the additive noise is AWGN, this scheme is capacity achieving.
Remark 20.3. In the Mod-Λ scheme the error probability does not depend on the chosen message,
such that Pe,max = Pe,avg . However, this required common randomness in the form of the dither U.
By a standard averaging argument it follows that there exist some fixed shift u that achieves the
same, or better, Pe,avg . However, for a fixed shift the error probability is no longer independent of
the chosen message.

219
20.4 Dirty Paper Coding
Assume now that the channel is

Y = X + S + Z,

where Z is a unit variance semi norm-ergodic noise, X is subject to the same power constraint
kXk2 ≤ nSNR as before, and S is some arbitrary interference vector, known to the transmitter but
not to the receiver.
Naively, one can think that the encoder can handle the interference S just by subtracting it
from the transmitted codeword. However, if the codebook is designed to exactly meet the power
constraint, after subtracting S the power constraint will be violated. Moreover, if kSk2 > nSNR,
this approach is just not feasible.
Using the Mod-Λ scheme, S can be cancelled out with no cost in performance. Specifically,
instead of transmitting X = [t + U] mod Λc , the transmitted signal in the presence of known
interference will be

X = [t + U − αS] mod Λc .

Clearly, the power constraint is not violated as X ∼ Uniform(Vc ) due to the Crypto Lemma (now,
U should also be independent of S). The decoder is exactly the same as in the Mod-Λ scheme with
no interference. It is easy to verify that the interference is completely cancelled out, and any rate
below 21 log(1 + SNR) can still be achieved.
Remark 20.4. When Z is Gaussian and S is Gaussian there is a scheme based on random codes
that can reliably achieve 12 log(1 + SNR). For arbitrary S, to date, only lattice based coding schemes
are known to achieve the interference free capacity. There are many more scenarios where lattice
codes can reliably achieve better rates than the best known random coding schemes.

20.5 Construction of Good Nested Lattice Pairs

We now briefly describe a method for constructing nested lattice pairs. Our construction is based
on starting with a linear code over a prime finite field, and embedding it periodically in Rn to form
a lattice.
Definition 20.6 (p-ary Construction A). Let p be a prime number, and let F ∈ Zk×n p be a k × n
matrix whose entries are all members of the finite field Zp . The matrix F generates a linear p-ary
code
n o
C(F) , x ∈ Znp : x = [wT F] mod p w ∈ Zkp .

The p-ary Construction A lattice induced by the matrix F is defined as

Λ(F) , p−1 C(F) + Zn .

Note that any point in Λ(F) can be decomposed as x = p−1 c + a for some c ∈ C(F) (where we
identify the elements of Zp with the integers [0, 1, . . . , p−1]) and a ∈ Zn . Thus, for any x1 , x2 ∈ Λ(F)

220
we have

x1 + x2 = p−1 (c1 + c2 ) + a1 + a2
= p−1 ([c1 + c2 ] mod p + pa) + a1 + a2
= p−1 c̃ + ã
∈ Λ(F)

where c̃ = [c1 + c2 ] mod p ∈ C(F) due to the linearity of C(F), and a and ã are some vectors in
Zn . It can be verified similarly that for any x ∈ Λ(F) it holds that −x ∈ Λ(F), and that if all
codewords in C(F) are distinct, then Λ(F) has a finite minimum distance. Thus, Λ(F) is indeed
a lattice. Moreover, if F is full-rank over Zp , then the number of distinct codewords in C(F) is
pk . Consequently, the number of lattice points in every integer shift of the unit cube is pk , so the
corresponding Voronoi region must satisfy Vol(V) = p−k .
Similarly, we can construct a nested lattice pair from a linear code. Let 0 ≤ k 0 < k and let F0 be
the sub-matrix obtained by taking only the first k 0 rows of F. The matrix F0 generates a linear
code C 0 (F0 ) that is nested in C(F), i.e., C 0 (F0 ) ⊂ C(F). Consequently we have that Λ(F0 ) ⊂ Λ(F),
and the nesting ratio is
k−k0
Γ(Λ(F), Λ(F0 )) = p n .

An advantage of this nested lattice construction for Voronoi constellations is that there is a very
simple mapping between messages and codewords in L = Λf ∩ Vc . Namely, we can index our set
0 0 0
of 2nR = pk−k messages by all vectors in Zk−k
p . Then, for each message vector w ∈ Zk−k p , the
corresponding codeword in L = Λ(F) ∩ V(Λ(F0 )) is obtained by constructing the vector

w̃T = [0 · · 0} wT ] ∈ Zkp ,
| ·{z (20.16)
k0 zeros

and taking t = t(w) = [w̃T F] mod p mod Λ(F0 ). Also, in order to specify the codebook L, only
the (finite field) generating matrix F is needed.
If we take the elements of F to be i.i.d. and uniform over Zp , we get a random ensemble of
nested lattice codes. It can be shown that if p grows fast enough with the dimension n (taking
p = O(n(1+)/2 ) suffices) almost all pairs in the ensemble have the property that both the fine and
coarse lattice are good for both coding and for MSE quantization [OE15].

Disclaimer: This text is a very brief and non-exhaustive survey of the applications of lattices
in information theory. For a comprehensive treatment, see [Zam14].

221
§ 21. Channel coding: energy-per-bit, continuous-time channels

21.1 Energy per bit

Consider the additive Gaussian noise channel:

Yi = Xi + Zi , Zi ∼ N (0, N0 /2). (21.1)

In the last lecture, we analyzed the maximum number of information bits (M ∗ (n, , P )) that can be
pumped through for given n time use of the channel under the energy constraint P . Today we shall
study the counterpart of it: without any time constraint, in order to send k information bits, what
is the minimum energy needed? (E ∗ (k, ))
Definition 21.1 ( (E, 2k , ) code). For a channel W → X ∞ → Y ∞ → Ŵ , where Y ∞ = X ∞ + Z ∞ ,
a (E, 2k , ) code is a pair of encoder-decoder:

f : [2k ] → R∞ , g : R∞ → [2k ],
such that 1). ∀m, kf (m)k22 ≤ E,
2). P [g(f (W ) + Z ∞ ) 6= W ] ≤ .

Definition 21.2 (Fundamental limit).

E ∗ (k, ) = min{E : ∃(E, 2k , ) code}

Note: Operational meaning of lim→0 E ∗ (k, ): it suggests the smallest battery one needs in order
to send k bits without any time constraints, below that level reliable communication is impossible.
Theorem 21.1 ((Eb /N0 )min = −1.6dB).

E ∗ (k, ) N0 1
lim lim sup = , = −1.6dB (21.2)
→0 k→∞ k log2 e log2 e

Proof.

222
1. (“≥” converse part)
1
−h() + k ≤ d((1 − )k ) (Fano)
M
≤ I(W ; Ŵ ) (data processing for divergence)
≤ I(X ∞ ; Y ∞ ) (data processing for M.I.)
X∞
≤ I(Xi ; Yi ) ( lim I(X n ; U ) = I(X ∞ ; U ))
n→∞
i=1
X∞
1 EXi2
≤ log(1 + ) (Gaussian)
2 N0 /2
i=1
∞
log e X EXi2
≤ (linearization)
2 N0 /2
i=1
E
≤ log e
N0
E ∗ (k, ) N0 h()
⇒ ≥ ( − ).
k log e k

2. (“≤” achievability part)

Notice that a (n, 2k , , P ) code for AWGN channel is also a (nP, 2k , ) code for the energy
problem without time constraint. Therefore,
∗
log2 Mmax (n, , P ) ≥ k ⇒ E ∗ (k, ) ≤ nP.
∗ (n, , P )c, we have E ∗ (kn ,) nP
∀P , take kn = blog Mmax kn ≤ kn , ∀n, and take the limit:

E ∗ (kn , ) nP
lim sup ≤ lim sup ∗
n→∞ kn n→∞ log Mmax (n, , P )
P
= 1 ∗ (n, , P )
lim inf n→∞ n log Mmax
P
= 1 P
2 log(1 + N0 /2 )

Choose P for the lowest upper bound:

E ∗ (kn , ) P
lim sup ≤ inf 1 P
n→∞ kn P ≥0
2 log(1 + N0 /2 )
P
= lim
P →0 1 log(1 + P
2 N0 /2 )
N0
=
log2 e

Note: [Remark] In order to send information reliably at Eb /N0 = −1.6dB, infinitely many time
slots are needed, and the information rate (spectral efficiency) is thus 0. In order to have non-zero
spectral efficiency, one necessarily has to step back from −1.6 dB.

223
Note: [PPM code] The following code, pulse-position modulation (PPM), is very efficient in terms
of Eb /N0 .
√
PPM encoder: ∀m, f (m) = (0, 0, . . . , E
|{z} ,...) (21.3)
m-th location

It is not hard to derive an upper bound on the probability of error that this code achieves [PPV11,
Theorem 2]: " ( ! )#
r
2E
≤ E min M Q + Z ,1 , Z ∼ N (0, 1) .
N0
In fact,
√ the code√can be further slightly optimized by subtracting the common center of gravity
(2−k E, . . . , 2−k E . . .) and rescaling each codeword to satisfy the power constraint. The resulting
constellation (simplex code) is conjectured to be non-asymptotic optimum in terms of Eb /N0 for
small (“simplex conjecture”).

21.2 What is N0 ?
In the above discussion, we have assumed Zi ∼ N (0, N0 /2), but how do we determine N0 ?
In reality the signals are continuous time (CT) process, the continuous time AWGN channel for
the RF signals is modeled as:

Y (t) = X(t) + N (t) (21.4)

where noise N (t) (added at the receiver antenna) is a real stationary ergodic process and is assumed
to be “white Gaussian noise” with single-sided PSD N0 . Figure 21.1 at the end illustrates the
communication architecture. In the following discussion, we shall find the equivalent discrete
time (DT) AWGN model for the continuous time (CT) AWGN model in (21.4), and identify the
relationship between N0 in the DT model and N (t) in the CT model.

• Goal: communication in fc ± B/2 band.

(the (possibly complex) baseband signal lies in [−W, +W ], where W = B/2)

• observations:

1. Any signal band limited to fc ± B/2 can be produced by this architecture

2. At the step of C/D conversion, the LPF followed by sampling at B samples/sec is
sufficient statistics for estimating X(t), XB (t), as well as {Xi }.

First of all, what is N (t) in (21.4)?

Engineers’ definition of N (t)

224
Estimate the average power dissipation at the resistor:
Z T
1 ergodic (*)
lim Ft2 dt = E[F 2 ] = N0 B
T →∞ T t=0

If for some constant N0 , (*) holds for any narrow band with center frequency fc and bandwidth B,
then N (t) is called a “white noise” with one-sided PSD N0 .
Typically, white noise comes from thermal noise at the receiver antenna. Thus:

N0 ≈ kT (21.5)

where k = 1.38 × 10−23 is the Boltzmann constant, and T is the absolute temperature. The unit of
N0 is (W att/Hz = J).
An intuitive explanation to (21.5) is as follows: the thermal energy carried by each microscopic
degree of freedom (dof) is approximately kT 2 ; for bandwidth B and duration T , there are in total
2BT dof; by “white noise” definition we have the total energy of the noise to be:
kT
N0 BT = 2BT ⇒ N0 = kT.
2

Mathematicians’ definition of N (t)

Denote the set of all real finite energy signals f (t) by L2 (R), it is a vector space with the inner
product of two signals f (t), g(t) defined by
Z ∞
< f, g >= f (t)g(t)dt.
t=−∞

Definition 21.3 (WhiteR noise). N (t) isR a white noise with two-sided PSD being constant N0 /2 if
∞ ∞
∀f, g ∈ L2 (R) such that −∞ f 2 (t)dt = −∞ g 2 (t)dt = 1, we have that

1.
Z ∞
N0
< f, N >, f (t)N (t)dt ∼ N (0, ). (21.6)
−∞ 2

2. The joint distribution of (< f, N >, < g, N >) is jointly Gaussian with covariance equal to
inner product < f, g >.

Note: By this definition, N (t) is not a stochastic process, rather it is a collection of linear mappings
that map any f ∈ L2 (R) to a Gaussian random variable.

225
Note: Informally, we write:
N0
N (t) is white noise with one-sided PSD N0 (or two-sided PSD N0 /2) ⇐⇒ E[N (t)N (s)] = δ(t − s)
2
(21.7)

Note: The concept of one-sided PSD arises when N (t) is necessarily real, since in that case power
spectrum density is symmetric around 0, and thus to get the noise power in band [a, b] one can get
Z b Z b Z −a
noise power = Fone-sided (f )df = + Ftwo-sided (f )df ,
a a −b

where Fone-sided (f ) = 2Ftwo-sided (f ). In theory of stochastic processes it is uncommon to talk about

one-sided PSD, but in engineering it is.

Verify the equivalence between CT /DT models

First, consider the relation between RF signals and baseband signals.

√
X(t) = Re(XB (t) 2ejωc t ),
√
YB (t) = 2LP F2 (Y (t)ejωc t ),

where ωc = 2πfc . The LP F2 with high cutoff frequency ∼ 34 fc serves to kill the high frequency
√
component after demodulation, and the amplifier of magnitude 2 serves to preserve the total
energy of the signal, so that in the absence of noise we have that YB (t) = XB (t). Therefore,
e (t) ∼ C
YB (t) = XB (t) + N
e (t) is a complex Gaussian white noise and
where N
e (t)N
E[N e (s)∗ ] = N0 δ(t − s).

Notice that after demodulation, the PSD

√ of the noise is N0 /2 with N0 /4 in the real part and N0 /4
in the imaginary part, and after the 2 amplifier the PSD of the noise is restored to N0 /2 in both
real and imaginary part.
Next, consider the equivalent discrete time signals.
∞
X i
XB (t) = Xi sincB (t − )
B
i=−∞
Z ∞
i
Yi = YB (t)sincB (t − )dt
t=−∞ B
Yi = Xi + Zi

226
where the additive noise Zi is given by:
Z ∞
Zi = Ne (t)sincB (t − i )dt ∼ i.i.d CN (0, N0 ). (by (21.6))
t=−∞ B

if we focus on the real part of all signals, it is consistent with the real AWGN channel model in
(21.1).
Finally, the energy of the signal is preserved:
∞
X
|Xi |2 = kXB (t)k22 = kX(t)k22 .
i=−∞

Note: [Punchline]

CT AWGN (band limited) ⇐⇒ DT C-AWGN

N0
two-sided PSD ⇐⇒ Zi ∼ CN (0, N0 )
Z 2
X
energy= X(t)2 dt ⇐⇒ energy= |Xi |2

21.3 Capacity of the continuous-time band-limited AWGN

channel
∗ (T, , P ) the maximum number of waveforms that can be sent through the
Theorem 21.2. Let MCT
channel
N0
Y (t) = X(t) + N (t) , E N (t)N (s) = δ(t − s)
2
such that:

1. in the duration [0, T ];

B
2. band limited to [fc − + B2 ] for some large carrier frequency
2 , fc
RT
3. input energy constrained to t=0 x2 (t) ≤ T P ;

4. error probability P [Ŵ 6= W ] ≤ .

Then
1 ∗ P
lim lim inf log MCT (T, , P ) = B log(1 + ), (21.8)
→0 n→∞ T N0 B
Proof. Consider the DT equivalent C-AWGN channel of this CT model, we have that
1 ∗ 1 ∗
log MCT (T, , P ) = log MC−AWGN (BT, , P/B)
T T
This is because:

• in time T we get to choose BT complex samples

227
• The power constraint in the DT model changed because for blocklength BT we have
BT
X
|Xi |2 = kX(t)k22 ≤ P T ,
i=1

P
thus per-letter power constraint is B.

Calculate the rate of the equivalent DT AWGN channel and we are done.
Note the above “theorem” is not rigorous, since conditions 1 and 2 are mutually exclusive:
any time limited non-trivial signal cannot be band limited. Rigorously, one should relax 2 by
constraining the signal to have a vanishing out-of-band energy as T → ∞. Rigorous approach to
this question lead to the theory of prolate spheroidal functions.

21.4 Capacity of the continuous-time band-unlimited AWGN

channel
In the limit of large bandwidth B the capacity formula (21.8) yields
P P
CB=∞ (P ) = lim B log(1 + )= log e .
B→∞ N0 B N0
It turns out that this result is easy to prove rigorously.
Theorem 21.3. Let M ∗ (T, , P ) the maximum number of waveforms that can be sent through the
channel
N0
Y (t) = X(t) + N (t) , E N (t)N (s) = δ(t − s)
2
such that each waveform x(t)
1. is non-zero only on [0, T ];
RT 2 (t)
2. input energy constrained to t=0 x ≤ TP;

3. error probability P [Ŵ 6= W ] ≤ .

Then
1 P
lim lim inf log M ∗ (T, , P ) = log e (21.9)
→0 T →∞ T N0
Proof. Note that the space of all square-integrable functions on [0, T ], denoted L2 [0, T ] has countable
basis (e.g. sinusoids). Thus, by changing to that basis we may assume that equivalent channel
model
N0
Ỹj = X̃j + Z̃j , Z̃j ∼ N (0, ),
2
and energy constraint (dependent upon duration T ):
∞
X
X̃j2 ≤ P T .
j=1

But then the problem is equivalent to energy-per-bit one and hence

log2 M ∗ (T, , P ) = k ⇐⇒ E ∗ (k, ) = P T .

228
Thus,
1 P P
lim lim inf log2 M ∗ (T, , P ) = E ∗ (k,)
= log2 e ,
→0 n→∞ T lim→0 lim supk→∞ N0
k
where the last step is by Theorem 21.1.

229
230

Figure 21.1: DT / CT AWGN model

21.5 Capacity per unit cost
Generalizing the energy-per-bit setting of Theorem 21.1 we get the problem of capacity per unit
cost:

1. Given a random transformation PY ∞ |X ∞ and cost function c : X → R+ , we let

M ∗ (E, ) = max{M : (E, M, )-code} ,

where (E, M, )-code is defined as a map [M ] → X ∞ with every codeword x∞ satisfying

∞
X
c(xt ) ≤ E . (21.10)
t=1

2. Capacity per unit cost is defined as

1
Cpuc , lim lim inf log M ∗ (E, ) .
→0 E→∞ E

3. Let C(P ) be the capacity-cost function of the channel (in the usual sense of capacity, as
defined in (19.1). Assuming P0 = 0 and C(0) = 0 it is not hard to show that:

C(P ) C(P ) d
Cpuc = sup = lim = C(P ) .
P P P →0 P dP P =0

4. The surprising discovery of Verdú is that one can avoid computing C(P ) and derive the
Cpuc directly. This is a significant help, as for many practical channels C(P ) is unknown.
Additionally, this gives a yet another fundamental meaning to KL-divergence.
Q
Theorem 21.4. For a stationary memoryless channel PY ∞ |X ∞ = PY |X with P0 = c(x0 ) = 0
(i.e. there is a symbol of zero cost), we have

D(PY |X=x kPY |X=x0 )

Cpuc = sup .
x6=x0 c(x)

In particular, Cpuc = ∞ if there exists x1 6= x0 with c(x1 ) = 0.

Proof. Let
D(PY |X=x kPY |X=x0 )
CV = sup .
x6=x0 c(x)

Converse: Consider a (E, M, ) code W → X ∞ → Y ∞ → Ŵ . Introduce an auxiliary distribution

QW,X ∞ ,Y ∞ ,Ŵ , where a channel is a useless one

QY ∞ |X ∞ = QY ∞ , PY∞|X=x0 .

That is, the overall factorization is

QW,X ∞ ,Y ∞ ,Ŵ = PW PX ∞ |W QY ∞ PŴ |Y ∞ .

231
Then, as usual we have from the data-processing for divergence
1
(1 − ) log M + h() ≤ d(1 − k ) (21.11)
M
≤ D(PW,X ∞ ,Y ∞ ,Ŵ kQW,X ∞ ,Y ∞ ,Ŵ ) (21.12)
= D(PY ∞ |X ∞ kQY ∞ |PX ∞ ) (21.13)
"∞ #
X
=E d(Xt ) , (21.14)
t=1

where we denoted for convenience

d(x) , D(PY |X=x kPY |X=x0 ) .

By the definition of CV we have

d(x) ≤ c(x)CV .
Thus, continuing (21.14) we obtain
"∞ #
X
(1 − ) log M + h() ≤ CV E c(Xt ) ≤ CV · E ,
t=1

where the last step is by the cost constraint (21.10). Thus, dividing by E and taking limits we get

Cpuc ≤ CV .

Achievability: We generalize the PPM code (21.3). For each x1 ∈ X and n ∈ Z+ we define the
encoder f as follows:

f (1) = (x1 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.15)

| {z } | {z }
n-times n(M −1)-times

f (2) = (x0 , x0 , . . . , x0 , x1 , . . . , x1 , x0 , . . . , x0 ) (21.16)

| {z } | {z } | {z }
n-times n-times n(M −2)-times

··· (21.17)
f (M ) = ( x0 , . . . , x0 , x1 , x1 , . . . , x1 ) (21.18)
| {z } | {z }
n(M −1)-times n-times

Now, by Stein’s lemma there exists a subset S ⊂ Y n with the property that

P[Y n ∈ S|X n = (x1 , . . . , x1 )] ≥ 1 − 1 (21.19)

P[Y n n
∈ S|X = (x0 , . . . , x0 )] ≤ exp{−nD(PY |X=x1 kPY |X=x0 ) + o(n)} . (21.20)

Therefore, we propose the following (suboptimal!) decoder:

Yn ∈S =⇒ Ŵ = 1 (21.21)
2n
Yn+1 ∈S =⇒ Ŵ = 2 (21.22)
··· (21.23)

From the union bound we find that the overall probability of error is bounded by

≤ 1 + M exp{−nD(PY |X=x1 kPY |X=x0 ) + o(n)} .

232
At the same time the total cost of each codeword is given by nc(x1 ). Thus, taking n → ∞ and after
straightforward manipulations, we conclude that
D(PY |X=x1 kPY |X=x0 )
Cpuc ≥ .
c(x1 )
This holds for any symbol x1 ∈ X , and so we are free to take supremum over x1 to obtain Cpuc ≥ CV ,
as required.

21.5.1 Energy-per-bit for AWGN channel subject to fading

Consider a stationary memoryless Gaussian channel with fading Hj (unknown at the receiver).
Namely,
N0
Yj = Hj Xj + Zj , Hj ∼ N (0, 1) ⊥
⊥ Zj ∼ N (0, ).
2
The cost function is the usual quadratic one c(x) = x2 . As we discussed previously, cf. (19.8), the
capacity-cost function C(P ) is unknown in closed form, but is known to behave drastically different
from the case of non-fading AWGN (i.e. when Hj = 1). So here previous theorem comes handy, as
we cannot just compute C 0 (0). Let us perform a simple computation required, cf. (1.17):
N0 N0
D(N (0, x2 + 2 )kN (0, 2 ))
Cpuc = sup (21.24)
x6=0 x2
2x2
!
1 log(1 + N0 )
= sup log e − 2x2
(21.25)
N0 x6=0
N0
log e
= (21.26)
N0
Comparing with Theorem 21.1 we discover that surprisingly, the capacity-per-unit-cost is unaffected
by the presence of fading. In other words, the random multiplicative noise which is so detrimental
at high SNR, appears to be much more benign at low SNR (recall that Cpuc = C 0 (0)). There is one
important difference, however. It should be noted that the supremization over x in (21.25) is solved
at x = ∞. Following the proof of the converse bound, we conclude that any code hoping to achieve
optimal Cpuc must satisfy a strange constraint:
X X
x2t 1{|xt | ≥ A} ≈ x2t ∀A > 0
t t

i.e. the total energy expended by each codeword must be almost entirely concentrated in very large
spikes. Such a coding method is called “flash signalling”. Thus, we can see that unlike non-fading
AWGN (for which due to rotational symmetry all codewords can be made “mellow”), the only hope
of achieving full Cpuc in the presence of fading is by signalling in huge bursts of energy.
This effect manifests itself in the speed of convergence to Cpuc with increasing constellation sizes.
∗
Namely, the energy-per-bit E (k,) k behaves as
r
E ∗ (k, ) const −1
= (−1.59 dB) + Q () (AWGN) (21.27)
k k
r
E ∗ (k, ) 3 log k
= (−1.59 dB) + (Q−1 ())2 (fading) (21.28)
k k
Fig. 21.2 shows numerical details.

233
14

10
Achievability

8
Converse
dB

6 Rayleigh fading, noCSI

N0 ,
Eb

2
fading+CSIR, non-fading AWGN
0
−1.59 dB
−2
100 101 102 103 104 105 106 107 108
Information bits, k

Figure 21.2: Comparing the energy-per-bit required to send a packet of k-bits for different channel
E ∗ (k,)
models (curves represent upper and lower bounds on the unknown optimal value k ). As a
comparison: to get to −1.5 dB one has to code over 6 · 104 data bits when the channel is non-fading
AWGN or fading AWGN with Hj known perfectly at the receiver. For fading AWGN without
knowledge of Hj (noCSI), one has to code over at least 7 · 107 data bits to get to the same −1.5 dB.
Plot generated via [Spe15].

234
§ 22. Advanced channel coding. Source-Channel separation.

Topics: Strong Converse, Channel Dispersion, Joint Source Channel Coding (JSCC)

22.1 Strong Converse

We begin by stating the main theorem.
Theorem 22.1. For any stationary memoryless channel with either |A| < ∞ or |B| < ∞ we have
C = C for 0 < < 1.
C
Remark: In Theorem 18.4, we showed that C ≤ C ≤ 1− . Now we are asserting that equality
holds for every . Our previous converse arguments (Theorem 16.4 based on Fano’s inequality)
showed that communication with an arbitrarily small error probability is possible only when using
rate R < C; the strong converse shows that when you try to communicate with any rate above
capacity R > C, then the probability of error will go to 1 (typically with exponential speed in n).
In other words, (
0 R<C
∗ (n, exp(nR)) →
1 R>C
where ∗ (n, M ) is the inverse of M ∗ (n, ) defined in (18.3).
In practice, engineers observe this effect in the form of waterfall plots, which depict the dependence
of a given communication system (code+modulation) on the SNR.
Pe
1
10−1
10−2
10−3
10−4
10−5
SNR
Below a certain SNR, the probability of error shoots up to 1, so that the receiver will only see
garbage.

Proof. We will give a sketch of the proof. Take an (n, M, )-code for channel PY |X . The main trick
is to consider an auxiliary channel QY |X which is easier to analyze.
PY n |X n
W Xn Yn Ŵ

QY n |X n

235
Sketch 1: Here, we take QY n |X n = (PY∗ )n , where PY∗ is the capacity-achieving output distri-
bution (caod) of the channel PY |X .1 Note that for communication purposes, QY n |X n is a useless
channel; it ignores the input and randomly picks a member of the output space according to (PY∗ )n ,
so that X n and Y n are decoupled (independent). Consider the probability of error under each
channel:
1
Q[Ŵ = W ] = (Blindly guessing the sent codeword)
M
P[Ŵ = W ] = 1 −

Since the random variable 1{Ŵ =W } has a huge mass under P and small mass under Q, this looks
like a great binary hypothesis test to distinguish the two distributions, PW X n Y n Ŵ and QW X n Y n Ŵ .
Since any hypothesis test can’t beat the optimal Neyman-Pearson test, we get the upper bound
1
β1− (PW X n Y n Ŵ , QW X n Y n Ŵ ) ≤ (22.1)
M
(Recall that βα (P, Q) = inf P [E]≥α Q[E]). Since the likelihood ratio is a sufficient statistic for this
hypothesis test, we can test only between

PW X n Y n Ŵ PW PX n |W PY n |Xn PŴ |Y n PW |X n PX n Y n PŴ |Y n PX n Y n

= ∗ = ∗ =
QW X n Y n Ŵ n
PW PX n |W (PY ) PŴ |Y n n
PW |X n PX (PY ) PŴ |Y n
n PX n (PY∗ )n

Therefore, inequality above becomes

1
β1− (PX n Y n , PX n (PY∗ )n ) ≤ (22.2)
M
Computing the LHS of this bound need not be easy, since generally we know PY |X and PY∗ , but
can’t assume anything about PX n which depends on the code. (Note that X n is the output of the
encoder and uniformly distributed on the codebook for deterministic encoders). Certain tricks are
needed to remove the dependency on codebook. However, in case the channel is “symmetric” the
dependence on the codebook disappears: this is shown in the following example for the BSC. To
treat the general case one simply decomposes the channel into symmetric subchannels (for example,
by considering constant composition subcodes).
Example. For a BSC(δ)n , recall that

PY n |X n (y n |xn ) = PZn (y n − xn ), Z n ∼ Bern(δ)n

(PY∗ )n (y n ) = 2−n

From the Neyman Pearson test, the optimal HT takes the form

PX n Y n PX n Y n
βα (PX n Y n , PX n (PY∗ )n ) = Q log ≥ γ where α = P log ≥ γ
| {z } | {z } PX n (PY∗ )n PX n (PY∗ )n
P Q

For the BSC, this becomes

PX n Y n PZ n (y n − xn )
log = log
PX n (PY∗ )n 2−n
1
Recall from Theorem 4.5 that the caod of a random transformation always exists and is unique, whereas a caid
may not exist.

236
So under each hypothesis P and Q, the difference Y n − X n takes the form
1
Q : Y n − X n ∼ Bern( )n
2
P : Y − X ∼ Bern(δ)n
n n

Now all the relevant distributions are known, so we can compute βα

1
βα (PX n Y n , PX n (PY∗ )n ) = βα (Bern(δ)n , Bern( )n )
2
−nD(Bern(δ)kBern( 21 ))+o(n)
= 2 (Stein’s Lemma Theorem 13.1)
−nd(δk 21 )+o(n)
= 2

Putting this all together, we see that any (n, M, ) code for the BSC satisfies
1 1 1
2−nd(δk 2 )+o(n) ≤ =⇒ log M ≤ nd(δk ) + o(n)
M 2
Since this is satisfied for all codes, it is also satisfied for the optimal code, so we get the converse
bound
1 1
lim inf log M ∗ (n, ) ≤ d(δk ) = log 2 − h(δ)
n→∞ n 2
For a general channel, this computation can be much more difficult. The expression for β in this
case is
∗ 1
β1− (PX n PY n |X n , PX n (PY∗ )n ) = 2−nD(PY |X kPY |P̄X )+o(n) ≤ (22.3)
M
where P̄X is unknown (depending on the code).
Explanation of (22.3): A statistician observes sequences of (X n , Y n ):

Xn = [ 0 1 2 0 0 1 2 2 ]
Yn =[a b b a c c a b]

On the marked three blocks, test between iid samples of PY |X=0 vs PY∗ , which has exponent
D(PY |X=0 kPY∗ ). Thus, intuitively averaging over the composition of the codeword we get that the
exponent of β is given by (22.3).
Recall that from the saddle point characterization of capacity (Theorem 4.4) for any distribution
P̄X we have
D(PY |X kPY∗ |P̄X ) ≤ C . (22.4)
Thus from (22.3) and (22.1):

log M ≤ nD(PY |X kPY∗ |P̄X ) + o(n) ≤ nC + o(n)

Sketch 2: (More formal) Again, we will choose a dummy auxiliary channel QY n |X n = (QY )n .
However, choice of QY will depend on one of the two cases:

237
1. If |B| < ∞ we take QY = PY∗ (the caod) and note that from (18.12) we have
X
PY |X (y|x0 ) log2 PY |X (y|x0 ) ≤ log2 |B| ∀x0 ∈ A
y

and since miny PY∗ (y) > 0 (without loss of generality), we conclude that for any distribution
of X on A we have

PY |X (Y |X)
Var log |X ≤ K < ∞ ∀PX . (22.5)
QY (Y )

Furthermore, we also have from (22.4) that

PY |X (Y |X)
E log |X ≤ C ∀PX . (22.6)
QY (Y )

2. If |A| < ∞, then for each codeword c ∈ An we define its composition as

n
1X
P̂c (x) , 1{cj = x} .
n
j=1

By simple counting it is clear that from any (n, M, ) code, it is possible to select an (n, M 0 , )
subcode, such that a) all codeword have the same composition P0 ; and b) M 0 > nM |A| . Note
that, log M = log M 0 + O(log n) and thus we may replace M with M 0 and focus on the analysis
of the chosen subcode. Then we set QY = PY |X ◦ P0 . In this case, from (18.9) we have

PY |X (Y |X)
Var log |X ≤ K < ∞ X ∼ P0 . (22.7)
QY (Y )
Furthermore, we also have

PY |X (Y |X)
E log |X = D(PY |X kQY |P0 ) = I(X; Y ) ≤ C X ∼ P0 . (22.8)
QY (Y )

Now, proceed as in (22.2) to get

1
β1− (PX n Y n , PX n (QY )n ) ≤ . (22.9)
M
We next apply the lower bound on β from Theorem 12.5:
h dPY n |X n (Y n |X n ) i
γβ1− (PX n Y n , PX n (QY )n ) ≥ P log Q ≤ log γ −
d QY (Yi )
√
Set log γ = nC + K 0 n with K 0 to be chosen shortly and denote for convenience
n
dPY n |X n (Y n |X n ) X dPY |X (Yj |Xj )
Sn , log Q = log
d QY (Yi ) dQY (Yj )
j=1

Conditioning on X n and using (22.6) and (22.8) we get

√ √
P Sn ≤ nC + K 0 n|X n ≥ P Sn ≤ n E[Sn |X n ] + K 0 n|X n

238
From here, we apply Chebyshev inequality and (22.5) or (22.7) to get

√ K 02
P Sn ≤ n E[Sn |X n ] + K 0 n|X n ≥ 1 − .
K
K 02
If we set K 0 so large that 1 − K > 2 then overall we get that
√
log β1− (PX n Y n , PX n (QY )n ) ≥ −nC − K 0 n − log .

Consequently, from (22.9) we conclude that

√
log M ∗ (n, ) ≤ nC + O( n) ,

implying the strong converse.

In summary, the take-away points for the strong converse are

1. Strong converse can be proven by using binary hypothesis testing.

2. The capacity saddle point (22.4) is key.

In the homework, we will explore in detail proofs of the strong converse for the BSC and the AWGN
channel.

22.2 Stationary memoryless channel without strong converse

It may seem that the strong converse should hold for an arbitrary stationary memoryless channel (it
was only showed for the discrete ones above). However, it turns out that there exist counterexamples.
We construct one next.
Let the output alphabet be B = [0, 1]. The input A is going to be countably infinite. It will be
convenient to define it as
A = {(j, m) : j, m ∈ Z+ , 0 ≤ j ≤ m} .
The single-letter channel PY |X is defined in terms of probability density function as
(
j
am , m ≤ y ≤ j+1
m ,,
pY |X (y|(j, m)) =
bm , otherwise ,

where am , bm are chosen to satisfy

1 1
am + (1 − )bm = 1 (22.10)
m m
1 1
am log am + (1 − )bm log bm = C , (22.11)
m m
where C > 0 is an arbitary fixed constant. Note that for large m we have
mC 1
am = (1 + O( )) , (22.12)
log m log m
C 1
bm =1− + O( 2 ) (22.13)
log m log m

239
It is easy to see that PY∗ = Unif[0, 1] is the capacity-achieving output distribution and

sup I(X; Y ) = C .
PX

Thus by Theorem 18.6 the capacity of the corresponding stationary memoryless channel is C. We
next show that nevertheless the -capacity can be strictly greater than C.
Indeed, fix blocklength n and consider a single letter distribution PX assigning equal weights
to all atoms (j, m) with m = exp{2nC}. It can be shown that in this case, the distribution of a
single-letter information density is given by
(
1
2nC, w.p. 2n
i(X; Y ) ≈ 1
0, w.p.1 − 2n

Thus, for blocklength-n density we have

1
i(X n ; Y n ) → 2CPoisson(1/2) .
n

Therefore, from Theorem 17.1 we get that for > 1 − e−1/2 there exist (n, M, )-codes with

log M ≥ 2nC .

In particular,
C ≥ 2C ∀ > 1 − e−1/2

22.3 Channel Dispersion

The strong converse tells us that log M ∗ (n, ) = nC + o(n) ∀ ∈ (0, 1). An engineer sees this, and
estimates log M ∗ ≈ nC. However, this doesn’t give any information about the dependence of log M ∗
on the error probability , which is hidden in the o(n) term. We unravel this in the following
theorem.
Theorem 22.2. Consider one of the following channels:

1. DMC

2. DMC with cost constraint

3. AWGN or parallel AWGN

The following expansion holds for a fixed 0 < < 1/2 and n → ∞
√
log M ∗ (n, ) = nC − nV Q−1 () + O(log n)

where Q−1 is the inverse of the complementary standard normal CDF, the channel capacity is
C = I(X ∗ ; Y ∗ ) = E[i(X ∗ ; Y ∗ )], and the channel dispersion2 is V = Var[i(X ∗ ; Y ∗ )|X ∗ ].
2
There could be multiple capacity-achieving input distributions, in which case PX ∗ should be chosen as the one
that minimizes Var[i(X ∗ ; Y ∗ )|X ∗ ]. See [PPV10a] for more details.

240
√
Proof. For achievability, we have shown (Theorem 18.7) that log M ∗ (n, ) ≥ nC − nV Q−1 () by
refining the proof of the noisy channel coding theorem using the CLT.
The converse statement is log M ∗ ≤ − log β1− (PX n Y n , PX n (PY∗ )n ). For the BSC, we showed
that the RHS of the previous expression is
1 1 √ √
− log β1− (Bern(δ)n , Bern( )n ) = nd(δk ) + nV Q−1 () + o( n)
2 2
(see homework) where the dispersion is
" #
Bern(δ)
V = VarZ∼Bern(δ) log (Z) .
Bern( 12 )

The general proof is omitted.

Remark: This expansion only applies for certain channels (as described in the theorem). If,
for example, Var[i(X; Y )] = ∞, then the theorem need not hold and there are other stable (non-
Gaussian) distributions that we might converge to instead. Also notice that for DMC without cost
constraint
Var[i(X ∗ ; Y ∗ )|X ∗ ] = Var[i(X ∗ ; Y ∗ )]
since (capacity saddle point!) E[i(X ∗ ; Y ∗ )|X ∗ = x] = C for PX ∗ -almost all x.

22.3.1 Applications
As stated earlier, direct computation of M ∗ (n, ) by exhaustive search doubly exponential in
complexity, and thus is infeasible in most cases. However, we can get an easily computable
approximation using the channel dispersion via
√
log M ∗ (n, ) ≈ nC − nV Q−1 ()

Consider a BEC (n = 500, δ = 1/2) as an example of using this approximation. For this channel,
the capacity and dispersion are

C =1−δ
V = δ δ̄

Where δ̄ = 1 − δ. Using these values, our approximation for this BEC becomes
√ p
log M ∗ (500, 10−3 ) ≈ nC − nV Q−1 () = nδ̄ − nδ δ̄Q−1 (10−3 ) ≈ 215.5 bits

In the homework, for the BEC(500, 1/2) we obtained bounds 213 ≤ log M ∗ (500, 10−3 ) ≤ 217, so
this approximation falls in the middle of these bounds.
Examples of Channel Dispersion

241
For a few common channels, the dispersions are

BEC: V (δ) = δ δ̄ log2 2

δ̄
BSC: V (δ) = δ δ̄ log2
δ
P (P + 2) P (P + 2)
AWGN: V (P ) = log2 e (Real) log2 e (Complex)
2(P + 1)2 (P + 1)2
!2 +
XL 2 X L 2
Pj log e σj
Parallel AWGN: V (P, σ 2 ) = VAW GN ( 2 ) = 1 −
σ 2 T
j=1 j j=1
L
X
where |T − σj2 |+ = P is the water-filling solution of the parallel AWGN
j=1

Punchline: Although the only machinery needed for this approximation is the CLT, the results
produced are incredibly useful. Even though log M ∗ is nearly impossible to compute on its own, by
only finding C and V we are able to get a good approximation that is easily computable.

22.4 Normalized Rate

Suppose you’re given two codes k1 → n1 and k2 → n2 , how do you fairly compare them? Perhaps
they have the following waterfall plots
Pe k1 → n1 Pe k2 → n2

10−4 10−4

P∗ SNR P∗ SNR

After inspecting these plots, one may believe that the k1 → n1 code is better, since it requires
a smaller SNR to achieve the same error probability. However, there are many factors, such as
blocklength, rate, etc. that don’t appear on these plots. To get a fair comparison, we can use the
notion of normalized rate. To each (n, 2k , )-code, define

k k
Rnorm = ∗ ≈ p
log2 MAW GN (n, , P ) nC(P ) − nV (P )Q−1 ()

Take = 10−4 , and P (SNR) according to the water fall plot corresponding to Pe = 10−4 , and we
can compare codes directly (see Fig. 22.1). This normalized rate gives another motivation for the
expansion given in Theorem 22.2.

22.5 Joint Source Channel Coding

Now we will examine a slightly different information transmission scenario called Joint Source
Channel Coding

242
Normalized rates of code families over AWGN, Pe=0.0001
1

0.95

0.9

0.85 Turbo R=1/3

Turbo R=1/6
Turbo R=1/4
0.8 Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
0.75 Cassini/Pathfinder
Galileo LGA
Hermitian curve [64,32] (SDD)
0.7 Reed−Solomon (SDD)
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (List dec.)
0.65 ME LDPC R=1/2 (BP)

0.6

0.55

0.5 2 3 4 5
10 10 10 10
Blocklength, n
Normalized rates of code families over BIAWGN, Pe=0.0001
1

0.95

0.9
Turbo R=1/3
Turbo R=1/6
Turbo R=1/4
0.85
Voyager
Normalized rate

Galileo HGA
Turbo R=1/2
Cassini/Pathfinder
0.8
Galileo LGA
BCH (Koetter−Vardy)
Polar+CRC R=1/2 (L=32)
Polar+CRC R=1/2 (L=256)
0.75
Huawei NB−LDPC
Huawei hybrid−Cyclic
ME LDPC R=1/2 (BP)
0.7

0.65

0.6 2 3 4 5
10 10 10 10
Blocklength, n

Figure 22.1: Normalized rates for various codes. Plots generated via [Spe15].

243
Sk Encoder Xn Yn Decoder Ŝ k
Source (JSCC) Channel (JSCC)

Definition 22.1. For a Joint Source Channel Code

• Goal: P[S k 6= Ŝ k ] ≤

• Encoder: f : Ak → X n

• Decoder: g : Y n → Ak

• Fundamental Limit (Optimal probability of error): ∗JSCC (k, n) = inf f,g P[S k 6= Ŝ k ]
k
where the rate is R = n (symbol per channel use).
Note: In channel coding we are interested in transmitting M messages and all messages are born
equal. Here we want to convey the source realizations which might not be equiprobable (has
redundancy). Indeed, if S k is uniformly distributed on, say, {0, 1}k , then we are back to the channel
coding setup with M = 2k under average probability of error, and ∗JSCC (k, n) coincides with
∗ (n, 2k ) defined in Section 22.1.
Note: Here, we look for a clever scheme to directly encode k symbols from A into a length n channel
input such that we achieve a small probability of error over the channel. This feels like a mix of two
problems we’ve seen: compressing a source and coding over a channel. The following theorem shows
that compressing and channel coding separately is optimal. This is a relief, since it implies that we
do not need to develop any new theory or architectures to solve the Joint Source Channel Coding
problem. As far as the leading term in the asymptotics is concerned, the following two-stage scheme
is optimal: First use the optimal compressor to eliminate all the redundancy in the source, then use
the optimal channel code to add redundancy to combat the noise in the data transmission.
Theorem 22.3. Let the source {Sk } be stationary memoryless on a finite alphabet with entropy H.
Let the channel be stationary memoryless with finite capacity C. Then
(
→ 0 R < C/H
∗JSCC (nR, n) n → ∞.
6→ 0 R > C/H

Note: Interpretation: Each source symbol has information content (entropy) H bits. Each channel
use can convey C bits. Therefore to reliably transmit k symbols over n channel uses, we need
kH ≤ nC.

Proof. Achievability. The idea is to separately compress our source and code it for transmission.
Since this is a feasible way to solve the JSCC problem, it gives an achievability bound. This
separated architecture is

f1 f2 PY n |X n g2 g1
S k −→ W −→ X n −→ Y n −→ Ŵ −→ Ŝ k

Where we use the optimal compressor (f1 , g1 ) and optimal channel code (maximum probability of
error) (f2 , g2 ). Let W denote the output of the compressor which takes at most Mk values. Then
1
(From optimal compressor) log Mk > H + δ =⇒ P[Ŝ k 6= S k (W )] ≤ ∀k ≥ k0
k
1
(From optimal channel code) log Mk < C − δ =⇒ P[Ŵ 6= m|W = m] ≤ ∀m, ∀k ≥ k0
n

244
Using both of these,

P[S k 6= Ŝ k (Ŵ )] ≤ P[S k 6= Ŝ k , W = Ŵ ] + P[W 6= Ŵ ]

≤ P[S k 6= Ŝ k (W )] + P[W 6= Ŵ ] ≤ +

And therefore if R(H + δ) < C − δ, then ∗ → 0. By the arbitrariness of δ > 0, we conclude the
weak converse for any R > C/H.
Converse: channel-substitution proof. Let QS k Ŝ k = US k PŜ k where US k is the uniform
distribution. Using data processing
1
D(PS k Ŝ k kQS k Ŝ k ) = D(PS k kUS k ) + D(PŜ|S k kPŜ |PS k ) ≥ d(1 − k )
|A|k
Rearranging this gives
1
I(S k ; Ŝ k ) ≥ d(1 − k ) − D(PS k kUS k )
|A|k
log |A| + H(S k ) − k log |A|
≥ − log 2 + k¯
= H(S k ) − log 2 − k log |A|

Which follows from expanding out the terms. Now, normalizing and taking the sup of both sides
gives
1 1 k
sup I(X n ; Y n ) ≥ H(S k ) − log |A| + o(1)
n Xn n n
letting R = k/n, this shows
RH − C
C ≥ RH − R log |A| =⇒ ≥ >0
R log |A|
where the last expression is positive when R > C/H.
Converse: usual proof. Any JSCC encoder/decoder induces a Markov chain

S k → X n → Y n → Ŝ k .

Applying data processing for mutual information

I(S k ; Ŝ k ) ≤ I(X n ; Y n ) ≤ sup I(X n ; Y n ) = nC.

PX n

On the other hand, since P[S k 6= Ŝ k ] ≤ n , Fano’s inequality (Theorem 5.3) yields

I(S k ; Ŝ k ) = H(S k ) − H(S k |Ŝ k ) ≥ kH − n log |A|k − log 2.

Combining the two gives

nC ≥ kH − n log |A|k − log 2.
Since R = nk , dividing both sides by n and sending n → ∞ yields
RH − C
lim inf n ≥ .
n→∞ R log |A|
Therefore n does not vanish if R > C/H.

245
§ 23. Channel coding with feedback

Criticism: Channels without feedback don’t exist (except for storage).

Motivation: Consider the communication channel of the downlink transmission from a satellite
to earth. Downlink transmission is very expensive (power constraint at the satellite), but the uplink
from earth to the satellite is cheap which makes virtually noiseless feedback readily available at
the transmitter (satellite). In general, channel with noiseless feedback is interesting when such
asymmetry exists between uplink and downlink.
In the first half of our discussion, we shall follow Shannon to show that feedback gains “nothing”
in the conventional setup, while in the second half, we look at situations where feedback gains a lot.

23.1 Feedback does not increase capacity for stationary

memoryless channels
Definition 23.1 (Code with feedback). An (n, M, )-code with feedback is specified by the encoder-
decoder pair (f, g) as follows:
• Encoder: (time varying)

f1 : [M ] → A
f2 : [M ] × B → A
..
.
fn : [M ] × B n−1 → A

• Decoder:

g : B n → [M ]

such that P[W 6= Ŵ ] ≤ .

246
Here the symbol transmitted at time t depends on both the message and the history of received
symbols:
Xt = ft (W, Y1t−1 )
Hence the probability space is as follows:

W ∼ uniform on [M ]
PY |X

X1 = f1 (W ) −→ Y1 


.. −→ Ŵ = g(Y n )
. 
PY |X 

Xn = fn (W, Y1n−1 ) −→ Yn

Definition 23.2 (Fundamental limits).

Mf∗b (n, ) = max{M : ∃(n, M, ) code with feedback.}

1
Cf b, = lim inf log Mf∗b (n, )
n→∞ n
Cf b = lim Cf b, (Shannon capacity with feedback)
→0

Theorem 23.1 (Shannon 1956). For a stationary memoryless channel,

Cf b = C = Ci = sup I(X; Y )
PX

Proof. Achievability: Although it is obvious that Cf b ≥ C, we wanted to demonstrate that in fact

constructing codes achieving capacity with full feedback can be done directly, without appealing to a
(much harder) problem of non-feedback codes. Let πt (·) , PW |Y t (·|Y t ) with the (random) posterior
distribution after t steps. It is clear that due to the knowledge of Y t on both ends, transmitter and
receiver have perfectly synchronized knowledge of πt . Now consider how the transmission progresses:
1
1. Initialize π0 (·) = M

2. At (t+1)-th step, having knowledge of πt all messages are partitioned into classes Pa , according
to the values ft+1 (·, Y t ):

Pa , {j ∈ [M ] : ft+1 (j, Y t ) = a} a ∈ A.

Then transmitter, possessing the knowledge of the true message W , selects a letter Xt+1 =
ft+1 (W, Y t ).

3. Channel perturbs Xt+1 into Yt+1 and both parties compute the updated posterior:

PY |X (Yt+1 |ft+1 (j, Y t ))

πt+1 (j) , πt (j)Bt+1 (j) , Bt+1 (j) , P .
a∈A πt (Pa )

Notice that (this is the crucial part!) the random multiplier satisfies:
XX PY |X (y|a)
E[log Bt+1 (W )|Y t ] = πt (Pa ) log P = I(π̃t , PY |X ) (23.1)
a∈A y∈B a∈A πt (Pa )a

where π̃t (a) , πt (Pa ) is a (random) distribution on A.

247
The goal of the code designer is to come up with such a partitioning {Pa , a ∈ A} that the speed
of growth of πt (W ) is maximal. Now, analyzing the speed of growth of a random-multiplicative
process is best done by taking logs:
t
X
log πt (j) = log Bs + log π0 (j) .
s=1

Intutively, we expect that the process log πt (W ) resembles a random walk starting from − log M and
having a positive drift. Thus to estimate the time it takes for this process to reach value 0 we need
to estimate the upward drift. Appealing to intuition and the law of large numbers we approximate
t
X
log πt (W ) − log π0 (W ) ≈ E[log Bs ] .
s=1

Finally, from (23.1) we conclude that the best idea is to select partitioning at each step in such a
way that π̃t ≈ PX∗ (caid) and this obtains

log πt (W ) ≈ tC − log M ,

implying that the transmission terminates in time ≈ logCM . The important lesson here is the
following: The optimal transmission scheme should map messages to channel inputs in such a way
that the induced input distribution PXt+1 |Y t is approximately equal to the one maximizing I(X; Y ).
This idea is called posterior matching and explored in detail in [SF11].1
Converse: we are left to show that Cf b ≤ Ci .
Recall the key in proving weak converse for channel coding without feedback: Fano’s inequality
plus the graphical model
W → X n → Y n → Ŵ . (23.2)
Then
−h() + ¯ log M ≤ I(W ; Ŵ ) ≤ I(X n ; Y n ) ≤ nCi .
With feedback the probabilistic picture becomes more complicated as the following figure shows
for n = 3 (dependence introduced by the extra squiggly arrows):

X1 Y1 X1 Y1

W X2 Y2 Ŵ W X2 Y2 Ŵ

X1 Y1 X1 Y1
without feedback with feedback
1
Note that the magic of Shannon’s theorem is that this optimal partitioning can also be done blindly, i.e. it is
∗
possible to preselect partitions Pa in a way that is independent of πt (but dependent on t) and so that πt (Pa ) ≈ PX (a)
with overwhelming probability and for all t ∈ [n].

248
So, while the Markov chain realtion in (23.2) is still true, the input-output relation is no longer
memoryless2
n
Y
PY n |X n (y n |xn ) 6= PY |X (yj |xj ) (!)
j=1
There is still a large degree of independence in the channel, though. Namely, we have
(Y t−1 , W ) →Xt → Yt , t = 1, . . . , n (23.3)
W → Y n → Ŵ (23.4)
Then
−h() + ¯ log M ≤ I(W ; Ŵ ) (Fano)
n
≤ I(W ; Y ) (Data processing applied to (23.4))
Xn
= I(W ; Yt |Y t−1 ) (Chain rule)
t=1
Xn
≤ I(W, Y t−1 ; Yt ) (I(W ; Yt |Y t−1 ) = I(W, Y t−1 ; Yt ) − I(Y t−1 ; Yt ))
t=1
Xn
≤ I(Xt ; Yt ) (Data processing applied to (23.3))
t=1
≤ nCt
The following result (without proof) suggests that feedback does not even improve the speed of
approaching capacity either (under fixed-length block coding) and can at most improve smallish
log n terms:
Theorem 23.2 (Dispersion with feedback). For weakly input-symmetric DMC (e.g. additive noise,
BSC, BEC) we have: √
log Mf∗b (n, ) = nC − nV Q−1 () + O(log n)
(The meaning of this is that for such channels feedback can at most improve smallish log n
terms.)

23.2* Alternative proof of Theorem 23.1 and Massey’s directed

information
The following alternative proof emphasizes on data processing inequality and the comparison idea
(auxiliary channel) as in Theorem 19.1.
Proof. It is obvious that Cf b ≥ C, we are left to show that Cf b ≤ Ci .
1. Recap of the steps of showing the strong converse of C ≤ Ci in the last lecture: take any
(n, M, ) code, compare the two distributions:
P : W → X n → Y n → Ŵ (23.5)
Q : W → Xn Y n → Ŵ (23.6)
two key observations:
2
This is easy to see from the example where X2 = Y1 and thus PY1 |X 2 has no randomness.

249
⊥ W , so that Q[W = Ŵ ] =
a) Under Q, W ⊥ 1
M while P[W = Ŵ ] ≥ 1 − .
b) The two graphical models give the factorization:

PW,X n ,Y n ,Ŵ = PW,X n PY n |X n PŴ |Y n , QW,X n ,Y n ,Ŵ = PW,X n PY n PŴ |Y n

thus D(P kQ) = I(X n ; Y n ) measures the information flow through the links X n → Y n .

n
1 dpi mem−less,stat X
−h() + ¯ log M = d(1 − k ) ≤ D(P kQ) = I(X n ; Y n ) = I(X; Y ) ≤ nCi
M
i=1
(23.7)

2. Notice that when feedback is present, X n → Y n is not memoryless due to the transmission
protocol, let’s unfold the probability space over time to see the dependence. As an example,
the graphical model for n = 3 is given below:

If we define Q similarly as in the case without feedback, we will encounter a problem at the
second
Pn last inequality in (23.7), as with feedback I(X n ; Y n ) can be significantly larger than
n n
i=1 I(X; Y ). Consider the example where X2 = Y1 , we have I(X ; Y ) = +∞ independent
of I(X; Y ).
We also make the observation that if Q is defined in (23.6), D(P kQ) = I(X n ; Y n ) measures
the information flow through all the 6→ and links. This motivates us to find a proper Q
such that D(P kQ) only captures the information flow through all the 6→ links {Xi → Yi : i =
1, . . . , n}, so that D(P kQ) closely relates to nCi , while still guarantees that W ⊥
⊥ W , so that
Q[W 6= Ŵ ] = M . 1

3. Formally, we shall restrict QW,X n ,Y n ,Ŵ ∈ Q, where Q is the set of distributions that can be
factorized as follows:

250
⊥ W under Q: W and Ŵ are d-separated by X n .
Verify that W ⊥
Notice that in the graphical models, when removing 9 we also added the directional links
between the Yi ’s, these links serve to maximally preserve the dependence relationships between
variables when 9 are removed, so that Q is the “closest” to P while W ⊥ ⊥ W is satisfied.
1
Now we have that for Q ∈ Q, d(1 − k M ) ≤ D(P kQ), in order to obtain the least upper
bound, in Lemma 23.1 we shall show that:
n
X
inf D(PW,X n ,Y n ,Ŵ kQW,X n ,Y n ,Ŵ ) = I(Xt ; Yt |Y t−1 )
Q∈Q
t=1
n
X
= EY t−1 [I(PXt |Y t−1 , PY |X )]
t=1
Xn
≤ I(EY t−1 [PXt |Y t−1 ], PY |X ) (concavity of I(·, PY |X ))
t=1
Xn
= I(PXt , PY |X )
t=1
≤nCi .

Following the same procedure as in (a) we have

nC + h() C
−h() + ¯ log M ≤ nCi ⇒ log M ≤ ⇒ Cf b, ≤ ⇒ Cf b ≤ C.
1− 1−

4. Notice that the above proof is also valid even when cost constraint is present.

Lemma 23.1.
n
X
inf D(PW,X n ,Y n ,Ŵ kQW,X n ,Y n ,Ŵ ) = I(Xt ; Yt |Y t−1 ) (23.10)
Q∈Q
t=1
~ n ; Y n ), directed information)
(, I(X

Proof. By chain rule, we can show that the minimizer Q ∈ Q must satisfy the following equalities:

QX,W = PX,W ,
QXt |W,Y t−1 = PXt |W,Y t−1 , (check!)
QŴ |Y n = PW |Y n .

and therefore

inf D(PW,X n ,Y n ,Ŵ kQW,X n ,Y n ,Ŵ )

Q∈Q

251
23.3 When is feedback really useful?
Theorems 23.1 and 23.2 state that feedback does not improve communication rate neither asymptot-
ically nor for moderate blocklengths. In this section, we shall examine three cases where feedback
turns out to be very useful.

23.3.1 Code with very small (e.g. zero) error probability

Theorem 23.3 (Shannon ’56). For any DMC PY |X ,

1
Cf b,0 = max min log (23.11)
PX y∈B PX (Sy )
where

Sy = {x ∈ A : PY |X (y|x) > 0}

denotes the set of input symbols that can lead to the output symbol y.
Note: For stationary memoryless channel,
def. def. Thm 23.1 Shannon
C0 ≤ Cf b,0 ≤ Cf b = lim Cf b, = C = lim C = Ci = sup I(X; Y )
→0 →0 PX

All capacity quantities above are defined with (fixed-length) block codes.

Note:

1. In DMC for both zero-error capacities (C0 and Cf b,0 ) only the support of the transition
matrix PY |X , i.e., whether PY |X (b|a) > 0 or not, matters. The value of PY |X (b|a) > 0
is irrelevant. That is, C0 and Cf b,0 are functions of a bipartite graph between input and
output alphabets. Furthermore, the C0 (but not Cf b,0 !) is a function of the confusability
graph – a simple undirected graph on A with a 6= a0 connected by an edge iff ∃b ∈ B
s.t. PY |X (b|a)PY |X (b|a0 ) > 0.

2. That Cf b,0 is not a function of the confusability graph alone is easily seen from comparing the
polygon channel (next remark) with L = 3 (for which Cf b,0 = log 23 ) and the useless channel
with A = {1, 2, 3} and B = {1} (for which Cf b,0 = 0). Clearly in both cases confusability
graph is the same – a triangle.

3. Usually C0 is very hard to compute, but Cf b,0 can be obtained in closed form as in (23.11).
Example: (Polygon channel)

1
5

4
3

Bipartite graph Confusability graph

252
• Zero-error capacity C0 :
– L = 3: C0 = 0
– L = 5: C0 = 12 log 5 (Shannon ’56-Lovasz ’79).
Achievability:
a) blocklength one: {1, 3}, rate = 1 bit.
b) blocklength two: {(1, 1), (2, 3), (3, 5), (4, 2), (5, 4)}, rate = 12 log 5 bit – optimal!
– L = 7: 3/5 log 7 ≤ C0 ≤ log 3.32 (Exact value unknown to this day)
– Even L = 2k: C0 = log L2 for all k (Why? Homework.).
– Odd L = 2k + 1: C0 = log L2 + o(1) as k → ∞ (Bohman ’03)
• Zero-error capacity with feedback (proof: exercise!)
L
Cf b,0 = log , ∀L,
2
which can be strictly bigger than C0 .

4. Notice that Cf b,0 is not necessarily equal to Cf b = lim→0 Cf b, = C. Here is an example when

C0 < Cf b,0 < Cf b = C

Example:

Then

C0 = log 2
2
Cf b,0 = max − log max( δ, 1 − δ) (PX∗ = (δ/3, δ/3, δ/3, δ̄))
δ 3
5 3
= log > C0 (δ ∗ = )
2 5
On the other hand, Shannon capacity C = Cf b can be made arbitrarily close to log 4 by
picking the cross-over probability arbitrarily close to zero, while the confusability graph stays
the same.

Proof of Theorem 23.3. 1. Fix any (n, M, 0)-code. Denote the confusability set of all possible
messages that could have produced the received signal y t = (y1 , . . . , yt ) for all t = 0, 1, . . . , n
by:

Et (y t ) , {m ∈ [M ] : f1 (m) ∈ Sy1 , f2 (m, y1 ) ∈ Sy2 , . . . , fn (m, y t−1 ) ∈ Syt }

Notice that zero-error means no ambiguity:

= 0 ⇔ ∀y n ∈ B n , |En (y n )| = 1 or 0. (23.12)

253
2. The key quantities in the proof are defined as follows:
θf b = min max PX (Sy ),
PX y∈B

PX∗ = argmin max PX (Sy )

PX y∈B

The goal is to show

1
Cf b,0 = log .
θf b
By definition, we have
∀PX , ∃y ∈ B, such that PX (Sy ) ≥ θf b (23.13)
Notice the minimizer distribution PX∗ is usually not the caid in the usual sense. This definition
also sheds light on how the encoding and decoding should be proceeded and serves to lower
bound the uncertainty reduction at each stage of the decoding scheme.
3. “≤” (converse): Let PX n be he joint distribution of the codewords. Denote E0 = [M ] – original
message set.
t = 1: For PX1 , by (23.13), ∃y1∗ such that:
|{m : f1 (m) ∈ Sy1∗ }| |E1 (y1∗ )|
PX1 (Sy1∗ ) = = ≥ θf b .
|{m ∈ [M ]}| |E0 |
t = 2: For PX2 |X1 ∈Sy∗ , by (23.13), ∃y2∗ such that:
1

|{m : f1 (m) ∈ Sy1∗ , f2 (m, y1∗ ) ∈ Sy2∗ }| |E2 (y1∗ , y2∗ )|

PX2 (Sy2∗ |X1 ∈ Sy1∗ ) = = ≥ θf b ,
|{m : f1 (m) ∈ Sy1∗ }| |E1 (y1∗ )|
t = n: Continue the selection process up to yn∗ which satisfies that:
|En (y1∗ , . . . , yn∗ )|
PXn (Syn∗ |Xt ∈ Syt∗ for t = 1, . . . , n − 1) = ∗ )| ≥ θf b .
|En−1 (y1∗ , . . . , yn−1
Finally, by (23.12) and the above selection procedure, we have
1 |En (y1∗ , . . . , yn∗ )|
≥ ≥ θfnb
M |E0 |
⇒ M ≤ −n log θf b
⇒ Cf b,0 ≤ − log θf b

4. “≥” (achievability)
Let’s construct a code that achieves (M, n, 0).

254
The above example with |A| = 3 illustrates that the encoder f1 partitions the space of all
messages to 3 groups. The encoder f1 at the first stage encodes the groups of messages into
a1 , a2 , a3 correspondingly. When channel outputs y1 and assume that Sy1 = {a1 , a2 }, then the
decoder can eliminate a total number of M PX∗ (a3 ) candidate messages in this round. The
“confusability set” only contains the remaining M PX∗ (Sy1 ) messages. By definition of PX∗ we
know that M PX∗ (Sy1 ) ≤ M θf b . In the second round, f2 partitions the remaining messages
into three groups, send the group index and repeat.
By similar arguments, each interaction reduces the uncertainty by a factor of at least θf b .
After n iterations, the size of “confusability set” is upper bounded by M θfnb , if M θfnb ≤ 1,3
then zero error probability is achieved. This is guaranteed by choosing log M = −n log θf b .
Therefore we have shown that −n log θf b bits can be reliably delivered with n + O(1) channel
uses with feedback, thus

Cf b,0 ≥ − log θf b

23.3.2 Code with variable length

Consider the example of BEC(δ) with feedback, send k bits in the following way: repeat sending
each bit until it gets through the channel correctly. The expected number of channel uses for sending
k bits is given by
k
l = E[n] =
1−δ
We state the result for variable-length feedback (VLF) code without proof:

log MV∗ LF (l, 0) ≥ lC

√
Notice that compared to the scheme without feedback, there is the improvement of nV Q−1 () in
√
the order of O( n), which is stronger than the result in Theorem 23.2.
This is also true in general [PPV10b]:

lC
log MV∗ LF (l, ) = + O(log l)
1−
Example: For BSC(0.11), without feedback, n = 3000 is needed to achieve 90% of capacity C,
while with VLF code l = En = 200 is enough to achieve that.

23.3.3 Code with variable power

Elias’ scheme To send a number A drawn from a Gaussian distribution N (0, Var A), Elias’
scheme uses linear processing.
AWGN setup:

Yk = Xk + Zk , Zk ∼ N (0, σ 2 ) i.i.d.
E[Xk2 ] ≤ P, power constraint in expectation
3 ∗
Some rounding-off errors need to be corrected in a few final steps (because PX may not be closely approximable
when very few messages are remaining). This does not change the asymptotics though.

255
Note:
Pn If we insist the codeword satisfies power constraint almost surely instead on average, i.e.,
2
k=1 Xk ≤ nP a.s., then the scheme below does not work!

According to the orthogonality principle of the mininum mean-square estimation (MMSE) of A

at receiver side in every step:
A = Ân + Nn , Nn ⊥ Y n .
Morever, since all operations are lienar and everything is jointly Gaussian, Nn ⊥ ⊥ Y n . Since
Xn ∝ Nn−1 ⊥ ⊥Y n−1 , the codeword we are sending at each time slot is independent of the history of
the channel output (”innovation”), in order to maximize information transfer.
Note that Y n → Ân → A, and the optimal estimator Ân (a linear combination of Y n ) is a
sufficient statistic of Y n for A under Gaussianity. Then
I(A; Y n ) =I(A; Ân , Y n )
= I(A; Ân ) + I(A; Y n |Ân )
= I(A; Ân )
1 Var(A)
= log .
2 Var(Nn )
where the last equality uses the fact that N follows a normal distribution. Var(Nn ) can be computed
directly using standard linear MMSE results. Instead, we determine it information theoretically:
Notice that we also have
I(A; Y n ) = I(A; Y1 ) + I(A; Y2 |Y1 ) + · · · + I(A; Yn |Y n−1 )
= I(X1 ; Y1 ) + I(X2 ; Y2 |Y1 ) + · · · + I(Xn ; Yn |Y n−1 )
key
= I(X1 ; Y1 ) + I(X2 ; Y2 ) + · · · + I(Xn ; Yn )
1
= n log(1 + P ) = nC
2

256
Therefore, with Elias’ scheme of sending A ∼ N (0, Var A), after the n-th use of the AWGN(P )
channel with feedback,
P n
Var Nn = Var(Ân − A) = 2−2nC Var A = Var A,
P + σ2
which says that the reduction of uncertainty in the estimation is exponential fast in n.

Schalkwijk-Kailath Elias’ scheme can also be used to send digital data. Let W ∼ uniform on
2 2k
M -PAM constellation in ∈ [−1, 1], i.e., {−1, −1 + M , · · · , −1 + M , · · · , 1}. In the very first step W
is sent (after scaling to satisfy the power constraint):
√
X0 = P W, Y0 = X0 + Z0

Since Y0 and X0 are both known at the encoder, it can compute Z0 . Hence, to describe W it is
sufficient for the encoder to describe the noise realization Z0 . This is done by employing the Elias’
scheme (n − 1 times). After n − 1 channel uses, and the MSE estimation, the equivalent channel
output:

Ye0 = X0 + Ze0 , e0 ) = 2−2(n−1)C

Var(Z

Finally, the decoder quantizes Ye0 to the nearest PAM point. Notice that
" √ # √ !
1 P 2 (n−1)C P
≤ P |Ze0 | > =P 2 −(n−1)C
|Z| > = 2Q
2M 2M 2M
√
P
⇒ log M ≥ (n − 1)C + log − log Q−1 ( )
2 2
= nC + O(1).

Hence if the rate is strictly less than capacity, the error probability decays doubly exponentially
√
fast as n increases. More importantly, we gained an n term in terms of log M , since for the case
without feedback we have
√
log M ∗ (n, ) = nC − nV Q−1 () + O(log n) .

Example: = 1⇒ channel capacity C = 0.5 bit per channel use. To achieve error probability
P(n−1)C (n−1)C
−3
10 , 2Q 2
2M ≈ 10−3 , so e 2M ≈ 3, and lognM ≈ n−1 log 8
n C − n . Notice that the capacity is
achieved to within 99% in as few as n = 50 channel uses, whereas the best possible block codes
without feedback require n ≈ 2800 to achieve 90% of capacity.

Take-away message:
Feedback is best harnessed with adaptive strategies. Although it does not increase capacity
under block coding, feedback greatly boosts reliability as well as reduces coding complexity.

257
§ 24. Capacity-achieving codes via Forney concatenation

Shannon’s Noisy Channel Theorem assures us the existence of capacity-achieving codes. However,
exhaustive search for the code has double-exponential complexity: Search over all codebook of size
2nR over all possible |X |n codewords.
Plan for today: Constructive version of Shannon’s Noisy Channel Theorem. The goal is to show
that for BSC, it is possible to achieve capacity in polynomial time. Note that we need to consider
three aspects of complexity

• Encoding

• Decoding

• Construction of the codes

24.1 Error exponents

Recall we have defined the fundamental limit

M ∗ (n, ) = max{M : ∃(n, M, )-code}

For notational convenience, let us define its functional inverse

∗ (n, M ) = inf{ : ∃(n, M, )-code}

Shannon’s theorem shows that for stationary memoryless channels, n , ∗ (n, exp(nR)) → 0 for any
R < C = supX I(X; Y ). Now we want to know how fast it goes to zero as n → ∞. It turns out
the speed is exponential, i.e., n ≈ exp(−nE(R)) for some error exponent E(R) as a function R,
which is also known as the reliability function of the channel. Determining E(R) is one of the most
long-standing open problems in information theory. What we know are

• Lower bound on E(R) (achievability): Gallager’s random coding bound (which analyzes the
ML decoder, instead of the suboptimal decoder as in Shannon’s random coding bound or DT
bound).

• Upper bound on E(R) (converse): Sphere-packing bound (Shannon-Gallager-Berlekamp), etc.

It turns out there exists a number Rcrit ∈ (0, C), called the critical rate, such that the lower and
upper bounds meet for all R ∈ (Rcrit , C), where we obtain the value of E(R). For R ∈ (0, Rcrit ), we
do not even know the existence of the exponent!
Deriving these bounds is outside the scope of this lecture. Instead, we only need the positivity
of error exponent, i.e., for any R < C, E(R) > 0. On the other hand, it is easy to see that
E(C−) = 0 as a consequence of weak converse. Since as the rate approaches capacity from below,
the communication becomes less reliable. The next theorem is a simple application of large deviation.

258
Theorem 24.1. For any DMC, for any R < C = supX I(X; Y ),

∗ (n, exp(nR)) ≤ exp(−nE(R)), for some E(R) > 0.

Proof. Fix R < C so that C − R > 0. Let PX∗ be the capacity-achieving input distribution, i.e.,
C = I(X ∗ ; Y ∗ ). Recall Shannon’s random coding bound (DT/Feinstein work as well):

≤ P (i(X; Y ) ≤ log M + τ ) + exp(−τ ).

As usual, we apply this bound with iid PX n = (PX∗ )n , log M = nR and τ = n(C−R)2 , to conclude the
achievability of
1 n n C +R n(C − R)
n ≤ P i(X ; Y ) ≤ + exp − .
n 2 2
P
Since i(X n ; Y n ) = i(Xk ; Yk ) is an iid sum, and Ei(X; Y ) = C > (C + R)/2, the first term is
upper bounded by exp(−nψT∗ ( R+C 2 )) where T = i(X; Y ). The proof is complete since n is smaller
than the sum of two exponentially small terms.

Note: Better bound can be obtained using DT bound. But to get the best lower bound on E(R)
we know (Gallager’s random coding bound), we have to analyze the ML decoder.

24.2 Achieving polynomially small error probability

In the sequel we focus on BSC channel with cross-over probability δ, which is an additive-noise DMC.
Fix R < C = 1 − h(δ) bits. Let the block length be n. Our goal is to achieve error probability
n ≤ n−α for arbitrarily large α > 0 in polynomial time.
To this end, fix some b > 1 to be specified later and pick m = b log n and divide the block
n
into m sub-blocks of m bits. Applying Theorem 24.1, we can find [later on how to find] an
(m, exp(Rm), m )-code such that

m ≤ exp(−mE(R)) = n−bE(R)

where E(R) > 0. Apply this code to each m-bit sub-block and apply ML decoding to each block.
n
The encoding/decoding complexity is at most m exp(O(m)) = nO(1) . To analyze the probability of
error, use union bound:
n
Pe ≤ m ≤ n−bE(R)+1 ≤ n−α ,
m
α+1
if we choose b ≥ E(R) .
Remark 24.1. The final question boils down to how to find the shorter code of blocklength m in
poly(n)-time. This will be done if we can show that we can find good code (satisfying the Shannon
random coding bound) for BSC of blocklenth m in exponential time. To this end, let us go through
the following strategies:

1. Exhaustive search: A codebook is a subset of cardinality 2Rm out of 2m possible codewords.
m
Total number of codebooks: 22Rm = exp(Ω(m2Rm )) = exp(Ω(nc log n)). The search space is
too big.

2. Linear codes: In Lecture 18 we have shown that for additive-noise channels on finite fields we
can focus on linear codes. For BSC, each linear code is parameterized by a generator matrix,
2
with Rm2 entries. Then there are a total of 2Rm = nΩ(log n) – still superpolynomial and we
cannot afford the search over all linear codes.

259
3. Toeplitz generator matrices: In homework we see that it does not lose generality to focus on
linear codes with Toeplitz generator matrices, i.e., G such that Gij = Gi−1,j−1 for all i, j > 1.
Toeplitz matrices are determined by diagonals. So there are at most 22m = nO(1) and we can
find the optimal one by exhaustive search in poly(n)-time.

Since the channel is additive-noise, linear codes + syndrome decoder leads to the same maximal
probability of error as average (Lecture 18).

24.3 Concatenated codes

Forney introduced the idea of concatenated codes in 1965 to build longer codes from shorter codes
with manageable complexity. It consists of an inner code and an outer code:
k
1. Cin : {0, 1}k → {0, 1}n , with rate n
K
2. Cout : B K → B N for some alphabet B of cardinality 2k , with rate N.

The concatenated code C : {0, 1}kK → {0, 1}nN works as follows (Fig. 24.1):

1. Collect the kK message bits into K symbols in the alphabet B, apply Cout componentwise to
get a vector in B N

2. Map each symbol in B into k bits and apply Cin componentwise to get an nN -bit codeword.
kK
The rate of the concatenated code is the product of the rates of the inner and outer codes: R = nN.

Cin Din

Cin Din
kK k n k kK
Cout Cin Din Dout

Cin Din

Cin Din
Figure 24.1: Concatenated code, where there are N inner encoder-decoder pairs.

24.4 Achieving exponentially small error probability

Forney proposed the following idea:

• Use an optimal code as the inner code

• Use a Reed-Solomon code as the outer code which can correct a constant fraction of errors.

260
Reed-Solomon (RS) codes are linear codes from FK q → Fq where the block length N = q − 1
N

and the message length is K. Similar to P the Reed-Muller code, the RS code treats the input
(a0 , a1 , . . . , aK−1 ) as a polynomial p(x) = K−1
i=0 ai z i over F of degree at most K − 1, and encodes it
q
by its values at all non-zero elements. Therefore the RS codeword is a vector (p(α) : α ∈ Fq \{0}) ∈
FNq . Therefore the generator matrix of RS code is a Vandermonde matrix.
The RS code has the following advantages:

1. The minimum distance of RS code N − K + 1. So if we choose K = (1 − )N , then RS code

can correct N
2 errors.

2. The encoding and decoding (e.g., Berlekamp-Massey decoding algorithm) can be implemented
in poly(N ) time.

In fact, as we will see later, any efficient code which can correct a constant fraction of errors will
suffice as the outer code for our purpose.
Now we show that we can achieve any rate below capacity and exponentially small probability
of error in polynomial time: Fix η, > 0 arbitrary.

• Inner code: Let k = (1 − h(δ) − η)n. By Theorem 24.1, there exists a Cin : {0, 1}k → {0, 1}n ,
which is a linear (n, 2k , n )-code and maximal error probability n ≤ 2−nE(η) . By Remark 24.1,
Cin can be chosen to be a linear code with Toeplitz generator matrix, which can be found in
2n time. The inner decoder is ML, which we can afford since n is small.

• Outer code: We pick the RS code with field size q = 2k with blocklength N = 2k − 1. Pick
the number of message bits to be K = (1 − )N . Then we have Cout : FK
2k
→ FN
2k
.

Then we obtain a concatenated code C : {0, 1}kK → {0, 1}nN with blocklength L = nN = n2Cn for
some constant C and rate R = (1 − )(1 − h(δ) − η). It is clear that the code can be constructed in
2n = poly(L) time and all encoding/decoding operations are poly(L) time.
Now we analyze the probability of error: Let us conditioned on the message bits (input to Cout ).
Since the outer code can correct N
2 errors, an error happens only if the number of erroneous inner
N
encoder-decoder pairs exceeds 2 . Since the channel is memoryless, each of the N pairs makes an
error independently1 with probability at most n . Therefore the number of errors is stochastically
smaller than Bin(N, n ), and we can upper bound the total probability of error using Chernoff
bound:

N
Pe ≤ P Bin(N, n ) ≥ ≤ exp (−N d(/2kn )) = exp (−Ω(N log N )) = exp(−Ω(L)).
2

where we have used = Ω(1) and hence n ≤ exp(−Ω(n)) and d(/2kn ) ≥ 2 log 2n = Ω(n) =
Ω(log N ).

Note: For more details see the excellent exposition by Spielman [Spi97]. For modern constructions
using sparse graph codes which achieve the same goal in linear time, see, e.g., [Spi96].

1
Here controlling the maximal error probability of inner code is the key. If we only have average error probability,
then given a uniform distributed input to the RS code, the output symbols (which are the inputs to the inner encoders)
need not be independent, and Chernoff bound is not necessarily applicable.

261
Part V

Lossy data compression

262
§ 25. Rate-distortion theory

Big picture so far:

1. Lossless data compression: Given a discrete ergodic source S k , we know how to encode to
pure bits W ∈ [2k ].

2. Binary HT: Given two distribution P and Q, we know how to distinguish them optimally.

3. Channel coding: How to send bits over a channel [2k ] 3 W → X → Y .

4. JSCC: how to send discrete data optimally over a noisy channel.

Next topic, lossy data compression: Given X, find a k-bit representation W , X → W → X̂,
such that X̂ is a good reconstruction of X.
Real-world examples: codecs consist of a compressor and a decompressor

• Image: JPEG...

• Audio: MP3, CD...

• Video: MPEG...

25.1 Scalar quantization

Problem: Data isn’t discrete! Often, a signal (function) comes from voltage levels or other
continuous quantities. The question of how to map (naturally occurring) continuous time/analog
signals into (electronics friendly) discrete/digital signals is known as quantization, or in information
theory, as rate distortion theory.

Domain Range
Continuous Analog
Sampling time Quantization
Signal
Discrete Digital
time

We will look at several ways to do quantization in the next few sections.

263
25.1.1 Scalar Uniform Quantization
The idea of qunatizing an inherently continuous-valued signal was most explicitly expounded in the
patenting of Pulse-Coded Modulation (PCM) by A. Reeves, cf. [Ree65] for some interesting historical
notes. His argument was that unlike AM and FM modulation, quantized (digital) signals could be
sent over long routes without the detrimental accumulation of noise. Some initial theoretical analysis
of the PCM was undertaken in 1947 by Oliver, Pierce, and Shannon (same Shannon), cf. [OPS48].
For a random variable X ∈ [−A/2, A/2] ⊂ R, the scalar uniform quantizer qU (X) with N
quantization points partitions the interval [−A/2, A/2] uniformly
N equally spaced points

−A A
2 2

where the points are in { −A kA

2 + N , k = 0, . . . , N − 1}.
What is the quality (or fidelity) of this quantization? Most of the time, mean squared error is
used as the quality criterion:

D(N ) = E|X − qU (X)|2

where D denotes the average distortion. Often R = log2 N is used instead of N , so that we think
about the number of bits we can use for quantization instead of the number of points. To analyze
this scalar uniform quantizer, we’ll look at the high-rate regime (R 1). The key idea in the high
rate regime is that (assuming a smooth density PX ), each quantization interval ∆j looks nearly flat,
so conditioned on ∆j , the distribution is accurately approximately by a uniform distribution.
Nearly flat for
large partition

∆j

Let cj be the j-th quantization point, and ∆j be the j-th quantization interval. Here we have
N
X
DU (R) = E|X − qU (X)| = 2
E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ] (25.1)
j=1
N
X |∆j |2
(high rate approximation) ≈ P[X ∈ ∆j ] (25.2)
12
j=1
A 2
(N ) A2 −2R
= = 2 , (25.3)
12 12
where we used the fact that the variance of Uniform[−a, a] = a2 /3.
How much do we gain per bit?
Var(X)
10 log10 SN R = 10 log10
E|X − qU (X)|2
12V ar(X)
= 10 log10 + (20 log10 2)R
A2
= constant + (6.02dB)R

264
For example, when X is uniform on [− A2 , A2 ], the constant is 0. Every engineer knows the rule
of thumb “6dB per bit”; adding one more quantization bit gets you 6 dB improvement in SNR.
However, here we can see that this rule of thumb is valid only in the high rate regime. (Consequently,
widely articulated claims such as “16-bit PCM (CD-quality) provides 96 dB of SNR” should be
taken with a grain of salt.)
Note: The above deals with X with a bounded support. When X is unbounded, a wise thing to do
is to allocate the quantization points to the range of values that are more likely and saturate the
large values at the dynamic range of the quantizer. Then there are two contributions, known as
the granular distortion and overload distortion. This leads us to the question: Perhaps uniform
quantization is not optimal?

25.1.2 Scalar Non-uniform Quantization

Since our source has density pX , a good idea might be to use more quantization points where pX is
larger, and less where pX is smaller.

Often the way such quantizers are implemented is to take a monotone transformation of the source
f (X), perform uniform quantization, then take the inverse function:

f
X U
q qU (25.4)
X̂ qU (U )
f −1

i.e., q(X) = f −1 (qU (f (X))). The function f is usually called the compander (compressor+expander).
One of the choice of f is the CDF of X, which maps X into uniform on [0, 1]. In fact, this compander
architecture is optimal in the high-rate regime (fine quantization) but the optimal f is not the CDF
(!). We defer this discussion till Section 25.1.4.
In terms of practical considerations, for example, the human ear can detect sounds with volume
as small as 0 dB, and a painful, ear-damaging sound occurs around 140 dB. Achieveing this is
possible because the human ear inherently uses logarithmic companding function. Furthermore,
many natural signals (such as differences of consecutive samples in speech or music (but not samples
themselves!)) have an approximately Laplace distribution. Due to these two factors, a very popular
and sensible choice for f is the µ-companding function

f (X) = sign(X) ln(1+µ|X|)

ln(1+µ)

265
which compresses the dynamic range, uses more bits for smaller |X|’s, e.g. |X|’s in the range of
human hearing, and less quantization bits outside this region. This results in the so-called µ-law
which is used in the digital telecommunication systems in the US, while in Europe a slightly different
compander called the A-law is used.

25.1.3 Optimal Scalar Quantizers

Now we look for the optimal scalar quantizer given R bits for reconstruction. Formally, this is

Dscalar (R) = min E|X − q(X)|2 (25.5)

q:|Im q|≤2R

Intuitively, we would think that the optimal quantization regions should be contiguous; otherwise,
given a point cj , our reconstruction error will be larger. Therefore in one dimension quantizers are
piecewise constant:

q(x) = cj 1Tj ≤x≤Tj+1

for some cj ∈ [Tj , Tj+1 ].

One-bit quantization of X ∼ 2
Simple example:q qN (0, σ ). Then optimal quantization points are
c1 = E[X|X ≥ 0] = π2 σ, c2 = E[X|X ≤ 0] = − π2 σ.
With ideas like this, in 1982 Stuart Lloyd developed an algorithm (called Lloyd’s algorithm)
for iteratively finding optimal quantization regions and points. This works for both the scalar and
vector cases, and goes as follows:

1. Pick any N = 2k points

2. Draw the Voronoi regions around the chosen quantization points (aka minimum distance
tessellation, or set of points closest to cj ), which forms a partition of the space.

3. Update the quantization points by the centroids (E[X|X ∈ D]) of each Voronoi region.

4. Repeat.

b b
b b

b b

b b
b b

Steps of Lloyd’s algorithm

Lloyd’s clever observation is that the centroid of each Voronoi region is (in general) different than
the original quantization points. Therefore, iterating through this procedure gives the Centroidal
Voronoi Tessellation (CVT - which are very beautiful objects in their own right), which can be
viewed as the fixed point of this iterative mapping. The following theorem gives the results about
Lloyd’s algorithm
Theorem 25.1 (Lloyd).

1. Lloyd’s algorithm always converges to a Centroidal Voronoi Tessellation.

266
2. The optimal quantization strategy is always a CVT.
3. CVT’s need not be unique, and the algorithm may converge to non-global optima.
Remark 25.1. The third point tells us that Lloyd’s algorithm isn’t always guaranteed to give the
optimal quantization strategy.1 One sufficient condition for uniqueness of a CVT is the log-concavity
of the density of X [Fleischer ’64]. Thus, for Gaussian PX , Lloyd’s algorithm outputs the optimal
quantizer, but even for Gaussian, if N > 3, optimal quantization points are not known in closed
form! So it’s hard to say too much about optimal quantizers. Because of this, we next look for an
approximation in the regime of huge number of points.
Remark 25.2 (k-means). A popular clustering method called k-means is the following: Given n
data points x1 , . . . , xn ∈ Rd , the goal is to find k centers µ1 , . . . , µk ∈ Rd to minimize the objective
function
Xn
min kxi − µj k2 .
j∈[k]
i=1
This is equivalent to solving the optimal vector quantization problem analogous to (25.5):
min EkX − q(X)k2
q:|Im(q)|≤k
P
where X is distributed according to the empirical distribution over the dataset, namely, n1 ni=1 δxi .
Solving the k-means problem is NP-hard in the worst case, and Lloyd’s algorithm is commonly used
heuristic.

25.1.4 Fine quantization

[Panter-Dite ’51] Now we look at the high SNR approximation. For this, introduce a probability
density function λ(x), which represents the density of our quantization points and allows us to
approximateRsummations by integrals2 . Then the number of quantization points in any interval
b
[a, b] is ≈ N a λ(x)dx. For any point x, denote its distance to the closest quantization point by
∆(x). Then
Z x+∆(x)
1
N λ(t)dt ≈ N λ(x)∆(x) ≈ 1 =⇒ ∆(x) ≈ .
x N λ(x)
With this approximation, the quality of reconstruction is
N
X
E|X − q(X)|2 = E[|X − cj |2 |X ∈ ∆j ]P[X ∈ ∆j ]
j=1
N
X Z
|∆j |2 ∆2 (x)
≈ P[X ∈ ∆j ] ≈ p(x) dx
12 12
j=1
Z
1
= p(x)λ−2 (x)dx
12N 2
1
As a simple example one may consider PX = 13 φ(x − 1) + 13 φ(x) + 13 φ(x + 1) where φ(·) is a very narrow pdf,
symmetric around 0. Here the CVT with centers ± 23 is not optimal among binary quantizers (just compare to any
quantizer that quantizes two adjacent spikes to same value).
2
This argument is easy to make rigorous. We only need to define reconstruction points cj as solutions of
Z cj
j
λ(x) dx = .
−∞ N

267
To find the optimal density λ that gives the best
R 1/3 R reconstruction
R 2/3 (minimum MSE)
R −2whenR X1/3has
−2
density p, we use Hölder’s inequality: p ≤ ( pλ ) ( λ) . Therefore pλ ≥ ( p )3 ,
1/3
1/3
with equality iff pλ−2 ∝ λ. Hence the optimizer is λ∗ (x) = Rp (x) . Therefore when N = 2R ,3
p1/3 dx
Z 3
1
Dscalar (R) ≈ 2−2R p 1/3
(x)dx
12
So our optimal quantizer density in the high rate regime is proportional to the cubic root of the
density of our source. This approximation is called the Panter-Dite approximation. For example,
• When X ∈ [− A2 , A2 ], using Hölder’s inequality again h1, p1/3 i ≤ k1k 3 kp1/3 k3 = A2/3 , we have
2

1
Dscalar (R) ≤ 2−2R A2 = DU (R)
12
where the RHS is the uniform quantization error given in (25.1). Therefore as long as the
source distribution is not uniform, there is strict improvement. For uniform distribution,
uniform quantization is, unsurprisingly, optimal.
• When X ∼ N (0, σ 2 ), this gives
√
2 −2R π 3
Dscalar (R) ≈ σ 2 (25.6)
2
Note: In fact, in scalar case the optimal non-uniform quantizer can be realized using the compander
architecture (25.4) that we discussed in Section 25.1.2: As an exercise, use Taylor expansion to
analyze the quantization error of (25.4) when N → ∞. The optimal compander f : R → [0, 1] turns
t
p1/3 (t)dt
R
out to be f (x) = R−∞
∞ [Bennett ’48, Smith ’57].
1/3 (t)dt
−∞ p

25.1.5 Fine quantization and variable rate

So far we have been focusing on quantization with restriction on the cardinality of the image of
q(·). If one, however, intends to further compress the values q(X) via noiseless compressor, a more
natural constraint is to bound H(q(X)).
Koshelev [Kos63] discovered in 1963 that in the high rate regime uniform quantization is
asymptotically optimal under the entropy constraint. Indeed, if q∆ is a uniform quantizer with cell
size ∆, then it is easy to see that
H(q∆ (X)) = h(X) − log ∆ + o(1) , (25.7)
R
where h(X) = − pX (x) log pX (x) dx is the differential entropy of X. So a uniform quantizer with
H(q(X)) = R achieves
∆2 22h(X)
D= ≈ 2−2R .
12 12
On the other hand,
R cj any quantizer with unnormalized point density function Λ(x) (i.e. smooth
function such that −∞ Λ(x)dx = j) can be shown to achieve (assuming Λ → ∞ pointwise)
Z
1 1
D≈ pX (x) 2 dx (25.8)
12 Λ (x)
Z
Λ(x)
H(q(X)) ≈ pX (x) log dx (25.9)
pX (x)
3
In fact when R → ∞, “≈” can be replaced by “= 1 + o(1)” [Zador ’56].

268
Now, from Jensen’s inequality we have
Z Z
1 1 1 22h(X)
pX (x) 2 dx ≥ exp{−2 pX (x) log Λ(x) dx} ≈ 2−2H(q(X)) ,
12 Λ (x) 12 12
concluding that uniform quantizer is asymptotically optimal.
Furthermore, it turns out that for any source, even the optimal vector quantizers (to be considered
2h(X)
next) can not achieve distortion better that 2−2R 2 2πe – i.e. the maximal improvement they can
gain (on any iid source!) is 1.53 dB (or 0.255 bit/sample). This is one reason why scalar uniform
quantizers followed by lossless compression is an overwhelmingly popular solution in practice.

25.2 Information-theoretic vector quantization

Consider Gaussian distribution N (0, σ 2 ). By doing optimal vector quantization in high dimensions
(namely, compressing (X1 , . . . , Xn ) to 2nR points), rate-distortion theory will tell us that when n is
large, we can achieve the per-coordinate MSE:

Dvec (R) = σ 2 2−2R

which, compared to (25.6), saves 4.35 dB (or 0.72 bit/sample). This should be rather surprising, so
let’s reiterate: even when X1 , . . . , Xn are iid, we can get better performance by quantizing the Xi ’s
jointly. One instance of this surprising effect is the following:

Hamming Game Given 100 unbiased bits, we want to look at them and scribble something
down on a piece of paper that can store 50 bits at most. Later we will be asked to guess the
original 100 bits, with the goal of maximizing the number of correctly guessed bits. What is the best
strategy? Intuitively, it seems the optimal strategy would be to store half of the bits then randomly
guess on the rest, which gives 25% BER. However, as we will show in the next few lectures, the
optimal strategy amazingly achieves a BER of 11%. How is this possible? After all we are guessing
independent bits and the utility function (BER) treats all bits equally. Some intuitive explanation:

1. Applying scalar quantization componentwise results in quantization region that are hypercubes,
which might not be efficient for covering.

2. Concentration of measures removes many source realizations that are highly unlikely. For
example, if we think about quantizing a single Gaussian X, then we need to cover large portion
of R in order to cover the cases of significant deviations of X from 0. However, when we are
quantizing many (X1 , . . . , Xn ) together, the law of large numbers makes sure that many Xj ’s
cannot conspire together and all produce large values. Indeed, (X1 , . . . , Xn ) concentrates near
a sphere. Thus, we may exclude large portions of the space Rn from consideration.

Math Formalism A lossy compressor is an encoder/decoder pair (f, g) where

f g
X −→ W −→ X̂

• X ∈ X : continuous source

• W : discrete data

• X̂ ∈ X̂ : reconstruction

269
A distortion metric is a function d : X × X̂ → R ∪ {+∞} (loss function). There are various
formulations of the lossy compression problem:

1. Fixed length (fixed rate), average distortion: W ∈ [M ], minimize E[d(X, X̂)].

2. Fixed length, excess distortion: W ∈ [M ], minimize P[d(X, X̂) > D].

3. Variable length, max distortion: W ∈ {0, 1}∗ , d(X, X̂) ≤ D a.s., minimize E[length(W )] or
H(X̂) = H(W ).

Note: In this course we focus on lossy compression with fixed length and average distortion. The
difference between average distortion and excess distortion is analogous to average risk bound and
high-probability bound in statistics/machine learning.
Definition 25.1. Rate-distortion problem is characterized by a pair of alphabets A, Â, a single-
letter distortion function d(·, ·) : A × Â → R ∪ {+∞} and a source – a sequence of A-valued r.v.’s
(S1 , S2 , . . .). A separable distortion metric is defined for n-letter vectors by averaging the single-letter
distortions:
1X
d(an , ân ) , d(ai , âi )
n
An (n, M, D)-code is

• Encoder f : An → [M ]

• Decoder g : [M ] → Ân

• Average distortion: E[d(S n , g(f (S n )))] ≤ D

Fundamental limit:

M ∗ (n, D) = min{M : ∃(n, M, D)-code}

1
R(D) = lim sup log M ∗ (n, D)
n→∞ n

Now that we have the definition, we give the (surprisingly simple) general converse
Theorem 25.2 (General Converse). For all lossy codes X → W → X̂ such that E[d(X, X̂)] ≤ D,
we have

log M ≥ ϕX (D) , inf I(X; Y )

PY |X :E[d(X,Y )]≤D

where W ∈ [M ].

Proof.

log M ≥ H(W ) ≥ I(X; W ) ≥ I(X; X̂) ≥ ϕX (D)

where the last inequality follows from the fact that PX̂|X is a feasible solution (by assumption).

Theorem 25.3 (Properties of ϕX ).

1. ϕX is convex, non-increasing.

270
2. ϕX continuous on (D0 , ∞), where D0 = inf{D : ϕX (D) < ∞}.

3. If
(
D0 x=y
d(x, y) =
> D0 x 6= y

Then ϕX (D0 ) = I(X; X).

4. Let
Dmax = inf Ed(X, x̂).
x̂∈X̂

Then ϕX (D) = 0 for all D > Dmax . If D0 > Dmax then also ϕX (Dmax ) = 0.

Note: If Dmax = Ed(X, x̂) for some x̂, then x̂ is the “default” reconstruction of X, i.e., the best
estimate when we have no information about X. Therefore D ≥ Dmax can be achieved for free.
This is the reason for the notation Dmax despite that it is defined as an infimum.
Example: (Gaussian with MSE distortion) For X ∼ N (0, σ 2 ) and d(x, y) = (x − y)2 , we have
2
ϕX (D) = 12 log+ σD . In this case D0 = 0 which is not attained; Dmax = σ 2 and if D ≥ σ 2 , we can
simply output X̂ = 0 as the reconstruction which requires zero bits.

Proof.

1. Convexity follows from the convexity of PY |X 7→ I(PX , PY |X ).

2. Continuity on interior of the domain follows from convexity.

3. The only way to satisfy the constraint is to take X = Y .

4. For any D > Dmax we can set X̂ = x̂ deterministically. Thus I(X; x̂) = 0. The second claim
follows from continuity.

In channel coding, we looked at the capacity and the information capacity. We define the
Information Rate-Distortion function in an analogous way here, which by itself is not an operational
quantity.
Definition 25.2. The Information Rate-Distortion function for a source is
1
Ri (D) = lim sup ϕS n (D) where ϕS n (D) = inf I(S n ; Ŝ n )
n→∞ n PŜ n |S n :E[d(S n ,Ŝ n )]≤D

And D0 = inf{D : Ri (D) < ∞}.

The reason for defining Ri (D) is because from Theorem 25.2 we immediately get:
Corollary 25.1. ∀D, R(D) ≥ Ri (D).
Naturally, the information rate-distortion function inherit the properties of ϕ:
Theorem 25.4 (Properties of Ri ).

1. Ri (D) is convex, non-increasing

2. Ri (D) is continuous on (D0 , ∞), where D0 , inf{D : Ri (D) < ∞}.

271
3. If
(
D0 x=y
d(x, y) =
> D0 x 6= y

Then for stationary ergodic {S n }, Ri (D) = H (entropy rate) or +∞ if Sk is not discrete.

4. Ri (D) = 0 for all D > Dmax , where

Dmax , lim sup inf Ed(X n , xˆn ) .

n→∞ xˆn ∈X̂

If D0 < Dmax , then Ri (Dmax ) = 0 too.

5. (Single letterization) If the source {Si } is i.i.d., then

Ri (D) = ϕS1 (D) = inf I(S; Ŝ)

PŜ|S :E[d(S,Ŝ)]≤D

Proof. Properties 1-4 follow directly from corresponding properties of φS n in Theorem 25.3 and
property 5 will be established in the next section.

25.3* Converting excess distortion to average

Finally, we discuss how to build a compressor for average distortion if we have a compressor for
excess distortion, which we will not discuss in details in class.
Assumption Dp . Assume that for (S, d), there exists p > 1 such that Dp < ∞, where

Dp , sup inf (E|d(S n , x̂)|p )1/p < +∞

n x̂

i.e. that our separable distortion metric d doesn’t grow too fast. Note that (by Minkowski’s
inequality) for stationary memoryless sources we have a single-letter bound:

Dp ≤ inf (E|d(S, x̂)|p )1/p (25.10)

x̂

Theorem 25.5 (Excess-to-Average). Suppose there exists X → W → X̂ such that W ∈ [M ] and

P[d(X, X̂) > D] ≤ . Suppose for some p ≥ 1 and x̂0 ∈ X̂ , (E[d(X, x̂0 )]p )1/p = Dp < ∞. Then there
exists X → W 0 → X̂ 0 code such that W 0 ∈ [M + 1] and

E[d(X, X̂ 0 )] ≤ D(1 − ) + Dp 1−1/p (25.11)

Remark 25.3. Theorem is only useful for p > 1, since for p = 1 the right-hand side of (25.11) does
not converge to 0 as → 0.

Proof. We transform the first code into the second by adding one codeword:
(
f (x) d(x, g(f (x))) ≤ D
f 0 (x) =
M + 1 o/w
(
g(j) j ≤ M
g 0 (j) =
x̂0 j =M +1

272
Then

E[d(X, g 0 ◦ f 0 (X)) ≤ E[d(X, X̂)|Ŵ 6= M + 1](1 − ) + E[d(X, x0 )1{Ŵ = M + 1}]

(Hölders Inequality) ≤ D(1 − ) + Dp 1−1/p

273
§ 26. Rate distortion: achievability bounds

26.1 Recap
Recall from the last lecture:
1
R(D) = lim sup log M ∗ (n, D), (rate distortion function)
n→∞ n
1
Ri (D) = lim sup ϕS n (D), (information rate distortion function)
n→∞ n
and

ϕS (D) , inf I(S; Ŝ)

PŜ|S :E[d(S,Ŝ)]≤D

ϕS n (D) = inf I(S n ; Ŝ n )

PŜ n |S n :E[d(S n ,Ŝ n )]≤D

Also, we showed the general converse: For any (M, D)-code X → W → X̂ we have

log M ≥ ϕX (D)
=⇒ log M ∗ (n, D) ≥ ϕS n (D)
=⇒ R(D) ≥ Ri (D)

In this lecture, we will prove the achievability bound and establish the identity R(D) = Ri (D)
for stationary memoryless sources.
First we show that Ri (D) can be easily calculated for memoryless source without going through
the multi-letter optimization problem. This is the counterpart of Corollary 19.1 for channel capacity
(with separable cost function).
Theorem 26.1 (Single-letterization). For stationary memoryless source S n and separable distortion
d,

Ri (D) = ϕS (D)

Proof. By definition we have that ϕS n (D) ≤ nϕS (D) by choosing a product channel: PŜ n |S n =
(PŜ|S )n . Thus Ri (D) ≤ ϕS (D).

274
For the converse, take any PŜ n |S n such that the constraint E[d(S n , Ŝ n )] ≤ D is satisfied, we have
n
X
I(S n ; Ŝ n ) ≥ I(Sj , Ŝj ) (S n independent)
j=1
n
X
≥ ϕS (E[d(Sj , Ŝj )])
j=1
 
n
X
1
≥ nϕS  E[d(Sj , Ŝj )] (convexity of ϕS )
n
j=1

≥ nϕS (D) (ϕS non-increasing)

26.2 Shannon’s rate-distortion theorem

i.i.d.
Theorem 26.2. Let the source S n be stationary and memoryless, S n ∼ PS , and suppose that
distortion metric d and the target distortion D satisfy:
1. d(sn , ŝn ) is non-negative and separable
2. D > D0
3. Dmax is finite, i.e.
Dmax , inf E[d(S, ŝ)] < ∞.
ŝ

Then
R(D) = Ri (D) = inf I(S; Ŝ). (26.1)
PŜ|S :E[d(S,Ŝ)]≤D

Remark 26.1. • Note that Dmax < ∞ does not imply that d(·, ·) only takes values in R,
i.e. theorem permits d(a, â) = ∞.
• It should be remarked that when Dmax = ∞ (e.g. S ∼ Cauchy) typically R(D) = ∞. Indeed,
suppose that d(·, ·) is a metric (i.e. finite valued and satisfies triangle inequality). Then, for
any x0 ∈ An we have
d(X, X̂) ≥ d(X, x0 ) − d(x0 , X̂) .
Thus, for any finite codebook {c1 , . . . , cM } we have maxj d(x0 , cj ) < ∞ and therefore
E[d(X, X̂)] ≥ E[d(X, x0 )] − max d(x0 , cj ) = ∞ .
j

So that R(D) = ∞ for any finite D. This observation, however, should not be interpreted as
absolute impossibility of compression for such sources. It is just not possible with fixed-rate
codes. As an example, for quadratic distortion and Cauchy-distributed S, Dmax = ∞ since S
has infinite second-order moments. But it is easy to see that Ri (D) < ∞ for any D ∈ (0, ∞).
In fact, in this case Ri (D) is a hyperbola-like curve that never touches either axis. A non-
trivial compression can be attained with compressors S n → W of bounded entropy H(W )
(but unbounded alphabet of W ). Indeed if we take W to be a ∆-quantized version of S and
notice that differential entropy of S is finite, we get from (25.7) that Ri (∆) ≤ H(W ) < ∞.
Interesting question: Is H(W ) = nRi (D) + o(n) attainable?

275
• Techniques in proving (26.1) for memoryless sources can be applied to prove it for “stationary
ergodic” sources with changes similar to those we have discussed in lossless compression
(Lecture 10).
Before giving a formal proof, we illustrate the intuition non-rigorously.

26.2.1 Intuition
Try to throw in M points C = {c1 , . . . , cM } ∈ Ân which are drawn i.i.d. according to a product
distribution QnŜ where QŜ is some distribution on Â. Examine the simple encoder-decoder pair:

encoder : f (sn ) = argmin d(sn , cj ) (26.2)

j∈[M ]

decoder : g(j) = cj (26.3)

The basic idea is the following: Since the codewords are generated independently of the source,
the probability that a given codeword offers good reconstruction is (exponentially) small, say, .
However, since we have many codewords, the chance that there exists a good one can be of high
probability. More precisely, the probability that no good codeword exist is (1 − )M , which can be
very close to zero as long as M 1 .
To explain the intuition further, let us consider the excess distortion of this code: P[d(S n , Ŝ n ) >
D]. Define
Psuccess , P[∃c ∈ C, s.t. d(S n , c) ≤ D]
Then

Pfailure ,P[∀ci ∈ C, d(S n , c) > D] (26.4)

n n
≈P[∀ci ∈ C, d(S , c) > D|S ∈ Tn ] (26.5)
( Tn is the set of typical strings with empirical distribution P̂S n ≈ PS )
=P[d(S n , Ŝ n ) > D|S n ∈ Tn ]M (PS n ,Ŝ n = PSn QnŜ ) (26.6)
=(1 − P[d(S n , Ŝ n ) ≤ D|S n ∈ Tn ] )M (26.7)
| {z }
since S n ⊥
⊥ Ŝ n , this should be small

≈(1 − 2−nE(QŜ ) )M (large deviation!) (26.8)

where it can be shown (similar to information projection in Lecture 14) that

E(QŜ ) = min D(PŜ|S kQŜ |PS ) (26.9)

PŜ|S :E[d(S,Ŝ)]≤D

Thus we conclude that ∀QŜ , ∀δ > 0 we can pick M = 2n(E(QŜ )+δ) and the above code will have
arbitrarily small excess distortion:

Pfailure = P[∀c ∈ C, d(S n , c) > D] → 0 as n → ∞.

276
We optimize QŜ to get the smallest possible M :

min E(QŜ ) = min min D(PŜ|S kQŜ |PS ) (26.10)

QŜ QŜ P
Ŝ|S :E[d(S,Ŝ)]≤D

= min min D(PŜ|S kQŜ |PS ) (26.11)

PŜ|S :E[d(S,Ŝ)]≤D QŜ

= min I(S; Ŝ)

PŜ|S :E[d(S,Ŝ)]≤D

= ϕS (D)

26.2.2 Proof of Theorem 26.2

Theorem 26.3 (Performance bound of average-distortion codes). Fix PX and suppose d(x, x̂) ≥ 0
for all x, x̂. ∀PY |X , ∀γ > 0, ∀y0 ∈ Â, there exists a code X → W → X̂, where W ∈ [M + 1] and

E[d(X, X̂)] ≤ E[d(X, Y )] + E[d(X, y0 )]e−M/γ + E[d(X, y0 )1{i(X;Y )>log γ} ]

d(X, X̂) ≤ d(X, y0 ) a.s.

Note:

• This theorem says that from an arbitrary PY |X such that Ed(X, Y ) ≤ D, we can extract a
good code with average distortion D plus some extra terms which will vanish in the asymptotic
regime.

• The proof uses the random coding argument. The role of the deterministic y0 is a “fail-safe”
codeword (think of y0 as the default reconstruction with Dmax = E[d(X, y0 )]). We add y0 to
the random codebook for “damage control”, to hedge against the (highly unlikely and unlucky)
event that we end up with a terrible codebook.

Proof. Similar to the previous intuitive argument, we apply random coding and generate the
codewords randomly and independently of the source:
i.i.d.
C = {c1 , . . . , cM } ∼ PY ⊥
⊥X

and add the “fail-safe” codeword cM +1 = y0 . We adopt the same encoder-decoder pair (26.2) –
(26.3) and let X̂ = g(f (X)). Then by definition,

d(X, X̂) = min d(X, cj ) ≤ d(X, y0 ).

j∈[M +1]

To simplify notation, let Y be an independent copy of Y (similar to the idea of introducing unsent
codeword X in channel coding – see Lecture 17):

PX,Y,Y = PX,Y PY

277
where PRY = PY . Recall the formula for computing the expectation of a random variable U ∈ [0, a]:
a
E[U ] = 0 P[U ≥ u]du. Then the average distortion is

Ed(X, X̂) = E min d(X, cj ) (26.12)

j∈[M +1]
h i

= EX E min d(X, cj )X (26.13)
j∈[M +1]
Z d(X,y0 ) h i

= EX P min d(X, cj ) > uX du (26.14)
0 j∈[M +1]
Z d(X,y0 ) h i

≤ EX P min d(X, cj ) > uX du (26.15)
0 j∈[M ]
Z d(X,y0 )
= EX P[d(X, Y ) > u|X]M du (26.16)
0
Z d(X,y0 )
= EX (1 − P[d(X, Y ) ≤ u|X])M du (26.17)
0 | {z }
,δ(X,u)

Next we upper bound (1 − δ(X, u))M as follows:

(1 − δ(X, u))M ≤ e−M/γ + |1 − γδ(X, u)|+ (26.18)

+
= e−M/γ + 1 − γE[exp{−i(X; Y )}1{d(X,Y )≤u} |X] (26.19)
−M/γ
≤e + P[i(X; Y ) > log γ|X] + P[d(X, Y ) > u|X] (26.20)

where

• (26.18) uses the following trick in dealing with (1 − δ)M for δ 1 and M 1. First, recall
the standard rule of thumb: (
0, n n 1
(1 − n )n ≈
1, n n 1
In order to obtain firm bounds of similar flavor, consider
union bound
1 − δM ≤ (1 − δ)M ≤ e−δM (log(1 − δ) ≤ −δ)
≤ e−M/γ (γδ ∧ 1) + |1 − γδ|+ (∀γ > 0)
−M/γ +
≤e + |1 − γδ|

• (26.19) is simply change of measure using i(x; y) = log P PY (y) (i.e., conditioning-unconditioning
Y |X (y|x)
trick for information density, cf. Proposition 17.1.

278
• (26.20):

1 − γ E[exp{−i(X; Y )}1{d(X,Y )≤u} |X] ≤ 1 − γ E[exp{−i(X; Y )}1{d(X,Y )≤u,i(X;Y )≤log γ} |X]

≤ 1 − E[1{d(X,Y )≤u,i(X;Y )≤log γ} |X]
= P[d(X, Y ) > u or i(X; Y ) > log γ|X]
≤ P[d(X, Y ) > u|X] + P[i(X; Y ) > log γ|X]

Plugging (26.20) into (26.17), we have

Z d(X,y0 )
E[d(X, X̂)] ≤ EX (e−M/γ + P[i(X; Y ) > log γ|X] + P[d(X, Y ) > u|X])du
0
Z ∞
−M/γ
≤ E[d(X, y0 )]e + E[d(X, y0 )P[i(X; Y ) > log γ|X]] + EX P[d(X, Y ) > u|X])du
0
−M/γ
= E[d(X, y0 )]e + E[d(X, y0 )1{i(X;Y )>log γ} ] + E[d(X, Y )]

As a side product, we have the following achievability for excess distortion.

Theorem 26.4 (Performance bound of excess-distortion codes). ∀PY |X , ∀γ > 0, there exists a code
X → W → X̂, where W ∈ [M ] and

P[d(X, X̂) > D] ≤ e−M/γ + P[{d(X, Y ) > D} ∪ {i(X; Y ) > log γ}]

Proof. Proceed exactly as in the proof of Theorem 26.3, replace (26.12) by P[d(X, X̂) > D] = P[∀j ∈
[M ], d(X, cj ) > D] = EX [(1 − P[d(X, Y ) ≤ D|X])M ], and continue similarly.

Finally, we are able to prove Theorem 26.2 rigorously by applying Theorem 26.3 to iid sources
X = S n and n → ∞:

Proof of Theorem 26.2. Our goal is the achievability: R(D) ≤ Ri (D) = ϕS (D).
WLOG we can assume that Dmax = E[d(S, ŝ0 )] achieved at some fixed ŝ0 – this is our default
reconstruction; otherwise just take any other fixed sequence so that the expectation is finite. The
default reconstruction for S n is ŝn0 = (ŝ0 , . . . , ŝ0 ) and E[d(S n , ŝn0 )] = Dmax < ∞ since the distortion
is separable.
Fix some small δ > 0. Take any PŜ|S such that E[d(S, Ŝ)] ≤ D − δ. Apply Theorem 26.3 to
(X, Y ) = (S n , Ŝ n ) with

PX = PS n
PY |X = PŜ n |S n = (PŜ|S )n
log M = n(I(S; Ŝ) + 2δ)
log γ = n(I(S; Ŝ) + δ)
n
1X
d(X, Y ) = d(Sj , Ŝj )
n
j=1

y0 = ŝn0

279
we conclude that there exists a compressor f : An → [M + 1] and g : [M + 1] → Ân , such that

E[d(S n , g(f (S n )))] ≤ E[d(S n , Ŝ n )] + E[d(S n , ŝn0 )]e−M/γ + E[d(S n , ŝn0 )1{i(S n ;Ŝ n )>log γ } ]
≤ D − δ + Dmax e− exp(nδ) + E[d(S n , ŝn0 )1En ], (26.21)
| {z } | {z }
→0 →0 (later)

where
 
1 X
n 
WLLN
En = {i(S n ; Ŝ n ) > log γ} = i(Sj ; Ŝj ) > I(S; Ŝ) + δ ====⇒ P[En ] → 0
n 
j=1

If we can show the expectation in (26.21) vanishes, then there exists an (n, M, D)-code with:

M = 2n(I(S;Ŝ)+2δ) , D = D − δ + o(1) ≤ D.

To summarize, ∀PŜ|S such that E[d(S, Ŝ)] ≤ D − δ we have that:

R(D) ≤ I(S; Ŝ)

δ↓0
==⇒ R(D) ≤ ϕS (D−) = ϕS (D). (continuity, since D > D0 )

It remains to show the expectation in (26.21) vanishes. This is a simple consequence of the
L
uniform integrability of the sequence {d(S n , ŝn0 )}. (Indeed, any sequence Vn →1 V is uniformly
integrable.) If you do not know what uniform integrability is, here is a self-contained proof.
Lemma 26.1. For any positive random variable U , define g(δ) = supH:P[H]≤δ E[U 1H ]. Then1
δ→0
EU < ∞ ⇒ g(δ) −−−→ 0.
b→∞
Proof. For any b > 0, E[U 1H ] ≤ E[U 1{U >b} ] + bδ, where E[U 1{U >b} ] −−−→ 0 by dominated
√
convergence theorem. Then the proof is completed by setting b = 1/ δ.
P
Now d(S n , ŝn0 ) = n1 P of U . Since E[U ] = Dmax < ∞ by assumption,
Uj , where Uj are iid copies
applying Lemma 26.1 yields E[d(S n , ŝn0 )1En ] = n1 E[Uj 1En ] ≤ g(P[En ]) → 0, since P[En ] → 0. We
are done proving the theorem.

Note: It seems that in Section 26.2.1 and in Theorem 26.2 we applied different relaxations in
showing the lower bound, how come they turn out to yield the same tight asymptotic result?
This is because the key to both proofs is to estimate the exponent (large deviations) of the
underlined probabilities in (26.7) and (26.17), respectively. To get the right exponent, as we know,
the key is to apply tilting (change of measure) to the distribution solving the information projection
problem (26.9). In the case, when PY = (QŜ )n = (PŜ )n is chosen as the solution to rate-distortion
optimization inf I(S; Ŝ), the resulting tilting is precisely given by 2−i(X;Y ) .
1
In fact, ⇒ is ⇔.

280
26.3* Covering lemma
Goal:

In other words:

Approximate P with Q such that for any function f , ∀x, we have:

P[f (An , B n ) ≤ x] ≈ Q[f (An , B n ) ≤ x], |W | ≤ 2nR .

what is the minimum rate R to achieve this?

Some remarks:

1. The minimal rate will depend (although it is not obvious) on whether the encoder An → W
knows about the test that the tester is running (or equivalently whether he knows the function
f (·, ·)).
P
2. If the function is known to be of the form f (An , B n ) = nj=1 f1 (Aj , Bj ), then evidently the
job of the encoder is the following: For any realization of the sequence An , we need to generate
a sequence B n such that joint composition (empirical distribution) is very close to PA,B .

3. If R = H(A), we can compress An and send it to “B side”, who can reconstruct An perfectly
and use that information to produce B n through PB n |An .

4. If R = H(B), “A side” can generate B n according to PA,B

n and send that B n sequence to the
“B side”.

⊥ B, we know that R = 0, as “B side” can generate B n independently.

5. If A ⊥

281
Our previous argument turns out to give a sharp answer for the case when encoder is aware of
the tester’s algorithm. Here is a precise result:
Theorem 26.5 (Covering Lemma). ∀PA,B and R > I(A; B), let C = {c1 , . . . , cM } where each
codeword cj is i.i.d. drawn from distribution PBn . ∀ > 0, for M ≥ 2n(I(A;B)+) we have that:

P[∃c ∈ C such that P̂An ,c ≈ PA,B ] → 1

Stronger form: ∀F

P[∃c : (An , c) ∈ F ] ≥ P[(An , B n ) ∈ F ] + o(1)

|{z}
uniform in F

Proof. Following similar arguments of the proof for Theorem 26.3, we have

P[∀c ∈ C : (An , c) 6∈ F ] ≤ e−γ + P[{(An , B n ) 6∈ F } ∪ {i(An ; B n ) > log γ}]

= P[(An , B n ) 6∈ F ] + o(1)
⇒ P[∀c ∈ C : (An , c) ∈ F ] ≥ P[(An , B n ) ∈ F ] + o(1)

Note: [Intuition] To generate B n , there are around 2nH(B) high probability sequences; for each An
sequence, there are around 2nH(B|A) B n sequences that have the same joint distribution, therefore, it
nH(B)
is sufficient to describe the class of B n for each An sequence, and there are around 22nH(B|A) = 2nI(A;B)
classes.
Although Covering Lemma is a powerful tool, it does not imply that the constructed joint
distribution QAn B n can fool any permutation invariant tester. In other words, it is not guaranteed
that
n
sup |QAn ,B n (F ) − PA,B (F )| → 0 .
F ⊂A ×B ,permut.invar.
n n

Indeed, a sufficient statistic for a permutation invariant tester is a joint type P̂An ,c . Our code
satisfies P̂An ,c ≈ PA,B , but it might happen that P̂An ,c although close to PA,B still takes highly
unlikely values (for example, if we restrict all c to have the same composition P0 , the tester can
√
easily detect the problem since PBn -measure of all strings of composition P0 cannot exceed O(1/ n)).
Formally, to fool permutation invariant tester we need to have small total variation between the
distribution on the joint types under P and Q. (It is natural to conjecture that rate R = I(A; B)
should be sufficient to achieve this requirement, though).
A related question is about the minimal possible rate (i.e. cardinality of W ∈ [2nR ]) required to
have small total variation:
n
TV(QAn ,B n , PAB )≤ (26.22)
Note that condition (26.22) guarantees that any tester (permutation invariant or not) is fooled to
believe he sees the truly iid (An , B n ). The minimal required rate turns out to be (Cuff’2012):

R= min I(A, B; U )
A→U →B

a quantity known as Wyner’s common information C(A; B). Showing that Wyner’s common
n (in TV) we have
information is a lower-bound is not hard. Indeed, since QAn ,B n ≈ PAB

I(QAt−1 ,B t−1 , QAt Bt |At−1 ,B t−1 ) ≈ I(PAt−1 ,B t−1 , PAt Bt |At−1 ,B t−1 ) = 0

282
(Here one needs to use finiteness of the alphabet of A and B and the bounds relating H(P ) − H(Q)
with TV(P, Q)). We have (under Q!)

nR = H(W ) ≥ I(An , B n ; W ) (26.23)

T
X
≥ I(At , Bt ; W ) − I(At , Bt ; At−1 B t−1 ) (26.24)
t=1
XT
≈ I(At , Bt ; W ) (26.25)
t=1
& nC(A; B) (26.26)

where in the last step we used the crucial observation that under Q there is a Markov chain

At → W → B t

and that Wyner’s common information PA,B 7→ C(A; B) should be continuous in the total variation
distance on PA,B . Showing achievability is a little more involved.

283
§ 27. Evaluating R(D). Lossy Source-Channel separation.

Last time: For stationary memoryless (iid) sources and separable distortion, under the assumption
that Dmax < ∞.
R(D) = Ri (D) = inf I(S; Ŝ).
PŜ|S :Ed(S,Ŝ)≤D

27.1 Evaluation of R(D)

So far we’ve proved some properties about the rate distortion function, now we’ll compute its value
for a few simple statistical sources. We’ll do this in a somewhat unsatisfying way: guess the answer,
then verify its correctness. At the end, we’ll show that there is a pattern behind this method.

27.1.1 Bernoulli Source

Let S ∼ Ber(p), p ≤ 1/2, with Hamming distortion d(S, Ŝ) = 1{S 6= Ŝ} and alphabets A = Â =
{0, 1}. Then d(sn , ŝn ) = n1 ksn − ŝn kHamming is the bit-error rate.
Claim:
R(D) = |h(p) − h(D)|+ (27.1)

Proof. Since Dmax = p, in the sequel we can assume D < p for otherwise there is nothing to show.
(Achievability) We’re free to choose any PŜ|S , so choose S = Ŝ + Z, where Ŝ ∼ Ber(p0 ) ⊥
⊥Z∼
0
Ber(D), and p is such that

p0 ∗ D = p0 (1 − D) + (1 − p0 )D = p,
p−D
i.e., p0 = 1−2D . In other words, the backward channel PS|Ŝ is BSC(D). This induces some forward
channel PŜ|S . Then,

I(S; Ŝ) = H(S) − H(S|Ŝ) = h(p) − h(D)

Since one such PŜ|S exists, we have the upper bound R(D) ≤ h(p) − h(D).
(Converse) First proof: For any PŜ|S such that P [S 6= Ŝ] ≤ D ≤ p ≤ 12 ,

I(S; Ŝ) = H(S) − H(S|Ŝ)

= H(S) − H(S + Ŝ|Ŝ)
≥ H(S) − H(S + Ŝ)
= h(p) − h(P [S 6= Ŝ])
≥ h(p) − h(D)

284
Second proof: Here is a more general strategy. Denote the random transformation from the
Ŝ|S with EQ [d(S, Ŝ)] ≤ D
∗ . Now we need to show that there is no better Q
achievability proof by PŜ|S
and a smaller mutual information. Then consider the chain:

R(D) ≤ I(PS , QŜ|S ) = D(QS|Ŝ kPS |QŜ )

" #
PS|Ŝ
= D(QS|Ŝ kPS|Ŝ |QŜ ) + EQ log
PS
(Marginal QS Ŝ = PS QŜ|S ) = D(QS|Ŝ kPS|Ŝ |QŜ ) + H(S) + EQ [log D1{S 6= Ŝ} + log D̄1{S = Ŝ}]

And we can minimize this expression by taking QS|Ŝ = PS|Ŝ , giving

≥ 0 + H(S) + P [S = Ŝ] log(1 − D) + P [S 6= Ŝ] log D ≥ h(p) − h(D) (D ≤ 1/2) (27.2)

Since the upper and lower bound agree, we have R(D) = |h(p) − h(D)|+ .

For example, when p = 1/2, D = .11, then R(D) = 1/2 bit. In the Hamming game where we
compressed 100 bits down to 50, we indeed can do this while achieving 11% average distortion,
compared to the naive scheme of storing half the string and guessing on the other half, which
achieves 25% average distortion.
Interpretation: By WLLN, the distribution PSn = Ber(p)n concentrates near the Hamming
sphere of radius np as n grows large. The above result about Hamming sources tells us that the
optimal reconstruction points are from PŜn = Ber(p0 )n where p0 < p if p < 1/2 and p0 = 1/2 if
p = 1/2, which concentrates on a smaller sphere of radius np0 (note the reconstruction points are
some exponentially small subset of this sphere).

S(0, np)

S(0, np′ )

Hamming Spheres

It is interesting to note that none of the reconstruction points are the same as any of the typical
source realizations.

27.1.2 Gaussian Source

The Gaussian source is defines as A = Â = R, S ∼ N (0, σ 2 ), d(a, â) = |a − â|2 (MSE distortion).
Claim:
1 σ2
R(D) = log+ (27.3)
2 D
Proof. Since Dmax = σ 2 , in the sequel we can assume D < σ 2 for otherwise there is nothing to show.

285
(Achievability) Choose S = Ŝ + Z , where Ŝ ∼ N (0, σ 2 − D) ⊥
⊥ Z ∼ N (0, D). In other words,
the backward channel PS|Ŝ is AWGN with noise power D. Since everything is jointly Gaussian, the
2 2
forward channel can be easily found to be PŜ|S = N ( σ σ−D
2 S,
σ −D
σ2
D). Then

1 σ2 1 σ2
I(S; Ŝ) = log =⇒ R(D) ≤ log
2 D 2 D
(Converse) Let PŜ|S be any conditional distribution such that EP |S − Ŝ|2 ≤ D. Denote the
∗ . We use the same trick as before
forward channel in the achievability by PŜ|S
" ∗ #
PS| Ŝ
∗
I(PS , PŜ|S ) = D(PS|Ŝ kPS|Ŝ |PŜ ) + EP log
PS
" ∗ #
PS| Ŝ
≥ EP log
PS
 
(S−Ŝ)2
1 −

√ e 2D 
= EP log 2πD S2

√ 1 e− 2σ2
2πσ 2
" #
1 σ 2 log e S 2 |S − Ŝ|2
= log + EP −
2 D 2 σ2 D
1 σ2
≥
log .
2 D
Again, the upper and lower bounds agree.
The interpretation in the Gaussian case is very similar√to the case of the Hamming source. As n
grows large, our source distribution concentrates on S(0, pnσ 2 ) (n-sphere in Euclidean space rather
than Hamming), and our reconstruction points on S(0, n(σ 2 − D)). So again the picture is two
nested sphere.
How sensitive is the rate-distortion formula to the Gaussianity assumption of the source? The
following is a counterpart of Theorem 19.6 for channel capacity:
Theorem 27.1. Assume that ES = 0 and Var S = σ 2 . Let the distortion metric be quadratic:
d(s, ŝ) = (s − ŝ)2 . Then
1 σ2 1 σ2
log+ − D(PS kN (0, σ 2 )) ≤ R(D) = inf I(S; Ŝ) ≤ log+ .
2 D PŜ|S :E(Ŝ−S)2 ≤D 2 D

Note: This result is in exact parallel to what we proved in Theorem 19.6 for additive-noise channel
capacity:

1 P 1 P
log 1 + 2 ≤ sup I(X; X + Z) ≤ log 1 + 2 + D(PZ kN (0, σ 2 )).
2 σ PX :EX 2 ≤P 2 σ

where EZ = 0 and Var Z = σ 2 .

Note: A simple consequence of Theorem 27.1 is that for source distributions with a density, the rate-
distortion function grows according to 12 log D 1
in the low-distortion regime as long as D(PS kN (0, σ 2 ))
is finite. In fact, the first inequality, known as the Shannon lower bound, is asymptotically tight, i.e.,
2
R(D) = 12 log σD − D(PS kN (0, σ 2 )) + o(1) as D → 0. Therefore in this regime performing uniform
scalar quantization with accuracy √1D is in fact asymptotically optimal within an o(1) term.

286
Proof. Again, assume D < Dmax = σ 2 . Let SG ∼ N (0, σ 2 ).
2 σ 2 −D
∗
“≤”: Use the same PŜ|S = N ( σ σ−D
2 S, σ2
D) in the achievability proof of Gaussian rate-
distortion function:
∗
R(D) ≤ I(PS , PŜ|S )
σ2 − D σ2 − D
= I(S; S + W) W ∼ N (0, D)
σ2 σ2
σ2 − D
≤ I(SG ; SG + W ) by Gaussian saddle point (Theorem 4.6)
σ2
1 σ2
= log .
2 D

“≥”: For any PŜ|S such that E(Ŝ − S)2 ≤ D. Let PS|
∗ = N (Ŝ, D) denote AWGN with noise
Ŝ
power D. Then

I(S; Ŝ) = D(PS|Ŝ kPS |PŜ )

" ∗ #
PS| Ŝ
∗
= D(PS|Ŝ kPS| |P )
Ŝ Ŝ
+ EP log − D(PS kPSG )
PSG
 
(S−Ŝ)2
√ 1 e− 2D
 2πD 
≥ EP log S2
 − D(PS kPSG )
√ 1 e −
2σ 2
2πσ 2
1 σ2
≥ log − D(PS kPSG ).
2 D

Remark: The theory of quantization and the rate distortion theory at large have played a
significant role in pure mathematics. For instance, Hilbert’s thirteenth problem was partially solved
by Arnold and Kolmogorov after they realized that they could classify spaces of functions looking
at the optimal quantizer for such functions.

27.2* Analog of saddle-point property in rate-distortion

In the computation of R(D) for the Hamming and Gaussian source, we guessed the correct form
of the rate distortion function. In both of their converse arguments, we used the same trick to
establish that any other PŜ|S gave a larger value for R(D). In this section, we formalize this trick,
in an analogous manner to the saddle point property of the channel capacity. Note that typically
we don’t need any tricks to compute R(D), since we can obtain a solution in a parametric form to
the unconstrained convex optimization

min I(S; Ŝ) + λ E[d(S, Ŝ)]

PŜ|S

In fact there are also iterative algorithms (Blahut-Arimoto) that computes R(D). However, for
the peace of mind it is good to know there are some general reasons why tricks like we used in
Hamming/Gaussian actually are guaranteed to work.

287
Theorem 27.2. 1. Suppose PY ∗ and PX|Y ∗ PX are found with the property that E[d(X, Y ∗ )] ≤ D
and for any PXY with E[d(X, Y )] ≤ D we have

dPX|Y ∗
E log (X|Y ) ≥ I(X; Y ∗ ) . (27.4)
dPX

Then R(D) = I(X; Y ∗ ).

2. Suppose that I(X; Y ∗ ) = R(D). Then for any regular branch of conditional probability PX|Y ∗
and for any PXY satisfying

• E[d(X, Y )] ≤ D and

• PY PY ∗ and

• I(X; Y ) < ∞

the inequality (27.4) holds.

Remarks:

1. The first part is a sufficient condition for optimality of a given PXY ∗ . The second part gives a
necessary condition that is convenient to narrow down the search. Indeed, typically the set of
PXY satisfying those conditions is rich enough to infer from (27.4):

dPX|Y ∗
log (x|y) = R(D) − θ[d(x, y) − D] ,
dPX
for a positive θ > 0.

2. Note that the second part is not valid without assuming PY PY ∗ . A counterexample to
this and various other erroneous (but frequently encountered) generalizations is the following:
A = {0, 1}, PX = Bern(1/2), Â = {0, 1, 00 , 10 } and

d(0, 0) = d(0, 00 ) = 1 − d(0, 1) = 1 − d(0, 10 ) = 0 .

The R(D) = |1 − h(D)|+ , but there are a bunch of non-equivalent optimal PY |X , PX|Y and
PY ’s.

Proof. First part is just a repetition of the proofs above, so we focus on part 2. Suppose there exists
a counterexample PXY achieving

dPX|Y ∗
I1 = E log (X|Y ) < I ∗ = R(D) .
dPX

Notice that whenever I(X; Y ) < ∞ we have

I1 = I(X; Y ) − D(PX|Y kPX|Y ∗ |PY ) ,

and thus
D(PX|Y kPX|Y ∗ |PY ) < ∞ . (27.5)
Before going to the actual proof, we describe the principal idea. For every λ we can define a joint
distribution
PX,Yλ = λPX,Y + (1 − λ)PX,Y ∗ .

From here we will conclude, similar to Prop. 4.1, that the first term is o(λ) and thus for sufficiently
small λ we should have I(X; Yλ ) < R(D), contradicting optimality of coupling PX,Y ∗ .
We proceed to details. For every λ ∈ [0, 1] define
dPY
ρ1 (y) , (y) (27.9)
dPY ∗
λρ1 (y)
λ(y) , (27.10)
λρ1 (y) + λ̄
(λ)
PX|Y =y = λ(y)PX|Y =y + λ̄(y)PX|Y ∗ =y (27.11)
dPYλ = λdPY + λ̄dPY ∗ = (λρ1 (y) + λ̄)dPY ∗ (27.12)
D(y) = D(PX|Y =y kPX|Y ∗ =y ) (27.13)
(λ)
Dλ (y) = D(PX|Y =y kPX|Y ∗ =y ) . (27.14)

Notice:
On {ρ1 = 0} : λ(y) = D(y) = Dλ (y) = 0
and otherwise λ(y) > 0. By convexity of divergence

Dλ (y) ≤ λ(y)D(y)

and therefore
1
Dλ (y)1{ρ1 (y) > 0} ≤ D(y)1{ρ1 (y) > 0} .
λ(y)
Notice that by (27.5) the function ρ1 (y)D(y) is non-negative and PY ∗ -integrable. Then, applying
dominated convergence theorem we get
Z Z
1 1
lim dPY ∗ Dλ (y)ρ1 (y) = dPY ∗ ρ1 (y) lim Dλ (y) = 0 (27.15)
λ→0 {ρ1 >0} λ(y) {ρ1 >0} λ→0 λ(y)

where in the last step we applied the result from Lecture 4

D(P kQ) < ∞ =⇒ D(λP + λ̄QkQ) = o(λ)

since for each y on the set {ρ1 > 0} we have λ(y) → 0 as λ → 0.

On the other hand, notice that
Z Z
1 1
dPY ∗ Dλ (y)ρ1 (y)1{ρ1 (y) > 0} = dPY ∗ (λρ1 (y) + λ̄)Dλ (y) (27.16)
{ρ1 >0} λ(y) λ {ρ1 >0}
Z
1
= dPYλ Dλ (y) (27.17)
λ {ρ1 >0}
Z
1 1 (λ)
= dPYλ Dλ (y) = D(PX|Y kPX|Y ∗ |PYλ ) , (27.18)
λ Y λ

289
where in the penultimate step we used Dλ (y) = 0 on {ρ1 = 0}. Hence, (27.15) shows
(λ)
D(PX|Y kPX|Y ∗ |PYλ ) = o(λ) , λ → 0.

contradicting the assumption

I(X; Yλ ) ≥ I ∗ = R(D) .

27.3 Lossy joint source-channel coding

The lossy joint source channel coding problem refers to the fundamental limits of lossy compression
followed by transmission over a channel.
Problem Setup: For an A-valued ({S1 , S2 , . . . } and distortion metric d : Ak × Âk → R, a
lossy JSCC is a pair (f, g) such that

f ch. g
S k −→ X n −→ Y n −→ Ŝ k

Definition 27.1. (f, g) is a (k, n, D)-JSCC if E[d(S k , Ŝ k )] ≤ D.

Sk Xn Yn Ŝ k
Source JSCC enc Channel JSCC dec
k
R= n

where ρ is the bandwidth expansion factor :

n
ρ= channel uses/symbol.
k
Our goal is to minimize ρ subject to a fidelity guarantee by designing the encoder/decoder pairs
smartly. The asymptotic fundamental limit for a lossy JSCC is
n
ρ∗ (D) = lim sup min{ : ∃(k, n, D) − code}
n→∞ k
For simplicity in this lecture we will focus on JSCC for stationary memoryless sources with
separable distortion + stationary memoryless channels.

290
27.3.1 Converse
The converse for the JSCC is quite simple. Note that since there is no under consideration, the
strong converse is the same as the weak converse. The proof architecture is identical to the weak
converse of lossless JSCC which uses Fano’s inequality.
Theorem 27.3 (Converse). For any source such that
1
Ri (D) = lim inf I(S k ; Ŝ k )
k→∞ k PŜ k |S k :E[d(S k ,Ŝ k )]≤D

we have
Ri (D)
ρ∗ (D) ≥
Ci
Remark: The requirement of this theorem on the source isn’t too stringent; the limit expression
for Ri (D) typically exists for stationary sources (like for the entropy rate)
Proof. Take a (k, n, D)-code S k → X n → Y n → Ŝ k . Then

inf I(S k ; Ŝ k ) ≤ I(S k ; Ŝ k ) ≤ I(X k ; Y k ) ≤ sup I(X n ; Y n )

PŜ k |S k PX n

Which follows from data processing and taking inf/sup. Normalizing by 1/k and taking the liminf
as n → ∞
1
(LHS) lim inf sup I(X n ; Y n ) = Ci
n→∞ n PX n
1
(RHS) lim inf inf I(S kn ; Ŝ kn ) = Ri (D)
n→∞ kn PŜ kn |S kn

And therefore, any sequence of (kn , n, D)-codes satisfies

n Ri (D)
lim sup ≥
n→∞ kn Ci

Note: Clearly the assumptions in Theorem 27.3 are satisfied for memoryless sources. If the source
S is iid Bern(1/2) with Hamming distortion, then Theorem 27.3 coincides with the weak converse
for channel coding under bit error rate in Theorem 16.4:
nC
k≤
1 − h(pb )
which we proved using ad hoc techniques. In the case of channel with cost constraints, e.g., the
AWGN channel with C(SNR) = 21 log(1 + SNR), we have

C(SNR)
pb ≥ h−1 1 −
R
This is often referred to as the Shannon limit in plots comparing the bit-error rate of practical codes.
See, e.g., Fig. 2 from [RSU01] for BIAWGN (binary-input) channel. This is erroneous, since the
pb above refers to the bit-error of data bits (or systematic bits), not all of the codeword bits. The
latter quantity is what typically called BER in the coding-theoretic literature.

291
27.3.2 Achievability via separation
The proof strategy is similar to the lossless JSCC: We construct a separated lossy compression
and channel coding scheme using our tools from those areas, i.e., let the JSCC encoder to be
the concatenation of a loss compressor and a channel encoder, and the JSCC decoder to be the
concatenation of a channel decoder followed by a loss compressor, then show that this separated
construction is optimal.
Theorem 27.4. For any stationary memoryless source (PS , A, Â, d) satisfying assumption A1
(below), and for any stationary memoryless channel PY |X ,

R(D)
ρ∗ (D) =
C
Note: The assumption on the source is to control the distortion incurred by the channel decoder
making an error. Although we know that this is a low-probability event, without any assumption
on the distortion metric, we cannot say much about its contribution to the end-to-end average
distortion. This will not be a problem if the distortion metric is bounded (for which Assumption A1
is satisfied of course). Note that we do not have this nuisance in the lossless JSCC because we at
most suffer the channel error probability (union bound).
The assumption is rather technical which can be skipped in the first reading. Note that it is
trivially satisfied by bounded distortion (e.g., Hamming), and can be shown to hold for Gaussian
source and MSE distortion.

Proof. The converse direction follows from the previous theorem. For the other direction, we
constructed a separated compression / channel coding scheme. Take

S k → W → Ŝ k compressor to W ∈ [2kR(D)+o(k) ] with E[d(S k , Ŝ k )] ≤ D

W → X n → Y n → Ŵ maximal probability of error channel code (assuming kR(D) ≤ nC + o(n))
with P[W 6= Ŵ ] ≤ ∀PW

So that the overall system is

S k −→ W −→ X n −→ Y n −→ Ŵ −→ Ŝ k

Note that here we need a maximum probability of error code since when we concatenate these
two schemes, W at the input of the channel is the output of the source compressor, which is not
guaranteed to be uniform. Now that we have a scheme, we must analyze the average distortion to
show that it meets the end-to-end distortion constraint. We start by splitting the expression into
two cases

E[d(S k , Ŝ k )] = E[d(S k , Ŝ k (W ))1{W = Ŵ }] + E[d(S k , Ŝ k (Ŵ ))1{W 6= Ŵ }]

By assumption on our lossy code, we know that the first term is ≤ D. In the second term, we know
that the probability of the event {W =6 Ŵ } is small by assumption on our channel code, but we
cannot say anything about E[d(S , Ŝ (Ŵ ))] unless, for example, d is bounded. But by Lemma 27.1
k k

(below), ∃ code S k → W → Ŝ k such that

(1) E[d(S k , Ŝ k )] ≤ D holds

(2) d(ak0 , Ŝ k ) ≤ L for all quantization outputs Ŝ k , where ak0 = (a0 , . . . , a0 ) is some fixed string of
length k from the Assumption A1 below.

292
The second bullet says that all points in the reconstruction space are “close” to some fixed string.
Now we can deal with the troublesome term
E[d(S k , Ŝ k (Ŵ ))1{W 6= Ŵ }] ≤ E[1{W 6= Ŵ }λ(d(S k , âk0 ) + d(ak0 , Ŝ k ))]
(by point (2) above) ≤ λE[1{W 6= Ŵ }d(S k , âk0 )] + λE[1{W 6= Ŵ }L]
≤ λo(1) + λL → 0 as → 0
where in the last step we applied the same uniform integrability argument that showed vanishing of
the expectation in (26.21) before.
In all, our scheme meets the average distortion constraint. Hence we conclude that for ∀ρ >
R(D)
C , ∃ sequence of (k, n, D + o(1))-codes.

The following assumption is critical to the previous theorem:

Assumption A1: For a source (PS , A, Â, d), ∃λ ≥ 0, a0 ∈ A, â0 ∈ Â such that
1. d(a, â) ≤ λ(d(a, âo ) + d(a0 , â)) ∀a, â (generalized triangle inequality)
2. E[d(S, â0 )] < ∞ (so that Dmax < ∞ too).
3. E[d(a0 , Ŝ)] < ∞ for any output distribution PŜ achieving the rate-distortion function R(D)
at some D.
4. d(a0 , â0 ) < ∞.
This assumption says that the spaces A and Â have “nice centers”, in the sense that the distance
between any two points is upper bounded by a constant times the distance from the centers to each
point (see figure below).
b
b

a â

b b

a0 â0

A Â

But the assumption isn’t easy to verify, or clear which sources satisfy the assumptions. Because of
this, we now give a few sufficient conditions for Assumption A1 to hold.
Trivial Condition: If the distortion function is bounded, then the assumption A1 holds
automatically. In other words, if we have a discrete source with finite alphabet |A|, |Â| < ∞ and
a finite distortion function d(a, â) < ∞, then A1 holds. More generally, we have the following
criterion.
Theorem 27.5 (Criterion for satisfying A1). If A = Â and d(a, â) = ρq (a, â) for some metric ρ
with q ≥ 1, and Dmax , inf â0 E[d(S, â0 )] < ∞, then A1 holds.
Proof. Take a0 = â0 that achieves finite Dp (in fact, any points can serve as centers in a metric
space). Then
q
1 1 1
( ρ(a, â))q ≤ ρ(a, a0 ) + ρ(a0 , â)
2 2 2
1 1
(Jensen’s) ≤ ρq (a, a0 ) + ρq (a0 , â)
2 2

293
And thus d(a, â) ≤ 2q−1 (d(a, a0 ) + d(a0 , â)). Taking λ = 2q−1 verifies (1) and (2) in A1. To verify
(3), we can use this generalized triangle inequality for our source

d(a0 , Ŝ) ≤ 2q−1 (d(a0 , S) + d(S, Ŝ))

Then taking the expectation of both sides gives

E[d(a0 , Ŝ)] ≤ 2q−1 (E[d(a0 , S)] + E[d(S, Ŝ)])

≤ 2q−1 (Dmax + D) < ∞

So that condition (3) in A1 holds.

So we see that metrics raised to powers (e.g. squared Euclidean norm) satisfy the condition A1.
The lemma used in Theorem 27.4 is now given.
Lemma 27.1. Fix a source satisfying A1 and an arbitrary PŜ|S . Let R > I(S; Ŝ), L > max{E[d(a0 , Ŝ)], d(a0 , â0 )}
and D > E[d(S, Ŝ)]. Then, there exists a (k, 2kR , D)-code such that for every reconstruction point
x̂ ∈ Âk we have d(ak0 , x̂) ≤ L.

Proof. Let X = Ak , X̂ = Âk and PX = PSk , PY |X = PŜ|S

k . Then apply the achievability bound for

excess distortion from Theorem 26.4 with

(
d(x, x̂) d(ak0 , x̂) ≤ L
d1 (x, x̂) =
+∞ o/w

Note that this is NOT a separable distortion metric. Also note that without any change in d1 -
distortion we can remove all (if any) reconstruction points x̂ with d(ak0 , x̂) > L. Furthermore, from
the WLLN we have for any D > D0 > E[d(S, Ŝ 0 )]

P[d1 (X, Y ) > D0 ] ≤ P[d(S k , Ŝ k ) > D0 ] + P[d(ak0 , Ŝ k ) > L] → 0

as k → ∞ (since E[d(S, Ŝ)] < D0 and E[a0 , Ŝ] < L). Thus, overall we get M = 2kR reconstruction
points (c1 , . . . , cM ) such that
P[ min d(S k , cj ) > D0 ] → 0
j∈[M ]

and d(ak0 , cj ) ≤ L. By adding cM +1 = (â0 , . . . , â0 ) we get

E[ min d(S k , cj )] ≤ D0 + E[d(S k , cM +1 )1{ min d(S k , cj ) > D0 }] = D0 + o(1) ,

j∈[M +1] j∈[M ]

where the last estimate follows from uniform integrability as in the vanishing of expectation in (26.21).
Thus, for sufficiently large n the expected distortion is ≤ D, as required.

To summarize the results in this section, under stationarity and memorylessness assumptions on
the source and the channel, we have shown that the following separately-designed scheme achieves
the optimal rate for lossy JSCC: First compress the data, then encode it using your favorite channel
code, then decompress at the receiver.

R(D) bits/sec ρC bits/sec

Source JSCC enc Channel

294
27.4 What is lacking in classical lossy compression?
Examples of some issues with the classical compression theory:

• compression: we can apply the standard results in compression of a text file, but it is extremely
difficult for image files due to the strong spatial correlation. For example, the first sentence
and the last in Tolstoy’s novel are pretty uncorrelated. But the regions in the upper-left
and bottom-right corners of one image can be strongly correlated. Thus for practicing the
lossy compression of videos and images the key problem is that of coming up with a good
“whitening” basis.

• JSCC: Asymptotically the separation principle sounds good, but the separated systems can
be very unstable - no graceful degradation. Consider the following example of JSCC.
Example: Source = Bern( 12 ), channel = BSC(δ).
R(D)
1. separate compressor and channel encoder designed for C(δ) =1
2. a simple JSCC:
ρ = 1, Xi = Si

295
Part VI

Advanced topics

296
§ 28. Applications to statistical decision theory

In this lecture we discuss applications of information theory to statistical decision theory. Although
this lecture only focuses on statistical lower bound (converse result), let us remark in passing that
the impact of information theory on statistics is far from being only on proving impossibility results.
Many procedures are based on or inspired by information-theoretic ideas, e.g., those based on
metric entropy, pairwise comparison, maximum likelihood estimator and analysis, minimum distance
estimator (Wolfowitz), maximum entropy estimators, EM algorithm, minimum description length
(MDL) principle, etc.
We discuss two methods: LeCam-Fano (hypothesis testing) method and the rate-distortion
(mutual information) method.
We begin with the decision-theoretic setup of statistical estimation. The general paradigm is
the following:
θ
|{z} → |{z}
X → |{z} θ̂
parameter data estimator

The main ingredients are

• Parameter space: Θ 3 θ

• Statistical model: {PX|θ : θ ∈ Θ}, which is a collection of distributions indexed by the

parameter

• Estimator: θ̂ = θ̂(X)

• Loss function: `(θ, θ̂) measures the inaccuracy.

The goal is make random variable `(θ, θ̂) small either in probability or in expectation, uniformly
over the unknown parameter θ. To this end, we define the minimax risk

R∗ = inf sup Eθ [`(θ, θ̂)].

θ̂ θ∈Θ

Here Eθ denotes averaging with respect to the randomness of X ∼ Pθ .

Ideally we want to compute R∗ and find the minimax optimal estimator that achieves the
minimax risk. This tasks can be very difficult especially in high dimensions, in which case we will
be content with characterizing the minimax rate, which approximates R∗ within multiplicative
universal constant factors, and the estimator that achieves a constant factor of R∗ will be called
rate-optimal.
As opposed to the worst-case analysis of the minimax risk, the Bayes approach is an average-case
analysis by considering the average risk of an estimator over all θ ∈ Θ. Let the prior π be a
probability distribution on Θ, from which the parameter θ is drawn. Then, the average risk (w.r.t
π) is defined as
Rπ (θ̂) = Eθ∼π Rθ (θ̂) = Eθ,X `(θ, θ̂).

297
The Bayes risk for a prior π is the minimum that the average risk can achieve, i.e.

Rπ∗ = inf Rπ (θ̂).

θ̂

By the simple logic of “maximum ≥ average”, we have

R∗ ≥ Rπ∗ (28.1)

and in fact R∗ = supπ∈M(Θ) Rπ∗ whenever the minimax theorem holds, where M(Θ) denotes the
collection of all probability distributions on Θ. In other words, solving the minimax problem can be
done by finding the least-favorable (Bayesian) prior. Almost all of the minimax lower bounds boil
down to bounding from below the Bayes risk for some prior. When this prior is uniform on just two
points, the method is known under a special name of (two-point) LeCam or LeCam-Fano method.
Note also that when `(θ, θ̂) = kθ − θ̂k22 is the quadratic `2 risk, the optimal estimator achieving Rπ∗
is easy to describe: θ̂∗ = E[θ|X]. This fact, however, is of limited value, since typically conditional
expectation is very hard to analyze.

28.1 Fano, LeCam and minimax risks

We demonstrate the LeCam-Fano method on the following example:
• Parameter space θ ∈ [0, 1]
• Observation model Xi – i.i.d. Bern(θ)
• Quadratic loss function:
`(θ̂, θ) = (θ̂ − θ)2

• Fundamental limit:
R∗ (n) , sup inf E[(θ̂(X n ) − θ)2 |θ = θ0 ]
θ0 ∈[0,1] θ̂

A natural estimator to consider is the empirical mean:

1X
θ̂emp (X n ) = Xi
n
i

It achieves the loss

θ0 (1 − θ0 ) 1
sup E[(θ̂emp − θ)2 |θ = θ0 ] = sup = . (28.2)
θ0 θ0 n 4n
The question is how close this is to the optimal.
First, recall the Cramer-Rao lower bound : Consider an arbitrary statistical estimation problem
θ → X → θ̂ with θ ∈ R and PX|θ (dx|θ0 ) = f (x|θ)µ(dx) with f (x|θ) is differentiable in θ. Then for
any θ̂(x) with E[θ̂(X)|θ] = θ + b(θ) and smooth b(θ) we have
(1 + b0 (θ0 ))2
E[(θ̂ − θ)2 |θ = θ0 ] ≥ b(θ0 )2 + , (28.3)
JF (θ0 )

where JF (θ0 ) = Var[ ∂ ln f∂θ(X|θ) |θ = θ0 ] is the Fisher information (4.7). In our case, for any unbiased
estimator (i.e. b(θ) = 0) we have
θ0 (1 − θ0 )
E[(θ̂ − θ)2 |θ = θ0 ] ≥ ,
n

298
and we can see from (28.2) that θ̂emp is optimal in the class of unbiased estimators.
Can biased estimators do better? The answer is yes. Consider
1 − n X 1 1
θ̂bias = (Xi − ) + ,
n 2 2
i

where choice of n > 0 “shrinks” the estimator towards 12 and regulates the bias-variance tradeoff.
1
In particular, setting n = √n+1 achieves the minimax risk

1
sup E[(θ̂bias − θ)2 |θ = θ0 ] = √ , (28.4)
θ0 4( n + 1)2

which is better than the empirical mean (28.2), but only slightly.
How do we show that arbitrary biased estimators can not do significantly better? This is where
LeCam-Fano method comes handy. Suppose some estimator θ̂ achieves

E[(θ̂ − θ)2 |θ = θ0 ] ≤ ∆2n (28.5)

for all θ0 . Then, setup the following probability space:

W → θ → X n → θ̂ → Ŵ

• W ∼ Bern(1/2)

• θ = 1/2 + κ(−1)W ∆n where κ > 0 is to be specified later

• X n is i.i.d. Bern(θ)

• θ̂ is the given estimator

• Ŵ = 0 if θ̂ > 1/2 and Ŵ = 1 otherwise

The idea here is that we use our high-quality estimator to distinguish between two hypotheses
θ = 1/2 ± κ∆n . Notice that for probability of error we have:

E[(θ̂ − θ)2 ] 1
P[W 6= Ŵ ] = P[θ̂ > 1/2|θ = 1/2 − κ∆n ] ≤ 2 2
≤ 2
κ ∆n κ

where the last steps are by Chebyshev and (28.5), respectively. Thus, from Fano’s inequality
Theorem 5.3 we have
1
I(W ; Ŵ ) ≥ 1 − 2 log 2 − h(κ−2 ) .
κ
On the other hand, from data-processing and golden formula we have

I(W ; Ŵ ) ≤ I(θ; X n ) ≤ D(PX n |θ kBern(1/2)n |Pθ )

Computing the last divergence we get

D(PX n |θ kBern(1/2)n |Pθ ) = nd(1/2 − κ∆n k1/2) = n(log 2 − h(1/2 − κ∆n ))

As ∆n → 0 we have
h(1/2 − κ∆n ) = log 2 − 2 log e · (κ∆n )2 + o(∆2n ) .

299
So altogether, we get that for every fixed κ we have

1
1 − 2 log 2 − h(κ−2 ) ≤ 2n log e · (κ∆n )2 + o(n∆2n ) .
κ
In particular, by optimizing over κ we get that for some constant c ≈ 0.015 > 0 we have
c
∆2n ≥ + o(1/n) .
n
Together with (28.2), we have
0.015 1
+ o(1/n) ≤ R∗ (n) ≤ ,
n 4n
and thus the empirical-mean estimator is rate-optimal.
We mention that for this particular problem (estimating mean of Bernoulli samples) the minimax
risk is known exactly:
1
R∗ (n) = √ (28.6)
4(1 + n)2
but obtaining this requires different methods.1 In fact, even showing R∗ (n) = 4n 1
+ o(1/n) requires
careful priors on θ (unlike the simple two-point prior we used above).2
We demonstrated here the essense of the Fano method of proving lower (impossibility) bounds
in statistical decision theory. Namely, given an estimation task we select a prior, uniform on finitely
many θ’s, which on one hand yields a rather small information I(θ; X) and on the other hand has
sufficiently separated points which thus should be distinguishable by a good estimator. For more
see [Yu97].
A natural (and very useful) generalization is to consider non-discrete prior Pθ , and use the
following natural chain of inequalities

f (Pθ , R) ≤ I(θ; θ̂) ≤ I(θ; X n ) ≤ sup I(θ; X n ) ,

Pθ

where
f (Pθ , R) , inf{I(θ; θ̂) : Pθ̂|θ s.t. E[`(θ, θ̂)] ≤ R}
is the rate-distortion function. This method we discuss next.
1
The easiest way to get this is to apply (28.1). . Fortunately, in this case if π is the β-distribution, computation of
conditional expectation can be performed in closed form, and optimizing parameters of the β-distribution one recovers
a lower bound that together with (28.4) establishes (28.6). Note that the resulting worst-case π is not uniform, and in
fact β → ∞ (i.e. π concentrates in a small region around θ = 1/2).
2
It follows from the following Bayesian Cramer-Rao lower bound [GL95] : For any estimator θ̂ and for any prior
π(θ)dθ with smooth density π we have
(log e)2
Eθ∼π [(θ̂(X) − θ)2 ] ≥ ,
E[JF (θ)] + JF (π)
R 0 (θ))2
where JF (θ) is as in (28.3), JF (π) , (log e)2 (ππ(θ) dθ. Then taking π supported on a n−1/4 -neighborhood
surrounding a given point θ0 we get that E[JF (θ)] = θ0 (1−θ0 ) + o(n) and JF (π) = o(n), yielding
n

θ0 (1 − θ0 )
R∗ (n) ≥ + o(1/n) .
n
This is a rather general phenomenon: Under regularity assumptions in any iid estimation problem θ → X n → θ̂ with
quadratic loss we have
1
R∗ (n) = + o(1/n) .
inf θ JF (θ)

300
28.2 Mutual information method
The main workhorse will be

1. Data processing inequality

2. Rate-distortion theory

3. Capacity and mutual information bound

To illustrate the mutual information method and its execution in various problems, we will discuss
three vignettes:

• Denoise a vector;

• Denoise a sparse vector;

• Community detection.

Here’s the main idea of the mutual information method. Fix some prior π and we turn to lower
bound Rπ∗ . The unknown θ is distributed according to π. Let θ̂ be a Bayes optimal estimator that
achieves the Bayes risk Rπ∗ .
The mutual information method consists of applying the data processing inequality to the
Markov chain θ → X → θ̂:
dpi
inf I(θ; θ̂) ≤ I(θ, θ̂) ≤ I(θ; X). (28.7)
∗
Pθ̃|θ :E`(θ,θ̃)≤Rπ

Note that

• The leftmost quantity can be interpreted as the minimum amount of information required for
an estimation task, which is reminiscent of rate-distortion function.

• The rightmost quantity can be interpreted as the amount of information provided by the
data about the parameter. Sometimes it suffices to further upper-bound it by capacity of the
channel θ 7→ X:
I(θ; X) ≤ sup I(θ; X). (28.8)
π∈M(Θ)

• This chain of inequalities is reminiscent of how we prove the converse in joint-source channel
coding (Section 27.3), with the capacity-like upper bound and rate-distortion-like lower bound.

• Only the lower bound is related to the loss function.

• Sometimes we need a smart choice of the prior.

301
28.2.1 Denoising (Gaussian location model)
The setting is the following: given n noisy observations of a high-dimensional vector θ ∈ Rp ,
i.i.d.
Xi ∼ N (θ, Ip ), i = 1, . . . , n (28.9)

The loss is simply the quadratic error: `(θ, θ̂) = kθ − θ̂k22 . Next we show that
p
R∗ = , ∀p, n. (28.10)
n
P
Upper bound. Consider the estimator X̄ = n1 ni=1 Xi . Then X̄ ∼ N (θ, n1 Ip ) and clearly
EkX̄ − θk22 = p/n.
Lower bound. Consider a Gaussian prior θ ∼ N (0, σ 2 Ip ). Instead of evaluating the exact Bayes
risk (MMSE) which is a simple exercise, let’s implement the mutual information method (28.7).
Given any estimator θ̂. Let D = Ekθ̂ − θk22 . Then

p σ2 suff stat p σ2
log = inf I(θ; θ̂) ≤ I(θ, θ̂) ≤ I(θ; X) = I(θ; X̄) = log 1 + .
2 D/p Pθ̃|θ :Ekθ−θ̃k22 ≤D 2 1/n

where the left inequality follows from the Gaussian rate-distortion function (27.3) and the single-
letterization result (Theorem 26.1) that reduces p dimensions to one dimension. Putting everything
together we have
pσ 2
R∗ ≥ Rπ∗ ≥ .
1 + nσ 2
Optimizing over σ 2 (by sending it to ∞), we have R∗ ≥ p/n.

28.2.2 Denoising sparse vectors

Here the setting is identical to (28.9), expect that we have the prior knowledge that θ is sparse, i.e.,

θ ∈ Θ , {all p-dim k-sparse vectors} = {θ ∈ Rp : kθk0 ≤ k}

P
where kθk0 = i∈[p] 1{θi 6=0} is the sparisty (number of nonzeros) of θ.
The minimax rate of denoising k-sparse vectors is given by the following
k ep
R∗ log , ∀k, p, n. (28.11)
n k
Before proceeding to the proof, a quick observation is that we have the oracle lower bound R∗ ≥ nk
follows from (28.10), since if the support of the θ is known which reduces the problem to k dimensions.
Thus, the meaning of statement (28.11) is that the lack of knowledge of the support contributes
(merely) a log factor.
To show this, again, by passing to sufficient statistics, it suffices to consider the observation
X ∼ N (θ, n1 Ip ). For simplicity we only consider n = 1 below.
Upper bound. (Sketch) The rate is achieved by thresholding the observation X that only keep
the large entries. The intuition is that since
√ the ground truth θ has many zeros, we should kill the
small entries in X. Since kZk∞ ≤ (2 + ) log p with √ high probability, hard thresholding estimator
that sets all entries of X with magnitude ≤ (2 + ) log p achieves a mean-square error of O(k log p),
which is rate optimal unless k = Ω(p), in which case we can simply apply the original X as the
estimator.

302
Lower bound. In view of the oracle lower bound, it suffices to consider k = O(p). Next we
assume k ≤ p/16. Consider a p-dimensional Hamming sphere of radius k, i.e.

B = {b ∈ {0, 1}p : wH (b) = k},

where wH (b)
qis the Hamming weights of b. Let b be drawn uniformly from the set B and θ = τ b,
where τ = k
100 log kp . Thus, we have the following Markov chain which represents our problem
model,
b → θ → X → θ̂ → b̂.
τ 2k
Note that the channel θ → X is just p uses of the AWGN channel, with power p , and thus
by Theorem 4.6 and single-letterization (Theorem 5.1) we have

p τ 2k log e
I(θ; θ̂) ≤ I(θ; X) ≤ log 1 + ≤ sup kθk22 = ckτ 2 ,
2 p θ∈G 2

for some c > 0. We note that related techniques have been used in proving lower bound for stable
recovery in noiseless compressed sensing [PW12].
To give a lower bound for I(θ; θ̂), consider

b̂ = argmin kθ̂ − τ bk22 .

b∈B

Since b̂ is the minimizer of kθ̂ − τ bk22 , we have,

kτ b̂ − θk2 ≤ kτ b̂ − θ̂k2 + kθ − θ̂k2 ≤ 2kθ − θ̂k2 .

Thus,
τ 2 dH (b, b̂) = kτ b̂ − θk22 ≤ 4kθ − θ̂k22 ,
where dH denotes the Hamming distance between b and b̂. Suppose that Ekθ̂ − θk22 = τ 2 k. Then
we have EdH (b, b̂) ≤ 4k. Our goal is to show that is at least a small constant by the mutual
information method. First,

I(b̂; b) ≥ min I(b̂; b).

EdH (b,b̂)≤4k

Before we bound the RHS, let’s first guess its behavior. Note that it is the rate-distortion function
of the random vector b, which is uniform over B, the Hamming sphere of radius k, and each entry is
Bern(k/p). Had the entries been iid, then rate-distortion theory ((27.1) and Theorem 26.1) would
yield that the RHS is simply p(h(k/p) − h(4k/p)). Next, following the proof of (27.1), we show
that this behavior is indeed correct:

min I(b̂; b) = H(b) − max H(b|b̂)

EdH (b,b̂)≤4k EdH (b,b̂)≤4k

p
= log − max H(b ⊕ b̂|b̂)
k EdH (b,b̂)≤4k

p
≥ log − max H(W ).
k EwH (W )≤4k

The maximum-entropy problem is easy to solve:

m
max H(W ) = ph . (28.12)
EwH (W )=m,W ∈{0,1}p p

303
The solution is W = Bern(m/p)⊗p . One way to get this is to write H(W ) = p log 2−D(PW kBern(1/2)⊗p )
and apply Theorem 14.3 with X = wH (W ), to get that optimal PW (w) ∼ exp{cwH (w)}. In the
end we get Combine this with the previous bound, we get

p 4k
I(b̂; b) ≥ log − ph( ).
k p
On the other hand, we have
p
I(b̂; b) ≤ I(θ; Y ) ≤ cτ 2 = c0 k log .
k
p
Note that h(α) −α log α for α < 41 . WLOG, since k ≤ 16 , we have ≥ c0 for some universal
constant c0 . Therefore
p
R∗ ≥ τ 2 k & k log .
k
Combining with the result in the oracle lower bound, we have the desired.
p
R∗ & k + k log
k
or for general n ≥ 1
k ep
R∗ & log .
n k
∗
Remark 28.1. Let Rk,p = R∗ . For the case k = o(p), the sharp asymptotics is

∗ p
Rk,p ≥ (2 + op (1))k log .
k
To prove this result, we need to first show that for the case k = 1,
∗
R1,p ≥ (2 + op (1)) log p.

Next, show that for any k, the minimax risk is lower bounded by the Bayesian risk with the block
prior. The block prior is that we divide the p-coordinate into k blocks, and pick one coordinate
from each p/k-coordinate uniformly. With this prior, one can show
∗ ∗ p
Rk,p ≥ kR1,p/k = (2 + op (1))k log .
k

28.2.3 Community detection

We only consider the problem of a single hidden community. Given a graph of n vertices, a community
is a subset of vertices where the edges tend to be denser than everywhere else. Specifically, we
consider the planted dense subgraph model (i.e., the stochastic block model with a single community).
Let the community C be uniformly drawn from all subsets of [n] of cardinality k. The graph is
generated by independently connecting each pair of vertices, with probability p if both belong to the
community C ∗ , and with probability q otherwise. Equivalently, in terms of the adjacency matrix A,
Aij ∼ Bern(p) if i, j ∈ C and Bern(q) otherwise. Assume p > q. Thus the subgraph induced by C ∗
is likely to be denser than the rest of the graph. We are interested in the large-graph asymptotics,
where both the network size n and the community size k grow to infinity.
Given the adjacency matrix A, the goal is to recover the hidden community C almost perfectly,
i.e., achieving
E[|Ĉ4C|] = o(k) (28.13)

304
Given the network size n and the community size k, there exists a sharp condition on the edge
density (p, q) that says the community needs to be sufficient denser than the outside. It turns
out this is precisely described by the binary divergence d(pkq). Under the assumption that p/q is
bounded, e.g., p = 2q, the information-theoretic necessary condition is

kd(pkq)
k · d(pkq) → ∞ and lim inf ≥ 2. (28.14)
n→∞ log nk

This condition is tight in the sense that if in the above “≥” is replaced by “>”, then there exists an
estimator (e.g., the maximal likelihood estimator) that achieves (28.13).
Next we only prove the necessity of the second condition in (28.14), again using the mutual
information method. Let ξ and ξˆ be the indicator vector of the community C and the estimator
Ĉ, respectively. Thus ξ = (ξ1 , . . . , ξn ) is uniformly drawn from the set {x ∈ {0, 1}n : wH (x) = k}.
Therefore ξi ’s are individually Bern(k/n). Let E[dH (ξ, ξ)] ˆ = n k, where n → 0 by assumption.
Consider the following chain of inequalities, which lower bounds the amount of information required
for a distortion level n :
dpi
ˆ ξ) ≥
I(A; ξ) ≥ I(ξ; min ˜ ξ) ≥ H(ξ) −
I(ξ; max H(ξ˜ ⊕ ξ)
E[d(ξ̃,ξ)]≤n k E[d(ξ̃,ξ)]≤n k

(28.12) n n k n
= log − nh ≥ k log (1 + o(1)),
k n k
k
where the last step follows from the bound nk ≥ nk , the assumption k/n is bounded away from
one, and the bound h(p) ≤ −p log p + p for p ∈ [0, 1].
On the other hand, to bound the mutual information, we use the golden formula Corollary 3.1
and choose a simple reference Q:

I(A; ξ) = min D(PA|ξ kQ|Pξ )

Q
n
≤ D(PA|ξ kBern(q)⊗( 2 ) |Pξ )

k
= d(pkq).
2
(k−1)D(P kQ)
Combining the last two displays yields lim inf n→∞ log(n/k) ≥ 2.

305
§ 29. Multiple-access channel

29.1 Problem motivation and main results

Note: In network community, people are mostly interested in channel access control mechanisms
that help to detect or avoid data packet collisions so that the channel is shared among multiple
users.

The famous ALOHA protocal achieves

X
Ri ≈ 0.37C
i

where C is the (single-user) capacity of the channel.1

In information theory community, the goal is to achieve
X
Ri > C
i

The key to achieve this is to use coding so that collisions are resolvable.
In the following discussion we shall focus on the case with two users. This is without loss of
much generality, as all the results can easily be extended to N users.
Definition 29.1.
• Multiple-access channel: {PY n |An ,B n : An × B n → Y n , n = 1, 2, . . . }.
• a (n, M1 , M2 , ) code is specified by
f1 : [M1 ] → An , f2 : [M2 ] → B n
g : Y n → [M1 ] × [M2 ]
1
Note that there is a lot of research on how to achieve just these 37%. Indeed, ALOHA in a nutshell simply
postulates that every time a user has a packet to transmit, he should attempt transmission in each time slot with
probability p, independently. The optimal setting of p is the inverse of the number of actively trying users. Thus, it is
non-trivial how to learn the dynamically changing number of active users without requiring a central authority. This
is how ideas such as exponential backoff etc arise.

306
W1 , W2 ∼ uniform, and the codes achieves
[
P[{W1 6= Ŵ1 } {W2 6= Ŵ2 }] ≤

• Fundamental limit of capacity region

R∗ (n, ) = {(R1 , R2 ) : ∃ a (n, 2nR1 , 2nR2 , )-code}

• Asymptotics:
C = cl lim inf R∗ (n, )
n→∞

where cl denotes the closure of a set.

Note: lim inf and lim sup of a sequence of sets {An }:

lim inf An = {ω : ω ∈ An , ∀n ≥ n0 }
n
lim sup An = {ω : ω infinitely occur}
n

• \
C = lim C = C
>0

Theorem 29.1 (Capacity region).

[
C =co P enta(PA , PB ) (29.1)
PA ,PB
h [ i
= P enta(PA|U , PB|U |PU ) (29.2)
PU,A,B =PU PA|U PB|U

307
Note: the two forms in (29.1) and (29.2) are equivalent without cost constraints. In the case when
constraints such as Ec1 (A) ≤ P1 and Ec2 (B) ≤ P2 are present, only the second expression yields
the true capacity region.

29.2 MAC achievability bound

First, we introduce a lemma which will be used in the proof of Theorem 29.1.
Lemma 29.1. ∀PA , PB , PY |A,B such that PA,B,Y = PA PB PY |A,B , and ∀γ1 , γ2 , γ12 > 0, ∀M1 , M2 ,
there exists a (M1 , M2 , ) MAC code such that:
h [ [ i
≤P {i12 (A, B; Y ) ≤ log γ12 } {i1 (A; Y |B) ≤ log γ1 } {i2 (B; Y |A) ≤ log γ2 }
+ (M1 − 1)(M2 − 1)e−γ12 + (M1 − 1)e−γ1 + (M2 − 1)e−γ2 (29.3)

Proof. We again use the idea of random coding.

Generate the codebooks

c1 , . . . , cM1 ∈ A, d1 , . . . , dM2 ∈ B

where the codes are drawn i.i.d from distributions: c1 , . . . , cM1 ∼ i.i.d. PA , d1 , . . . , dM2 ∼ i.i.d. PB .
The decoder operates in the following way: report (m, m0 ) if it is the unique pair that satisfies:

(P12 ) i12 (cm , dm0 ; y) > log γ12

(P1 ) i1 (cm ; y|dm0 ) > log γ1
(P2 ) i2 (dm0 ; y|cm ) > log γ2

Evaluate the expected error probability:

h
EPe (cM
1
1
, dM2
1 ) = P {(W1 , W2 ) violate (P12 ) or (P1 ) or (P2 )}
[ i
{∃ impostor (W10 , W20 ) that satisfy (P12 ) and (P1 ) and (P2 )}

by symmetry of random codes, we have

h
Pe = E[Pe |W1 = m, W2 = m0 ] = P {(m, m0 ) violate (P12 ) or (P1 ) or (P2 )}
[ i
{∃ impostor (i 6= m, j 6= m0 ) that satisfy (P12 ) and (P1 ) and (P2 )}

308
h [ [ i
⇒ Pe ≤ P {i12 (A, B; Y ) ≤ log γ12 } {i1 (A; Y |B) ≤ log γ1 } {i2 (B; Y |A) ≤ log γ2 } + P[E12 ] + P[E1 ] + P[E2 ]

where

P[E12 ] = P[{∃(i 6= m, j 6= m0 ) s.t. i12 (cm , dm0 ; y) > log γ12 }]

≤ (M1 − 1)(M2 − 1)P[i12 (A, B; Y ) > log γ12 ]
= E[e−i12 (A,B;Y ) 1{i12 (A, B; Y ) > log γ12 }]
≤ e−γ12
P[E2 ] = P[{∃(j 6= m0 ) s.t. i2 (dj ; y|ci ) > log γ2 }]
≤ (M2 − 1)P[i2 (B; Y |A) > log γ2 ]
= EA [e−i2 (B;Y |A) 1{i2 (B; Y |A) > log γ2 }|A]
≤ EA [e−γ2 |A] = e−γ2
similarly P[E1 ] ≤ e−γ1

Note: [Intuition] Consider the decoding step when a random codebook is used. We observe Y
and need to solve an M -ary hypothesis testing problem: Which of {PY |A=cm ,B=dm0 }m,m0 ∈[M1 ]×[M2 ]
produced the sample Y ?
Recall that in P2P channel coding, we had a similar problem and the M-ary hypothesis testing
problem was converted to M binary testing problems:
X 1
PY |X=cj vs PY−j , P ≈ PY
M − 1 Y |X=ci
i6=j

i.e. distinguish cj (hypothesis H0 ) against the average distribution induced by all other codewords
(hypothesis H1 ), which for a random coding ensemble cj ∼ PX is very well approximated by
PY = PY |X ◦ PX . The optimal test for this problem is roughly

PY |X=c
& log(M − 1) =⇒ decide PY |X=cj (29.4)
PY
1
since the prior for H0 is M , while the prior for H1 is MM−1 .
The proof above followed the same idea except that this time because of the two-dimensional
grid structure:

And analogously to (29.4) the optimal tests are given by comparing the respective information
densities with log M1 M2 , log M1 and log M2 .
Another observation following from the proof is that the following decoder would also achieve
exactly the same performance:

• Step 1: rule out all cells (i, j) with i12 (ci , dj ; Y ) . log M1 M2 .

• Step 2: If the messages remaining are NOT all in one row or one column, then FAIL.

• Step 3a: If the messages remaining are all in one column m0 then declare Ŵ2 = m0 . Rule out
all entries in that column with i1 (ci ; Y |dm0 ) . log M1 . If more than one entry remains, FAIL.
Otherwise declare the unique remaining entry m as Ŵ1 = m.

• Step 3b: Similarly with column replaced by row, i1 with i2 and log M1 with log M2 .

The importance of this observation is that in the regime when RHS of (29.3) is small, the
decoder always finds it possible to basically decode one message, “subtract” its influence and then
decode the other message. Which of the possibilities 3a/3b appears more often depends on the
operating point (R1 , R2 ) inside C.

29.3 MAC capacity region proof

Proof. 1. Show C is convex.
Take (R1 , R2 ) ∈ C/2 , and take (R10 , R20 ) ∈ C/2 .
We merge the (n, 2nR1 , 2nR2 , /2) code and the (n, 2nR1 , 2nR2 , /2) code in the following time-
sharing way: in the first n channels, use the first set of codes, and in the last n channels, use
the second set of codes.
0 0
Thus we formed a new (2n, 2R1 +R1 , 2R2 +R2 , ) code, we know that
1 1
C/2 + C/2 ⊂ C
2 2
take limit at both sides
1 1
C+ C⊂C
2 2
also we know that C ⊂ 12 C + 21 C, therefore C = 12 C + 12 C is convex.
Note: the set addition is defined in the following way:

A + B , {(a + b) : a ∈ A, b ∈ B}

310
2. Achievability
STP: ∀PA , PB , ∀(R1 , R2 ) ∈ P enta(PA , PB ), ∃(n, 2nR1 , 2nR2 , )code.
Apply Lemma 29.1 with:

PA → PAn , PB → PBn , PY |A,B → PYn|A,B

M1 = 2nR1 , M2 = 2nR2 ,
log γ12 = n(I(A, B; Y ) − δ), log γ1 = n(I(A; Y |B) − δ), log γ2 = n(I(B; Y |A) − δ).

we have that there exists a (M1 , M2 , ) code with

h 1X n [ 1X n
≤P { i12 (Ak , Bk ; Yk ) ≤ log γ12 − δ} { i1 (Ak ; Yk |Bk ) ≤ log γ1 − δ}
n n
k=1 k=1
[ 1X n i
{ i2 (Bk ; Yk |Ak ) ≤ log γ2 − δ}
n
k=1
| {z }
1
+ (2nR1 − 1)(2nR2 − 1)e−γ12 + (2nR1 − 1)e−γ1 + (2nR2 − 1)e−γ2
| {z }
2

by WLLN, the first part goes to zero, and for any (R1 , R2 ) such that R1 < I(A; Y |B) − δ
and R2 < I(B; Y |A) − δ and R1 + R2 < I(A, B; Y ) − δ, the second part goes to zero as well.
Therefore, if (R1 , R2 ) ∈ interior of the pentagon, there exists a (M1 , M2 , = o(1)) code.

3. Weak converse

1
Q[W1 = Ŵ1 , W2 = Ŵ2 ] = , P[W1 = Ŵ1 , W2 = Ŵ2 ] ≥ 1 −
M1 M2
d-proc:
1
d(1 − k ) ≤ inf D(P kQ) = I(An , B n ; Y n )
M1 M2 Q∈(∗)
1
⇒R1 + R2 ≤ I(An , B n ; Y n ) + o(1)
n

311
To get separate bounds, we apply the same trick to evaluate the information flow from the
link between A → Y and B → Y separately:
1
Q1 [W2 = Ŵ2 ] = , P[W2 = Ŵ2 ] ≥ 1 −
M2
d-proc:
1
d(1 − k ) ≤ inf D(P kQ1 ) = I(B n ; Y n |An )
M2 Q1 ∈(∗1)
1
⇒R2 ≤ I(B n ; Y n |An ) + o(1)
n
similarly we can show that
1
R2 ≤ I(An ; Y n |B n ) + o(1)
n
P
For memoryless channels, we know that n1 I(An , B n ; Y n ) ≤ n1 k I(Ak , Bk ; Yk ). Similarly,
since given B n the channel An → Y n is still memoryless we have
n
X n
X
I(An ; Y n |B n ) ≤ I(Ak ; Yk |B n ) = I(Ak ; Yk |Bk )
k=1 k=1

Notice that each (Ai , Bi ) pair corresponds to (PAk , PBk ), ∀k define

 
 0 ≤ R1,k ≤ I(Ak ; Yk |Bk ) 
P entak (PAk , PBk ) = (R1,k , R2,k ) : 0 ≤ R2,k ≤ I(Bk ; Yk |Ak )
 
R1,k + R2,k ≤ I(Ak , Bk ; Yk )

therefore
h1 X i
(R1 , R2 ) ∈ P entak
n
k
[
⇒C ∈ co P enta
PA ,PB

312
§ 30. Examples of MACs. Maximal Pe and zero-error capacity.

30.1 Recap
Last time we defined the multiple access channel as the sequence of random transformations

{PY n |An B n : An × B n → Y n , n = 1, 2, . . .}

Furthermore, we showed that its capacity region is

[
C = {(R1 , R2 ) : ∃(n, 2nR1 , 2nR2 , ) − M AC code} = co Penta(PA , PB )
PA PB

were co denotes the convex hull of the sets Penta, and Penta is


R1 ≤ I(A; Y |B)
Penta(PA , PB ) = R2 ≤ I(B; Y |A)


R1 + R2 ≤ I(A, B; Y )

So a general MAC and one Penta region looks like

R2
I(A, B; Y )
An I(B; Y |A)
PY n |An B n Yn
Bn R1
I(A; Y |B)

Note that the union of Pentas need not look like a Penta region itself, as we will see in a later
example.

30.2 Orthogonal MAC

The trivial MAC is when each input sees its own independent channel: PY |AB = PY |A PY |B where
the receiver sees (YA , YB ). In this situation, we expect that each transmitter can achieve it’s own
capacity, and no more than that. Indeed, our theorem above shows exactly this:


R1 ≤ I(A; Y |B) = I(A; Y )
Penta(PA , PB ) = R2 ≤ I(B; Y |A) = I(B; Y )


R1 + R2 ≤ I(A, B; Y )

Where in this case the last constraint is not applicable; it does not restrict the capacity region.

313
R2

A PY |A YA C2

B PY |B YB R1
C1

Hence our capacity region is a rectangle bounded by the individual capacities of each channel.

30.3 BSC MAC

Before introducing this channel, we need a definition and a theorem:
Definition 30.1 (Sum Capacity). Csum , max{R1 + R2 : (R1 , R2 ) ∈ C}
Theorem 30.1. Csum = maxA⊥
⊥B I(A, B; Y )

Proof. Since the max above is achieved by an extreme point on one of the Penta regions, we can
drop the convex closure operation to get
[ [
max{R1 + R2 : (R1 , R2 ) ∈ co Penta(PA , PB )} = max{R1 + R2 : (R1 , R2 ) ∈ Penta(PA , PB )}
max {R1 + R2 : (R1 , R2 ) ∈ Penta(PA , PB )} ≤ max I(A, B; Y )
PA ,PB PA ,PB

Where the last step follows from the definition of Penta. Now we need to show that the constraint
on R1 + R2 in Penta is active at at least one point, so we need to show that I(A, B; Y ) ≤
I(A; Y |B) + I(B; Y |A) when A ⊥⊥ B, which follows from applying Kolmogorov identities

I(A; Y, B) = 0 + I(A; Y |B) = I(A; Y ) + I(A; B|Y ) =⇒ I(A; Y ) ≤ I(A; Y |B)

=⇒ I(A, B; Y ) = I(A; Y ) + I(B; Y |A) ≤ I(A; Y |B) + I(B; Y |A)

Hence maxPA ,PB {R1 + R2 : (R1 , R2 ) ∈ Penta(PA , PB )} = maxPA PB I(A, B; Y )

We now look at the BSC MAC, defined by

Y =A+B+Z mod 2
Z ∼ Ber(δ)
A, B ∈ {0, 1}

Since the output Y can only be 0 or 1, the capacity of this channel can be no larger than 1 bit. If
B doesn’t transmit at all, then A can achieve capacity 1 − h(δ) (and B can achieve capacity when
A doesn’t transmit), so that R1 , R2 ≤ 1 − h(δ). By time sharing we can obtain any point between
these two. This gives an inner bound on the capacity region. For an outer bound, we use Theorem
30.1, which gives

Csum = max I(A, B; Y ) = max I(A, B; A + B + Z)

PA PB PA PB
= max H(A + B + Z) − H(Z) = 1 − h(δ)
PA PB

Hence R1 + R2 ≤ 1 − h(δ), so by this outer bound, we can do no better than time sharing between
the two individual channel capacity points.

314
R2

A 1 − h(δ)
Y
B
R1
1 − h(δ)

Remark: Even though this channel seems so simple, there are still hidden things about it, which
we’ll see later.

30.4 Adder MAC

Now we analyze the Adder MAC, which is a noiseless channel defined by:
Y = A + B (over Z)
A, B ∈ {0, 1}
Intuitively, the game here is that when both A and B send either 0 or 1, we receiver 0 or 2 and can
decode perfectly. However, when A sends 0 and B send 1, the situation is ambiguous. To analyze
this channel, we start with an interesting fact
Interesting Fact 1: Any deterministic MAC (Y = f (A, B)) has Csum = max H(Y ). To see
this, just expand I(A, B; Y ).
Therefore, the sum capacity of this MAC is

1 1 1 3
Csum = max H(A + B) = H , , = bits
⊥B
A⊥ 4 2 4 2
Which is achieved when both A and B are Ber(1/2). With this, our capacity region is


R1 ≤ I(A; Y |B) = H(A) = 1
Penta(Ber(1/2), Ber(1/2)) = R2 ≤ I(B; Y |A) = H(B) = 1


R1 + R2 ≤ I(A, B; Y ) = 3/2
So the channel can be described by
R2
R1 + R2 ≤ 3/2
A Y
1

B
R1
1
Now we can ask: how do we achieve the corner points of the region, e.g. R1 = 1/2 and R2 = 1?
The answer gives insights into how to code for this channel. Take the greedy codebook B = {0, 1}n
(the entire space), then the channel A → Y is a DMC:
1
2 0
0 1
2
1
1
1 2
1 2
2

315
Which we recognize as a BEC(1/2) (no preference to either −1 or 1), which has capacity 1/2. How
do we decode? The idea is successive cancellation, where first we decode A, then remove Â from Y ,
then decode B.
Yn Dec Ân
An BEC(1/2)

Bn B̂ n

Using this strategy, we can use a single user code for the BEC (an object we understand well) to
attain capacity.

30.5 Multiplier MAC

The Multiplier MAC is defined as
Y = AB
A ∈ {0, 1}, B ∈ {−1, 1}
Note that A = |Y | can always be resolved, and B can be resolved whenever A = 1. To find the
capacity region of this channel, we’ll use another interesting fact:
Interesting Fact 2: If A = g(Y ), then each Penta(PA , PB ) is a rectangle with
(
R1 ≤ H(A)
Penta(PA , PB ) =
R2 ≤ I(A, B; Y ) − H(A)

Proof. Using the assumption that A = g(Y ) and expanding the mutual information
I(A; Y |B) + I(B; Y |A) = H(A) − H(Y |A) − H(Y |A, B) = H(A, Y ) − H(Y |A, B)
= H(Y ) − H(Y |A, B) = I(A, B; Y )
Therefore the R1 + R2 constraint is not active, so our region is a rectangle.
By symmetry, we take PB = Ber(1/2). When PA = Ber(p), the output has H(Y ) = p + h(p).
Using the above fact, the capacity region for the Multiplier MAC is
(
[ R1 ≤ H(A) = h(p)
C = co
R2 ≤ H(Y ) − H(A) = p
We can view this as the graph of the binary entropy function on its side, parametrized by p:
R1
1

1/2

R2
1
To achieve the extreme point (1, 1/2) of this region, we can use the same scheme as for the Adder
MAC: take the codebook of A to be {0, 1}n , then B sees a BEC(1/2). Again, successive cancellation
decoding can be used.
For future reference we note:

316
Lemma 30.1. The full capacity region of multiplier MAC is achieved with zero error.
Proof. For a given codebook D of user B the number of messages that user A can send equals the
total number of erasure patters that codebook D can tolerate with vanishing probability of error.
Fix rate R2 < 1 and let D be a row-span of a random linear nR2 × n binary matrix. Then randomly
erase each column with probability 1 − R2 − . Since on average there will be n(R2 + ) columns
left, the resulting matrix is still full-rank and the decoding is possible. In other words,
P[D is decodable, # of erasures ≈ n(1 − R2 − )] → 1 .
Hence, by counting the total number of erasures, for a random linear code we have
E[# of decodable erasure patterns for D] ≈ 2nh(1−R2 −)+o(n) .
And result follows by selecting a random element of the D-ensemble and then taking the codebook
of user A to be the set of decodable erasure patterns for a selected D.

30.6 Contraction MAC

The Contraction MAC is defined as
B= + B= −
0 b

e1 0 b b
1
{0, 1, 2, 3} ∋ A Erasure Y b

1 b
1 b b
2
2 b b
2 2 b
b e2
{−, +} ∋ B 3 b b
3 3 b

Here, B is received perfectly, We can use the fact above to see that the capacity region is
(
R1 ≤ 32
C=
R2 ≤ 1
For future reference we note the following:
Lemma 30.2. The zero-error capacity of the contraction MAC satisfies
R1 ≤ h(1/3) + (2/3 − p) log 2 , (30.1)
R2 ≤ h(p) (30.2)
for some p ∈ [0, 1/2]. In particular, the point R1 = 23 , R2 = 1 is not achievable with zero error.
Proof. Let C and D denote the zero-error codebooks of two users. Then for each string bn ∈ {+, −}n
denote
Ubn = {an : aj ∈ {0, 1} if bj = +, aj ∈ {2, 3} if bj = −} .
Then clearly for each bn we have
n ,D)
|Ubn | ≤ 2d(b ,
where d(bn , D) denotes the minimum Hamming distance from string bn to the set D. Then,
X n
|C| ≤ 2d(b ,D) (30.3)
bn
n
X
= 2j |{bn : d(bn , D) = j}| (30.4)
j=0

317
For a given cardinality |D| the set that maximizes the above sum is the Hamming ball. Hence,
R2 = h(p) + O(1) implies

R2 ≤ max h(q) + (q − p) log 2 = h(1/3) + (2/3 − p) log 2 .

q∈[p,1]

30.7 Gaussian MAC

Perhaps the most important MAC is the Gaussian MAC. This is defined as

Y =A+B+Z
Z ∼ N (0, 1)
E[A2 ] ≤ P1 , E[B 2 ] ≤ P2

Evaluating the mutual information, we see that the capacity region is

1
I(A; Y |B) = I(A; A + Z) ≤ log(1 + P1 )
2
1
I(B; Y |A) = I(B; B + Z) ≤ ≤ (1 + P2 )
2
1
I(A, B; Y ) = h(Y ) − h(Z) ≤ log(1 + P1 + P2 )
2
1
R2 2 log(1 + P1 + P2 )
Z 1
2 log(1 + P2 )

A Y

B R1
1
2 log(1 + P1 )

Where the region is Penta(N (0, P1 ), N (0, P2 )). How do we achieve the rates in this region? We’ll
look at a few schemes.

1. TDMA: A and B switch off between transmitting at full rate and not transmitting at all. This
achieves any rate pair in the form
1 1
R1 = λ log(1 + P1 ), R2 = λ̄ log(1 + P2 )
2 2
Which is the dotted line on the plot above. Clearly, there are much better rates to be gained
by smarter schemes.

2. FDMA (OFDM): Dividing users into different frequency bands rather than time windows
gives an enormous advantage. Using frequency division, we can attain rates

1 P1 1 P2
R1 = λ log 1 + , R2 = λ̄ log 1 +
2 λ 2 λ̄

318
In fact, these rates touch the boundary of the capacity region at its intersection with the
R1 = R2 line. The optimal rate occurs when the power at each transmitter makes the noise
look white:
P1 P2 P1
= =⇒ λ∗ =
λ λ̄ P1 + P2
While this touches the capacity region at one point, it doesn’t quite reach the corner points.
Note, however, that practical systems (e.g. cellular networks) typically employ power control
that ensures received powers Pi of all users are roughly equal. In this case (i.e. when P1 = P2 )
the point where FDMA touches the capacity boundary is at a very desirable location of
symmetric rate R1 = R2 . This is one of the reasons why modern standards (e.g. LTE 4G) do
not employ any specialized MAC-codes and use OFDM together with good single-user codes.

3. Rate Splitting/Successive Cancellation: To reach the corner points, we can use successive
cancellation, similar to the decoding schemes in the Adder and Multiplier MACs. We can use
rates:
1
R2 = log(1 + P2 )
2
1 1 P1
R1 = (log(1 + P1 + P2 ) − log(1 + P2 )) = log 1 +
2 2 1 + P2

The second expression suggests that A transmits at a rate for an AWGN channel that has
power constraint P1 and noise 1 + P2 , i.e. the power used by B looks like noise to A.

Z
Y
A Dec A Â

B Dec B B̂

Theorem 30.2. There exists a successive-cancellation code (i.e. (E1 , E2 , D1 , D2 )) that achieves
the corner points of the Gaussian MAC capacity region.

Proof. Random coding: B n ∼ N (0, P2 )n . Since An now sees noise 1 + P2 , there exists a code
for A with rate R1 = 12 log(1 + P1 /(1 + P2 )).

This scheme (unlike the above two) can tolerate frame un-synchronization between the two
transmitters. This is because any chunk of length n has distribution N (0, P2 )n . It has
generalizations to non-corner points and to arbitrary number of users. See [RU96] for details.

30.8 MAC Peculiarities

Now that we’ve seen some nice properties and examples of MACs, we’ll look at cases where MACs
differ from the point to point channels we’ve seen so far.

1. Max probability of error 6= average probability of error.

Theorem 30.3. C (max) 6= C

319
Proof. The key observation for deterministic MAC is that C (max) = C0 (zero error capacity)
when ≤ 1/2. This is because when any two strings can be confused, the maximum probability
of error

max0 P[Ŵ1 6= m ∪ Ŵ2 6= m0 |W1 = m, W2 = m0 ]

m,m

Must be larger than 1/2.

For some of the channels we’ve seen

• Contraction MAC: C0 6= C
• Multiplier MAC: C0 = C
• Adder MAC: C0 6= C. For this channel, no one yet can show that C0,sum < 3/2. The
idea is combinatorial in nature: produce two sets (Sidon sets) such that all pairwise sums
between the two do not overlap.

2. Separation does not hold: In the point to point channel, through joint source channel coding
we saw that an optimal architecture is to do source coding then channel coding separately.
This doesn’t hold for the MAC. Take as a simple example the Adder MAC with a correlated
source and bandwidth expansion factor ρ = 1. Let the source (S, T ) have joint distribution

1/3 1/3
PST =
0 1/3

We encode S n to channel input An and T n to channel input B n . The simplest possible scheme
is to not encoder at all; simply take Sj = Aj and Tj = Bj . Take the decoder

Ŝ T̂
Yj = 0 =⇒ 0 0
Yj = 1 =⇒ 0 1
Yj = 2 =⇒ 1 1

Which gives P[Ŝ n = S n , T̂ n = T n ] = 1, since we are able to take advantage of the zero entry
in joint distribution of our correlated source.
Can we achieve this with a separated source? Amazingly, even though the above scheme is so
simple, we can’t! The compressors in the separated architecture operate in the Slepian-Wolf
region (see Theorem 9.7)


R1 ≥ H(S|T )
R2 ≥ H(T |S)


R1 + R2 ≥ H(S, T ) = log 3

Hence the sum rate for compression must be ≥ log 3, while the sum rate for the Adder MAC
must be ≤ 3/2, so these two regions do not overlap, hence we can not operate at a bandwidth
expansion factor of 1 for this source and channel.

320
R2
Slepian Wolf

1 log 3

1/2
3/2
Adder MAC
R1
1/2 1

3. Linear codes beat generic ones: Consider a BSC-MAC and suppose that two users A and
B have independet k-bit messages W1 , W2 ∈ Fk2 . Suppose the receiver is only interested in
estimating W1 + W2 . What is the largest ratio k/n? Clearly, separation can achieve
1
k/n ≈ (log 2 − h(δ))
2
by simply creating a scheme in which both W1 and W2 are estimated and then their sum is
computed.
A more clever solution is however to encode

An = G · W 1 ,
B n = G · W2 ,
Y n = An + B n + Z n = G(W1 + W2 ) + Z n .

where G is a generating matrix of a good k-to-n linear code. Then, provided that

k < n(log 2 − h(δ)) + o(n)

the sum W1 + W2 is decodable (see Theorem 18.2). Hence even for a simple BSC-MAC there
exist clever ways to exceed MAC capacity for certain scenarios. Note that this “distributed
computation” can also be viewed as lossy source coding with a distortion metric that is only
sensitive to discrepancy between W1 + W2 and Ŵ1 + Ŵ2 .

4. Dispersion is unknown: We have seen that for the point-to-point channel, not only we know
the capacity, but the next-order terms (see Theorem 22.2). For the MAC-channel only the
capacity is known. In fact, let us define
∗
Rsum (n, ) , sup{R1 + R2 : (R1 , R2 ) ∈ R∗ (n, )} .

Now, take Adder-MAC as an example. A simple exercise in random-coding with PA = PB =

Ber(1/2) shows r
∗ 3 1 −1 log n
Rsum (n, ) ≥ log 2 − Q () log 2 + O( ).
2 4n n
In the converse direction the situation is rather sad. In fact the best bound we have is only
slightly better than the Fano’s inequality [Ahl82]. Namely for each > 0 there is a constant
K > 0 such that
∗ 3 log n
Rsum (n, ) ≤ log 2 + K √ .
2 n

321
So it is not even known if sum-rate approaches sum-capacity from above or from below as
n → ∞! What is even more surprising, is that the dependence of the residual term on is not
clear at all. In fact, despite the decades of attempts, even for = 0 the best known bound to
date is just the Fano’s inequality(!)

∗ 3
Rsum (n, 0) ≤ .
2

322
§ 31. Random number generators

Let’s play the following game: Given a stream of Bern(p) bits, with unknown p, we want to turn
them into pure random bits, i.e., independent fair coin flips Bern(1/2). Our goal is to find a universal
way to extract the most number of bits.
In 1951 von Neumann [vN51] proposed the following scheme: Divide the stream into pairs of
bits, output 0 if 10, output 1 if 01, otherwise do nothing and move to the next pair. Since both 01
and 10 occur with probability pq (where q = 1 − p throughout this lecture), regardless of the value
of p, we obtain fair coin flips at the output. To measure the efficiency of von Neumann’s scheme,
note that, on average, we have 2n bits in and 2pqn bits out. So the efficiency (rate) is pq. The
question is: Can we do better?
Several variations:

1. Universal v.s. non-universal: know the source distribution or not.

2. Exact v.s. approximately fair coin flips: in terms of total variation or Kullback-Leibler
divergence

We only focus on the universal generation of exactly fair coins.

31.1 Setup
Recall from Lecture 8 that {0, 1}∗ = ∪k≥0 {0, 1}k = {∅, 0, 1, 00, 01, . . . } denotes the set of all finite-
length binary strings, where ∅ denotes the empty string. For any x ∈ {0, 1}∗ , let l(x) denote the
length of x.
Let’s first introduce the definition of random number generator formally. If the input vector is
X, denote the output (variable-length) vector by Y ∈ {0, 1}∗ . Then the desired property of Y is
the following: Conditioned on the length of Y being k, Y is uniformly distributed on {0, 1}k .
Definition 31.1 (Extractor). We say Ψ : {0, 1}∗ → {0, 1}∗ is an extractor if

1. Ψ(x) is a prefix of Ψ(y) if x is a prefix of y.

i.i.d.
2. For any n and any p ∈ (0, 1), if X n ∼ Bern(p), then Ψ(X n ) ∼ Bern(1/2)k conditioned on
l(Ψ(X n )) = k.

The rate of Ψ is
E[l(Ψ(X n ))] i.i.d.
rΨ (p) = lim sup , X n ∼ Bern(p).
n→∞ n

Note that the von Neumann scheme above defines a valid extractor ΨvN (with ΨvN (x2n+1 ) =
ΨvN (x2n )), whose rate is rvN (p) = pq. Clearly this is wasteful, because even if the input bits are
already fair, we only get 25% in return.

323
31.2 Converse
No extractor has a rate higher than the binary entropy function. The proof is simply data processing
inequality for entropy and the converse holds even if the extractor is allowed to be non-universal
(depending on p).
Theorem 31.1. For any extractor Ψ and any p ∈ (0, 1),
1 1
rΨ (p) ≥ h(p) = p log2 + q log2 .
p q
Proof. Let L = Ψ(X n ). Then
nh(p) = H(X n ) ≥ H(Ψ(X n )) = H(Ψ(X n )|L) + H(L) ≥ H(Ψ(X n )|L) = E [L] bits,
where the step follows from the assumption on Ψ that Ψ(X n ) is uniform over {0, 1}k conditioned
on L = k.
The rate of von Neumann extractor and the entropy bound are plotted below. Next we present
two extractors, due to Elias [Eli72] and Peres [Per92] respectively, that attain the binary entropy
function. (More precisely, both construct a sequence of extractors whose rate approaches the entropy
bound).
rate

1 bit

rvN

p
0 1 1
2

31.3 Elias’ construction of RNG from lossless compressors

Main idea The intuition behind Elias’ scheme is the following:
1. For iid X n , the probability of each string only depends on its type, i.e., the number of 1’s.
Therefore conditioned on the number of 1’s, X n is uniformly distributed (over the type class).
This observation holds universally for any p.
2. Given a uniformly distributed random variable on some finite set, we can easily turn it into
variable-length fair coin flips. For example:
• If U is uniform over {1, 2, 3}, we can map 1 7→ ∅, 2 7→ 0 and 3 7→ 1.
• If U is uniform over {1, 2, . . . , 11}, we can map 1 7→ ∅, 2 7→ 0, 3 7→ 1, and 4, . . . , 11 7→
3-bit strings.
Lemma 31.1. Given U uniformly distributed on [M ], there exists f : [M ] → {0, 1}∗ such that
conditioned on l(f (U )) = k, f (U ) is uniformly over {0, 1}k . Moreover,
log2 M − 4 ≤ E[l(f (U ))] ≤ log2 M bits.

324
Proof. We defined f by partitioning [M ] into subsets whose cardinalities are powers of two, and
assign elements P in each subset to binary strings of that length. Formally, denote the binary expansion
of M by M = ni=0 mi 2i , where the most significant bit mn = 1 and n = blog2 M c + 1. Those
non-zero mi ’s defines a partition [M ] = ∪tj=0 Mj , where |Mi | = 2ij . Map the elements of Mj to
{0, 1}ij . Finally, notice that uniform distribution conditioned on any subset is still uniform.
To prove the bound on the expected length, the upper bound follows from the same entropy
argument log2 M = H(U ) ≥ H(f (U )) ≥ H(f (U )|l(f (U ))) = E[l(f (U ))], and the lower bound
follows from
n n n
1 X 1 X 2n X i−n 2n+1
E[l(f (U ))] = i
mi 2 · i = n − i
mi 2 (n − i) ≥ n − 2 (n − i) ≥ n − ≥ n − 4,
M M M M
i=0 i=0 i=0

where the last step follows from n ≤ log2 M + 1.

Elias’ extractor Let w(xn ) define the Hamming weight (number of ones) of a binary string. Let
Tk = {xn ∈ {0, 1}n : w(xn ) = k} define the Hamming sphere of radius k. For each 0 ≤ k ≤ n, we
apply the function f from Lemma 31.1 to each Tk . This defines a mapping ΨE : {0, 1}∗ → {0, 1}∗
and then we extend it to ΨE : {0, 1}n → {0, 1}∗ by applying the mapping per n-bit block and
discard the last incomplete block. Then it is clear that the rate is given by n1 E[l(ΨE (X n ))]. By
Lemma 31.1, we have

n n
E log − 4 ≤ E[l(ΨE (X ))] ≤ E log
n
w(X n ) w(X n )
Using Stirling’s approximation (see, e.g., [Ash65, Lemma 4.7.1]), we have

2nh(p) n 2nh(p)
√ ≤ ≤√ (31.1)
8npq k 2πnpq
where p = 1 − q = k/n ∈ (0, 1). Since w(X n ) ∼ Bin(n, p), we have
E[l(ΨE (X n ))] = nh(p) + O(log n).
Therefore the extraction rate approaches the optimum h(p) as n → ∞.

31.4 Peres’ iterated von Neumann’s scheme

Main idea Recycle the bits thrown away in von Neumann’s scheme and iterate. What did von
Neumann’s extractor discard: (a) bits from equal pairs; (b) location of the distinct pairs. To achieve
the entropy bound, we need to extract the randomness out of these two parts as well.
First some notations: Given x2n , let k = l(ΨvN (x2n )) denote the number of consecutive distinct
bit-pairs.
• Let 1 ≤ m1 < . . . < mk ≤ n denote the locations such that x2mj 6= x2mj −1 .
• Let 1 ≤ i1 < . . . < in−k ≤ n denote the locations such that x2ij = x2ij −1 .
• yj = x2mj , vj = x2ij , uj = x2j ⊕ x2j+1 .
Here y k are the bits that von Neumann’s scheme outputs and both v n−k and un are discarded. Note
that un is important because it encodes the location of the y k and contains a lot of information.
Therefore von Neumann’s scheme can be improved if we can extract the randomness out of both
v n−k and un .

325
Peres’ extractor For each t ∈ N, recursively define an extractor Ψt as follows:
• Set Ψ1 to be von Neumann’s extractor ΨvN , i.e., Ψ1 (x2n+1 ) = Ψ1 (x2n ) = y k .

• Define Ψt by Ψt (x2n ) = Ψt (x2n+1 ) = (Ψ1 (x2n ), Ψt−1 (un ), Ψt−1 (v n−k )).
Example: Input x = 100111010011 of length 2n = 12. Output recursively:
y u v
z }| { z }| { z }| {
(011) (110100) (101)
(1)(010)(10)(0)
(1)(0)

Next we (a) verify Ψt is a valid extractor; (b) evaluate its efficiency (rate). Note that the bits
that enter into the iteration are no longer i.i.d. To compute the rate of Ψt , it is convenient to
introduce the notion of exchangeability. We say X n are exchangeable if the joint distribution is
invariant under permutation, that is, PX1 ,...,Xn = PXπ(1) ,...,Xπ(n) for any permutation π on [n]. In
particular, if Xi ’s are binary, then X n are exchangeable if and only if the joint distribution only
depends on the Hamming weight, i.e., PX n =xn = p(w(xn )). Examples: X n is iid Bern(p); X n is
uniform over the Hamming sphere Tk .
As an example, if X 2n are i.i.d. Bern(p), then conditioned on L = k, V n−k is iid Bern(p2 /(p2 +q 2 )),
since L ∼ Binom(n, 2pq) and

pk+2m q n−k−2m
P[Y k = y, U n = u, V n−k = v|L = k] = n

2 2 n−k (2pq)k
k (p + q )
−1
n p2 m q 2 n−k−m
= 2−k · · 2
k p + q2 p2 + q 2
= P[Y k = y|L = k]P[U n = u|L = k]P[V n−k = v|L = k],

where m = w(v). In general, when X 2n are only exchangeable, we have the following:
Lemma 31.2 (Ψt preserves exchangebility). Let X 2n be exchangeable and L = Ψ1 (X 2n ). Then con-
i.i.d.
ditioned on L = k, Y k , U n and V n−k are independent and exchangeable. Furthermore, Y k ∼ Bern( 12 )
and U n is uniform over Tk .
Proof. If suffices to show that ∀y, y 0 ∈ {0, 1}k , u, u0 ∈ Tk and v, v 0 ∈ {0, 1}n−k such that w(v) = w(v 0 ),
we have

P[Y k = y, U n = u, V n−k = v|L = k] = P[Y k = y 0 , U n = u0 , V n−k = v 0 |L = k].

Note that the string X 2n and the triple (Y k , U n , V n−k ) are in one-to-one correspondence of each
other. Indeed, to reconstruct X 2n , simply read the k distinct pairs from Y and fill them according
to the locations of ones in U and fill the remaining equal pairs from V . Finally, note that u, y, v and
u0 , y 0 , v 0 correspond to two input strings x and x0 of identical Hamming weight (w(x) = k + 2w(v))
and hence of identical probability due to the exchangeability of X 2n . [Examples: (y, u, v) =
(01, 1100, 01) ⇒ x = (10010011), (y, u, v) = (11, 1010, 10) ⇒ x0 = (01110100).]
Computing the marginals, we conclude that both Y k and U n are uniform over their respective
support set.
i.i.d.
Lemma 31.3 (Ψt is an extractor). Let X 2n be exchangeable. Then Ψt (X 2n ) ∼ Bern(1/2) condi-
tioned on l(Ψt (X 2n )) = m.

326
Proof. Note that Ψt (X 2n ) ∈ {0, 1}∗ . It is equivalent to show that for all sm ∈ {0, 1}m ,

P[Ψt (X 2n ) = sm ] = 2−m P[l(Ψt (X 2n )) = m].

Proceed by induction on t. The base case of t = 1 follows from Lemma 31.2 (the distribution of the
Y part). Assume Ψt−1 is an extractor. Recall that Ψt (X 2n ) = (Ψ1 (X 2n ), Ψt−1 (U n ), Ψt−1 (V n−k ))
and write the length as L = L1 + L2 + L3 , where L2 ⊥ ⊥ L3 |L1 by Lemma 31.2. Then

P[Ψt (X 2n ) = sm ]
Xm
= P[Ψt (X 2n ) = sm |L1 = k]P[L1 = k]
k=0
Xm m−k
X
Lemma 31.2
= P[L1 = k]P[Y k = sk |L1 = k]P[Ψt−1 (U n ) = sk+r
k+1 |L1 = k]P[Ψt−1 (V
n−k
) = sm
k+r+1 |L1 = k]
k=0 r=0
Xm m−k
X
induction
= P[L1 = k]2−k 2−r P[L2 = r|L1 = k]2−(m−k−r) P[L3 = m − k − r|L1 = k]
k=0 r=0
= 2−m P[L = m].

i.i.d.
Next we compute the rate of Ψt . Let X 2n ∼ Bern(p). Then by SLLN, 2n 1
l(Ψ1 (X 2n )) , L2nn
1 a.s.
converges a.s. to pq. Assume, again by induction, that 2n l(Ψt−1 (X 2n ))−−→rt−1 (p), with r1 (p) = pq.
Then
1 Ln 1 1
l(Ψt (X 2n )) = + l(Ψt−1 (U n )) + l(Ψt−1 (V n−Ln )).
2n 2n 2n 2n
i.i.d. i.i.d. a.s.
Note that U n ∼ Bern(2pq), V n−Ln |Ln ∼ Bern(p2 /(p2 + q 2 )) and Ln −−→∞. Then the induction
a.s. a.s.
hypothesis implies that n1 l(Ψt−1 (U n ))−−→rt−1 (2pq) and 2(n−L 1
n)
l(Ψt−1 (V n−Ln ))−−→rt−1 (p2 /(p2 +
q 2 )). We obtain the recursion:

1 p2 + q 2 p2
rt (p) = pq + rt−1 (2pq) + rt−1 , (T rt−1 )(p), (31.2)
2 2 p2 + q 2

where the operator T maps a continuous function on [0, 1] to another. Furthermore, T is monotone
in the senes that f ≤ g pointwise then T f ≤ T g. Then it can be shown that rt converges
monotonically from below to the fixed-point of T , which turns out to be exactly the binary
entropy function h. Instead of directly verifying T h = h, next we give a simple proof: Consider
i.i.d.
X1 , X2 ∼ Bern(p). Then 2h(p) = H(X1 , X2 ) = H(X1 ⊕ X2 , X1 ) = H(X1 ⊕ X2 ) + H(X1 |X1 ⊕ X2 ) =
2
h(p2 + q 2 ) + 2pqh( 21 ) + (p2 + q 2 )h( p2p+q2 ).
The convergence of rt to h are shown in Fig. 31.1.

31.5 Bernoulli factory

Given a stream of Bern(p) bits with unknown p, for what kind of function f : [0, 1] → [0, 1] can we
simulate iid bits from Bern(f (p)). Our discussion above deals with f (p) ≡ 21 . The most famous
example is whether we can simulate Bern(2p) from Bern(p), i.e., f (p) = 2p ∧ 1. Keane and O’Brien
[KO94] showed that all f that can be simulated are either constants or “polynomially bounded away
from 0 or 1”: for all 0 < p < 1, min{f (p), 1 − f (p)} ≥ min{p, 1 − p}n for some n ∈ N. In particular,
doubling the bias is impossible.

327
1.0

0.8

0.6

0.4

0.2

0.2 0.4 0.6 0.8 1.0

Figure 31.1: Rate function rt for t = 1, 4, 10 versus the binary entropy function.

The above result deals with what f (p) can be simulated in principle. What type of computational
devices are needed for such as task? Note that since r1 (p) is quadratic in p, all rate functions rt
that arise from the iteration (31.2) are rational functions (ratios of polynomials), converging to
the binary entropy function as Fig. 31.1 shows. It turns out that for any rational function f that
satisfies 0 < f < 1 on (0, 1), we can generate independent Bern(f (p)) from Bern(p) using either of
the following schemes with finite memory [MP05]:

1. Finite-state machine (FSM): initial state (red), intermediate states (white) and final states
(blue, output 0 or 1 then reset to initial state).

2. Block simulation: let A0 , A1 be disjoint subsets of {0, 1}k . For each k-bit segment, output 0 if
falling in A0 or 1 if falling in A1 . If neither, discard and move to the next segment. The block
size is at most the degree of the denominator polynomial of f .

The next table gives examples of these two realizations:

328
Goal Block simulation FSM
1
1
0

f (p) = 1/2 A0 = 10; A1 = 01 0

1 0
0

0 0
1
f (p) = 2pq A0 = 00, 11; A1 = 01, 10 0 1
0
1 1

0 0
0
0
1 1
p3
f (p) = p3 +q 3
A0 = 000; A1 = 111

0 0
1
1
1 1
Exercise: How to generate f (p) = 1/3?
It turns out that the only type of f that can be simulated using either FSM or block simulation
√
is rational function. For f (p) = p, which satisfies Keane-O’Brien’s characterization, it cannot be
simulated by FSM or block simulation, but it can be simulated by pushdown automata (PDA),
which are FSM operating with a stack (infinite memory) [MP05].
What is the optimal Bernoulli factory with the best rate is unclear. Clearly, a converse is the
h(p)
entropy bound h(f (p)) , which can be trivial (bigger than one).

31.6 Related problems

31.6.1 Generate samples from a given distribution
The problem of how to turn pure bits into samples of a given distribution P is in a way the opposite
direction of what we have been considering so far. This can be done via Knuth-Yao’s tree algorithm:
Starting at the root, flip a fair coin for each edge and move down the tree until reaching a leaf node
and outputting the symbol. Let L denote the number of flips, which is a random variable. Then
H(P ) ≤ E[L] ≤ H(P ) + 2 bits.
Examples:

• To generate P = [1/2, 1/4, 1/4] on {a, b, c}, use the finite tree: E[L] = 1.5.

329
0 1
a
1 1

b c

• To generate P = [1/3, 2/3] on {a, b} (note that 2/3 = 0.1010 . . . , 1/3 = 0.0101 . . .), use the
infinite tree: E[L] = 2 (geometric distribution)

0 1
a
0 1
b
0 1

a ..
.

31.6.2 Approximate random number generator

The goal is to design f : X n → {0, 1}k s.t. f (X n ) is close to fair coin flips in distribution in certain
performance metric (TV or KL or min-entropy). One formulation is that D(Pf (X n ) kUniform) = o(k).
Intuitions: The connection to lossless data compression is as follows: A good compressor
squeezes out all the redundancy of the source. Therefore its output should be close to pure bits,
otherwise we can compress it furthermore. So good lossless compressors should act like good
approximate random number generators.

330
§ 32. Entropy method in combinatorics and geometry

A commonly used method in combinatorics for bounding the number of certain objects from above
involves a smart application of Shannon entropy. Usually the method proceeds as follows: in order
to count the cardinality of a given set C, we draw an element uniformly at random from C, whose
entropy is given by log |C|. To bound |C| from above, we describe this random object by a random
vector X = (X1 , . . . , Xn ), e.g., an indicator vector, then proceed to compute or upper-bound the
joint entropy H(X1 , . . . , Xn ) using methods we learned in Part I.
Notably, three methods of increasing precision are as follows:
• Marginal bound:
n
X
H(X1 , . . . , Xn ) ≤ H(Xi )
i=1

• Pairwise bound (Shearer’s lemma and generalization Theorem 1.4):

2 X
H(X1 , . . . , Xn ) ≤ H(Xi , Xj )
n−2
i<j

• Chain rule (exact calculation):

n
X
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 )
i=1

Next, we give three applications using the above three methods, respectively, in increasing difficulties:
• Enumerating binary vectors of a given average weights
• Counting triangles and other subgraphs
• Brégman’s theorem
Finally, to demonstrate how entropy method can also be used for questions in Euclidean spaces,
we prove the Loomis-Whitney and Bollobás-Thomason theorems based on analogous properties of
differential entropy.

32.1 Binary vectors of average weights

Lemma 32.1 (Massey [Mas74]). Let C ⊂ {0, 1}n and let p be the average fraction of 1’s in C, i.e.
1 X w(x)
p= ,
|C| n
x∈C

where w(x) is the Hamming weight (number of 1’s) of x ∈ {0, 1}n . Then |C| ≤ 2nh(p) .

331
Remark 32.1. This result holds even if p > 1/2.

Proof. Let X = (X1 , . . . , Xn ) be drawn uniformly at random from C. Then

n
X n
X
log |C| = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) = h(pi ),
i i=1

where pi = P [Xi = 1] is the fraction of vertices whose i-th bit is 1. Note that
n
1X
p= pi ,
n
i=1

since we can either first average over vectors in C or first average across different bits. By Jensen’s
inequality and the fact that x 7→ h(x) is concave,
n n
!
X 1X
h(pi ) ≤ nh pi = nh(p).
n
i=1 i=1

Hence we have shown that log |C| ≤ nh(p).

Theorem 32.1.
k
X n
≤ 2nh(k/n) , k ≤ n/2.
j
j=0

Proof. We take C = {x ∈ {0, 1}n : w(x) ≤ k} and invoke the previous lemma, which says that
k
X n
= |C| ≤ 2nh(p) ≤ 2nh(k/n) ,
j
j=0

where the last inequality follows from the fact that x 7→ h(x) is increasing for x ≤ 1/2.

Remark 32.2. Alternatively, we can prove the theorem using the large-deviation bound in Part III.
By the Chernoff bound on the binomial tail (see Example after Theorem 14.1),
LHS k 1 RHS
n
= P(Bin(n, 1/2) ≤ k) ≤ 2−nd( n k 2 ) = 2−n(1−h(k/n)) = n .
2 2

32.2 Shearer’s lemma & counting subgraphs

Recall that a special case of Shearer’s lemma Theorem 1.4 (or Han’s inequality Theorem 1.3) says:
1
H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )].
2
A classical application of this result (see Remark 1.1) is to bound cardinality of a set in R3 given
cardinalities of its projections.
For graphs H and G, define N (H, G) to be the number of copies of H in G.1 For example,

N( , ) = 4, N( , ) = 4.
1
To be precise, here N (H, G) is the injective homomorphism number, that is, the number of injective mappings
ϕ : V (H) → V (G) which preserve edges, i.e., (ϕ(v)ϕ(w)) ∈ E(G) whenever (vw) ∈ E(H). Note that if H is a clique,
every homomorphism is automatically injective.

332
If we know G has m edges, what is the maximal number of H that are contained in G? To study
this quantity, let’s define
N (H, m) = max N (H, G).
G:|E(G)|≤m

We will show that for the maximal number of triangles satisfies

N (K3 , m) m3/2 . (32.1)
n

To show that N (H, m) & m3/2 , consider G = Kn which has m = |E(G)| = 2 n2 and

N (K3 , Kn ) = n3 n3 m3/2 .
To show the upper bound, fix a graph G = (V, E) with m edges. Draw a labeled triangle
uniformly at random and denote the vertices by (X1 , X2 , X3 ). Then by Shearer’s Lemma,
1 3
log(3!N (K3 , G)) = H(X1 , X2 , X3 ) ≤ [H(X1 , X2 ) + H(X2 , X3 ) + H(X1 , X3 )] ≤ log(2m).
2 2
Hence √
2 3/2
N (K3 , G) ≤
m . (32.2)
3
Remark 32.3. Interestingly, linear algebra argument yields the exactly same upper bound as
(32.2): Let A be the adjacency matrix of G with eigenvalues {λi }. Then
X
2|E(G)| = tr(A2 ) = λ2i
X
6N (K3 , G) = tr(A3 ) = λ3i
√
2 3/2
By Minkowski’s inequality, (6N (K3 , G))1/3 ≤ (2|E(G)|)1/2 which yields N (K3 , G) ≤ 3 m .
Using Shearer’s Theorem 1.4 Friedgut and Kahn [FK98] obtained the counterpart of (32.1) for
arbitrary H, which was first proved by Alon [Alo81]. We first introduce the fractional covering
number of a graph. For a graph H = (V, E), define the fractional covering number as the value of
the following linear program:2
 
X X 
∗
ρ (H) = min w(e) : w(e) ≥ 1, ∀v ∈ V, w(e) ∈ [0, 1] (32.3)
w  
e∈E e∈E, v∈e

Theorem 32.2.
∗ (H) ∗ (H)
c0 (H)mρ ≤ N (H, m) ≤ c1 (H)mρ . (32.4)
For example, for triangles we have ρ∗ (K3 ) = 3/2 and Theorem 32.2 is consistent with (32.1).
Proof. Upper bound: Let |V (H)| = n and let w∗ (e) be the solution for ρ∗ (H). For any G with m
edges, draw a subgraph of G, uniformly at random from all those that are isomorphic to H, and
label its vertices by X = (X1 , . . . , Xn ). Now define a random 2-subset S of [n] by sampling an
∗ (e)
edge e from E(H) with probability ρw∗ (H) . By the definition of ρ∗ (H) we have for any i ∈ [n] that
P[i ∈ S] ≥ ρ∗ (H)
1
. We are now ready to apply Shearer’s Theorem 1.4:

1
log(N (H, G)n!) = H(X)
ρ∗ (H)
≤H(XS |S) ≤ log(2m) ,
2
If the “∈ [0, 1]” constraints in (32.3) and (32.5) are replaced by “∈ {0, 1}”, we obtain the covering number ρ(H)
and the independence number α(H) of H, respectively.

333
where the last bound is as before: if S = {v, w} then XS = (Xv , Xw ) takes one of 2m values. Overall,
1 ∗
we get N (H, G) ≥ n! (2m)ρ (H) .
Lower bound: It amounts to construct a graph G with m edges for which N (H, G) ≥
∗
c(H)|e(G)|ρ (H) . Consider the dual LP of (32.3)
 
 X 
∗
α (H) = max ψ(v) : ψ(v) + ψ(w) ≤ 1, ∀(vw) ∈ E, ψ(v) ∈ [0, 1] (32.5)
ψ  
v∈V (H)

i.e., the fractional packing number. By the duality theorem of LP, we have α∗ (H) = ρ∗ (H). The
graph G is constructed as follows: for each vertex v of H, replicate it for m(v) times. For each edge
e = (vw) of H, replace it by a complete bipartite graph Km(v),m(w) . Then the total number of edges
of G is X
|E(G)| = m(v)m(w).
(vw)∈E(H)
Q
Furthermore, N (G, H) ≥ v∈V (H) m(v). To minimize the exponent log N (G,H)
log |E(G)| , fix a large number

M and let m(v) = M ψ(v) , where ψ is the maximizer in (32.5). Then
X
|E(G)| ≤ 4M ψ(v)+ψ(w) ≤ 4M |E(H)|
(vw)∈E(H)
Y ∗ (H)
N (G, H) ≥ M ψ(v) = M α
v∈V (H)

and we are done.

32.3 Brégman’s Theorem

Next, we present Brégman’s Theorem [Bre73] and an elegant proof given by Radhakrishnan [Rad97].
Definition 32.1. A perfect matching is a 1-regular spanning subgraph.
The permanent of an n × n matrix A is defined as
n
X Y
perm(A) = aiπ(i) ,
π∈Sn i=1

where Sn denotes the group of all permutations of n letters. For a bipartite graph G with n vertices
on the left and right respectively, the number of perfect matchings in G is given by perm(A), where
A is the adjacency matrix.
Example:  
 
 
 
perm   = 1, perm 

=2

 

Theorem 32.3 (Brégman’s Theorem). For any n × n bipartite graph,

n
Y 1
perm(A) ≤ (di !) di ,
i=1

where di is the degree of left vertex i (i.e. sum of the ith row).

334
Example: Consider G = Kn,n . Then perm(G) = n!, and the RHS is [(n!)1/n ]n = n! as well. More
generally, if G consists of n/d copies of Kd,d , then Bregman’s bound is tight and perm = (d!)n/d .
First attempt: We select a perfect matching uniformly at random which matches the ith left
vertex to the Xi th right one. Let X = (X1 , . . . , Xn ). Then
n
X n
X
log perm(A) = H(X) = H(X1 , . . . , Xn ) ≤ H(Xi ) ≤ log(di ).
i=1 i=1
Q
Hence perm(A) ≤ i di . This is much worse than Brégman’s bound since by Stirling’s formula
n n
!
Y 1 Y P
(di !) di ∼ di e− di .
i=1 i=1
P
where di is the total number of edges.
Second attempt: The hope is to the chain rule to expand the joint entropy and bound the
conditional entropy more carefully. Let’s write
n
X n
X
H(X1 , . . . , Xn ) = H(Xi |X1 , . . . , Xi−1 ) ≤ E[log Ni ].
i=1 i=1

where Ni , as a random variable, denotes the number of possible values Xi can take conditioned
on X1 , . . . , Xi−1 , i.e., how many possible matchings for left vertex i given the outcome of where
1, . . . , i − 1 are matched to. However, it is hard to proceed from this point as we only know the
degree information, not the graph itself. In fact, since we do not know the relative positions of the
vertices, why should we order like this? The key idea is to label the vertices randomly, apply chain
rule in this random order and average.
To this end, pick π uniformly at random from Sn and independent of X. Then

log perm(A) = H(X) = H(X|π)

= H(Xπ(1) , . . . , Xπ(n) |π)
Xn
= H(Xπ(k) |Xπ(1) , . . . , Xπ(k−1) , π)
k=1
n
X
= H(Xk |{Xj : π −1 (j) < π −1 (k)}, π)
k=1
n
X
≤ E log Nk ,
k=1

where Nk denotes the number of possible matchings for vertex k given the outcome of {Xj : π −1 (j) <
π −1 (k)} are matched to and the expectation is with respect to (X, π). The key observation is:
Lemma 32.2. Nk is uniformly distributed on [dk ].
Example: Consider the graph G on the right. Consider k = 1. Then dk = 2. 1 1
Depending on the random ordering, if π = 1 ∗ ∗, then Nk = 2
w.p. 1/3; if π = ∗ ∗ 1, then Nk = 1 w.p. 1/3; if π = 213, then 2 2
Nk = 2 w.p. 1/3; if π = 312, then Nk = 1 w.p. 1/3. Combining
everything, indeed Nk is equally likely to be 1 or 2. 3 3

335
Thus,
dk
1 X 1
E(X,π) log Nk = log i = log(di !) di
dk
i=1
and hence
n
X n
Y
1 1
log perm(A) ≤ log(di !) di
= log (di !) di .
k=1 i=1
Finally, we prove Lemma 32.2:
Proof. Note that Xi = σ(i) for some random permutation σ. Let T = ∂(k) be the neighbors of k.
Then
Nk = |T \{σ(j) : π −1 (j) < π −1 (k)}|
which is a function of (σ, π). In fact, conditioned on any realization of σ, Nk is uniform over [dk ].
To see this, note that σ −1 (T ) is a fixed subset of [n] of cardinality dk , and k ∈ σ −1 (T ). On the
other hand, S , {j : π −1 (j) < π −1 (k)} is a uniformly random subset of [n]\{k}. Then
Nk = |σ −1 (T )\S| = 1 + |σ −1 (T )\{k} ∩ S|,
| {z }
uniform({0,...,dk −1})

which uniform over [dk ].

32.4 Euclidean geometry: Bollobás-Thomason and

Loomis-Whitney
The following famous result shows that n-dimensional rectangles simultaneously minimize volumes
of all coordinate projections:3
Theorem 32.4 (Bollobás-Thomason Box Theorem). Let K ⊂ Rn be a compact set. For S ⊂ [n],
denote by KS ⊂ RS the projection of K onto those coordinates indexed by S. Then there exists a
rectangle A s.t. Leb{A} = Leb{K} and for all S ⊂ [n]:
Leb{AS } ≤ Leb{KS }
Proof. Let Xn be uniformly distributed on K. Then h(X n ) = log Leb{K}. Let A be a rectangle of
size a1 × · · · × an where
log ai = h(Xi |X i−1 ) .
Then, we have by 1. in Theorem 1.6
h(XS ) ≤ log Leb{KS }.
On the other hand, by the chain rule
n
X
h(XS ) = 1{i ∈ S}h(Xi |X[i−1]∩S )
i=1
X
≥ h(Xi |X i−1 )
i∈S
Y
= log ai
i∈S
= log Leb{AS }
3
Note that since K is compact, its projection and slices are all compact and hence measurable.

336
Corollary 32.1 (Loomis-Whitney). Let K be a compact subset of Rn and let Kj c denote the
projection of K onto coordinates in [n] \ j. Then
n
Y 1
Leb{K} ≤ Leb{Kj c } n−1 . (32.6)
j=1

Proof. Apply previous theorem to construct rectangle A and note that

n
Y 1
Leb{K} = Leb{A} = Leb{Aj c } n−1
j=1

By previous theorem, Leb{Aj c } ≤ Leb{Kj c }.

The meaning of Loomis-Whitney inequality is best understood by introducing the average width
Leb{K}
of K in direction j: wj , Leb{Kjc }
. Then (32.6) is equivalent to

n
Y
Leb{K} ≥ wj ,
j=1

i.e. that volume of K is greater than volume of the rectangle of average widths.

337
Bibliography

[AFTS01] I.C. Abou-Faycal, M.D. Trott, and S. Shamai. The capacity of discrete-time memoryless
rayleigh-fading channels. IEEE Transaction Information Theory, 47(4):1290 – 1301, 2001.

[Ahl82] Rudolf Ahlswede. An elementary proof of the strong converse theorem for the multiple-
access channel. J. Combinatorics, Information and System Sciences, 7(3), 1982.

[Alo81] Noga Alon. On the number of subgraphs of prescribed type of graphs with a given
number of edges. Israel J. Math., 38(1-2):116–130, 1981.

[AN07] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191.
American Mathematical Soc., 2007.

[AS08] Noga Alon and Joel H. Spencer. The Probabilistic Method. John Wiley & Sons, 3rd
edition, 2008.

[Ash65] Robert B. Ash. Information Theory. Dover Publications Inc., New York, NY, 1965.

[BF14] Ahmad Beirami and Faramarz Fekri. Fundamental limits of universal lossless one-to-one
compression of parametric sources. In Information Theory Workshop (ITW), 2014 IEEE,
pages 212–216. IEEE, 2014.

[Bla74] R. E. Blahut. Hypothesis testing and information theory. IEEE Trans. Inf. Theory,
20(4):405–417, 1974.

[BNO03] Dimitri P Bertsekas, Angelia Nedić, and Asuman E. Ozdaglar. Convex analysis and
optimization. Athena Scientific, Belmont, MA, USA, 2003.

[Boh38] H. F. Bohnenblust. Convex regions and projections in Minkowski spaces. Ann. Math.,
39(2):301–308, 1938.

[Bre73] Lev M Bregman. Some properties of nonnegative matrices and their permanents. Soviet
Math. Dokl., 14(4):945–949, 1973.

[Bro86] L. D. Brown. Fundamentals of statistical exponential families with applications in

statistical decision theory. In S. S. Gupta, editor, Lecture Notes-Monograph Series,
volume 9. Institute of Mathematical Statistics, Hayward, CA, 1986.

[CB90] B. S. Clarke and A. R. Barron. Information-theoretic asymptotics of Bayes methods.

IEEE Trans. Inf. Theory, 36(3):453–471, 1990.

[CB94] Bertrand S Clarke and Andrew R Barron. Jeffreys’ prior is asymptotically least favorable
under entropy risk. Journal of Statistical planning and Inference, 41(1):37–60, 1994.

[Ç11] Erhan Çinlar. Probability and Stochastics. Springer, New York, 2011.

338
[Cho56] Noam Chomsky. Three models for the description of language. IRE Trans. Inform. Th.,
2(3):113–124, 1956.

[CK81a] I. Csiszár and J. Körner. Graph decomposition: a new key to coding theorems. IEEE
Trans. Inf. Theory, 27(1):5–12, 1981.

[CK81b] I. Csiszár and J. Körner. Information Theory: Coding Theorems for Discrete Memoryless
Systems. Academic, New York, 1981.

[CS83] J. Conway and N. Sloane. A fast encoding method for lattice codes and quantizers. IEEE
Transactions on Information Theory, 29(6):820–824, Nov 1983.

[Csi67] I. Csiszár. Information-type measures of difference of probability distributions and

indirect observation. Studia Sci. Math. Hungar., 2:229–318, 1967.

[CT06] Thomas M. Cover and Joy A. Thomas. Elements of information theory, 2nd Ed. Wiley-
Interscience, New York, NY, USA, 2006.

[Doo53] Joseph L. Doob. Stochastic Processes. New York Wiley, 1953.

[Eli55] Peter Elias. Coding for noisy channels. IRE Convention Record, 3:37–46, 1955.

[Eli72] P. Elias. The efficient construction of an unbiased random sequence. Annals of Mathe-
matical Statistics, 43(3):865–870, 1972.

[ELZ05] Uri Erez, Simon Litsyn, and Ram Zamir. Lattices which are good for (almost) everything.
IEEE Transactions on Information Theory, 51(10):3401–3416, Oct. 2005.

[EZ04] U. Erez and R. Zamir. Achieving 12 log(1 + SNR) on the AWGN channel with lattice
encoding and decoding. IEEE Trans. Inf. Theory, IT-50:2293–2314, Oct. 2004.

[FHT03] A. A. Fedotov, P. Harremoës, and F. Topsøe. Refinements of Pinsker’s inequality.

Information Theory, IEEE Transactions on, 49(6):1491–1498, Jun. 2003.

[FJ89] G.D. Forney Jr. Multidimensional constellations. II. Voronoi constellations. IEEE
Journal on Selected Areas in Communications, 7(6):941–958, Aug 1989.

[FK98] Ehud Friedgut and Jeff Kahn. On the number of copies of one hypergraph in another.
Israel J. Math., 105:251–256, 1998.

[FMG92] Meir Feder, Neri Merhav, and Michael Gutman. Universal prediction of individual
sequences. IEEE Trans. Inf. Theory, 38(4):1258–1270, 1992.

[Gil10] Gustavo L Gilardoni. On pinsker’s and vajda’s type inequalities for csiszár’s-divergences.
Information Theory, IEEE Transactions on, 56(11):5377–5386, 2010.

[GKY56] I. M. Gel’fand, A. N. Kolmogorov, and A. M. Yaglom. On the general definition of the

amount of information. Dokl. Akad. Nauk. SSSR, 11:745–748, 1956.

[GL95] Richard D Gill and Boris Y Levit. Applications of the van Trees inequality: a Bayesian
Cramér-Rao bound. Bernoulli, pages 59–79, 1995.

[Har] Sergiu Hart. Overweight puzzle. http://www.ma.huji.ac.il/˜hart/puzzle/overweight.html.

339
[Hoe65] Wassily Hoeffding. Asymptotically optimal tests for multinomial distributions. The
Annals of Mathematical Statistics, pages 369–401, 1965.

[HV11] P. Harremoës and I. Vajda. On pairs of f -divergences and their joint range. IEEE Trans.
Inf. Theory, 57(6):3230–3235, Jun. 2011.

[KO94] M.S. Keane and G.L. O’Brien. A Bernoulli factory. ACM Transactions on Modeling and
Computer Simulation, 4(2):213–219, 1994.

[Kos63] VN Koshelev. Quantization with minimal entropy. Probl. Pered. Inform, 14:151–156,
1963.

[KS14] Oliver Kosut and Lalitha Sankar. Asymptotics and non-asymptotics for universal fixed-
to-variable source coding. arXiv preprint arXiv:1412.4444, 2014.

[KV14] Ioannis Kontoyiannis and Sergio Verdú. Optimal lossless data compression: Non-
asymptotics and asymptotics. IEEE Trans. Inf. Theory, 60(2):777–795, 2014.

[LC86] Lucien Le Cam. Asymptotic methods in statistical decision theory. Springer-Verlag, New
York, NY, 1986.

[LM03] Amos Lapidoth and Stefan M Moser. Capacity bounds via duality with applications to
multiple-antenna systems on flat-fading channels. IEEE Transactions on Information
Theory, 49(10):2426–2467, 2003.

[Loe97] Hans-Andrea Loeliger. Averaging bounds for lattices and linear codes. IEEE Transactions
on Information Theory, 43(6):1767–1773, Nov. 1997.

[Mas74] James Massey. On the fractional weight of distinct binary n-tuples (corresp.). IEEE
Transactions on Information Theory, 20(1):131–131, 1974.

[MF98] Neri Merhav and Meir Feder. Universal prediction. IEEE Trans. Inf. Theory, 44(6):2124–
2147, 1998.

[MP05] Elchanan Mossel and Yuval Peres. New coins from old: computing with unknown bias.
Combinatorica, 25(6):707–724, 2005.

[MT10] Mokshay Madiman and Prasad Tetali. Information inequalities for joint distributions,
with interpretations and applications. IEEE Trans. Inf. Theory, 56(6):2699–2713, 2010.

[OE15] O. Ordentlich and U. Erez. A simple proof for the existence of “good” pairs of nested
lattices. IEEE Transactions on Information Theory, Submitted Aug. 2015.

[OPS48] BM Oliver, JR Pierce, and CE Shannon. The philosophy of pcm. Proceedings of the IRE,
36(11):1324–1331, 1948.

[Per92] Yuval Peres. Iterating von Neumann’s procedure for extracting random bits. Annals of
Statistics, 20(1):590–597, 1992.

[PPV10a] Y. Polyanskiy, H. V. Poor, and S. Verdú. Channel coding rate in the finite blocklength
regime. IEEE Trans. Inf. Theory, 56(5):2307–2359, May 2010.

[PPV10b] Y. Polyanskiy, H. V. Poor, and S. Verdú. Feedback in the non-asymptotic regime. IEEE
Trans. Inf. Theory, April 2010. submitted for publication.

340
[PPV11] Y. Polyanskiy, H. V. Poor, and S. Verdú. Minimum energy to send k bits with and
without feedback. IEEE Trans. Inf. Theory, 57(8):4880–4902, August 2011.

[PW12] E. Price and D. P. Woodruff. Applications of the shannon-hartley theorem to data

streams and sparse recovery. In Proceedings of the 2012 IEEE International Symposium
on Information Theory, pages 1821–1825, Boston, MA, Jul. 2012.

[PW14] Y. Polyanskiy and Y. Wu. Peak-to-average power ratio of good codes for Gaussian
channel. IEEE Trans. Inf. Theory, 60(12):7655–7660, December 2014.

[PW17] Yury Polyanskiy and Yihong Wu. Strong data-processing inequalities for channels and
Bayesian networks. In Eric Carlen, Mokshay Madiman, and Elisabeth M. Werner, editors,
Convexity and Concentration. The IMA Volumes in Mathematics and its Applications,
vol 161, pages 211–249. Springer, New York, NY, 2017.

[Rad97] Jaikumar Radhakrishnan. An entropy proof of Bregman’s theorem. J. Combin. Theory

Ser. A, 77(1):161–164, 1997.

[Ree65] Alec H Reeves. The past present and future of PCM. IEEE Spectrum, 2(5):58–62, 1965.

[RSU01] Thomas J. Richardson, Mohammad Amin Shokrollahi, and Rüdiger L. Urbanke. Design
of capacity-approaching irregular low-density parity-check codes. IEEE Transactions on
Information Theory, 47(2):619–637, 2001.

[RU96] Bixio Rimoldi and Rüdiger Urbanke. A rate-splitting approach to the gaussian multiple-
access channel. Information Theory, IEEE Transactions on, 42(2):364–375, 1996.

[RZ86] Ryabko B. Reznikova Zh. Analysis of the language of ants by information-theoretical

methods. Problemi Peredachi Informatsii, 22(3):103–108, 1986. English translation:
http://reznikova.net/R-R-entropy-09.pdf.

[SF11] Ofer Shayevitz and Meir Feder. Optimal feedback communication via posterior matching.
IEEE Trans. Inf. Theory, 57(3):1186–1222, 2011.

[Sha48] C. E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27:379–423
and 623–656, July/October 1948.

[Sio58] Maurice Sion. On general minimax theorems. Pacific J. Math, 8(1):171–176, 1958.

[Smi71] J. G. Smith. The information capacity of amplitude and variance-constrained scalar

Gaussian channels. Information and Control, 18:203 – 219, 1971.

[Spe15] Spectre. SPECTRE: Short packet communication toolbox. https://github.com/

yp-mit/spectre, 2015. GitHub repository.

[Spi96] Daniel A. Spielman. Linear-time encodable and decodable error-correcting codes. IEEE
Transactions on Information Theory, 42(6):1723–1731, 1996.

[Spi97] Daniel A. Spielman. The complexity of error-correcting codes. In Fundamentals of

Computation Theory, pages 67–84. Springer, 1997.

[SV11] Wojciech Szpankowski and Sergio Verdú. Minimum expected length of fixed-to-variable
lossless compression without prefix constraints. IEEE Trans. Inf. Theory, 57(7):4017–4025,
2011.

341
[TE97] Giorgio Taricco and Michele Elia. Capacity of fading channel with no side information.
Electronics Letters, 33(16):1368–1370, 1997.

[Top00] F. Topsøe. Some inequalities for information divergence and related measures of discrim-
ination. IEEE Transactions on Information Theory, 46(4):1602–1609, 2000.

[Tsy09] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer Verlag, New York,

NY, 2009.

[TV05] David Tse and Pramod Viswanath. Fundamentals of wireless communication. Cambridge
University Press, 2005.

[UR98] R. Urbanke and B. Rimoldi. Lattice codes can achieve capacity on the AWGN channel.
IEEE Transactions on Information Theory, 44(1):273–278, 1998.

[Ver07] S. Verdú. EE528–Information Theory, Lecture Notes. Princeton Univ., Princeton, NJ,
2007.

[vN51] J. von Neumann. Various techniques used in connection with random digits. Monte
Carlo Method, National Bureau of Standards, Applied Math Series, (12):36–38, 1951.

[Yek04] Sergey Yekhanin. Improved upper bound for the redundancy of fix-free codes. IEEE
Trans. Inf. Theory, 50(11):2815–2818, 2004.

[Yos03] Nobuyuki Yoshigahara. Puzzles 101: A Puzzlemaster’s Challenge. A K Peters, Natick,

MA, USA, 2003.

[Yu97] Bin Yu. Assouad, Fano, and Le Cam. Festschrift for Lucien Le Cam: Research Papers
in Probability and Statistics, pages 423–435, 1997.

[Zam14] Ram Zamir. Lattice Coding for Signals and Networks. Cambridge University Press,
Cambridge, 2014.

[ZY97] Zhen Zhang and Raymond W Yeung. A non-Shannon-type conditional inequality of

information quantities. IEEE Trans. Inf. Theory, 43(6):1982–1986, 1997.

[ZY98] Zhen Zhang and Raymond W Yeung. On characterization of entropy function via
information inequalities. IEEE Trans. Inf. Theory, 44(4):1440–1452, 1998.

342