Information Theory Notes
Information Theory Notes
Jensen’s: Entropy & Perplexity Proving Gibbs’ inequality Huffman code worst case
1 Idea: use Jensen’s inequality Previously saw: simple simple code `i = dlog 1/pie
Set u(x) = , p(u(x)) = p(x)
p(x)
For the idea to work, the proof must look like this: Always compresses with E[length] < H(X)+1
E[u] = E[ p(x)
1
] = |A| (Tutorial 1 question) X pi
DKL(p || q) = pi log = E[f (u)] ≥ f E[u] Huffman code can be this bad too:
q i
H(X) = E[log u(x)] ≤ log E[u] i
For PX = {1−, }, H(x) → 0 as → 0
qi
H(X) ≤ log |A| Define ui = pi , with p(ui) = pi, giving E[u] = 1 Encoding symbols independently means E[length] = 1.
Equality, maximum Entropy, for constant u ⇒ uniform p(x) Identify f (x) ≡ log 1/x = − log x, a convex function Relative encoding length: E[length]/H(X) → ∞ (!)
Substituting gives: DKL(p || q) ≥ 0 Question: can we fix the problem by encoding blocks?
2H(X) = “Perplexity” = “Effective number of choices”
H(X) is log(effective number of choices)
Maximum effective number of choices is |A| With many typical symbols the “+1” looks small
1
ai pi li (ai )
Reminder on Relative Entropy and symbol codes:
a 0.0575
log2 pi
4.1 4 0000
a
n Bigram statistics
The Relative Entropy (AKA Kullback–Leibler or KL divergence) gives b 0.0128 6.3 6 001000 b
0.0263 5.2 5 00101 g
the expected extra number of bits per symbol needed to encode a d 0.0285 5.1 5 10000 c Previous slide: AX = {a − z, }, H(X) = 4.11 bits
source when a complete symbol code uses implicit probabilities e 0.0913 3.5 4 1100 s
f 0.0173 5.9 6 111000 −
qi = 2−`i instead of the true probabilities pi. g 0.0133 6.2 6 001001 d
h 0.0313 5.0 5 10001 h
Question: I decide to encode bigrams of English text:
We have been assuming symbols are generated i.i.d. with i 0.0599 4.1 4 1001 i
known probabilities pi. j
k
0.0006
0.0084
10.7
6.9
10
7
1101000000
1010000
k
x AX 0 = {aa, ab, . . . , az, a , . . . , }
l 0.0335 4.9 5 11101 y
Where would we get the probabilities pi from if, say, we were m 0.0235 5.4 6 110101
o
u What is H(X 0) for this new ensemble?
compressing text? A simple idea is to read in a large text file and n 0.0596 4.1 4 0001
o 1011 e
record the empirical fraction of times each character is used. Using p
0.0689
0.0192
3.9
5.7
4
6 111001 j A ∼ 2 bits
z
these probabilities the next slide (from MacKay’s book) gives a q 0.0008 10.3 9 110100001
q B ∼ 4 bits
r 0.0508 4.3 5 11011
Huffman code for English text. s 0.0567 4.1 4 0011
w
v
C ∼ 7 bits
t 0.0706 3.8 4 1111
The Huffman code uses 4.15 bits/symbol, whereas H(X) = 4.11 bits. u 0.0334 4.9 5 10101
m D ∼ 8 bits
v 0.0069 7.2 8 11010001 r
Encoding blocks might close the narrow gap. w 0.0119 6.4 7 1101001
f
p
E ∼ 16 bits
x 0.0073 7.1 7 1010001 l
More importantly English characters are not drawn y 0.0164 5.9 6 101001
z 0.0007 10.4 10 1101000001
t Z ?
independently encoding blocks could be a better model. { 0.1928 2.4 2 01
(Mackay, p100)
Answering the previous vague question
Human predictions Predictions
We didn’t completely define the ensemble: what are the probabilities?
We could draw characters independently using pi’s found before. Ask people to guess letters in a newspaper headline:
Then a bigram is just two draws from X, often written X 2.
H(X 2) = 2H(X) = 8.22 bits k·i·d·s· ·m·a·k·e· ·n·u·t·r·i·t·i·o·u·s· ·s·n·a·c·k·s
11·4·2·1·1·4·2·4·1·1·15·5·1·2·1·1·1·1·2·1·1·16·7·1·1·1·1
We could draw pairs of adjacent characters from English text.
When predicting such a pair, how many effective choices do we have? Numbers show # guess required by 2010 class
More than when we had AX = {a–z, }: we have to pick the first
character and another character. But the second choice is easier. ⇒ “effective number of choices” or entropy varies hugely
We expect H(X) < H(X 0) < 2H(X). Maybe 7 bits?
Looking at a large text file the actual answer is about 7.6 bits.
We need to be able to use a different probability
This is ≈ 3.8 bits/character — better compression than before. distribution for every context
Shannon (1948) estimated about 2 bits/character for English text. Sometimes many letters in a row can be predicted at
Shannon (1951) estimated about 1 bits/character for English text minimal cost: need to be able to use < 1 bit/character.
Compression performance results from the quality of a probabilistic (MacKay Chapter 6 describes how numbers like those above could be
model and the compressor that uses it. used to encode strings.)
Cliché Predictions A more boring prediction game Product rule / Chain rule
“I have a binary string with bits that were drawn i.i.d..
Predict away!” P (A, B | H) = P (A | H) P (B | A, H) = P (B | H) P (A | B, H)
What fraction of people, f , guess next bit is ‘1’ ? = P (A | H) P (B | H) iff independent
⇒ Encode bacabab
with 0111...01
Arithmetic Coding Arithmetic Coding Tutorial homework: prove encoding length < log P (1x) + 2 bits
An excess of 2 bits on the whole file (millions or more bits?)
Arithmetic coding compresses very close to the information content
Zooming out. . . Zooming in. . . given by the probabilistic model used by both the sender and receiver.
The conditional probabilities P (xi | xj<i) can change for each symbol.
Arbitrary adaptive models can be used (if you have one).
Large blocks of symbols are compressed together: possibly your whole
file. The inefficiencies of symbol codes have been removed.
Huffman coding blocks of symbols requires an exponential number of
codewords. In arithmetic coding, each character is predicted one at a
time, as in a guessing game. The model and arithmetic coder just
consider those |AX | options at a time. None of the code needs to
enumerate huge numbers of potential strings. (De)coding costs should
1000010101 uniquely decodes to ‘bac’ be linear in the message length.
1000010110 would also work: slight inefficiency Model probabilities P (x) might need to be rounded to values Q(x)
2−l > 4 P (x = bac) ⇒ l < − log P (x = bac)+2 bits that can be represented consistently by the encoder and decoder. This
approximation introduces the usual average overhead: DKL(P || Q).
AC and sparse files Non-binary encoding Dasher
Finally we have a practical coding algorithm for sparse files Can overlay string on any other indexing of [0,1] line Dasher is an information-efficient text-entry interface.
Use the same string tree. Gestures specify which one we want.