Unit 1: Information Theory and Coding
Unit 1: Information Theory and Coding
Noise
Example: Morse Code
telegraph wire
transmitter receiver
dots, dashes ∙─_
A, …, Z Encoding Decoding A, …, Z
spaces
keyer recognizer
shortwave radio
0111……111 is uniquely decodable, but the first symbol cannot be decoded without
reading all the way to the end.
4.3
Constructing Instantaneous Codes
Decoding tree 0 1
0 1 0 1
s1 = 00 s2 = 01 s3 = 10
0 1
4.4
Kraft Inequality
Theorem: There exists an instantaneous code for S where each
symbol s S is encoded in radix r with length |s| if and
only if 1
sS r
s
1
Basis: n = 1 0 ≤ r 1
……
0 …… ≤ r1
T0 T≤r-1
s1 ………… s≤r
at most r
at most r subtrees
r
1 1 1 1 1
1
1 r s 1
s
r sTi r r
IH
i 1 r
sTi
Inequality in the binary case implies that not all internal nodes have
degree 2, but if a node has degree 1, then clearly that edge can be
removed by contraction.
4.5
Kraft Inequality ()
Construct a code via decoding trees. Number the symbols s1, …, sq
so that l1 ≤ … ≤ lq and assume K ≤ 1.
Greedy method: proceed left-to-right, systematically assigning
leaves to code words, so that you never pass through or land on a
previous one. The only way this method could fail is if it runs out of
nodes (tree is over-full), but that would mean K > 1.
0 1 0 1 0 1
0 1 0 0 1
not used
½+¼+¼+⅛>1
½+⅛+⅛+⅛<1 ½+¼+⅛+⅛=1
4.5
Shortened Block Codes
With exactly 2m symbols, we can form a set of code words each of
length m : b1 …… bm bi {0,1}. This is a complete binary decoding
tree of depth m. With < 2m symbols, we can chop off branches to get
modified (shortened) block codes.
0 s1
1
0 0 s1
1 s2
1
0 s2
0 0 1
s3
1 1
0 s3
s4 0
1 s4
1
s5 s5
Ex 1 Ex 2
4.6
McMillan Inequality
Idea: Uniquely decodable codes satisfy the same bounds as
instantaneous codes.
q pi probability of symbol si
Lavg pi l i where
i 1 l i si the length of si
4.8
Start with S = {s1, …, sq} the source alphabet. And consider B = {0, 1} as our code
alphabet (binary). First, observe that lq1 = lq, since the code is instantaneous, s<q
cannot be a prefix of sq, so dropping the last symbol from sq (if lq > lq1) won’t hurt.
Huffman algorithm: So, we can combine sq1 and sq into a “combo-symbol” (sq1+sq)
with probability (pq1+pq) and get a code for the reduced alphabet.
0.6 0.4
0 1
1.0
ε
N. B. the case for q = 1 does not produce a valid code. 4.8
Huffman is always of shortest average length
Huffman Lavg
Assume We know
trying to show
≥
p1 ≥ … ≥ pq l1 ≤ … ≤ l q
Alternative L
4.8
Claim that lq1 = lq = lq1, q + 1 because
combined symbol
reduced code
sq1 + sq total height = lq
0 1
s1, ………
So its reduced code will always sq
sq1 satisfy:
q 2
By IH, L′L i ( pimportantly
pi lmore
avg ≤ L′. But q 1 pq )(l q the
1)reduced
L pHuffman
q 1 pq code
i 1 properties so it also satisfies the same equation
shares the same
L′avg + (pq1 + pq) = Lavg, hence Lavg ≤ L.
4.8
Code Extensions
Take p1 = ⅔ and p2 = ⅓ Huffman code gives
s1 = 0 s2 = 1 Lavg = 1
Square the symbol alphabet to get:
S2 : s1,1 = s1s1; s1,2 = s1s2; s2,1 = s2s1; s2,2 = s2s2;
p1,1 = 4⁄9 p1,2 = 2⁄9 p2,1 = 2⁄9 p2,2 =
1
⁄9
Apply Huffman to S2:
s1,1 = 1; s1,2 = 01; s2,1 = 000; s2,2 = 001
4 2 2 1 17
Lavg 1 2 3 3 2
9 9 9 9 9
But we are sending two symbols at a time!
4.10
Huffman Codes in radix r
At each stage down, we merge the last (least probable) r states
into 1, reducing the # of states by r 1. Since we end with one
state, we must begin with no. of states 1 mod (r 1) . We pad
out states with probability 0 to get this. Example: r = 4; k = 3
0.22 0.2 0.18 0.15 0.1 0.08 0.05 0.02 0.0 0.0 pads
0 1 2 3
4.11
Information
A quantitative measure of the amount of information any
event represents. I(p) = the amount of information in the
occurrence of an event of probability p.
single
source
Axioms: symbol
A. I(p) ≥ 0 for any event p
B. I(p1∙p2) = I(p1) + I(p2) p1 & p2 are independent events
C. I(p) is a continuous function of p
Cauchy functional equation units of information:
in base 2 = a bit
Existence: I(p) = log_(1/p) in base e = a nat
in base 10 = a Hartley
6.2
Uniqueness:
Suppose I′(p) satisfies the axioms. Since I′(p) ≥ 0, take any
0 < p0 < 1, any base k = (1/p0)(1/I′(p0)). So kI′(p0) = 1/p0, and
hence logk (1/p0) = I′(p0). Now, any z (0,1) can be
written as p0r, r a real number R+ (r = logp0 z). The
Cauchy Functional Equation implies that I′(p0n) = n I′(p0)
and m Z+, I′(p01/m) = (1/m) I′(p0), which gives I′(p0n/m) =
(n/m) I′(p0), and hence by continuity I′(p0r) = r I′(p0).
Hence I′(z) = r∙logk (1/p0) = logk (1/p0r) = logk (1/z).
6.2
Entropy
The average amount of information received on a per
symbol basis from a source S = {s1, …, sq} of symbols, si
has probability pi. It is measuring the information rate.
In radix r, when all the probabilities are independent: information of the
weighted arithmetic mean
weighted geometric mean
of information
pi pi
q
1 q
1 q
1
H r ( S ) pi log r log r log r
i 1 pi i 1 pi i 1 pi
i 1 P i 1 pi
6.3
Consider f(p) = p ln (1/p): (works for any base, not just e)
f′(p) = (-p ln p)′ = -p(1/p) – ln p = -1 + ln (1/p)
f″(p) = p(-p-2) = - 1/p < 0 for p (0,1) f is concave down
f′(1/e) = 0
f(1/e) = 1/e
1
/e
f
f′(1) = -1
f′(0) = ∞
0 1
/e 1
p f(1) = 0
ln p1 ln p ( ln p) p 1
lim f ( p) lim lim 1 lim lim 0
p 0 p0 1 p 0 p p 0 ( p 1 ) p0 p 2
p
6.3
Basic information about logarithm function
y=x1
Tangent line to y = ln x at x = 1
(y ln 1) = (ln)′x=1(x 1) ln x
y=x1
0 x
(ln x)″ = (1/x)′ = -(1/x2) < 0 x 1
ln x is concave down. -1
Conclusion: ln x x 1
6.4
Fundamental Gibbs inequality
q q
Let xi 1 and yi 1 be two probabilit y distributi ons, and consider
i 1 i 1
only when xi yi
q
yi q
yi q q q
i 1
xi log
xi
i 1
xi (1 ) ( xi yi ) xi yi 1 1 0
xi i 1 i 1 i 1
6.5
Shannon-Fano Coding
Simplest variable length method. Less efficient than Huffman, but
allows one to code symbol si with length li directly from
probability pi.
li = logr(1/pi)
1 1 1 r pi
log r li log r 1 li
r pi r .
li
pi pi pi pi r
K
q q q
pi 1
Summing this inequality over i: p
i 1
i 1 r
i 1
li
i 1
r r
Kraft inequality is satisfied, therefore there is an instantaneous
code with these lengths.
6.6
q q
1
Also, H r ( S ) pi log r pi li H r ( S ) 1
i 1 pi i 1
L
by summing multiplied by pi
0 1
H2(S) = 2.5 L = 5/2
0 1 0 1
0 1 0 1
6.6
The Entropy of Code Extensions
Recall: The nth extension of a source S = {s1, …, sq} with probabilities p1, …,
pq is the set of symbols
T = Sn = {si ∙∙∙ sin : sij S 1 j n} where
1
concatenation multiplication
ti = si ∙∙∙ sin has probability pi ∙∙∙ pin = Qi assuming independent
1 1
6.8
qn qn
1 1
Consider t he kth term Qi log pi1 pi n log
i 1 pi k i 1 pi k
q q q q q
1 1
i 1 1
pi1 pi n log
i n 1
i k pi1 pi k pi n pi k log
ˆ
pi k i1 1 i n 1
ˆ
i k 1 pi k
q q
i 1 1
iˆk pi1 pˆ i k pi n H (S ) H (S )
i n 1
pi1 pˆ i k pi n is just a probabilit y in the (n 1)st
extension, and adding them all up gives 1.
H(Sn) = n∙H(S)
Hence the average S-F code length Ln for T satisfies:
H(T) Ln < H(T) + 1 n ∙ H(S) Ln < n ∙ H(S) + 1
n n 3
n i n i dx
n n
n i x 1
(2 x ) 2 x n (2 x ) 2 (n i ) x
n n 1 n i 1
n 3n 1
i 0 i i 0 i
n
n i n
n i n
n n
n i
2 (n i ) n 3 n 2 i 2 2 i n 3n n 3n 1
i 0 i
n 1
i 0 i i 0 i
i
i 0 i
6.9
Markov Process Entropy
p( si | si1 sim ) conditiona l probabilit y that si follows si1 sim .
For an mth order process, think of letting the state s si1 ,, sim .
Hence, I ( s | s ) 1
i log , and so
p ( si | s )
H (S | s ) p( s | s ) I ( s | s )
si S
i i
4 1 1 1 1 1
2 log 2 2 log 2 4 log 2 0.801377
14 0.8 14 0.2 14 0.5 6.11
The Fibonacci numbers
•Let
f0 = 1 f1 = 2 f2 = 3 f3 = 5 f4 = 8 , …. be defined by fn+1 = fn + fn−1.
The = the golden ratio, a root of the equation x2 = x + 1. Use these as
the weights for a system of number representation with digits 0 and 1,
without adjacent 1’s (because (100)phi = (11)phi).
Base Fibonacci
Representation Theorem: every number from 0 to fn − 1 can be uniquely written as an n-
bit number with no adjacent one’s .
Uniqueness: Let i be the smallest number ≥ 0 with two distinct representations (no
leading zeros). i = (bn−1 … b0)phi = (b′n−1 … b′0)phi . By minimality of i bn−1 ≠ b′n−1 , and so
without loss of generality, let bn−1 = 1 b′n−1 = 0, implies (b′n−2 … b′0)phi ≥ fn−1 which can’t be
true.
Base Fibonacci
The golden ratio = (1+√5)/2 is a solution 0
1/r
…
to x2 − x − 1 = 0 and is equal to the limit of H2 = log2 r
r−1
the ratio of adjacent Fibonacci numbers.
1/2 0
1/
1/ 0 1 1st order Markov process:
1/2 0 1
1/
Think of source as 0
1/ 1/2 1/ + 1/2 = 1
emitting variable 10 1 0
length symbols: 1/2
3 4 4
In above example, pe , , which is the overall average weather.
11 11 11
5.2
Predictive Coding
Assume a prediction algorithm for a binary source which given all prior bits
predicts the next.
What is transmitted is the error, ei. By knowing just the error, the predictor also
knows the original symbols.
sn en en sn
source channel destination
pn pn
predictor predictor
must assume that both predictors are identical, and start in the same state
5.7
Accuracy: The probability of the predictor being correct is p = 1 q;
constant (over time) and independent of other prediction errors.
Let the probability of a run of exactly n 0’s, (0n1), be p(n) = pn ∙ q.
The probability of runs of any length n = 0, 1, 2, … is:
1 q q
p q q p q
n n
1
1 p 1 p q
n0 n0 f( p )
Expected length of a run (n 1) p q q (n 1) p n
n
n 0 n 0
1
f ( p)dp (n 1) p dp p
n 1
n
c p c. So,
n 0 n0 1 p
(1 p ) p p (1 p) 1 p p 1 q 1
f ( p) . So, q f ( p ) 2
(1 p) 2
(1 p ) 2
(1 p) 2
q q
2
Note: alternate method for calculating f(p), look at p n
n 0
5.8
Coding of Run Lengths
Send a k-digit binary number to represent a run of zeroes whose
length is between 0 and 2k 2. (small runs are in binary)
For run lengths larger than 2k 2, send 2k 1 (k ones) followed by
another k-digit binary number, etc. (large runs are in unary)
n=i∙m+j 0 ≤ j < m = 2k 1
code k k j in binary
n
01 11 1
1 B j 1ik
B j ; length (i 1)k l n
i
5.9
Expected length of run length code
p n l
But every n can be written uniquely as
n 0
n i∙m + j where i ≥ 0, 0 ≤ j < m = 2k 1.
m 1 m 1
p (i m j ) lim j p im j q (i 1) k
i 0 j 0 i 0 j 0
m 1
1 p m
qk (i 1) p p qk (i 1) p
im j im
i 0 j 0 i 0 q
1 k
k (1 p ) (i 1) p k (1 p )
m im m
i 0 (1 p )
m 2
1 p m
5.9
Gray Code
Consider an analog-to-digital “flash” converter consisting of a
rotating wheel:
0 0
The maximum 1 0 imagine
error in the 0 1 1 0
1 0 “brushes”
scheme is ± ⅛ 0 0
contacting the
rotation 0 0
1
1 1 1 0
1
wheel in each of
because …
1 0 the three circles
1 1
5.15-17
0G0
Binary ↔ Gray
0 0G n
Inductively: G (1) G (n 1) 2 1
1 1G2 n 1
Computationally: Bi bn 1 b0 bi {0,1} 1G0
Gi g n 1 g 0 g i {0,1}
b3 b2 b1 b0
Encoding g n 1 bn 1 ↓ ↘ ↓ ↘ ↓ ↘ ↓
g i bi bi 1 0 i n 1 g3 g2 g1 g0
n 1 ↓ ↓ ↓ ↓
Decoding bi g j b3 → b2 → b1 → b0
j i
(keep a running total)
5.15-17
Shannon’s Theorems
First theorem: H(S) ≤ Ln(Sn)/n < H(S) + 1/n
where Ln is the length of a certain code.
P ( a S )
too much noise is also inside
PE P ( ai S ) another a′
a ai
By the law of
ai
nQ large numbers,
sent symbol ai
a′ ai
P(a S )
lim P (b j S n ( Q ) (ai )) 0 A \{ a i }
bj
n
nQ
N. b. P bi S (ai ) P ai S (b j )
can be made « δ
nε′ 10.4
Idea
Pick # of code words M to be 2n(C−ε) where C is the channel capacity (the
block size n is as yet undetermined and depends on how close ε we
wish to approach the channel capacity). The number of possible
random codes = (2n)M = 2nM, each equally likely. Let PE = the probability
of errors averaged over all random codes. The idea is to show that PE
→ 0. I.e. given any code, most of the time it will probably work!
Proof
Suppose a is what’s sent, and b what’s received.
PE P d (a, b) n (Q ) P d (a, b) n (Q )
a a
too many errors another codeword is too close
Let X = 0/1 be a random variable representing errors in the channel,
with probability P/Q. So if the error vector a b = (X1, …, Xn), then
d(a, b) = X1 + … + Xn.
X X n
P d (a, b) n(Q ) P X 1 X n n(Q ) P 1 Q
as n
X X n V {X }
P 1 Q 0 n (by law of large numbers)
n
2
n
N. B. Q = E{X} Q < ½ , pick ε′ Q + ε′ < ½ 10.5
Since the a′ are randomly (uniformly) distributed throughout,
2 nH 2 (Q )
P d (a, b) n(Q )
by the binomial bound