0% found this document useful (0 votes)

2 views

Information Theory Notes

The document discusses concepts in Information Theory, particularly Jensen's inequality, which states that for convex functions, the expected value of the function is greater than or equal to the function of the expected value. It also covers entropy, perplexity, and compression techniques such as Huffman coding and arithmetic coding, emphasizing the importance of probability distributions in effective data compression. Additionally, it highlights the need for adaptive models and the potential for improved compression through the use of bigrams and context-aware predictions.

Uploaded by

shishen lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Information Theory Notes

Uploaded by

shishen lin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Information Theory

Jensen’s inequality Remembering Jensen’s

http://www.inf.ed.ac.uk/teaching/courses/it/
For convex functions: E[f (x)] ≥ f (E[x]) The inequality is reversed for concave functions.

Which way around is the inequality?

Week 4
Compressing streams I draw a picture in the margin.

Alternatively, try ‘proof by example’:

f (x) = x2 is a convex function
Centre of gravity at E[x], E[f (x)] , which is above E[x], f (E[x])

var[X] = E[x2] − E[x]2 ≥ 0

Iain Murray, 2014 Strictly convex functions:
School of Informatics, University of Edinburgh Equality only if P (x) puts all mass on one value So Jensen’s must be: E[f (x)] ≥ f (E[x]) for convex f .

Jensen’s: Entropy & Perplexity Proving Gibbs’ inequality Huffman code worst case
1 Idea: use Jensen’s inequality Previously saw: simple simple code `i = dlog 1/pie
Set u(x) = , p(u(x)) = p(x)
p(x)
For the idea to work, the proof must look like this: Always compresses with E[length] < H(X)+1

E[u] = E[ p(x)
1
] = |A| (Tutorial 1 question) X pi
DKL(p || q) = pi log = E[f (u)] ≥ f E[u] Huffman code can be this bad too:
q i
H(X) = E[log u(x)] ≤ log E[u] i
For PX = {1−, }, H(x) → 0 as → 0
qi
H(X) ≤ log |A| Define ui = pi , with p(ui) = pi, giving E[u] = 1 Encoding symbols independently means E[length] = 1.

Equality, maximum Entropy, for constant u ⇒ uniform p(x) Identify f (x) ≡ log 1/x = − log x, a convex function Relative encoding length: E[length]/H(X) → ∞ (!)

Substituting gives: DKL(p || q) ≥ 0 Question: can we fix the problem by encoding blocks?
2H(X) = “Perplexity” = “Effective number of choices”
H(X) is log(effective number of choices)
Maximum effective number of choices is |A| With many typical symbols the “+1” looks small

1
ai pi li (ai )
Reminder on Relative Entropy and symbol codes:
a 0.0575
log2 pi

4.1 4 0000
a
n Bigram statistics
The Relative Entropy (AKA Kullback–Leibler or KL divergence) gives b 0.0128 6.3 6 001000 b
0.0263 5.2 5 00101 g
the expected extra number of bits per symbol needed to encode a d 0.0285 5.1 5 10000 c Previous slide: AX = {a − z, }, H(X) = 4.11 bits
source when a complete symbol code uses implicit probabilities e 0.0913 3.5 4 1100 s
f 0.0173 5.9 6 111000 −
qi = 2−`i instead of the true probabilities pi. g 0.0133 6.2 6 001001 d
h 0.0313 5.0 5 10001 h
Question: I decide to encode bigrams of English text:
We have been assuming symbols are generated i.i.d. with i 0.0599 4.1 4 1001 i
known probabilities pi. j
k
0.0006
0.0084
10.7
6.9
10
7
1101000000
1010000
k
x AX 0 = {aa, ab, . . . , az, a , . . . , }
l 0.0335 4.9 5 11101 y
Where would we get the probabilities pi from if, say, we were m 0.0235 5.4 6 110101
o
u What is H(X 0) for this new ensemble?
compressing text? A simple idea is to read in a large text file and n 0.0596 4.1 4 0001
o 1011 e
record the empirical fraction of times each character is used. Using p
0.0689
0.0192
3.9
5.7
4
6 111001 j A ∼ 2 bits
z
these probabilities the next slide (from MacKay’s book) gives a q 0.0008 10.3 9 110100001
q B ∼ 4 bits
r 0.0508 4.3 5 11011
Huffman code for English text. s 0.0567 4.1 4 0011
w
v
C ∼ 7 bits
t 0.0706 3.8 4 1111
The Huffman code uses 4.15 bits/symbol, whereas H(X) = 4.11 bits. u 0.0334 4.9 5 10101
m D ∼ 8 bits
v 0.0069 7.2 8 11010001 r
Encoding blocks might close the narrow gap. w 0.0119 6.4 7 1101001
f
p
E ∼ 16 bits
x 0.0073 7.1 7 1010001 l
More importantly English characters are not drawn y 0.0164 5.9 6 101001
z 0.0007 10.4 10 1101000001
t Z ?
independently encoding blocks could be a better model. { 0.1928 2.4 2 01
(Mackay, p100)
Answering the previous vague question
Human predictions Predictions
We didn’t completely define the ensemble: what are the probabilities?
We could draw characters independently using pi’s found before. Ask people to guess letters in a newspaper headline:
Then a bigram is just two draws from X, often written X 2.
H(X 2) = 2H(X) = 8.22 bits k·i·d·s· ·m·a·k·e· ·n·u·t·r·i·t·i·o·u·s· ·s·n·a·c·k·s
11·4·2·1·1·4·2·4·1·1·15·5·1·2·1·1·1·1·2·1·1·16·7·1·1·1·1
We could draw pairs of adjacent characters from English text.
When predicting such a pair, how many effective choices do we have? Numbers show # guess required by 2010 class
More than when we had AX = {a–z, }: we have to pick the first
character and another character. But the second choice is easier. ⇒ “effective number of choices” or entropy varies hugely
We expect H(X) < H(X 0) < 2H(X). Maybe 7 bits?
Looking at a large text file the actual answer is about 7.6 bits.
We need to be able to use a different probability
This is ≈ 3.8 bits/character — better compression than before. distribution for every context
Shannon (1948) estimated about 2 bits/character for English text. Sometimes many letters in a row can be predicted at
Shannon (1951) estimated about 1 bits/character for English text minimal cost: need to be able to use < 1 bit/character.
Compression performance results from the quality of a probabilistic (MacKay Chapter 6 describes how numbers like those above could be
model and the compressor that uses it. used to encode strings.)

Cliché Predictions A more boring prediction game Product rule / Chain rule
“I have a binary string with bits that were drawn i.i.d..
Predict away!” P (A, B | H) = P (A | H) P (B | A, H) = P (B | H) P (A | B, H)
What fraction of people, f , guess next bit is ‘1’ ? = P (A | H) P (B | H) iff independent

Bit: 1 1 1 1 1 1 1 1 P (A, B, C, D, E) = P (A) P D, E | A)}

| (B, C,{z
f: ≈ 1/2 ≈ 1/2 ≈ 1/2 ≈ 2/3 . . . ... ... ≈1 P (B | A) P E | A, B)}
| (C, D,{z
P (C | A, B) P | A, B, C)}
| (D, E {z
The source was genuinely i.i.d.: each bit was independent of
past bits. P (D | A, B, C) P (E | A, B, C, D)

We, not knowing the underlying flip probability, learn from D

Y
experience. Our predictions depend on the past. So should P (x) = P (x1) P (xd | x<d)
our compression systems. d=2

Revision of the product rule:

Arithmetic Coding Arithmetic Coding
An identity like P (A, B) = P (A) P (A | B), is true for any variables or
collections of variables A and B. You can be explicit about the
For better diagrams and more detail, see MacKay Ch. 6 We give all the strings a binary codeword
current situation, by conditioning every term on any other collection
of variables: P (A, B | H) = P (A | H) P (B | A, H). You can also swap Huffman merged leaves — but we have too many to do that
A and B throughout, as these are arbitrary labels. Consider all possible strings in alphabetical order
Create a tree of strings ‘top-down’:
The second block of equations shows repeated application of the (If infinities scare you, all strings up to some maximum length)
product rule. Each time different groups of variables are chosen to be
the ‘A’, ‘B’ and ‘H’ in the identity above. The multivariate Example: AX = {a, b, c, }
distribution is factored into a product of one-dimensional probability
distributions (highlighted in blue). Where ‘’ is a special End-of-File marker.
The final line shows the same idea, applied to a D-dimensional vector
x = [x1, x2, . . . xD ]>. This equation is true for the distribution of any a, aa, · · · , ab, · · · , ac, ···
vector. In some probabilistic models we choose to drop some of the b, ba, · · · , bb, · · · , bc, ···
high-dimensional dependencies x<d in each term. For example, if x c, ca, · · · , cb, · · · , cc, ···, cccccc. . . cc
contains a time series and we believe only recent history affects what
will happen next. Could keep splitting into really short blocks of height P(string)
Arithmetic Coding Arithmetic Coding Arithmetic Coding
Both string tree and binary codewords index intervals ∈ [0, 1] Overlay string tree on binary symbol code tree.
Goal: encode bac

⇒ Encode bacabab
with 0111...01

Navigate string tree to find interval on real line.

Use ‘binary symbol code budget’ diagram1 to find binary
codeword that lies entirely within this interval.
1
week 3 notes, or MacKay Fig 5.1 p96 From P (x1) distribution can’t begin to encode ‘b’ yet Look at P (x2 | x1 = b) can’t start encoding ’ba’ either

Arithmetic Coding Arithmetic Coding Arithmetic Coding

Diagram: zoom in. Code: rescale to avoid underflow

1000010101 uniquely decodes to ‘bac’

Look at P (x3 | ba). Message for ‘bac’ begins 1000 From P (x4 | bac). Encoding of ‘bac’ starts 10000101... 1000010110 would also work: slight inefficiency

Arithmetic Coding Arithmetic Coding Tutorial homework: prove encoding length < log P (1x) + 2 bits
An excess of 2 bits on the whole file (millions or more bits?)
Arithmetic coding compresses very close to the information content
Zooming out. . . Zooming in. . . given by the probabilistic model used by both the sender and receiver.
The conditional probabilities P (xi | xj<i) can change for each symbol.
Arbitrary adaptive models can be used (if you have one).
Large blocks of symbols are compressed together: possibly your whole
file. The inefficiencies of symbol codes have been removed.
Huffman coding blocks of symbols requires an exponential number of
codewords. In arithmetic coding, each character is predicted one at a
time, as in a guessing game. The model and arithmetic coder just
consider those |AX | options at a time. None of the code needs to
enumerate huge numbers of potential strings. (De)coding costs should
1000010101 uniquely decodes to ‘bac’ be linear in the message length.

1000010110 would also work: slight inefficiency Model probabilities P (x) might need to be rounded to values Q(x)
2−l > 4 P (x = bac) ⇒ l < − log P (x = bac)+2 bits that can be represented consistently by the encoder and decoder. This
approximation introduces the usual average overhead: DKL(P || Q).
AC and sparse files Non-binary encoding Dasher
Finally we have a practical coding algorithm for sparse files Can overlay string on any other indexing of [0,1] line Dasher is an information-efficient text-entry interface.
Use the same string tree. Gestures specify which one we want.

Now know how to compress into α, β and γ

(You could make a better picture!)

The initial code-bit 0, encodes many initial message 0’s.

Notice how the first binary code bits will locate the first 1.
Something like run-length encoding has dropped out. http://www.inference.phy.cam.ac.uk/dasher/

Data Compression Solutions
79% (19)
Data Compression Solutions
67 pages
Module 5 - Part1
No ratings yet
Module 5 - Part1
36 pages
Lesson Plan Math
88% (24)
Lesson Plan Math
9 pages
Engineering Mathematics Question Bank 20sc01t
67% (3)
Engineering Mathematics Question Bank 20sc01t
10 pages
Implementation Details and Examples: Variable-Length Entropy Encoding Lossless Data Compression
No ratings yet
Implementation Details and Examples: Variable-Length Entropy Encoding Lossless Data Compression
26 pages
Source Coding
No ratings yet
Source Coding
29 pages
Lecture 3-Huffman Coding
No ratings yet
Lecture 3-Huffman Coding
30 pages
ECEVSP L03 Compression2
No ratings yet
ECEVSP L03 Compression2
40 pages
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
No ratings yet
Lecture 2 28 August, 2015: 2.1 An Example of Data Compression
7 pages
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
No ratings yet
Ec8093-Digital Image Processing: Dr.K.Kalaivani Associate Professor Dept. of EIE Easwari Engineering College
37 pages
Entropy & Run Length Coding
No ratings yet
Entropy & Run Length Coding
45 pages
chapter 2
No ratings yet
chapter 2
13 pages
Compression For Sending and Storing Information: Text, Audio, Images, Videos
No ratings yet
Compression For Sending and Storing Information: Text, Audio, Images, Videos
28 pages
Arithmetic, Run Length, Compression
No ratings yet
Arithmetic, Run Length, Compression
62 pages
chap2
No ratings yet
chap2
47 pages
L15-Compression
No ratings yet
L15-Compression
63 pages
Module IV
No ratings yet
Module IV
37 pages
Arithmetic Coding
No ratings yet
Arithmetic Coding
15 pages
cp467_12_lecture14_compression1
No ratings yet
cp467_12_lecture14_compression1
146 pages
Week 3
No ratings yet
Week 3
30 pages
Lecture
No ratings yet
Lecture
75 pages
Source Coding Ompression
No ratings yet
Source Coding Ompression
34 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
214762.214771 (1)
No ratings yet
214762.214771 (1)
21 pages
Intro To ICT 11
No ratings yet
Intro To ICT 11
31 pages
Huffman Coding: Vida Movahedi
No ratings yet
Huffman Coding: Vida Movahedi
8 pages
Ic23 Unit02 Script
No ratings yet
Ic23 Unit02 Script
29 pages
Basics of Compression: Goals
No ratings yet
Basics of Compression: Goals
15 pages
CH 6
No ratings yet
CH 6
21 pages
Witten Acm 87 Ar It HM Coding
No ratings yet
Witten Acm 87 Ar It HM Coding
21 pages
Dce Easy Solution
0% (1)
Dce Easy Solution
87 pages
2. Coding Theory
No ratings yet
2. Coding Theory
49 pages
Materi Source Coding
No ratings yet
Materi Source Coding
39 pages
Lesson - Huffman and Entropy Coding
No ratings yet
Lesson - Huffman and Entropy Coding
31 pages
Image Compression
No ratings yet
Image Compression
38 pages
Algorithms in The Real World: Data Compression: Lectures 1 and 2
No ratings yet
Algorithms in The Real World: Data Compression: Lectures 1 and 2
55 pages
Arithmetic Code Discussion and Implementation
No ratings yet
Arithmetic Code Discussion and Implementation
11 pages
Chapter Five Lossless Compression
No ratings yet
Chapter Five Lossless Compression
49 pages
Data_Compression__Unit-5 (1)
No ratings yet
Data_Compression__Unit-5 (1)
17 pages
Introduction To Data Compression - Guy E. Blelloch PDF
No ratings yet
Introduction To Data Compression - Guy E. Blelloch PDF
54 pages
Entropy
No ratings yet
Entropy
10 pages
Lossless Math
No ratings yet
Lossless Math
32 pages
Huffman Coding: Vida Movahedi
No ratings yet
Huffman Coding: Vida Movahedi
24 pages
Source Coding
No ratings yet
Source Coding
35 pages
Lossless Compression: Lesson 1
No ratings yet
Lossless Compression: Lesson 1
10 pages
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
No ratings yet
Image and Video Compression: Lecture 12, April 27, 2009 Lexing Xie
77 pages
L12, L13, L14, L15, L16 - Module 4 - Source Coding
No ratings yet
L12, L13, L14, L15, L16 - Module 4 - Source Coding
59 pages
Text Compression
No ratings yet
Text Compression
16 pages
DCT Based Coding
No ratings yet
DCT Based Coding
49 pages
Lec-2 Source Coding v3.0
No ratings yet
Lec-2 Source Coding v3.0
10 pages
Lecture 6 PDF
No ratings yet
Lecture 6 PDF
5 pages
Audio and Video Coding PDF
No ratings yet
Audio and Video Coding PDF
72 pages
Lecture 8-Print
No ratings yet
Lecture 8-Print
24 pages
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
No ratings yet
EC 2214: Coding & Data Compression: Vishwakarma Institute of Technology
35 pages
Information Theory and Coding: What You Need To Know in Today's ICE Age!
No ratings yet
Information Theory and Coding: What You Need To Know in Today's ICE Age!
44 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
The Gamma Function
From Everand
The Gamma Function
Emil Artin
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Lectures on Measure and Integration
From Everand
Lectures on Measure and Integration
Harold Widom
No ratings yet
Inverse Trigonometry PDF
No ratings yet
Inverse Trigonometry PDF
9 pages
Discrete-Time Signals and Systems
No ratings yet
Discrete-Time Signals and Systems
111 pages
03 - Instantaneous Rates of Change
No ratings yet
03 - Instantaneous Rates of Change
2 pages
Maths Class X Question Bank For Sa II 2016 17
No ratings yet
Maths Class X Question Bank For Sa II 2016 17
163 pages
Question Bank (Bba N 406-Operation Research: 2014-17) : Very Short & Short Answer Questions
No ratings yet
Question Bank (Bba N 406-Operation Research: 2014-17) : Very Short & Short Answer Questions
4 pages
06 - Binomial Theorem For Positve Integral Index
No ratings yet
06 - Binomial Theorem For Positve Integral Index
8 pages
Coinduccion y Correcursion
No ratings yet
Coinduccion y Correcursion
50 pages
Chapter 4.1
No ratings yet
Chapter 4.1
47 pages
Single Variable Optimization
No ratings yet
Single Variable Optimization
8 pages
1 Arithmetic and Geometric Sequences
No ratings yet
1 Arithmetic and Geometric Sequences
14 pages
Integral Equations and their Applications M. Rahman - Get the ebook instantly with just one click
100% (1)
Integral Equations and their Applications M. Rahman - Get the ebook instantly with just one click
56 pages
9.4 Integration With Trigonometric Functions
No ratings yet
9.4 Integration With Trigonometric Functions
17 pages
15.1 Workbook - Optimization PDF
No ratings yet
15.1 Workbook - Optimization PDF
8 pages
maths set 2 2025 paper-08 Mar 2025
No ratings yet
maths set 2 2025 paper-08 Mar 2025
12 pages
Basic Linear Algebra Jürgen Müller 2024 Scribd Download
100% (4)
Basic Linear Algebra Jürgen Müller 2024 Scribd Download
52 pages
On The Solution of The Differential Equation Occurring in The Problem of Heat Convection in Laminar Flow Through A Tube
No ratings yet
On The Solution of The Differential Equation Occurring in The Problem of Heat Convection in Laminar Flow Through A Tube
4 pages
Signal Flow Graph
No ratings yet
Signal Flow Graph
16 pages
Maths 2021 April
No ratings yet
Maths 2021 April
16 pages
11th Maths Volume 235 Marks Study Material
No ratings yet
11th Maths Volume 235 Marks Study Material
23 pages
Minkowski Distance
100% (1)
Minkowski Distance
2 pages
Differential Geometry of Curves and Surfaces by Manfredo Perdigão Do Carmo Homework6
No ratings yet
Differential Geometry of Curves and Surfaces by Manfredo Perdigão Do Carmo Homework6
10 pages
Fundamentals of Scientific Computations and Numerical Linear Algebra
No ratings yet
Fundamentals of Scientific Computations and Numerical Linear Algebra
318 pages
Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur
No ratings yet
Advanced Engineering Mathematics Prof. Pratima Panigrahi Department of Mathematics Indian Institute of Technology, Kharagpur
15 pages
Finite Element Programming With MATLAB
100% (8)
Finite Element Programming With MATLAB
58 pages
Ordinary Differential Equations Applications Models and Computing 1st Edition Charles Roberts 2024 scribd download
100% (4)
Ordinary Differential Equations Applications Models and Computing 1st Edition Charles Roberts 2024 scribd download
40 pages
Equation of A Quadratic Function Through Its Graph
No ratings yet
Equation of A Quadratic Function Through Its Graph
6 pages

Information Theory Notes

Uploaded by

Information Theory Notes

Uploaded by

Information Theory

Jensen’s inequality Remembering Jensen’s

Which way around is the inequality?

Alternatively, try ‘proof by example’:

var[X] = E[x2] − E[x]2 ≥ 0

Bit: 1 1 1 1 1 1 1 1 P (A, B, C, D, E) = P (A) P D, E | A)}

We, not knowing the underlying flip probability, learn from D

Revision of the product rule:

Navigate string tree to find interval on real line.

Arithmetic Coding Arithmetic Coding Arithmetic Coding

1000010101 uniquely decodes to ‘bac’

Now know how to compress into α, β and γ

(You could make a better picture!)

The initial code-bit 0, encodes many initial message 0’s.

You might also like