0% found this document useful (0 votes)
96 views71 pages

18 753 Information

This document introduces an 18-753 course on Information Theory and Coding. It discusses how information is often measured in bits today and explores why different signals like audio and video can be represented with the same currency (bits). It also examines how communication over noisy channels can result in errors and explores techniques like increasing transmit power or repetition coding to improve reliability. Finally, it provides an overview of key information theory concepts like channel capacity that will be covered in the course.

Uploaded by

bits_who_am_i
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views71 pages

18 753 Information

This document introduces an 18-753 course on Information Theory and Coding. It discusses how information is often measured in bits today and explores why different signals like audio and video can be represented with the same currency (bits). It also examines how communication over noisy channels can result in errors and explores techniques like increasing transmit power or repetition coding to improve reliability. Finally, it provides an overview of key information theory concepts like channel capacity that will be covered in the course.

Uploaded by

bits_who_am_i
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

18-753: Information Theory and Coding

Lecture 1: Introduction and Probability Review

Prof. Pulkit Grover

CMU

January 16, 2018


Bits and Information Theory

• More often than not, today’s information is measured in bits.


Bits and Information Theory

• More often than not, today’s information is measured in bits.

• Why?
Bits and Information Theory

• More often than not, today’s information is measured in bits.

• Why?
• Is it ok to represent signals as different as audio and video in the
same currency, bits?
Bits and Information Theory

• More often than not, today’s information is measured in bits.

• Why?
• Is it ok to represent signals as different as audio and video in the
same currency, bits?

• Also, why do I communicate over long distances with the exact same
currency?
Video

• Video signals are made up of colors that


vary in space and time.

• Even if we are happy with pixels on a


screen how do we know that all these
colors are optimally described by bits?
Audio

• Audio signals can be thought of


as an amplitude that varies
continuously with time.
15

• How can we optimally represent a


10

continuous signal in something 5

discrete like bits? 0

−5

−10
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Just a Single Bit

• These (and many other) signals can all be optimally represented by


bits.

• Information theory explains why and how.

• For now, let’s say we have a single bit that we really want to
communicate to our friend. Let’s say it represents whether or not it
is going to snow tonight.
Communication over Noisy Channels

• Let’s say we have to transmit our bit over the air using wireless
communication.

• We have a transmit antenna and our friend has a receive antenna.

• We’ll send out a negative amplitude (say −1) on our antenna when
the bit is 0 and a positive amplitude (say 1) when the bit is 1.

• Unfortunately, there are other signals in the air (natural and


man-made) so the receiver sees a noisy version of our transmission.

• If there are lots of these little effects, then the central limit theorem
tells us we can just model them as a single Gaussian random variable.
Received Signal

• Here is the probability


1
distribution function for the
0.9

received signal. No longer a 0.8

clean −1 or 1. 0.7

0.6

• You can prove that the best 0.5

thing to do now is just decide 0.4

that it’s −1 if the signal is 0.3

below 0 and 1 otherwise. 0.2

0.1

• But that leaves us with a 0


−4 −3 −2 −1 0 1 2 3 4

probability of error.
Received Signal

0.9

0.8

• One thing we can do is boost 0.7

our transmit power. 0.6

0.5

• The received signal will look 0.4

less and less noisy. 0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
Received Signal

0.9

0.8

• One thing we can do is boost 0.7

our transmit power. 0.6

0.5

• The received signal will look 0.4

less and less noisy. 0.3

0.2

0.1

0
−4 −3 −2 −1 0 1 2 3 4
Repetition Coding

• What if we can’t arbitrarily increase our transmit power?

• We can just repeat our bit many times! For example, if we have a 0,
just send −1, −1, −1, . . . , −1 and take a majority vote.

• Now we can get the probability of error to fall with the number of
repetitions.

• But the rate of incoming bits quickly goes to zero. Can we do


better?
Point-to-Point Communication

Zn

w Xn Yn ŵ
ENC DEC

• We know the capacity for an Gaussian channel:

1
C= log (1 + SNR) bits per channel use
2
• Proved by Claude Shannon in 1948.
• What does this mean?
The Benefits of Blocklength

• We can’t predict what the noise is going


to be in a single channel use.

• But we do know that in the long run the


noise is going to behave a certain way.

• For example, if a given channel flips bits


with probability 0.1 then in the long run
1
approximately 10 of the bits will be
(Cover and Thomas,
flipped. Elements of Information Theory )
Capacity

• So if we are willing to allow for some delay we can communicate


reliably with positive rate!

• Capacity is actually very simple to calculate using the mutual


information, C = maxp(x) I(X; Y ).
What is Information Theory?

• A powerful mathematical framework that allows us to determine the


fundamental limits of information compression, storage, processing,
communication, and use.

• Provides the theoretical underpinnings as to why today’s networks


are completely digital.

• Unlike many other classes, we will strive to understand “why”


through full proofs.

• As initially formulated, information theory ignores semantics of the


message. We will touch upon how it is being extended to
incorporate semantics.
Why is it taught in an engineering department?

• Imagine if the proof that the speed of light c is the maximum


attainable speed also gave us a deep understanding of how to build a
starship that can approach this speed. And that we were able to
build such a starship within 50 years from the theory’s first
appearance!
Why is it taught in an engineering department?

• Imagine if the proof that the speed of light c is the maximum


attainable speed also gave us a deep understanding of how to build a
starship that can approach this speed. And that we were able to
build such a starship within 50 years from the theory’s first
appearance!

• That is exactly what information theory has given us for compression


and communication.
Why is it taught in an engineering department?

• Imagine if the proof that the speed of light c is the maximum


attainable speed also gave us a deep understanding of how to build a
starship that can approach this speed. And that we were able to
build such a starship within 50 years from the theory’s first
appearance!

• That is exactly what information theory has given us for compression


and communication.

• Example: Verizon spent 9.4 billion dollars on just 22MHz of


spectrum. Very important to them (and you) to get the highest
possible data rates.
Organizational Details

• This is 18-753: Information Theory and Coding.


• Taught by Prof. Pulkit Grover.
• Pre-requisites: Fluency in probability and mathematical maturity.
• Course Ingredients: 8 Homeworks, 2 Midterms, and a course project.
• Homework 1 out tonight, due next Thursday in class. (Just requires
probability.)
• Required textbook: Cover & Thomas, Elements of Information
Theory, 2nd edition.
Available online:
http://onlinelibrary.wiley.com/book/10.1002/047174882X (need to
login through CMU library)
• Office Hours: TBD.
Information Theory

• Consider the following (hypothetical) interactions between two


students in CMU.
• A: Have you ever been to the Museum of Natural History?
B: Yes.
Information Theory

• Consider the following (hypothetical) interactions between two


students in CMU.
• A: Have you ever been to the Museum of Natural History?
B: Yes.
• A: Have you ever been to the moon?
B: No.
Information Theory

• Consider the following (hypothetical) interactions between two


students in CMU.
• A: Have you ever been to the Museum of Natural History?
B: Yes.
• A: Have you ever been to the moon?
B: No.
• Both questions had two possible answers. Which interaction
conveyed more information?
Information Theory

• Consider the following (hypothetical) interactions between two


students in CMU.
• A: Have you ever been to the Museum of Natural History?
B: Yes.
• A: Have you ever been to the moon?
B: No.
• Both questions had two possible answers. Which interaction
conveyed more information?
• The “amount of information” in an event is related to how likely the
event is.
Compression and Communication

• Information Theory is the science of measuring of information.


Compression and Communication

• Information Theory is the science of measuring of information.


• This science has a profound impact on sensing, compression,
storage, extraction, processing, and communication of information.
• Compressing data such as audio, images, movies, text, sensor
measurements, etc. (Example: We will see the principle behind the
‘zip’ compression algorithm in this course.)
• Communicating data over noisy channels such as wires, wireless links,
memory (e.g. hard disks), etc.
Compression and Communication

• Information Theory is the science of measuring of information.


• This science has a profound impact on sensing, compression,
storage, extraction, processing, and communication of information.
• Compressing data such as audio, images, movies, text, sensor
measurements, etc. (Example: We will see the principle behind the
‘zip’ compression algorithm in this course.)
• Communicating data over noisy channels such as wires, wireless links,
memory (e.g. hard disks), etc.
• Specifically, we will be interested in determining the fundamental
limits of compression and communication. This will shed light on
how to engineer near-optimal systems.
Compression and Communication

• Information Theory is the science of measuring of information.


• This science has a profound impact on sensing, compression,
storage, extraction, processing, and communication of information.
• Compressing data such as audio, images, movies, text, sensor
measurements, etc. (Example: We will see the principle behind the
‘zip’ compression algorithm in this course.)
• Communicating data over noisy channels such as wires, wireless links,
memory (e.g. hard disks), etc.
• Specifically, we will be interested in determining the fundamental
limits of compression and communication. This will shed light on
how to engineer near-optimal systems.
• We will use probability as a “language” to describe and derive these
limits.
Compression and Communication

• Information Theory is the science of measuring of information.


• This science has a profound impact on sensing, compression,
storage, extraction, processing, and communication of information.
• Compressing data such as audio, images, movies, text, sensor
measurements, etc. (Example: We will see the principle behind the
‘zip’ compression algorithm in this course.)
• Communicating data over noisy channels such as wires, wireless links,
memory (e.g. hard disks), etc.
• Specifically, we will be interested in determining the fundamental
limits of compression and communication. This will shed light on
how to engineer near-optimal systems.
• We will use probability as a “language” to describe and derive these
limits.
• Information Theory has strong connections to Statistics, Physics,
Computer Science, and many other disciplines.
• This course is, to a degree, not about the meaning of information
(i.e. semantics). Although that is a philosophical undercurrent that
we will examine, question, understand.
General Setting

• Information Source: Data we want to send (e.g. a movie).


• Noisy Channel: Communication medium (e.g. a wire).
• Encoder: Maps source into a channel codeword (signal).
• Decoder: Reconstructs source from channel output (signal).
• Fidelity Criterion: Measures quality of the source reconstruction.

Noisy Source
Source Encoder Decoder
Channel Reconstruction
General Setting

• Information Source: Data we want to send (e.g. a movie).


• Noisy Channel: Communication medium (e.g. a wire).
• Encoder: Maps source into a channel codeword (signal).
• Decoder: Reconstructs source from channel output (signal).
• Fidelity Criterion: Measures quality of the source reconstruction.

Noisy Source
Source Encoder Decoder
Channel Reconstruction

• Goal: Transmit at the highest rate possible while meeting the


fidelity criterion.
• Example: Maximize frames/second while keeping mean-squared error
below 1%.
General Setting

• Information Source: Data we want to send (e.g. a movie).


• Noisy Channel: Communication medium (e.g. a wire).
• Encoder: Maps source into a channel codeword (signal).
• Decoder: Reconstructs source from channel output (signal).
• Fidelity Criterion: Measures quality of the source reconstruction.

Noisy Source
Source Encoder Decoder
Channel Reconstruction

• Goal: Transmit at the highest rate possible while meeting the


fidelity criterion.
• Example: Maximize frames/second while keeping mean-squared error
below 1%.
• Look Ahead: We will see a theoretical justification for the layered
protocol architecture of communication networks (combine optimal
compression with optimal communication)
Bits: The currency of information

• As we will see, bits are a universal currency of information (for single


sender, single receiver).
• Specifically, when we talk about sources, we often describe their size
in bits. Example: A small JPEG is around 100kB.
• Also, when we talk about channels, we often mention what rate they
can support. Example: A dial-up modem can send 14.4kB/sec.
High Dimensions

• Much of information theory relies on high-dimensional thinking.


That is, to compress and communicate data close to the
fundamental limits, we will need to operate over long blocklengths.
• This is on its face an extremely complex problem: nearly impossible
to “guess and check” solutions.
• Using probability, we will be able to reason about the existence (or
non-existence) of good schemes. This will give us insight into how
actually construct near-optimal schemes.
• Along the way, you will develop a lot of intuition for how
high-dimensional random vectors behave.
• We will now review some basic elements of probability that we will
need for the course.
Probability Review: Events

Elements of a Probability Space (Ω, F, P):


1 Sample space Ω = {ω1 , ω2 , . . .} whose elements correspond to the
possible outcomes.
2 Family of events F = {E1 , E2 , . . .} which is a collection of subsets
of Ω. We say that the event Ei occurs if the outcome ωj is an
element of Ei .
3 Probability measure P : F → R+ , a function that satisfies
(i) P(∅) = 0.
(ii) P(Ω) = 1.
(iii) If Ei ∩ Ej = ∅ (i.e. Ei and Ej are disjoint) for all i ̸= j, then
(∪
∞ ) ∞

P Ei = P(Ei ) .
i=1 i=1
Probability Review: Events

Union of Events:
• P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 ). (Venn Diagram)
Probability Review: Events

Union of Events:
• P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 ). (Venn Diagram)
• More generally, we have the inclusion-exclusion principle:
(∪
n ) ∑
n ∑ ∑
P Ei = P(Ei ) − P(Ei ∩ Ej ) + P(Ei ∩ Ej ∩ Ek )
i=1 i=1 i<j i<j<k

∑ ( ∩
ℓ )
· · · + (−1) ℓ+1
P Eim + · · ·
i1 <i2 <···<iℓ m=1
Probability Review: Events

Union of Events:
• P(E1 ∪ E2 ) = P(E1 ) + P(E2 ) − P(E1 ∩ E2 ). (Venn Diagram)
• More generally, we have the inclusion-exclusion principle:
(∪
n ) ∑
n ∑ ∑
P Ei = P(Ei ) − P(Ei ∩ Ej ) + P(Ei ∩ Ej ∩ Ek )
i=1 i=1 i<j i<j<k

∑ ( ∩
ℓ )
· · · + (−1) ℓ+1
P Eim + · · ·
i1 <i2 <···<iℓ m=1

• Very difficult to calculate, often rely on the union bound:


(∪
n ) ∑
n
P Ei ≤ P(Ei ) .
i=1 i=1
Probability Review: Events

Independence:
• Two events E1 and E2 are independent if

P(E1 ∩ E2 ) = P(E1 )P(E2 ) .


Probability Review: Events

Independence:
• Two events E1 and E2 are independent if

P(E1 ∩ E2 ) = P(E1 )P(E2 ) .

• The events E1 , . . . , En are mutually independent (or just


independent) if, for all subsets I ⊆ {1, . . . , n},
(∩ ) ∏
P Ei = P(Ei ) .
i∈I i∈I
Probability Review: Events

Conditional Probability:
• The conditional probability that event E1 occurs given that event E2
occurs is
P(E1 ∩ E2 )
P(E1 |E2 ) = .
P(E2 )
Note that this is well-defined only if P(E2 ) > 0.
Probability Review: Events

Conditional Probability:
• The conditional probability that event E1 occurs given that event E2
occurs is
P(E1 ∩ E2 )
P(E1 |E2 ) = .
P(E2 )
Note that this is well-defined only if P(E2 ) > 0.

• Notice that if E1 and E2 are independent and P(E2 ) > 0,

P(E1 ∩ E2 ) P(E1 )P(E2 )


P(E1 |E2 ) = = = P(E1 ) .
P(E2 ) P(E2 )
Probability Review: Events

Law of Total Probability:




• If E1 , E2 , . . . are disjoint events such that Ω = Ei , then for any
i=1
event A

∑ ∞

P(A) = P(A ∩ Ei ) = P(A|Ei )P(Ei ) .
i=1 i=1

Bayes’ Law:


• If E1 , E2 , . . . are disjoint events such that Ω = Ei , then for any
i=1
event A
P(A|Ej )P(Ej )
P(Ej |A) = ∞ .

P(A|Ei )P(Ei )
i=1
Probability Review: Random Variables

• A random variable X on a sample space Ω is a real-valued function,


X : Ω → R.
• Cumulative Distribution Function (cdf): FX (x) = P(X ≤ x).
Discrete Random Variables:
Probability Review: Random Variables

• A random variable X on a sample space Ω is a real-valued function,


X : Ω → R.
• Cumulative Distribution Function (cdf): FX (x) = P(X ≤ x).
Discrete Random Variables:
• X is discrete if it only takes values on a countable subset X of R.
• Probability Mass Function (pmf): For discrete random variables, we
can define the pmf pX (x) = P(X = x).
• Example 1: Bernoulli with parameter q.
{
1−q x=0
pX (x) =
q x=1.

• Example 2: Binomial with parameters n and q.


( )
n k
pX (k) = q (1 − q)n−k k = 0, 1, . . . , n
k
Probability Review: Random Variables

Continuous Random Variables:


• A random variable X is called continuous if there exists a
nonnegative function fX (x) such that
∫ b
P(a < X ≤ b) = fX (x)dx for all − ∞ < a < b < ∞ .
a

This function fX (x) is called the probability density function (pdf)


of X.
Probability Review: Random Variables

• Example 1: Uniform with parameters a and b.


{
1
b−a a<x≤b
fX (x) =
0 otherwise.

• Example 2: Gaussian with parameters µ and σ 2 .

1 (x−µ)2
fX (x) = √ e− 2σ 2
2πσ 2
• Example 3: Exponential with parameter λ.
{
λe−λx x≥0
fX (x) =
0 x<0.
Probability Review: Random Variables

Expectation:

• Discrete rvs: E[g(X)] = g(x)pX (x)
x∈X
∫ ∞
• Continuous rvs: E[g(X)] = g(x)fX (x)dx
−∞

Special Cases

Mean:

• Discrete rvs: E[X] = xpX (x)
x∈X
∫ ∞
• Continuous rvs: E[X] = xfX (x)dx
−∞
Probability Review: Random Variables

Expectation:

• Discrete rvs: E[g(X)] = g(x)pX (x)
x∈X
∫ ∞
• Continuous rvs: E[g(X)] = g(x)fX (x)dx
−∞

Special Cases

Mean:

• Discrete rvs: E[X] = xpX (x)
x∈X
∫ ∞
• Continuous rvs: E[X] = xfX (x)dx
−∞
Bernoulli Binomial Uniform Gaussian Exponential
• a+b 1
p np 2 µ λ
Probability Review: Random Variables

Special Cases Continued

mth Moment:

• Discrete rvs: E[X m ] = xm pX (x)
x∈X
∫ ∞
• Continuous rvs: E[X m ] = xm fX (x)dx
−∞

Variance:
[ ] ( )2
• Var(X) = E (X − E[X])2 = E[X 2 ] − E[X]
Probability Review: Random Variables

Special Cases Continued

mth Moment:

• Discrete rvs: E[X m ] = xm pX (x)
x∈X
∫ ∞
• Continuous rvs: E[X m ] = xm fX (x)dx
−∞

Variance:
[ ] ( )2
• Var(X) = E (X − E[X])2 = E[X 2 ] − E[X]

Bernoulli Binomial Uniform Gaussian Exponential


• (b−a)2
p(1 − p) np(1 − p) 12 σ2 1
λ2
Probability Review: Collections of Random Variables

Pairs of Random Variables (X, Y ):


• Joint cdf: FXY (x, y) = P(X ≤ x, Y ≤ y)
• Joint pmf: pXY (x, y) = P(X = x, Y = y) (for discrete rvs)
• Joint pdf: If fXY satisfies
∫ b∫ d
P(a < X ≤ b, c < Y ≤ d) = fXY (x, y)dydx
a c

for all −∞ < a < b < ∞ and −∞ < c < d < ∞ then fXY is called
the joint pdf of (X, Y ). (for continuous rvs)
• Marginalization: ∑
pY (y) = pXY (x, y)
x∈X
∫ ∞
fY (y) = fXY (x, y)dx
−∞
Probability Review: Collections of Random Variables

n-tuples of Random Variables (X1 , . . . , Xn ):


• Joint cdf: FX1 ···Xn (x1 , . . . , xn ) = P(X1 ≤ x1 , . . . , Xn ≤ xn )
• Joint pmf: pX1 ···Xn (x1 , . . . , xn ) = P(X1 = x1 , . . . , Xn = xn )
• Joint pdf: If fX1 ···Xn satisfies

P(a1 < X1 ≤ b1 , . . . , an < Xn ≤ bn )


∫ b1 ∫ bn
= ··· fX1 ···Xn (x1 , . . . , xn )dxn · · · dx1
a1 an

for all −∞ < ai < bi < ∞ then fX1 ···Xn is called the joint pdf of
(X1 , . . . , Xn ).
Probability Review: Collections of Random Variables

Independence of Random Variables:


• X1 , . . . , Xn are independent if
FX1 ···Xn (x1 , . . . , xn ) = FX1 (x1 ) · · · FXn (xn )
∀x1 , x2 , . . . , xn
• Equivalently, we can just check if

pX1 ···Xn (x1 , . . . , xn ) = pX1 (x1 ) · · · pXn (xn ) (discrete rvs)

fX1 ···Xn (x1 , . . . , xn ) = fX1 (x1 ) · · · fXn (xn ) (continuous rvs)


Probability Review: Collections of Random Variables

Conditional Probability Densities:


• Given discrete rvs X and Y with joint pmf pXY (x, y), the
conditional pmf of X given Y = y is defined to be
{ p (x,y)
XY
pY (y) pY (y) > 0
pX|Y (x|y) =
0 otherwise.

• Given continous rvs X and Y with joint pdf fXY (x, y), the
conditional pdf of X given Y = y is defined to be
{ f (x,y)
XY
fY (y) fY (y) > 0
fX|Y (x|y) =
0 otherwise.

• Note that if X and Y are independent, then pX|Y (x|y) = pX (x) or


fX|Y (x|y) = fX (x).
Probability Review: Collections of Random Variables

Linearity of Expectation:
• E[a1 X1 + · · · + an Xn ] = a1 E[X1 ] + · · · + an E[Xn ] even if the Xi
are dependent.

Expectation of Products:
• If X1 , . . . , Xn are independent, then
E[g1 (X1 ) · · · gn (Xn )] = E[g1 (X1 )] · · · E[gn (Xn )] for any
deterministic functions gi .
Probability Review: Collections of Random Variables

Conditional Expectation:

• Discrete rvs: E[g(X)|Y = y] = g(x)pX|Y (x|y)
x∈X
∫ ∞
• Continuous rvs: E[g(X)|Y = y] = g(x)fX|Y (x|y)dx
−∞
Probability Review: Collections of Random Variables

Conditional Expectation:

• Discrete rvs: E[g(X)|Y = y] = g(x)pX|Y (x|y)
x∈X
∫ ∞
• Continuous rvs: E[g(X)|Y = y] = g(x)fX|Y (x|y)dx
−∞

• E[Y |X = x] is a number. This number can be interpreted as a


function of x.
• E[Y |X] is a random variable. It is in fact a function of the random
variable X. (Note: A function of a random variable is a random
variable.)
[ ]
• Lemma: EX E[Y |X] = E[Y ].
Probability Review: Collections of Random Variables

Conditional Independence:
• X and Y are conditionally independent given Z if

pXY |Z (x, y|z) = pX|Z (x|z)pY |Z (y|z) (discrete rvs)


fXY |Z (x, y|z) = fX|Z (x|z)fY |Z (y|z) (continuous rvs)
Probability Review: Collections of Random Variables

Markov Chains:
• Random variables X, Y , and Z are said to form a Markov chain
X → Y → Z if the conditional distribution of Z depends only on Y
and is conditionally independent of X.
Probability Review: Collections of Random Variables

Markov Chains:
• Random variables X, Y , and Z are said to form a Markov chain
X → Y → Z if the conditional distribution of Z depends only on Y
and is conditionally independent of X.
• Specifically, the joint pmf (or pdf) can be factored as

pXY Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y (z|y) (discrete rvs)


fXY Z (x, y, z) = fX (x)fY |X (y|x)fZ|Y (z|y) (continuous rvs) .
Probability Review: Collections of Random Variables

Markov Chains:
• Random variables X, Y , and Z are said to form a Markov chain
X → Y → Z if the conditional distribution of Z depends only on Y
and is conditionally independent of X.
• Specifically, the joint pmf (or pdf) can be factored as

pXY Z (x, y, z) = pX (x)pY |X (y|x)pZ|Y (z|y) (discrete rvs)


fXY Z (x, y, z) = fX (x)fY |X (y|x)fZ|Y (z|y) (continuous rvs) .

• X → Y → Z if and only if X and Z are conditionally independent


given Y .
• X → Y → Z implies Z → Y → X (and vice versa).

• If Z is a deterministic function of Y , i.e. Z = g(Y ), then


X → Y → Z automatically.
Probability Review: Inequalities

Convexity:
• A set X ⊆ Rn is convex if, for every x1 , x2 ∈ X and λ ∈ [0, 1], we
have that λx1 + (1 − λ)x2 ∈ X .
Probability Review: Inequalities

Convexity:
• A set X ⊆ Rn is convex if, for every x1 , x2 ∈ X and λ ∈ [0, 1], we
have that λx1 + (1 − λ)x2 ∈ X .
• A function g on a convex set X is convex if, for every x1 , x2 ∈ X
and λ ∈ [0, 1], we have that

g(λx1 + (1 − λ)x2 ) ≤ λg(x1 ) + (1 − λ)g(x2 ) .


Probability Review: Inequalities

Convexity:
• A set X ⊆ Rn is convex if, for every x1 , x2 ∈ X and λ ∈ [0, 1], we
have that λx1 + (1 − λ)x2 ∈ X .
• A function g on a convex set X is convex if, for every x1 , x2 ∈ X
and λ ∈ [0, 1], we have that

g(λx1 + (1 − λ)x2 ) ≤ λg(x1 ) + (1 − λ)g(x2 ) .

• A function g is concave if −g is convex.


Probability Review: Inequalities

Convexity:
• A set X ⊆ Rn is convex if, for every x1 , x2 ∈ X and λ ∈ [0, 1], we
have that λx1 + (1 − λ)x2 ∈ X .
• A function g on a convex set X is convex if, for every x1 , x2 ∈ X
and λ ∈ [0, 1], we have that

g(λx1 + (1 − λ)x2 ) ≤ λg(x1 ) + (1 − λ)g(x2 ) .

• A function g is concave if −g is convex.


Jensen’s Inequality:
• If g is a convex function and X is a random variable, then
( [ ]
g E[X]) ≤ E g(X)
Probability Review: Inequalities

Markov’s Inequality:
• Let X be a non-negative random variable. For any t > 0,

E[X]
P(X ≥ t) ≤ .
t
Probability Review: Inequalities

Markov’s Inequality:
• Let X be a non-negative random variable. For any t > 0,

E[X]
P(X ≥ t) ≤ .
t

Chebyshev’s Inequality:
• Let X be a random variable. For any ϵ > 0,
( ) Var(X)
P X − E[X] > ϵ ≤ .
ϵ2

(You will prove these in Homework 1.)


Probability Review: Inequalities

Weak Law of Large Numbers (WLLN):


• Let Xi be a sequence of independent and identically distributed
(i.i.d.) random variables with finite mean, µ = E[Xi ] < ∞.

1∑
n
• Define the sample mean X̄n = Xi .
n
i=1

• For any ϵ > 0, the WLLN implies that


( )
lim P X̄n − µ > ϵ = 0 .
n→∞

• That is, the sample mean converges (in probability) to the true
mean.
Probability Review: Inequalities

Strong Law of Large Numbers (SLLN):


• Let Xi be a sequence of independent and identically distributed
(i.i.d.) random variables with finite mean, µ = E[Xi ] < ∞.

1∑
n
• Define the sample mean X̄n = Xi .
n
i=1

• The SLLN implies that


({ })
P lim X̄n = µ =1.
n→∞

• That is, the sample mean converges (almost surely) to the true
mean.

You might also like