8 Conditional Expectation
8 Conditional Expectation
∙ Conditioning on an event
∙ Law of total expectation
∙ Conditional expectation as a r.v.
∙ Iterated expectation
∙ Conditional variance
∙ MSE estimation
∙ Quantization
∙ Summary
/
Conditioning on an event
∙ Example: Let X ∼ Exp(λ) and A = (a, ∞), for some constant a > ,
find the conditional pdf of X given {X ∈ A}
The conditional pdf is
λe−λx
x > a,
fX|A (x) = P{X > a}
otherwise
λe−λ(x−a) x>a
fX|A (x) =
otherwise
/
Conditional expectation
∙ Since fX|A (x) is a pdf on X, we can define the conditional expectation
of a function g(X) given A as
∞
E(g(X) | A) = g(x) fX|A (x) dx
−∞
dFY (y) x x x
fY (y) = , pZ (z) are the steps of FZ (z)
dy
/
Conditional expectation as a r.v.
∙ Let (X, Y) ∼ fX,Y (x, y). We defined the conditional pdf of X given {Y = y} as
fX,Y (x, y)
fX|Y (x|y) = , fY (y) ̸=
fY (y)
∙ We know that fX|Y (x|y) is a pdf for X (for any given y), so we can define
the expectation of any function g(X, Y) with respect to fX|Y (x|y) as
∞
E(g(X, Y) | Y = y) = g(x, y) fX|Y (x|y) dx
−∞
∙ We can similarly define this conditional expectation for two discrete and for
one discrete and one continuous r.v.s
/
Example
∙ Let
x, y ≥ , x + y ≤ ,
fX,Y (x, y) =
otherwise
Find E(X | Y = y) and E(XY | Y = y)
( − y) ≤ y ≤ ,
∙ We already know that (page of Slide set ): fY (y) =
otherwise
fX,Y (x, y) x, y ≥ , x + y < ,
Hence, fX|Y (x|y) = = − y
fY (y) otherwise fX|Y (x|y)
Thus −y
−y −y
E(X|Y = y) = x⋅ dx = , ≤y<
−y x
( − y)
Now to find E(XY|Y = y), note that
y( − y)
E(XY |Y = y) = E(X ⋅ y|Y = y) = y E(X|Y = y) = , ≤y<
/
Conditional expectation as a r.v.
∙ Since the conditional expectation E(g(X, Y) | Y = y) is a function of y,
we can define the random variable E(g(X, Y) | Y) as a function of Y
∙ In particular, the r.v. E(X | Y) is the conditional expectation of X given Y
−Y
∙ For the previous example, find the pdf of E(X | Y) =
( − y) ≤ y ≤ ,
∙ The pdf of Y, fY (y) =
otherwise
−Y
We want to find the pdf of E(X | Y) = Z =
We use the formula for pdf of a linear function Z = aY + b,
fZ (z)
z−b
fZ (z) = fY with a = −/, b = /
|a| a
z − /
= × −
−/
= z for < z ≤ /, otherwise z
/
/
Mean of conditional expectation
−Y
∙ Example: Consider our running example with E(X | Y) =
We know that fY (y) = ( − y), ≤ y ≤ , so we can find the expectation as
−y
−Y
EY [E(X | Y)] = EY = ⋅ ( − y) dy = ( − y) dy =
∙ We also know that fX (x) = ( − x) for ≤ x ≤ , hence
E(X) = x ⋅ ( − x) dx = − =
∙ Hence for this example, EY [E(X | Y)] = E(X)
∙ It turns out that this equality holds for every r.v.s (X, Y)!
/
Iterated expectation
/
Examples
∙ Let (X, Y) ∼ fX,Y (x, y) and fX|Y (x|y) be the conditional pdf of X given Y
∙ Using fX|Y (x|y), we can define the conditional expectation of g(X) given {Y = y}
as E(g(X) | Y = y)
∙ The conditional expectation E(g(X) | Y) is a r.v. that takes values E(g(X) | Y = y)
∙ Iterated expectation: EY [E(g(X) | Y)] = E(g(X))
∙ In particular, for g(X) = X, the conditional expectation of X given Y is E(X | Y)
and EY [E(X | Y)] = E(X)
∙ Up next:
Conditional variance as a r.v. and its expectation
Application: Mean squared error estimation
/
Conditional variance
(−y)
( − y)
E(X | Y = y) = x dx =
−y
Hence,
( − Y) ( − Y) ( − Y)
Var(X | Y) = − =
/
Conditional variance
∙ The expected value of Var(X|Y) can be computed as
EY Var(X | Y) = EY E(X | Y) − (E(X | Y)) = E(X ) − E (E(X | Y)) ()
∙ For our example, fY (y) = ( − y), ≤ y ≤ , fX (x) = ( − x), ≤ x ≤ , hence
E[Var(X|Y)] = E(X ) − E[(E(X|Y)) ]
( − y)
= x ⋅ ( − x) dx − ⋅ ( − y) dy
= − − =
∙ Compare to Var(X) = E(X ) − [E(X)] = − = ≥ E[Var(X|Y)]
∙ Since E(X | Y) is a r.v. it has a variance
Var(E(X | Y)) = EY (E(X | Y) − E[E(X | Y)]) = E (E(X | Y)) − (E(X)) ()
∙ Law of conditional variances: Adding () and (), we have
E (Var(X | Y)) + Var (E(X | Y)) = Var(X)
Thus in general, Var(X) ≥ E[Var(X|Y)]
/
MSE Estimation
∙ We wish to find the estimate X̂ that minimizes the mean squared error
̂ ]
MSE = E[(X − X)
∙ The X̂ that achieves the minimum MSE is called the MMSE estimate of X given Y
∙ Note that in general, the MMSE estimate is nonlinear, and
its MSE is ≤ the MSE of the linear MMSE estimate, i.e., it’s a better estimate
/
MMSE Estimate
∙ Theorem: The MMSE estimate of X given the observation Y is X̂ = E(X | Y)
and its MSE is
MMSE = E (X − E(X | Y))
= E E (X − E(X | Y)) | Y = EY Var(X | Y)
∙ Computing the above requires knowledge of the distribution of (X, Y)
∙ In contrast, the linear MMSE estimate requires knowledge only of the means,
variances, and covariance, which are far easier to estimate from data
∙ Properties of the MMSE estimate:
Since by iterated expectation, EY [E(X | Y)] = E(X), the MMSE estimate is unbiased
If X and Y are independent, then the MMSE estimate is E(X)
The conditional expectation of the estimation error for any Y = y is
̂ | Y = y) = E(X|Y = y) − E(X|Y
E((X − X) ̂ = y)
= E(X|Y = y) − E(X|Y = y) = ,
i.e., the error is unbiased for every Y = y
/
Proof of theorem
∙ Recall that minb E[(X − b) ] = Var(X) and is achieved for b = E(X) ()
∙ We will use this result to show that E(X | Y) is the MMSE estimate of X given Y
∙ First we use iterated expectation to write
̂
min E (X − X(Y)) ̂
= min EY EX (X − X(Y))
| Y
̂
X(Y) ̂
X(Y)
̂
= min EX (X − X(y))
| Y = ypY (y)
̂
X(Y) y
̂
= min EX (X − X(y))
| Y = y pY (y),
̂
y X(y)
/
Example
Hence,
x
−λx
E(Λ | X = x) = −x −x
⋅ λ e dλ
(x + − (x + )e )e
x (x + x + − (x + x + )e−x )e−x
= −x −x
⋅
(x + − (x + )e )e x
x + x + − (x + x + )e−x
= −x
, x≥
x(x + − (x + )e )
/
Example: MMSE versus linear MMSE estimates
1.50
MMSE
LMMSE
1.25
1.00
0.75
0.50
0.25
0.00
/
Constant versus Linear versus MSE estimation
Signal X X X
Observation none Y Y
Y − E(Y)
Optimal estimate E(X) ρX,Y σX + E(X) E(X|Y)
σY
MMSE Var(X) ( − ρX,Y )σX EY [Var(X|Y)]
Information needed E(X), Var(X) E(X), E(Y), σX , σY , ρX,Y joint distribution of (X, Y)
/
Quantization
x̂ x̂ x̂ i x̂ k− x̂ k −
I: i k− k −
/
Quantization
∙ We choose the {ai } and {̂xi } that minimize MSE = E[(X − X(I))
̂
]
∙ From the MSE estimation result, we know that X̂ = E(X | I) minimizes the MSE
∙ For I = i, from our earlier discussion on conditioning on an event,
x̂ i = E(X | I = i) = E X | X ∈ (ai , ai+ ]
a i+
x fX (x)
= dx
ai P{X ∈ (ai , ai+ ]}
a i+ x fX (x)
= dx
ai FX (ai+ ) − FX (ai )
So, if you know the {ai }, you can find the optimal {̂xi } using the above formula
x̂ x̂ x̂ i x̂ k− x̂ k −
I: i k− k −
/
-bit quantizer of Gaussian sample
∙ Let X ∼ N(, σ ), k = , and divide the line into two intervals (−∞, ], (, ∞]
fX (x)
x
I: x̂ x̂
∙ Hence, I = if X ≤ and I = if X >
∙ The estimate for I = is
x fX (x) x f (x)
X
x̂ = dx = dx
−∞ FX () − FX (−∞) −∞ .
− x
= xe dx = − σ
πσ −∞ π
∙ Similarly for I = , x̂ = σ
π
/
Problem of HW
∙ Finding the optimal {ai }, {̂xi } for a given k is computationally difficult in general
∙ In HW , you will explore an iterative procedure called the Lloyd algorithm
Fix an initial set {ai } and find the set {̂xi } using the formula
a i+
x fX (x)
x̂ i = dx
ai FX (ai+ ) − FX (ai )
Then fix the resulting {̂xi } and find the {ai }s that minimize the MSE, specifically
x̂ i− + x̂ i
ai =
Repeat the above procedure until your MSE doesn’t change much
∙ This procedure is not guaranteed to minimize the MSE
∙ Quantization is closely related to clustering, which you’ll also explore in HW
x̂ x̂ x̂ i x̂ k− x̂ k −
I: i k− k −
/
Summary
/