0% found this document useful (0 votes)
70 views

Conditional Expectation Notes

This document provides examples and exercises to help readers gain proficiency in calculating and manipulating conditional expectations and variances. It begins by defining expectations and conditional expectations for random variables taking values in a finite set. It then presents several key identities for manipulating conditional expectations, such as the law of iterated expectations, independence, linearity, and others. Examples and proofs are provided to illustrate these identities and give practice applying the related manipulations and calculations.

Uploaded by

Karima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views

Conditional Expectation Notes

This document provides examples and exercises to help readers gain proficiency in calculating and manipulating conditional expectations and variances. It begins by defining expectations and conditional expectations for random variables taking values in a finite set. It then presents several key identities for manipulating conditional expectations, such as the law of iterated expectations, independence, linearity, and others. Examples and proofs are provided to illustrate these identities and give practice applying the related manipulations and calculations.

Uploaded by

Karima
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Conditional Expectations: Review and

Lots of Examples
David S. Rosenberg
Abstract
The goal of this document is to get the reader to some level of proficiency in calculating
and manipulating conditional expectations and variances. To avoid making this into a
class on probability theory, we only define conditional expectation for the simplest case
of random variables taking values in a finite set. However, all the properties of conditional
expectations we give will hold in full generality (as specified in the text), so the practice
in manipulating these expressions will generalize to arbitrary settings with conditional
expectations. We do provide proofs for various identities, but not for the sake of rigor –
the proofs themselves give the opportunity to practice exactly the types of maniuplations
and calculations that are the point of this document. For a small additional challenge, you
can consider each theorem statement an exercise to complete for additional practice.

1 Basic Expectation
Let Y ∈ Y ⊂ R be a random variable – informally, Y is a random number.
In this document, we’ll discuss taking the expectation of Y with respect to many
different distributions. For simplicity, let’s suppose Y is a finite set, and let random
variable Y have a distribution described by the probability mass function (PMF)
p(y). Then we’ll define the of expectation of Y as
X
EY = yp(y).
y∈Y

You can think of EY as a weighted average of the different values that Y can take,
where the weights are the probabilities of each value.
Remark 1. Although we usually write expectations in terms of random variables,
it’s best to think of expectations as properties of distributions. Notice that the

1
2 2 Conditional Expectation

expression on the right hand side (RHS) above makes no reference to any partic-
ular random variable. In fact, all random variables with the same PMF p(y) have
the same expectation. So whenever we see an expectation operator, we should be
thinking about the distribution it’s acting on.
Remark 2. Although it’s obvious once you think about it, it’s worth noting explic-
itly that expectations do not make sense for everything that’s random. There is no
“expectation” for a random vegetable, because you cannot take a weighted average
of vegetables. The generic term for something that’s random is random element.
The specific term for a real-valued random element is random variable. We’ll
only be talking about expectations of random variables. Expectations of random
vectors, i.e. vectors of random variables, are a straightforward extension.
Exercise 1. Let X ∈ X be a random element
P PMF p(x), and suppose we have a
function f : X → R. Show that Ef (X) = x∈X f (x)p(x).
Solution 1. Let Y = f (X), and let Y = {f (x) : x ∈ X }. The key step is that the
probability of any particular value y that Y may take is the
Psum of the probabilities
of the x’s that lead to that value of y. That is, p(y) = x:f (x)=y p(x). Putting it
together, we get
X
Ef (X) = EY = yp(y)
y∈Y
 
X X
= y p(x)
y∈Y x:f (x)=y
X X
= f (x)p(x)
y∈Y x:f (x)=y
X
= f (x)p(x).
x∈X

2 Conditional Expectation
Let’s now introduce another random element X ∈ X into the mix. For simplicity,
we assume that X is a finite set, and let p(x, y) be the joint PMF for X and Y .
Recall that the conditional distribution of Y given X = x is represented by the
conditional PMF
p(x, y)
p(y | x) = .
p(x)
3

For each fixed x, p(y | x) represents a distribution


P on Y. You can verify this claim
by checking that p(y | x) ≥ 0 ∀y ∈ Y and y∈Y p(y | x) = 1.
Definition 1. The conditional expectation of Y given X = x, denoted by E [Y | X = x]
and occasionally by E [Y | x], is the expectation of the distribution represented by
p(y | x). That is, X
E [Y | X = x] = yp(y | x).
y∈Y

As x changes, the conditional distribution of Y given X = x typically changes


as well, and so might the conditional expectation of Y given X = x. So we can
view E [Y | X = x] as a function of x. To emphasize this, let’s define the function
f : X → R such that f (x) = E [Y | X = x]. Note that there is nothing random
about this function: the same x always gives us the same f (x) as output. We can
now define E [Y | X]:
Definition 2. We define the conditional expectation of Y given X, denoted
E [Y | X], as the random variable f (X), where f (x) = E [Y | X = x].
In other words, E [Y | X] is what we get when we plug in the random vari-
able X to the deterministic function f (x). Since X is random, f (X) and thus
E [Y | X] are themselves random variables.
Remark 3. There’s often a temptation to write f (X) = E [Y | X = X]. Avoid
this. One of the issues is that it’s ambiguous: you might interpret it as conditioning
on the event that X = X, which always occurs. It’s an unfortunate notational
awkwardness that one learns to work around.
Remark 4. We can generalize conditional expectation to condition on multiple
random elements in the obvious way. For example, if f (x, z) = E [Y | X = x, Z = z]
then E [Y | X, Z] = f (X, Z).
Exercise
P 2. Show that if X ∈ X has PMF p(x), then E [h(X)E [Y | X]] =
x∈X p(x)h(x)E [Y | X = x].
Proof. First, let f (x) = E [Y | X = x]. Then
E [h(X)E [Y | X]] = E [h(X)f (X)] definition of E [Y | X]
X
= p(x)h(x)f (x) Exercise 1
x∈X
X
= p(x)h(x)E [Y | X = x] .
x∈X
4 3 Identities for conditional expectations

3 Identities for conditional expectations


There are a lot of “rules” for manipulating conditional expectations, and the hope
of this document is to get you comfortable with all the main ones. Here we list the
rules, and in the next section we’ll give some derivations and discussion. We’ll
give a short-hand expression for each identity, mostly borrowed from [KBH19,
Ch 9] and [Wik20], so we can refer to them easily in derivations.

• Adam’s Law / Law of Iterated Expectation:

– Simple: E [E [Y | X]] = EY
– More general: E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))] for any
f and g with compatible domains and ranges.

• Independence: E [Y | X] = E [Y ] if X and Y are independent.

• Taking out what is known1 : E [h(X)Z | X] = h(X)E [Z | X].

• Linearity: E [aX + bY | Z] = aE [X | Z] + bE [Y | Z], for any a, b ∈ R.

• Projection interpretation: E [(Y − E [Y | X])h(X)] = 0 for any function


h : X → R.

• Keeping just what is needed: E [XY ] = E [XE [Y | X]] for X, Y ∈ R.

3.1 Law of Iterated Expectations


Since E [Y | X] is a random variable, it has a distribution. What is the expectation
of this distribution? In math, the expectation of E [Y | X] is E [E [Y | X]], of
course. The inner expectation is over Y , and the outer expectation is over X.
To clarify, this could be written as EX [EY [Y | X]], though this is rarely done in
practice unless we need to specify the distributions that the variables are referring
to, as in EX∼p1 (x) EY ∼p2 (y|x) [Y | X] .
1
This is the conditional version of E [cX] = cE [X], for any constant c ∈ R. But that is an
equation of two numbers, while the conditional version is an equality of random variables. The
idea is that inside the conditional expectation, we think of X as being constant, and thus h(X) is
also constant. As such, we can pull h(X) out of the expectation. Once it’s on the outside of the
expectation, h(X) is random again.
3.1 Law of Iterated Expectations 5

Just like all other [unconditional] expectations, E [E [Y | X]] is just a number:


it’s not random. Let’s expand out the definitions a bit. Let f (x) = E [Y | X = x].
Then

E [E [Y | X]] = E [f (X)]
X
= p(x)f (x)
x∈X
X
= p(x)E [Y | X = x] .
x∈X

In this last expression, we see that we’re taking the conditional expectation of
Y for each possible setting of X, and then taking the weighted average of them
where the weights are the probabilities that X takes each value. The next theorem
tells us that this is just another way to calculate EY .
Theorem 1 (Law of Iterated Expectations, “Adam’s Law”). For any random ele-
ment X ∈ X and random variable Y ∈ Y ⊂ R,

E [E [Y | X]] = EY.

Proof. We’ll prove this for the case of finite X and Y, but the result holds for
arbitrary random variables. As above, let f (x) = E [Y | X = x]. Then
X
E [E [Y | X]] = E [f (X)] = p(x)f (x)
x∈X
X
= p(x)E [Y | X = x]
x∈X
X X
= p(x) p(y|x)y
x∈X y∈Y
X X
= y p(x, y)
y∈Y x∈X
X
= yp(y)
y∈Y
= EY

Exercise 3. Let 1 [W = 1] denote the random variable that takes the value 1 if
W = 1 and 0 otherwise. Show that E [1 [W = 1] Y ] = P (W = 1) E [Y | W = 1].
6 3 Identities for conditional expectations

Proof. Let Z = 1 [W = 1]. Then

E [1 [W = 1] Y ] = E [ZY ] = E [E (ZY | Z)] by Adam’s Law


= E [ZE [Y | Z]] taking out what is known
= P (Z = 1) · 1 · E [Y | Z = 1]
+P (Z = 0) · 0 · E [Y | Z = 0] def of expectation
= P (W = 1) E [Y | W = 1] def of Z

3.1.1 Information processing


We’ll show later that E [Y | X] is the best prediction we can make for Y given X
(in terms of mean squared error). What if we have some function f : X → X 0 and
we consider E [Y | f (X)]. Does f (X) have more, less, or the same information
about Y as X does? Well, it could have much less, such as if f (x) ≡ 0 for
any x. If f (x) is injective (i.e. if x 6= y then f (x) 6= f (y)), then f (X) has
the same information as X, since we can always recover X from f (X) by X =
f −1 (f (X)). So in some sense, f (X) has at most as much information2 as X.
So generally speaking, E [Y | f (X)] will not be as good a prediction of Y as
E [Y | X].
We’ll now discuss the more general form of Adam’s Law presented above.
Suppose we have an information processing chain: x 7→ g(x) 7→ f (g(x)). We
can think g(X) as a “processed” or “coarsened” version of X. So E [Y | g(X)]
is our best approximation for Y given g(X). Suppose we have f (g(X)), which
is an even more processed version of X, and we want the best prediction for
E [Y | g(X)] given only f (g(X)). It turns out, that prediction is also the best
prediction for Y given only f (g(X)). This claim is formalized in the following
theorem:
Theorem 2 (Generalized Adam’s Law). We have

E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))]

for any f and g with compatible domains and ranges.


See Theorem 9 in the Appendix for a proof.
2
These notions are formalized in information theory by the data processing inequality (see
e.g. [CT06, Chapter 2]), but we’re just looking for intuition here, so we don’t need to be formal.
3.2 Projection interpretation 7

Exercise 4. Show that the following identity is a special case of the Generalized
Adam’s Law:
E [E [Y | X, Z] | Z] = E [Y | Z] .
Proof. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized Adam’s
Law, we get the result. This is a form of Adam’s Law that’s often useful in prac-
tice.

3.2 Projection interpretation


As exercises in using our other identities, in this section we’ll prove the “projec-
tion interpretation” and the fact that E [Y | X] gives the best possible prediction
for Y based only on X. We’ll also discuss why this allows us to characterize
E [Y | X] as a projection of the random variable Y onto the space of random vari-
ables that depend only on X.

3.2.1 What we can say about residuals


If we think of E [Y | X] as a prediction for Y given X, then Y − E [Y | X] is the
residual of that prediction. The next theorem shows that the residual for E [Y | X]
is “orthogonal” to every random variable of the form h(X). The connection be-
tween the theorem and the notion of orthogonality is explained in Section 3.2.4. In
the corollary that follows, we’ll relate orthogonality to covariance and correlation.
Theorem 3 (Projection interpretation). For any h : X → R, E [(Y − E [Y | X])h(X)] =
0.
Proof. We have
E [(Y − E [Y | X])h(X)]
= E [Y h(X)] − E [E [Y | X] h(X)] by linearity
= E [Y h(X)] − E [E [Y h(X) | X]] taking out what is known (in reverse)
= E [Y h(X)] − E [Y h(X)] Adam’s Law
= 0

Definition 3. The covariance of random variables X and Y is defined by Cov (X, Y ) =


E (X − EX) (Y − EY ) = EXY − EXEY . If Cov (X, Y ) = 0, then we say X
and Y are uncorrelated.
8 3 Identities for conditional expectations

Corollary 1. The residual Y − E [Y | X] and h(X) are uncorrelated (i.e. have


covariance 0) for every function h : X → R.

Proof. Note that E [Y − E [Y | X]] = EY − E [E [Y | X]] = 0 by linearity and


Adam’s Law. So

Cov(Y − E [Y | X] , h(X)) = E [(Y − E [Y | X]) h(X)] − E [Y − E [Y | X]] Eh(X)


| {z }
=0
= E [(Y − E [Y | X]) h(X)] = 0

where the last equality is by Theorem 3.

Remark 5. Note that Corollary 1 speaks about correlation, but not independence!
For example, the residual Y − E [Y | X] may have more variance for some values
of X than others. Thus Y − E [Y | X] is generally not independent of X, even
though it is uncorrelated with every random variable of the form h(X).

Exercise 5. Following the remark above, can we also say that Y − E [Y | X] is


uncorrelated with X? Why or why not?

Solution. X is not necessarily real-valued, and covariance and correlation are


defined specifically for random variables, which are real-valued by definition. If
X is a real-valued random element, then Y − E [Y | X] and X are uncorrelated.
This would be a special case of the original comment, taking h(x) = x. If X is
not real-valued, then the covariance and correlation with X are not defined. Note
that independence is defined for any types of random elements. So it’s reasonable
to ask whether Y −E [Y | X] and X are independent. As noted above, the general
answer is no.

3.2.2 Conditional expectation gives the best prediction


We now use Theorem 3 to prove that conditional expectation gives the best possi-
ble prediction of Y based on X.

Theorem 4 (Conditional expectation minimizes MSE). Suppose we have random


element X ∈ X and random variable Y ∈ R. Let g(x) = E [Y | X = x]. Then

g(x) = arg min E (Y − f (X))2 .


f
3.2 Projection interpretation 9

Proof. We have
E (f (X) − Y )2
 
 2
= E f (X) − E[Y | X] + E[Y | X] − Y
= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2
 
  

+ E f (X) − E[Y | X] E[Y | X] − Y  Projection interpretation


| {z } | {z }
function of X residual
| {z }
=0

= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2 .


 

The second term in the last expression is independent of f , and the first term
in the last expression is clearly minimized by taking f (x) = E [Y | X = x].
As we’ll explain below, this theorem is what justifies calling E [Y | X] a pro-
jection.

3.2.3 A variance decomposition


Sometimes it’s helpful to think of decomposing Y as
Y = E [Y | X] + Y − E [Y | X] .
| {z } | {z }
best prediction for Y given X residual

Note that the two terms on the RHS are uncorrelated, by the projection interpre-
tation (Corollary 1). Since variance is additive for uncorrelated random variables
(i.e. if X and Y are uncorrelated, then Var(X + Y ) = Var(X) + Var(Y )), we get
the following theorem:
Theorem 5 (Variance decomposition with projection). For any random X ∈ X
and random variable Y ∈ R, we have

Var (Y ) = Var (Y − E [Y | X]) + Var (E [Y | X]) .


Remark 6. Theorem 4 tells us that E [Y | X] is the best approximation of Y we can
get from X. We can also think of E [Y | X] as a “less random” version of Y , since
Var (E [Y | X]) ≤ Var(Y ) [this follows immediately from the previous Theorem
since variance is always ≥ 0]. We can say that E [Y | X] only keeps the random-
ness in Y that is predictable from X.... Why do we say this? E [Y | X] is a deter-
ministic function of X, so there’s no other source of randomness in E [Y | X].
10 3 Identities for conditional expectations

3.2.4 [Optional] Why do we call this the “projection interpretation”?


One can consider the space of all random variables with finite variance as an inner
product space with inner product given by
hX, Y i = E [XY ]
and norm given by kY k2 = hY, Y i = EY 2 . A random variable S 0 is called a
projection (or L2 -projection) of Y onto S if S 0 ∈ S and
2
E (Y − S 0 ) ≤ E (Y − S)2 ∀S ∈ S.
In words, S 0 is the projection of Y onto S if it is the best approximation of Y in
S in terms of mean squared error (MSE). In Theorem 4 above we exactly proved
that E [Y | X] is the function of X that has the smallest possible MSE for pre-
dicting Y . Thus E [Y | X] is the projection of Y onto the set of random variables
{h(X) | h is any real-valued function}.
Remark 7. The projection interpretation gives another way to think about the gen-
eralized Adam’s Law: E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))] for any f
and g with compatible domains and ranges. We can think of the LHS as a se-
quence of two projections, while the RHS is a single projection. Adam’s Law
says they’re equivalent. In more detail, E [Y | g(X)] is the projection of Y onto
{h(g(X)) | ∀h}, the set of all functions of g(X), and E [E [Y | g(X)] | f (g(X))]
is the projection of E [Y | g(X)] onto {h(f (g(X))) | ∀h}, the set of all functions
of f (g(X)). Note that the second set of functions is a subset of the first, i.e.
{h(f (g(X))) | ∀h} ⊆ {h(g(X)) | ∀h}, since f (·) may discard information from
g(X). So Adam’s Law is saying that if we project onto a set and then project onto
a subset of the original set, then we get the same thing as if we had projected Y
directly onto the subset to begin with. Perhaps you can visualize this by pictur-
ing projecting a vector in R3 onto a 2-dimensional subspace, and then projecting
the projection onto a 1-dimensional subspace contained in the 2-dimensional sub-
space.

3.2.5 Empirical example of the variance decomposition


To illustrate some of the concepts of the variance decomposition, let’s consider
the following joint distribution of (X, Y ):
X ∼ Unif[0, 6]
 2 !
1
Y |X=x ∼ N 6 + 1.3 sin(x), .3 + |3 − x|
4
3.2 Projection interpretation 11

Fig. 1: This plot shows the sampled (x, y) pairs, along with the con-
ditional expectation and residual for each: (x, E [Y | X = x]) and
(x, y − E [Y | X = x]).

So given X = x, the best predictor for Y in MSE is E [Y | X = x] = 6 +


1.3 sin(x). Figure 1 shows a sample of size n = 300 from this distribution. For
each sampled point (x, y), we also plot (x, E [Y | X = x]), which is the best pre-
diction of Y given just X = x, along with the residual of that prediction. Note
that the residuals hover around 0. Indeed, we should expect that since
E [Y − E [Y | X] | X = x]
= E [Y | X = x] − E [E [Y | X] | X = x] by linearity
= E [Y | X = x] − E [Y | X = x] E [1 | X = x] taking out what is known
= 0.
By the variance decomposition in terms of projection (Theorem 5), we know
Var (Y ) = Var (Y − E [Y | X]) + Var (E [Y | X]) . Using standard variance esti-
c −E [Y | X]) ≈ 0.53, Var(E
mators with our observed sample, we find Var(Y c [Y | X]) ≈
c ) ≈ 1.39. While Var(Y
0.91, and Var(Y c − E [Y | X]) + Var(E
c [Y | X]) = 1.43.
The gap between 1.43 and 1.39 is attributable to sampling error and vanishes as
we take the sample size n → ∞. In Figure 2 we show kernel density estimates of
each of the distributions in the variance decomposition.
12 3 Identities for conditional expectations

Fig. 2: This plot shows estimates of the densities of Y , E [Y | X], and Y −


E [Y | X].

3.3 Keeping just what is needed

We can get all the results in this section with trivial two line derivations using
Adam’s Law and taking out what is known. Nevertheless, we highlight these
identities as some additional intuition builders.

Theorem 6 (Keeping just what is needed). For any random variables X, Y ∈ R,


E [XY ] = E [XE [Y | X]].

One way to think about this is that for the purposes of computing E [XY ], we
only care about the randomness in Y that is predictable from X.

Proof. We can show this using the projection interpretation:


3.3 Keeping just what is needed 13

  

E [XY ] = E X E [Y | X] + Y − E [Y | X] 
| {z }
residual uncorrelated with X
= E [XE [Y | X]]
:0


+E [X (Y −E [Y | X])] Projection


 interpretation

= E [XE [Y | X]]

Exercise 6. Give an alternative proof of E [XY ] = E [XE [Y | X]] using Adam’s


Law and Taking out what is known.

Let’s put Theorem 6 in a slightly more general context and consider E [g(X)h(Y )].
Theorem 6 tells us that we get the same result if we replace h(Y ) by an approx-
imation to h(Y ), namely E [h(Y ) | g(X)]. By Theorem 4, this is actually the
best approximation for h(Y ) given g(X). Can we also get the same answer if
we replace h(Y ) by another approximation E [h(Y ) | X]? This approximation is
potentially better than E [h(Y ) | g(X)], since there may be more information in
X than in g(X). In the following Exercise, show that we get the same result even
if we plug in the better approximation:

Exercise 7. E [g(X)h(Y )] = E [g(X)E [h(Y ) | X]]. (Hint: You can either use
the projection interpretation approach we used for the proof of Theorem 6, or it’s
basically a two-liner with the application of Adam’s Law and Taking out what is
known.

Exercise 8. Show that E [XE [Y | Z]] = E [E (X | Z) E [Y | Z]] = E [E [X | Z] Z].


(This property is sometimes referred to as “self-adjointness”.)

Proof. We have

E [XE [Y | Z]] = E [E (XE [Y | Z] | Z)] Adam’s Law


= E [E (X | Z) E [Y | Z]] Taking out what is known.

Exercise 9. Give a new proof of the “projection interpretation” (Theorem 3) using


“keeping just what is needed” (Theorem 6).
14 3 Identities for conditional expectations

3.4 Intuition Builders and Extra Exercises


Suppose E [Y | X] = c is a constant. This means that whatever information we
learn from X, our best prediction for Y never changes. Does this mean that X and
Y are independent? No way! For example, the variance of Y can change dramat-
ically as a function of X, even if the expected value of Y is constant. However, if
X is a real-valued random variable, we can say something about the correlation
of X and Y .

Exercise 10. [KBH19, Ch. 9 Exercise 29] If X and Y are random variables and
E [Y | X] = c, then show that X and Y are uncorrelated. (Hint: It’s sufficient to
show that Cov(X, Y ) = E [XY ] − E [X] E [Y ] = 0.)

Proof. We have:

E [XY ] = E [E [Y X | X]]
= E [XE [Y | X]]
= cE [X]
E [Y ] = E [E [Y | X]] = c
Cov(X, Y ) = E [XY ] − E [X] E [Y ] = cE [X] − cE [X]] = 0.

Exercise 11. [KBH19, Ch. 9 Exercise 30]If X and Y are independent random
variables, then we know that E [Y | X] = E [Y ], which is a constant. However, if
we only know that X and Y are uncorrelated, then E [Y | X] is not necessarily a
constant. Give an example of this. (Hint: Your job here is to come up with a joint
distribution of X and Y and show it has the required properties. There are many
ways to do this, so try to keep things simple. For example, you can define Y to be
a deterministic function of X and keep X to be a small set.

Solution 2. Take (X, Y ) ∈ {(−1, 1), (0, 0), (1, 1)} with equal probability. Then
the covariance of X and Y is 0 and E [Y | X = x] = 1 [x ∈ {−1, 1}].

Exercise 12. We know that if X and Y are independent random variables, then
E [Y | X] = E [Y ]. But if there’s another random variable W in the picture, can
we also say that E [Y | X, W ] = E [Y | W ]? Is there a rule that we might call
“Drop what is independent from the conditioning”?
3.4 Intuition Builders and Extra Exercises 15

Solution 3. Nope! Consider X, W i.i.d. with uniform distributions on {0, 1}.


Suppose Y = 1 [X 6= W ]. (We can also write that as Y = X ⊕ W , using the
exclusive-or operator.) Then X alone gives no information about Y . Similarly
W alone gives no information about Y . Thus Y is independent of X and Y is
independent of W . But Y is not independent of (X, W ). In any case E [Y | W ] =
E [Y ] = 0.5, while E [Y | X, W ] = 1 [X 6= W ].
Exercise 13. [KBH19, Ch 9 Exercise 40]Let X1 , X2 , Y be random variables and
let A = E [Y | X1 ] and B = E [Y | X1 , X2 ]. Show that Var(A) ≤ Var(B).
At first glance, this result may seem counter to intuition. Usually we think
that getting more information (e.g. X1 and X2 rather than just X1 ) should reduce
uncertainty, rather than increase it. Why would variance be increasing when we
add more information? The devil’s in the details. Here we’re not talking about the
uncertainty in our estimate for Y (that would be something like Var (Y | X1 , X2 )),
but rather how much our estimates for Y change as we get different random X’s.
The more information we can use to estimate Y , the more potential there is for
variation in those estimates.
Proof. We first note that A = E [B | X1 ], by the generalized Adam’s Law. By the
projection interpretation, B − E [B | X1 ] and A = E [B | X1 ] are uncorrelated.
Thus from
B = B − E [B | X1 ] + A
we get
Var (B) = Var (B − E [B | X1 ]) + Var (A) .
Since we always have variance ≥ 0, we must have Var(B) ≥ Var(A).

Exercise 14. [KBH19, Ch 9, Exercise 41]Show that for any X and Y ,


E [Y | E [Y | X]] = E [Y | X] .
Proof. Let f (x) = E [Y | X = x]. So f (X) is our best approximation to Y given
X. So
E [Y | E [Y | X]] = E [Y | f (X)]
= E [E [Y | X] | f (X)] generalized Adam’s
= E [f (X) | f (X)]
= f (X) Taking out what is known
16 4 Conditional variance

4 Conditional variance
We could define Var (Y |X) using the same approach that we used to define E[Y |
X]. Let g(x) = Var(Y | X = x), where Var (Y | X = x) is the variance of the
conditional distribution Y | X = x, which is just a number. And then define
Var(Y | X) = g(X). We can also just define conditional variance directly in
terms of conditional expectations:

Definition 4. The conditional variance of Y given X is

Var (Y | X) = E (Y − E [Y | X])2 | X
 

= E Y 2 | X − (E [Y | X])2 .
 

4.1 Law of Total Variance


According to wikipedia, the law of total variance goes by many names, including
the variance decomposition formula, conditional variance formula, law of iterated
variances, and Eve’s law.

Theorem (Eve’s Law). If X and Y are random variables on the same probability
space, then
Var(Y ) = E [Var(Y | X)] + Var(E [Y | X]).

On the RHS, if we write E for expectation and V for variance, the sequence of
operations is EVVE. That’s why this is sometimes called “Eve’s law”. Not a bad
way to remember this important decomposition.
Let’s interpret this theorem in the case that X takes values in a finite set X =
{x1 , . . . , xN }. We can call Var (Y | X = x) the within group variance for the
group X = x, and so E [Var (Y | X)] is the [weighted] average of the within
group variances. This is clear just from writing out the expectation:
X
E [Var (Y | X)] = p(x)Var (Y | X = x) .
x∈X

We can call Var (E [Y | X]) the between group variance, where each group
x is represented by the single number E [Y | X = x]. If the groups have equal
probabilities p(x1 ) = · · · = p(xN ), then Var (E [Y | X]) is just the variance of the
numbers E [Y | X = x1 ] , . . . , E [Y | X = xN ]. More generally, Var (E [Y | X])
is the variance of the distribution described by the following table:
4.1 Law of Total Variance 17

Probability Value
p(x1 ) E [Y | X = x1 ]
.. ..
. .
p(xN ) E [Y | X = xN ]

Proof. Expanding the definitions:

E [Var(Y | X)] = E E Y 2 | X − (E [Y | X])2


   

= E E Y 2 | X − E (E [Y | X])2
    
by linearity
= EY 2 − E (E [Y | X])2
 
by Adam’s Law (1)

and

Var (E [Y | X]) = E (E [Y | X])2 − (E [E [Y | X]])2 def of Var


= E (E [Y | X])2 − (EY )2 by Adam’s Law.

Adding these expression together, we get the result.

Remark 8. It’s tempting to say that getting new information about Y from ob-
serving X = x would decrease the variance. That is, it seems reasonable that
Var(Y | X = x) ≤ Var(Y ) for all x. But this is not the case. For example, we
could have Var (Y | X = x) very large for a particular x, but if X = x is very
rare, the overall variance of Y could still be much smaller. On the other hand, it is
true that Var(Y | X = x) ≤ Var(Y ) on average over X. More precisely:

E [Var (Y | X)] ≤ Var(Y ).

This follows immediately from Eve’s Law (Theorem 4.1) and the fact that Var (E [Y | X]) ≥
0.
We can equate Eve’s Law with our variance decomposition in terms of projection
(Theorem 5) to get the following theorem:

Theorem 7. If X and Y are random variables on the same probability space, then

E [Var(Y | X)] = Var (Y − E [Y | X]) = E (Y − E [Y | X])2

Proof. As an exercise in conditional expectations, we’ll prove this without using


Eve’s Law:
18 4 Conditional variance

Since Y − E [Y | X] has mean 0,


Var (Y − E [Y | X])
= E (Y − E [Y | X])2
= EY 2 + E (E [Y | X])2 − 2E [Y E [Y | X]]
 

Since E [Y | X] is a function of X, we can use the generalized form of “keeping


just what is needed” (Exercise 7). We have E [Y E [Y | X]] = E [E [Y | X] E [Y | X]],
where we’ve replaced Y in the first expectation by E [Y | X]. Putting it together,
we get
Var (Y − E [Y | X]) = EY 2 + E (E [Y | X])2 − 2E (E [Y | X])2
   

= EY 2 − E (E [Y | X])2
 

= E [Var(Y | X)] ,
where the last equality is from Equation 1 in the proof of Eve’s Law above.
Exercise 15. Suppose A ∈ A has probability mass function π(a), for a ∈ A =
{1, . . . , k} and R ∈ R is an independent random element. Show that
k
1X
E [f (R, A)g(A)] = E [f (R, a)] π(a)g(a).
k a=1
Proof. In the context that this exercise arises, we start with the RHS and we need
to “discover” the LHS. So starting with the LHS would be a “guess and check”
approach. We’ll start with the RHS:
k
1X
E [f (R, a)] π(a)g(a)
k a=1
k
1X
= π(a)E [f (R, a)g(a)] since g(a) is constant
k a=1
k
1X
= π(a) E [f (R, A)g(A) | A = a] since R and A are independent
k a=1 | {z }
=h(a)

= E [h(A)]
= E [E [f (R, A)g(A) | A]]
= E [f (R, A)g(A)]
As we get more comfortable with conditional expectations, we can skip the step
involving h(a).
19

5 Law of total covariance / Conditional covariance


First, recall the definition of the covariance of X and Y : Cov (X, Y ) = E [XY ] −
EXEY = E (X − EX) (Y − EY ).

Exercise 16. Show that covariance is not affected by changing the means of the
random variables. To be precise, if X 0 = X + c1 and Y 0 = Y + c2 for constants
c1 , c2 ∈ R, then Cov (X 0 , Y 0 ) = Cov(X, Y ).

Definition 5. The conditional covariance of X and Y given Z is

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z]


= E [XY | Z] − E [X | Z] E [Y | Z]

Exercise 17. Use the rules we developed above to show that the two expressions
for Cov (X, Y | Z) are equivalent.

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z] definition


= E [XY |] Z + E [E [X | Z] E [Y | Z] | Z] linearity
−E [XE [Y | Z] | Z] − E [Y E [X | Z] | Z]
:1
= E [XY | Z] + E [X | Z] E [Y | Z]  |

E[1 Z] taking out what is known
−E [Y | Z] E [X | Z] − E [X | Z] E [Y | Z]
= E [XY | Z] − E [X | Z] E [Y | Z]

Theorem 8 (Law of Total Covariance (ECCE)). Suppose X and Y are random


variables and Z is a random element on the same probability space. Then

Cov (X, Y ) = E [Cov(X, Y | Z)] + Cov (E [X | Z] , E [Y | Z]) .

Note: Following [KBH19, Ch 9, Exercise 43], we’ll use ECCE as a short-


hand for the law of total covariance, based on the sequence of expectations and
covariances in the formula. (Again, also a good mnemonic.)
Proof. We have

E [Cov(X, Y | Z)] = E [E [XY | Z] − E [X | Z] E [Y | Z]] def


= E[XY ] − E [E [X | Z] E [Y | Z]] linearity and Adam’s
20 A Generalized Adam’s Law

and
Cov (E [X | Z] , E [Y | Z]) = E [E [X | Z] E [Y | Z]]
−E [E [X | Z]] E [E [Y | Z]] def
= E [E [X | Z] E [Y | Z]] − EXEY Adam’s
Adding these expressions together, we get
E [XY ] − EXEY = Cov (X, Y ) .

References
[CT06] Thomas M. Cover and Joy A. Thomas, Elements of information the-
ory (wiley series in telecommunications and signal processing), Wiley-
Interscience, USA, 2006.
[KBH19] Joseph K. Blitzstein and Jessica Hwang, Introduction to probability
second edition, 2nd ed., Chapman and Hall/CRC, 2019.
[Wik20] Wikipedia contributors, Conditional expectation — Wikipedia, the free
encyclopedia, 2020, [Online; accessed 31-December-2020].

A Generalized Adam’s Law


Theorem 9 (Generalized Adam’s Law). We have
E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))]
for any f and g with compatible domains and ranges. We also have that
E [E [Y | g(X)] | f (g(X)) = z] = E [Y | f (g(X)) = z]
for any z.
Proof. Let G = {g(x) | x ∈ X }. The key step in the proof is the following
P (Y = y | f (g(X)) = z)
X
= P (Y = y | f (g(X)) = z, g(X) = w) P (g(X) = w | f (g(X)) = z)
w∈G
X
= P (Y = y | g(X) = w) P (g(X) = w | f (g(X)) = z) ,
w∈G
21

where the second step is the law of total probability, and the third step follows
because once we know g(X) = w, we know that f (g(X)) = f (w), and so condi-
tioning on the value of f (g(X)) gives no additional information.
Let h(w) = E [Y | g(X) = w], to ease some calculations below. Then

X
E [Y | f (g(X)) = z] = yP (Y = y | f (g(X)) = z)
y∈Y
X X
= y P (Y = y | g(X) = w) P (g(X) = w | f (g(X)) = z)
y∈Y w∈G
X X
= P (g(X) = w | f (g(X)) = z) yP (Y = y | g(X) = w)
w∈G y∈Y
X
= P (g(X) = w | f (g(X)) = z) E [Y | g(X) = w]
w∈G
X
= P (g(X) = w | f (g(X)) = z) h(w)
w∈G
= E [h(g(X)) | f (g(X)) = z]
= E [E [Y | g(X)] | f (g(X)) = z]

I claim we’re done at this point. To make it clear, suppose we let k(z) = E [Y | f (g(X)) = z]
and r(z) = E [E [Y | g(X)] | f (g(X)) = z]. Then the calculations above have
shown that k(z) = r(z) for all possible z. So the equality certainly holds when
we plug in the random variable f (g(X)) for z: k(f (g(X))) = r(f (g(X)). And
then by the definition of conditional expectation, we conclude that

E [Y | f (g(X))] = E [E [Y | g(X)] | f (g(X))]

Note: It’s not true that E [E [Y | g(X)] | h(X)] = E [Y | h(X)]. Can you find
the spot in the proof where we can’t just replace f (g(X)) by h(X)?
Remark 9. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized
Adam’s Law, we that

E [E [Y | X, Z] | Z] = E [Y | Z] ,

which is often useful in practice.

You might also like