Conditional Expectation Notes
Conditional Expectation Notes
Lots of Examples
David S. Rosenberg
Abstract
The goal of this document is to get the reader to some level of proficiency in calculating
and manipulating conditional expectations and variances. To avoid making this into a
class on probability theory, we only define conditional expectation for the simplest case
of random variables taking values in a finite set. However, all the properties of conditional
expectations we give will hold in full generality (as specified in the text), so the practice
in manipulating these expressions will generalize to arbitrary settings with conditional
expectations. We do provide proofs for various identities, but not for the sake of rigor –
the proofs themselves give the opportunity to practice exactly the types of maniuplations
and calculations that are the point of this document. For a small additional challenge, you
can consider each theorem statement an exercise to complete for additional practice.
1 Basic Expectation
Let Y ∈ Y ⊂ R be a random variable – informally, Y is a random number.
In this document, we’ll discuss taking the expectation of Y with respect to many
different distributions. For simplicity, let’s suppose Y is a finite set, and let random
variable Y have a distribution described by the probability mass function (PMF)
p(y). Then we’ll define the of expectation of Y as
X
EY = yp(y).
y∈Y
You can think of EY as a weighted average of the different values that Y can take,
where the weights are the probabilities of each value.
Remark 1. Although we usually write expectations in terms of random variables,
it’s best to think of expectations as properties of distributions. Notice that the
1
2 2 Conditional Expectation
expression on the right hand side (RHS) above makes no reference to any partic-
ular random variable. In fact, all random variables with the same PMF p(y) have
the same expectation. So whenever we see an expectation operator, we should be
thinking about the distribution it’s acting on.
Remark 2. Although it’s obvious once you think about it, it’s worth noting explic-
itly that expectations do not make sense for everything that’s random. There is no
“expectation” for a random vegetable, because you cannot take a weighted average
of vegetables. The generic term for something that’s random is random element.
The specific term for a real-valued random element is random variable. We’ll
only be talking about expectations of random variables. Expectations of random
vectors, i.e. vectors of random variables, are a straightforward extension.
Exercise 1. Let X ∈ X be a random element
P PMF p(x), and suppose we have a
function f : X → R. Show that Ef (X) = x∈X f (x)p(x).
Solution 1. Let Y = f (X), and let Y = {f (x) : x ∈ X }. The key step is that the
probability of any particular value y that Y may take is the
Psum of the probabilities
of the x’s that lead to that value of y. That is, p(y) = x:f (x)=y p(x). Putting it
together, we get
X
Ef (X) = EY = yp(y)
y∈Y
X X
= y p(x)
y∈Y x:f (x)=y
X X
= f (x)p(x)
y∈Y x:f (x)=y
X
= f (x)p(x).
x∈X
2 Conditional Expectation
Let’s now introduce another random element X ∈ X into the mix. For simplicity,
we assume that X is a finite set, and let p(x, y) be the joint PMF for X and Y .
Recall that the conditional distribution of Y given X = x is represented by the
conditional PMF
p(x, y)
p(y | x) = .
p(x)
3
– Simple: E [E [Y | X]] = EY
– More general: E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))] for any
f and g with compatible domains and ranges.
E [E [Y | X]] = E [f (X)]
X
= p(x)f (x)
x∈X
X
= p(x)E [Y | X = x] .
x∈X
In this last expression, we see that we’re taking the conditional expectation of
Y for each possible setting of X, and then taking the weighted average of them
where the weights are the probabilities that X takes each value. The next theorem
tells us that this is just another way to calculate EY .
Theorem 1 (Law of Iterated Expectations, “Adam’s Law”). For any random ele-
ment X ∈ X and random variable Y ∈ Y ⊂ R,
E [E [Y | X]] = EY.
Proof. We’ll prove this for the case of finite X and Y, but the result holds for
arbitrary random variables. As above, let f (x) = E [Y | X = x]. Then
X
E [E [Y | X]] = E [f (X)] = p(x)f (x)
x∈X
X
= p(x)E [Y | X = x]
x∈X
X X
= p(x) p(y|x)y
x∈X y∈Y
X X
= y p(x, y)
y∈Y x∈X
X
= yp(y)
y∈Y
= EY
Exercise 3. Let 1 [W = 1] denote the random variable that takes the value 1 if
W = 1 and 0 otherwise. Show that E [1 [W = 1] Y ] = P (W = 1) E [Y | W = 1].
6 3 Identities for conditional expectations
Exercise 4. Show that the following identity is a special case of the Generalized
Adam’s Law:
E [E [Y | X, Z] | Z] = E [Y | Z] .
Proof. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized Adam’s
Law, we get the result. This is a form of Adam’s Law that’s often useful in prac-
tice.
Remark 5. Note that Corollary 1 speaks about correlation, but not independence!
For example, the residual Y − E [Y | X] may have more variance for some values
of X than others. Thus Y − E [Y | X] is generally not independent of X, even
though it is uncorrelated with every random variable of the form h(X).
Proof. We have
E (f (X) − Y )2
2
= E f (X) − E[Y | X] + E[Y | X] − Y
= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2
The second term in the last expression is independent of f , and the first term
in the last expression is clearly minimized by taking f (x) = E [Y | X = x].
As we’ll explain below, this theorem is what justifies calling E [Y | X] a pro-
jection.
Note that the two terms on the RHS are uncorrelated, by the projection interpre-
tation (Corollary 1). Since variance is additive for uncorrelated random variables
(i.e. if X and Y are uncorrelated, then Var(X + Y ) = Var(X) + Var(Y )), we get
the following theorem:
Theorem 5 (Variance decomposition with projection). For any random X ∈ X
and random variable Y ∈ R, we have
Fig. 1: This plot shows the sampled (x, y) pairs, along with the con-
ditional expectation and residual for each: (x, E [Y | X = x]) and
(x, y − E [Y | X = x]).
We can get all the results in this section with trivial two line derivations using
Adam’s Law and taking out what is known. Nevertheless, we highlight these
identities as some additional intuition builders.
One way to think about this is that for the purposes of computing E [XY ], we
only care about the randomness in Y that is predictable from X.
E [XY ] = E X E [Y | X] + Y − E [Y | X]
| {z }
residual uncorrelated with X
= E [XE [Y | X]]
:0
+E [X (Y −E [Y | X])] Projection
interpretation
= E [XE [Y | X]]
Let’s put Theorem 6 in a slightly more general context and consider E [g(X)h(Y )].
Theorem 6 tells us that we get the same result if we replace h(Y ) by an approx-
imation to h(Y ), namely E [h(Y ) | g(X)]. By Theorem 4, this is actually the
best approximation for h(Y ) given g(X). Can we also get the same answer if
we replace h(Y ) by another approximation E [h(Y ) | X]? This approximation is
potentially better than E [h(Y ) | g(X)], since there may be more information in
X than in g(X). In the following Exercise, show that we get the same result even
if we plug in the better approximation:
Exercise 7. E [g(X)h(Y )] = E [g(X)E [h(Y ) | X]]. (Hint: You can either use
the projection interpretation approach we used for the proof of Theorem 6, or it’s
basically a two-liner with the application of Adam’s Law and Taking out what is
known.
Proof. We have
Exercise 10. [KBH19, Ch. 9 Exercise 29] If X and Y are random variables and
E [Y | X] = c, then show that X and Y are uncorrelated. (Hint: It’s sufficient to
show that Cov(X, Y ) = E [XY ] − E [X] E [Y ] = 0.)
Proof. We have:
E [XY ] = E [E [Y X | X]]
= E [XE [Y | X]]
= cE [X]
E [Y ] = E [E [Y | X]] = c
Cov(X, Y ) = E [XY ] − E [X] E [Y ] = cE [X] − cE [X]] = 0.
Exercise 11. [KBH19, Ch. 9 Exercise 30]If X and Y are independent random
variables, then we know that E [Y | X] = E [Y ], which is a constant. However, if
we only know that X and Y are uncorrelated, then E [Y | X] is not necessarily a
constant. Give an example of this. (Hint: Your job here is to come up with a joint
distribution of X and Y and show it has the required properties. There are many
ways to do this, so try to keep things simple. For example, you can define Y to be
a deterministic function of X and keep X to be a small set.
Solution 2. Take (X, Y ) ∈ {(−1, 1), (0, 0), (1, 1)} with equal probability. Then
the covariance of X and Y is 0 and E [Y | X = x] = 1 [x ∈ {−1, 1}].
Exercise 12. We know that if X and Y are independent random variables, then
E [Y | X] = E [Y ]. But if there’s another random variable W in the picture, can
we also say that E [Y | X, W ] = E [Y | W ]? Is there a rule that we might call
“Drop what is independent from the conditioning”?
3.4 Intuition Builders and Extra Exercises 15
4 Conditional variance
We could define Var (Y |X) using the same approach that we used to define E[Y |
X]. Let g(x) = Var(Y | X = x), where Var (Y | X = x) is the variance of the
conditional distribution Y | X = x, which is just a number. And then define
Var(Y | X) = g(X). We can also just define conditional variance directly in
terms of conditional expectations:
Var (Y | X) = E (Y − E [Y | X])2 | X
= E Y 2 | X − (E [Y | X])2 .
Theorem (Eve’s Law). If X and Y are random variables on the same probability
space, then
Var(Y ) = E [Var(Y | X)] + Var(E [Y | X]).
On the RHS, if we write E for expectation and V for variance, the sequence of
operations is EVVE. That’s why this is sometimes called “Eve’s law”. Not a bad
way to remember this important decomposition.
Let’s interpret this theorem in the case that X takes values in a finite set X =
{x1 , . . . , xN }. We can call Var (Y | X = x) the within group variance for the
group X = x, and so E [Var (Y | X)] is the [weighted] average of the within
group variances. This is clear just from writing out the expectation:
X
E [Var (Y | X)] = p(x)Var (Y | X = x) .
x∈X
We can call Var (E [Y | X]) the between group variance, where each group
x is represented by the single number E [Y | X = x]. If the groups have equal
probabilities p(x1 ) = · · · = p(xN ), then Var (E [Y | X]) is just the variance of the
numbers E [Y | X = x1 ] , . . . , E [Y | X = xN ]. More generally, Var (E [Y | X])
is the variance of the distribution described by the following table:
4.1 Law of Total Variance 17
Probability Value
p(x1 ) E [Y | X = x1 ]
.. ..
. .
p(xN ) E [Y | X = xN ]
= E E Y 2 | X − E (E [Y | X])2
by linearity
= EY 2 − E (E [Y | X])2
by Adam’s Law (1)
and
Remark 8. It’s tempting to say that getting new information about Y from ob-
serving X = x would decrease the variance. That is, it seems reasonable that
Var(Y | X = x) ≤ Var(Y ) for all x. But this is not the case. For example, we
could have Var (Y | X = x) very large for a particular x, but if X = x is very
rare, the overall variance of Y could still be much smaller. On the other hand, it is
true that Var(Y | X = x) ≤ Var(Y ) on average over X. More precisely:
This follows immediately from Eve’s Law (Theorem 4.1) and the fact that Var (E [Y | X]) ≥
0.
We can equate Eve’s Law with our variance decomposition in terms of projection
(Theorem 5) to get the following theorem:
Theorem 7. If X and Y are random variables on the same probability space, then
= EY 2 − E (E [Y | X])2
= E [Var(Y | X)] ,
where the last equality is from Equation 1 in the proof of Eve’s Law above.
Exercise 15. Suppose A ∈ A has probability mass function π(a), for a ∈ A =
{1, . . . , k} and R ∈ R is an independent random element. Show that
k
1X
E [f (R, A)g(A)] = E [f (R, a)] π(a)g(a).
k a=1
Proof. In the context that this exercise arises, we start with the RHS and we need
to “discover” the LHS. So starting with the LHS would be a “guess and check”
approach. We’ll start with the RHS:
k
1X
E [f (R, a)] π(a)g(a)
k a=1
k
1X
= π(a)E [f (R, a)g(a)] since g(a) is constant
k a=1
k
1X
= π(a) E [f (R, A)g(A) | A = a] since R and A are independent
k a=1 | {z }
=h(a)
= E [h(A)]
= E [E [f (R, A)g(A) | A]]
= E [f (R, A)g(A)]
As we get more comfortable with conditional expectations, we can skip the step
involving h(a).
19
Exercise 16. Show that covariance is not affected by changing the means of the
random variables. To be precise, if X 0 = X + c1 and Y 0 = Y + c2 for constants
c1 , c2 ∈ R, then Cov (X 0 , Y 0 ) = Cov(X, Y ).
Exercise 17. Use the rules we developed above to show that the two expressions
for Cov (X, Y | Z) are equivalent.
and
Cov (E [X | Z] , E [Y | Z]) = E [E [X | Z] E [Y | Z]]
−E [E [X | Z]] E [E [Y | Z]] def
= E [E [X | Z] E [Y | Z]] − EXEY Adam’s
Adding these expressions together, we get
E [XY ] − EXEY = Cov (X, Y ) .
References
[CT06] Thomas M. Cover and Joy A. Thomas, Elements of information the-
ory (wiley series in telecommunications and signal processing), Wiley-
Interscience, USA, 2006.
[KBH19] Joseph K. Blitzstein and Jessica Hwang, Introduction to probability
second edition, 2nd ed., Chapman and Hall/CRC, 2019.
[Wik20] Wikipedia contributors, Conditional expectation — Wikipedia, the free
encyclopedia, 2020, [Online; accessed 31-December-2020].
where the second step is the law of total probability, and the third step follows
because once we know g(X) = w, we know that f (g(X)) = f (w), and so condi-
tioning on the value of f (g(X)) gives no additional information.
Let h(w) = E [Y | g(X) = w], to ease some calculations below. Then
X
E [Y | f (g(X)) = z] = yP (Y = y | f (g(X)) = z)
y∈Y
X X
= y P (Y = y | g(X) = w) P (g(X) = w | f (g(X)) = z)
y∈Y w∈G
X X
= P (g(X) = w | f (g(X)) = z) yP (Y = y | g(X) = w)
w∈G y∈Y
X
= P (g(X) = w | f (g(X)) = z) E [Y | g(X) = w]
w∈G
X
= P (g(X) = w | f (g(X)) = z) h(w)
w∈G
= E [h(g(X)) | f (g(X)) = z]
= E [E [Y | g(X)] | f (g(X)) = z]
I claim we’re done at this point. To make it clear, suppose we let k(z) = E [Y | f (g(X)) = z]
and r(z) = E [E [Y | g(X)] | f (g(X)) = z]. Then the calculations above have
shown that k(z) = r(z) for all possible z. So the equality certainly holds when
we plug in the random variable f (g(X)) for z: k(f (g(X))) = r(f (g(X)). And
then by the definition of conditional expectation, we conclude that
Note: It’s not true that E [E [Y | g(X)] | h(X)] = E [Y | h(X)]. Can you find
the spot in the proof where we can’t just replace f (g(X)) by h(X)?
Remark 9. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized
Adam’s Law, we that
E [E [Y | X, Z] | Z] = E [Y | Z] ,