0% found this document useful (0 votes)

70 views

Conditional Expectation Notes

This document provides examples and exercises to help readers gain proficiency in calculating and manipulating conditional expectations and variances. It begins by defining expectations and conditional expectations for random variables taking values in a finite set. It then presents several key identities for manipulating conditional expectations, such as the law of iterated expectations, independence, linearity, and others. Examples and proofs are provided to illustrate these identities and give practice applying the related manipulations and calculations.

Uploaded by

Karima

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Conditional Expectation Notes

Uploaded by

Karima

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Conditional Expectations: Review and

Lots of Examples
David S. Rosenberg
Abstract
The goal of this document is to get the reader to some level of proficiency in calculating
and manipulating conditional expectations and variances. To avoid making this into a
class on probability theory, we only define conditional expectation for the simplest case
of random variables taking values in a finite set. However, all the properties of conditional
expectations we give will hold in full generality (as specified in the text), so the practice
in manipulating these expressions will generalize to arbitrary settings with conditional
expectations. We do provide proofs for various identities, but not for the sake of rigor –
the proofs themselves give the opportunity to practice exactly the types of maniuplations
and calculations that are the point of this document. For a small additional challenge, you
can consider each theorem statement an exercise to complete for additional practice.

1 Basic Expectation
Let Y ∈ Y ⊂ R be a random variable – informally, Y is a random number.
In this document, we’ll discuss taking the expectation of Y with respect to many
different distributions. For simplicity, let’s suppose Y is a finite set, and let random
variable Y have a distribution described by the probability mass function (PMF)
p(y). Then we’ll define the of expectation of Y as
X
EY = yp(y).
y∈Y

You can think of EY as a weighted average of the different values that Y can take,
where the weights are the probabilities of each value.
Remark 1. Although we usually write expectations in terms of random variables,
it’s best to think of expectations as properties of distributions. Notice that the

1
2 2 Conditional Expectation

expression on the right hand side (RHS) above makes no reference to any partic-
ular random variable. In fact, all random variables with the same PMF p(y) have
the same expectation. So whenever we see an expectation operator, we should be
thinking about the distribution it’s acting on.
Remark 2. Although it’s obvious once you think about it, it’s worth noting explic-
itly that expectations do not make sense for everything that’s random. There is no
“expectation” for a random vegetable, because you cannot take a weighted average
of vegetables. The generic term for something that’s random is random element.
The specific term for a real-valued random element is random variable. We’ll
only be talking about expectations of random variables. Expectations of random
vectors, i.e. vectors of random variables, are a straightforward extension.
Exercise 1. Let X ∈ X be a random element
P PMF p(x), and suppose we have a
function f : X → R. Show that Ef (X) = x∈X f (x)p(x).
Solution 1. Let Y = f (X), and let Y = {f (x) : x ∈ X }. The key step is that the
probability of any particular value y that Y may take is the
Psum of the probabilities
of the x’s that lead to that value of y. That is, p(y) = x:f (x)=y p(x). Putting it
together, we get
X
Ef (X) = EY = yp(y)
y∈Y
 
X X
= y p(x)
y∈Y x:f (x)=y
X X
= f (x)p(x)
y∈Y x:f (x)=y
X
= f (x)p(x).
x∈X

2 Conditional Expectation
Let’s now introduce another random element X ∈ X into the mix. For simplicity,
we assume that X is a finite set, and let p(x, y) be the joint PMF for X and Y .
Recall that the conditional distribution of Y given X = x is represented by the
conditional PMF
p(x, y)
p(y | x) = .
p(x)
3

For each fixed x, p(y | x) represents a distribution

As x changes, the conditional distribution of Y given X = x typically changes

as well, and so might the conditional expectation of Y given X = x. So we can
view E [Y | X = x] as a function of x. To emphasize this, let’s define the function
f : X → R such that f (x) = E [Y | X = x]. Note that there is nothing random
about this function: the same x always gives us the same f (x) as output. We can
now define E [Y | X]:
Definition 2. We define the conditional expectation of Y given X, denoted
E [Y | X], as the random variable f (X), where f (x) = E [Y | X = x].
In other words, E [Y | X] is what we get when we plug in the random vari-
able X to the deterministic function f (x). Since X is random, f (X) and thus
E [Y | X] are themselves random variables.
Remark 3. There’s often a temptation to write f (X) = E [Y | X = X]. Avoid
this. One of the issues is that it’s ambiguous: you might interpret it as conditioning
on the event that X = X, which always occurs. It’s an unfortunate notational
awkwardness that one learns to work around.
Remark 4. We can generalize conditional expectation to condition on multiple
random elements in the obvious way. For example, if f (x, z) = E [Y | X = x, Z = z]
then E [Y | X, Z] = f (X, Z).
Exercise
P 2. Show that if X ∈ X has PMF p(x), then E [h(X)E [Y | X]] =
x∈X p(x)h(x)E [Y | X = x].
Proof. First, let f (x) = E [Y | X = x]. Then
E [h(X)E [Y | X]] = E [h(X)f (X)] definition of E [Y | X]
X
= p(x)h(x)f (x) Exercise 1
x∈X
X
= p(x)h(x)E [Y | X = x] .
x∈X
4 3 Identities for conditional expectations

3 Identities for conditional expectations

There are a lot of “rules” for manipulating conditional expectations, and the hope
of this document is to get you comfortable with all the main ones. Here we list the
rules, and in the next section we’ll give some derivations and discussion. We’ll
give a short-hand expression for each identity, mostly borrowed from [KBH19,
Ch 9] and [Wik20], so we can refer to them easily in derivations.

• Adam’s Law / Law of Iterated Expectation:

– Simple: E [E [Y | X]] = EY
– More general: E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))] for any
f and g with compatible domains and ranges.

• Independence: E [Y | X] = E [Y ] if X and Y are independent.

• Taking out what is known1 : E [h(X)Z | X] = h(X)E [Z | X].

• Linearity: E [aX + bY | Z] = aE [X | Z] + bE [Y | Z], for any a, b ∈ R.

• Projection interpretation: E [(Y − E [Y | X])h(X)] = 0 for any function

h : X → R.

• Keeping just what is needed: E [XY ] = E [XE [Y | X]] for X, Y ∈ R.

3.1 Law of Iterated Expectations

Since E [Y | X] is a random variable, it has a distribution. What is the expectation
of this distribution? In math, the expectation of E [Y | X] is E [E [Y | X]], of
course. The inner expectation is over Y , and the outer expectation is over X.
To clarify, this could be written as EX [EY [Y | X]], though this is rarely done in
practice unless we need to specify the distributions that the variables are referring
to, as in EX∼p1 (x) EY ∼p2 (y|x) [Y | X] .
1
This is the conditional version of E [cX] = cE [X], for any constant c ∈ R. But that is an
equation of two numbers, while the conditional version is an equality of random variables. The
idea is that inside the conditional expectation, we think of X as being constant, and thus h(X) is
also constant. As such, we can pull h(X) out of the expectation. Once it’s on the outside of the
expectation, h(X) is random again.
3.1 Law of Iterated Expectations 5

Just like all other [unconditional] expectations, E [E [Y | X]] is just a number:

it’s not random. Let’s expand out the definitions a bit. Let f (x) = E [Y | X = x].
Then

E [E [Y | X]] = E [f (X)]
X
= p(x)f (x)
x∈X
X
= p(x)E [Y | X = x] .
x∈X

In this last expression, we see that we’re taking the conditional expectation of
Y for each possible setting of X, and then taking the weighted average of them
where the weights are the probabilities that X takes each value. The next theorem
tells us that this is just another way to calculate EY .
Theorem 1 (Law of Iterated Expectations, “Adam’s Law”). For any random ele-
ment X ∈ X and random variable Y ∈ Y ⊂ R,

E [E [Y | X]] = EY.

Proof. We’ll prove this for the case of finite X and Y, but the result holds for
arbitrary random variables. As above, let f (x) = E [Y | X = x]. Then
X
E [E [Y | X]] = E [f (X)] = p(x)f (x)
x∈X
X
= p(x)E [Y | X = x]
x∈X
X X
= p(x) p(y|x)y
x∈X y∈Y
X X
= y p(x, y)
y∈Y x∈X
X
= yp(y)
y∈Y
= EY

Exercise 3. Let 1 [W = 1] denote the random variable that takes the value 1 if
W = 1 and 0 otherwise. Show that E [1 [W = 1] Y ] = P (W = 1) E [Y | W = 1].
6 3 Identities for conditional expectations

Proof. Let Z = 1 [W = 1]. Then

E [1 [W = 1] Y ] = E [ZY ] = E [E (ZY | Z)] by Adam’s Law

= E [ZE [Y | Z]] taking out what is known
= P (Z = 1) · 1 · E [Y | Z = 1]
+P (Z = 0) · 0 · E [Y | Z = 0] def of expectation
= P (W = 1) E [Y | W = 1] def of Z

3.1.1 Information processing

We’ll show later that E [Y | X] is the best prediction we can make for Y given X
(in terms of mean squared error). What if we have some function f : X → X 0 and
we consider E [Y | f (X)]. Does f (X) have more, less, or the same information
about Y as X does? Well, it could have much less, such as if f (x) ≡ 0 for
any x. If f (x) is injective (i.e. if x 6= y then f (x) 6= f (y)), then f (X) has
the same information as X, since we can always recover X from f (X) by X =
f −1 (f (X)). So in some sense, f (X) has at most as much information2 as X.
So generally speaking, E [Y | f (X)] will not be as good a prediction of Y as
E [Y | X].
We’ll now discuss the more general form of Adam’s Law presented above.
Suppose we have an information processing chain: x 7→ g(x) 7→ f (g(x)). We
can think g(X) as a “processed” or “coarsened” version of X. So E [Y | g(X)]
is our best approximation for Y given g(X). Suppose we have f (g(X)), which
is an even more processed version of X, and we want the best prediction for
E [Y | g(X)] given only f (g(X)). It turns out, that prediction is also the best
prediction for Y given only f (g(X)). This claim is formalized in the following
theorem:
Theorem 2 (Generalized Adam’s Law). We have

E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))]

for any f and g with compatible domains and ranges.

See Theorem 9 in the Appendix for a proof.
2
These notions are formalized in information theory by the data processing inequality (see
e.g. [CT06, Chapter 2]), but we’re just looking for intuition here, so we don’t need to be formal.
3.2 Projection interpretation 7

Exercise 4. Show that the following identity is a special case of the Generalized
Adam’s Law:
E [E [Y | X, Z] | Z] = E [Y | Z] .
Proof. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized Adam’s
Law, we get the result. This is a form of Adam’s Law that’s often useful in prac-
tice.

3.2 Projection interpretation

As exercises in using our other identities, in this section we’ll prove the “projec-
tion interpretation” and the fact that E [Y | X] gives the best possible prediction
for Y based only on X. We’ll also discuss why this allows us to characterize
E [Y | X] as a projection of the random variable Y onto the space of random vari-
ables that depend only on X.

3.2.1 What we can say about residuals

If we think of E [Y | X] as a prediction for Y given X, then Y − E [Y | X] is the
residual of that prediction. The next theorem shows that the residual for E [Y | X]
is “orthogonal” to every random variable of the form h(X). The connection be-
tween the theorem and the notion of orthogonality is explained in Section 3.2.4. In
the corollary that follows, we’ll relate orthogonality to covariance and correlation.
Theorem 3 (Projection interpretation). For any h : X → R, E [(Y − E [Y | X])h(X)] =
0.
Proof. We have
E [(Y − E [Y | X])h(X)]
= E [Y h(X)] − E [E [Y | X] h(X)] by linearity
= E [Y h(X)] − E [E [Y h(X) | X]] taking out what is known (in reverse)
= E [Y h(X)] − E [Y h(X)] Adam’s Law
= 0

Definition 3. The covariance of random variables X and Y is defined by Cov (X, Y ) =

E (X − EX) (Y − EY ) = EXY − EXEY . If Cov (X, Y ) = 0, then we say X
and Y are uncorrelated.
8 3 Identities for conditional expectations

Corollary 1. The residual Y − E [Y | X] and h(X) are uncorrelated (i.e. have

covariance 0) for every function h : X → R.

Proof. Note that E [Y − E [Y | X]] = EY − E [E [Y | X]] = 0 by linearity and

Adam’s Law. So

Cov(Y − E [Y | X] , h(X)) = E [(Y − E [Y | X]) h(X)] − E [Y − E [Y | X]] Eh(X)

| {z }
=0
= E [(Y − E [Y | X]) h(X)] = 0

where the last equality is by Theorem 3.

Remark 5. Note that Corollary 1 speaks about correlation, but not independence!
For example, the residual Y − E [Y | X] may have more variance for some values
of X than others. Thus Y − E [Y | X] is generally not independent of X, even
though it is uncorrelated with every random variable of the form h(X).

Exercise 5. Following the remark above, can we also say that Y − E [Y | X] is

uncorrelated with X? Why or why not?

Solution. X is not necessarily real-valued, and covariance and correlation are

defined specifically for random variables, which are real-valued by definition. If
X is a real-valued random element, then Y − E [Y | X] and X are uncorrelated.
This would be a special case of the original comment, taking h(x) = x. If X is
not real-valued, then the covariance and correlation with X are not defined. Note
that independence is defined for any types of random elements. So it’s reasonable
to ask whether Y −E [Y | X] and X are independent. As noted above, the general
answer is no.

3.2.2 Conditional expectation gives the best prediction

We now use Theorem 3 to prove that conditional expectation gives the best possi-
ble prediction of Y based on X.

Theorem 4 (Conditional expectation minimizes MSE). Suppose we have random

element X ∈ X and random variable Y ∈ R. Let g(x) = E [Y | X = x]. Then

g(x) = arg min E (Y − f (X))2 .

f
3.2 Projection interpretation 9

Proof. We have
E (f (X) − Y )2

2
= E f (X) − E[Y | X] + E[Y | X] − Y
= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2

  

+ E f (X) − E[Y | X] E[Y | X] − Y  Projection interpretation

| {z } | {z }
function of X residual
| {z }
=0

= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2 .

The second term in the last expression is independent of f , and the first term
in the last expression is clearly minimized by taking f (x) = E [Y | X = x].
As we’ll explain below, this theorem is what justifies calling E [Y | X] a pro-
jection.

3.2.3 A variance decomposition

Sometimes it’s helpful to think of decomposing Y as
Y = E [Y | X] + Y − E [Y | X] .
| {z } | {z }
best prediction for Y given X residual

Note that the two terms on the RHS are uncorrelated, by the projection interpre-
tation (Corollary 1). Since variance is additive for uncorrelated random variables
(i.e. if X and Y are uncorrelated, then Var(X + Y ) = Var(X) + Var(Y )), we get
the following theorem:
Theorem 5 (Variance decomposition with projection). For any random X ∈ X
and random variable Y ∈ R, we have

Var (Y ) = Var (Y − E [Y | X]) + Var (E [Y | X]) .

Remark 6. Theorem 4 tells us that E [Y | X] is the best approximation of Y we can
get from X. We can also think of E [Y | X] as a “less random” version of Y , since
Var (E [Y | X]) ≤ Var(Y ) [this follows immediately from the previous Theorem
since variance is always ≥ 0]. We can say that E [Y | X] only keeps the random-
ness in Y that is predictable from X.... Why do we say this? E [Y | X] is a deter-
ministic function of X, so there’s no other source of randomness in E [Y | X].
10 3 Identities for conditional expectations

3.2.4 [Optional] Why do we call this the “projection interpretation”?

One can consider the space of all random variables with finite variance as an inner
product space with inner product given by
hX, Y i = E [XY ]
and norm given by kY k2 = hY, Y i = EY 2 . A random variable S 0 is called a
projection (or L2 -projection) of Y onto S if S 0 ∈ S and
2
E (Y − S 0 ) ≤ E (Y − S)2 ∀S ∈ S.
In words, S 0 is the projection of Y onto S if it is the best approximation of Y in
S in terms of mean squared error (MSE). In Theorem 4 above we exactly proved
that E [Y | X] is the function of X that has the smallest possible MSE for pre-
dicting Y . Thus E [Y | X] is the projection of Y onto the set of random variables
{h(X) | h is any real-valued function}.
Remark 7. The projection interpretation gives another way to think about the gen-
eralized Adam’s Law: E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))] for any f
and g with compatible domains and ranges. We can think of the LHS as a se-
quence of two projections, while the RHS is a single projection. Adam’s Law
says they’re equivalent. In more detail, E [Y | g(X)] is the projection of Y onto
{h(g(X)) | ∀h}, the set of all functions of g(X), and E [E [Y | g(X)] | f (g(X))]
is the projection of E [Y | g(X)] onto {h(f (g(X))) | ∀h}, the set of all functions
of f (g(X)). Note that the second set of functions is a subset of the first, i.e.
{h(f (g(X))) | ∀h} ⊆ {h(g(X)) | ∀h}, since f (·) may discard information from
g(X). So Adam’s Law is saying that if we project onto a set and then project onto
a subset of the original set, then we get the same thing as if we had projected Y
directly onto the subset to begin with. Perhaps you can visualize this by pictur-
ing projecting a vector in R3 onto a 2-dimensional subspace, and then projecting
the projection onto a 1-dimensional subspace contained in the 2-dimensional sub-
space.

3.2.5 Empirical example of the variance decomposition

To illustrate some of the concepts of the variance decomposition, let’s consider
the following joint distribution of (X, Y ):
X ∼ Unif[0, 6]
2 !
1
Y |X=x ∼ N 6 + 1.3 sin(x), .3 + |3 − x|
4
3.2 Projection interpretation 11

Fig. 1: This plot shows the sampled (x, y) pairs, along with the con-
ditional expectation and residual for each: (x, E [Y | X = x]) and
(x, y − E [Y | X = x]).

So given X = x, the best predictor for Y in MSE is E [Y | X = x] = 6 +

1.3 sin(x). Figure 1 shows a sample of size n = 300 from this distribution. For
each sampled point (x, y), we also plot (x, E [Y | X = x]), which is the best pre-
diction of Y given just X = x, along with the residual of that prediction. Note
that the residuals hover around 0. Indeed, we should expect that since
E [Y − E [Y | X] | X = x]
= E [Y | X = x] − E [E [Y | X] | X = x] by linearity
= E [Y | X = x] − E [Y | X = x] E [1 | X = x] taking out what is known
= 0.
By the variance decomposition in terms of projection (Theorem 5), we know
Var (Y ) = Var (Y − E [Y | X]) + Var (E [Y | X]) . Using standard variance esti-
c −E [Y | X]) ≈ 0.53, Var(E
mators with our observed sample, we find Var(Y c [Y | X]) ≈
c ) ≈ 1.39. While Var(Y
0.91, and Var(Y c − E [Y | X]) + Var(E
c [Y | X]) = 1.43.
The gap between 1.43 and 1.39 is attributable to sampling error and vanishes as
we take the sample size n → ∞. In Figure 2 we show kernel density estimates of
each of the distributions in the variance decomposition.
12 3 Identities for conditional expectations

Fig. 2: This plot shows estimates of the densities of Y , E [Y | X], and Y −

E [Y | X].

3.3 Keeping just what is needed

We can get all the results in this section with trivial two line derivations using
Adam’s Law and taking out what is known. Nevertheless, we highlight these
identities as some additional intuition builders.

Theorem 6 (Keeping just what is needed). For any random variables X, Y ∈ R,

E [XY ] = E [XE [Y | X]].

One way to think about this is that for the purposes of computing E [XY ], we
only care about the randomness in Y that is predictable from X.

Proof. We can show this using the projection interpretation:

3.3 Keeping just what is needed 13

  

E [XY ] = E X E [Y | X] + Y − E [Y | X] 
| {z }
residual uncorrelated with X
= E [XE [Y | X]]
:0

+E [X (Y −E [Y | X])] Projection

interpretation

= E [XE [Y | X]]

Exercise 6. Give an alternative proof of E [XY ] = E [XE [Y | X]] using Adam’s

Law and Taking out what is known.

Let’s put Theorem 6 in a slightly more general context and consider E [g(X)h(Y )].
Theorem 6 tells us that we get the same result if we replace h(Y ) by an approx-
imation to h(Y ), namely E [h(Y ) | g(X)]. By Theorem 4, this is actually the
best approximation for h(Y ) given g(X). Can we also get the same answer if
we replace h(Y ) by another approximation E [h(Y ) | X]? This approximation is
potentially better than E [h(Y ) | g(X)], since there may be more information in
X than in g(X). In the following Exercise, show that we get the same result even
if we plug in the better approximation:

Exercise 7. E [g(X)h(Y )] = E [g(X)E [h(Y ) | X]]. (Hint: You can either use
the projection interpretation approach we used for the proof of Theorem 6, or it’s
basically a two-liner with the application of Adam’s Law and Taking out what is
known.

Exercise 8. Show that E [XE [Y | Z]] = E [E (X | Z) E [Y | Z]] = E [E [X | Z] Z].

(This property is sometimes referred to as “self-adjointness”.)

Proof. We have

E [XE [Y | Z]] = E [E (XE [Y | Z] | Z)] Adam’s Law

= E [E (X | Z) E [Y | Z]] Taking out what is known.

Exercise 9. Give a new proof of the “projection interpretation” (Theorem 3) using

“keeping just what is needed” (Theorem 6).
14 3 Identities for conditional expectations

3.4 Intuition Builders and Extra Exercises

Suppose E [Y | X] = c is a constant. This means that whatever information we
learn from X, our best prediction for Y never changes. Does this mean that X and
Y are independent? No way! For example, the variance of Y can change dramat-
ically as a function of X, even if the expected value of Y is constant. However, if
X is a real-valued random variable, we can say something about the correlation
of X and Y .

Exercise 10. [KBH19, Ch. 9 Exercise 29] If X and Y are random variables and
E [Y | X] = c, then show that X and Y are uncorrelated. (Hint: It’s sufficient to
show that Cov(X, Y ) = E [XY ] − E [X] E [Y ] = 0.)

Proof. We have:

E [XY ] = E [E [Y X | X]]
= E [XE [Y | X]]
= cE [X]
E [Y ] = E [E [Y | X]] = c
Cov(X, Y ) = E [XY ] − E [X] E [Y ] = cE [X] − cE [X]] = 0.

Exercise 11. [KBH19, Ch. 9 Exercise 30]If X and Y are independent random
variables, then we know that E [Y | X] = E [Y ], which is a constant. However, if
we only know that X and Y are uncorrelated, then E [Y | X] is not necessarily a
constant. Give an example of this. (Hint: Your job here is to come up with a joint
distribution of X and Y and show it has the required properties. There are many
ways to do this, so try to keep things simple. For example, you can define Y to be
a deterministic function of X and keep X to be a small set.

Solution 2. Take (X, Y ) ∈ {(−1, 1), (0, 0), (1, 1)} with equal probability. Then
the covariance of X and Y is 0 and E [Y | X = x] = 1 [x ∈ {−1, 1}].

Exercise 12. We know that if X and Y are independent random variables, then
E [Y | X] = E [Y ]. But if there’s another random variable W in the picture, can
we also say that E [Y | X, W ] = E [Y | W ]? Is there a rule that we might call
“Drop what is independent from the conditioning”?
3.4 Intuition Builders and Extra Exercises 15

Solution 3. Nope! Consider X, W i.i.d. with uniform distributions on {0, 1}.

Suppose Y = 1 [X 6= W ]. (We can also write that as Y = X ⊕ W , using the
exclusive-or operator.) Then X alone gives no information about Y . Similarly
W alone gives no information about Y . Thus Y is independent of X and Y is
independent of W . But Y is not independent of (X, W ). In any case E [Y | W ] =
E [Y ] = 0.5, while E [Y | X, W ] = 1 [X 6= W ].
Exercise 13. [KBH19, Ch 9 Exercise 40]Let X1 , X2 , Y be random variables and
let A = E [Y | X1 ] and B = E [Y | X1 , X2 ]. Show that Var(A) ≤ Var(B).
At first glance, this result may seem counter to intuition. Usually we think
that getting more information (e.g. X1 and X2 rather than just X1 ) should reduce
uncertainty, rather than increase it. Why would variance be increasing when we
add more information? The devil’s in the details. Here we’re not talking about the
uncertainty in our estimate for Y (that would be something like Var (Y | X1 , X2 )),
but rather how much our estimates for Y change as we get different random X’s.
The more information we can use to estimate Y , the more potential there is for
variation in those estimates.
Proof. We first note that A = E [B | X1 ], by the generalized Adam’s Law. By the
projection interpretation, B − E [B | X1 ] and A = E [B | X1 ] are uncorrelated.
Thus from
B = B − E [B | X1 ] + A
we get
Var (B) = Var (B − E [B | X1 ]) + Var (A) .
Since we always have variance ≥ 0, we must have Var(B) ≥ Var(A).

Exercise 14. [KBH19, Ch 9, Exercise 41]Show that for any X and Y ,

E [Y | E [Y | X]] = E [Y | X] .
Proof. Let f (x) = E [Y | X = x]. So f (X) is our best approximation to Y given
X. So
E [Y | E [Y | X]] = E [Y | f (X)]
= E [E [Y | X] | f (X)] generalized Adam’s
= E [f (X) | f (X)]
= f (X) Taking out what is known
16 4 Conditional variance

4 Conditional variance
We could define Var (Y |X) using the same approach that we used to define E[Y |
X]. Let g(x) = Var(Y | X = x), where Var (Y | X = x) is the variance of the
conditional distribution Y | X = x, which is just a number. And then define
Var(Y | X) = g(X). We can also just define conditional variance directly in
terms of conditional expectations:

Definition 4. The conditional variance of Y given X is

Var (Y | X) = E (Y − E [Y | X])2 | X

= E Y 2 | X − (E [Y | X])2 .

4.1 Law of Total Variance

According to wikipedia, the law of total variance goes by many names, including
the variance decomposition formula, conditional variance formula, law of iterated
variances, and Eve’s law.

Theorem (Eve’s Law). If X and Y are random variables on the same probability
space, then
Var(Y ) = E [Var(Y | X)] + Var(E [Y | X]).

On the RHS, if we write E for expectation and V for variance, the sequence of
operations is EVVE. That’s why this is sometimes called “Eve’s law”. Not a bad
way to remember this important decomposition.
Let’s interpret this theorem in the case that X takes values in a finite set X =
{x1 , . . . , xN }. We can call Var (Y | X = x) the within group variance for the
group X = x, and so E [Var (Y | X)] is the [weighted] average of the within
group variances. This is clear just from writing out the expectation:
X
E [Var (Y | X)] = p(x)Var (Y | X = x) .
x∈X

We can call Var (E [Y | X]) the between group variance, where each group
x is represented by the single number E [Y | X = x]. If the groups have equal
probabilities p(x1 ) = · · · = p(xN ), then Var (E [Y | X]) is just the variance of the
numbers E [Y | X = x1 ] , . . . , E [Y | X = xN ]. More generally, Var (E [Y | X])
is the variance of the distribution described by the following table:
4.1 Law of Total Variance 17

Probability Value
p(x1 ) E [Y | X = x1 ]
.. ..
. .
p(xN ) E [Y | X = xN ]

Proof. Expanding the definitions:

E [Var(Y | X)] = E E Y 2 | X − (E [Y | X])2

= E E Y 2 | X − E (E [Y | X])2

by linearity
= EY 2 − E (E [Y | X])2

by Adam’s Law (1)

and

Var (E [Y | X]) = E (E [Y | X])2 − (E [E [Y | X]])2 def of Var

= E (E [Y | X])2 − (EY )2 by Adam’s Law.

Adding these expression together, we get the result.

Remark 8. It’s tempting to say that getting new information about Y from ob-
serving X = x would decrease the variance. That is, it seems reasonable that
Var(Y | X = x) ≤ Var(Y ) for all x. But this is not the case. For example, we
could have Var (Y | X = x) very large for a particular x, but if X = x is very
rare, the overall variance of Y could still be much smaller. On the other hand, it is
true that Var(Y | X = x) ≤ Var(Y ) on average over X. More precisely:

E [Var (Y | X)] ≤ Var(Y ).

This follows immediately from Eve’s Law (Theorem 4.1) and the fact that Var (E [Y | X]) ≥
0.
We can equate Eve’s Law with our variance decomposition in terms of projection
(Theorem 5) to get the following theorem:

Theorem 7. If X and Y are random variables on the same probability space, then

E [Var(Y | X)] = Var (Y − E [Y | X]) = E (Y − E [Y | X])2

Proof. As an exercise in conditional expectations, we’ll prove this without using

Eve’s Law:
18 4 Conditional variance

Since Y − E [Y | X] has mean 0,

Var (Y − E [Y | X])
= E (Y − E [Y | X])2
= EY 2 + E (E [Y | X])2 − 2E [Y E [Y | X]]

Since E [Y | X] is a function of X, we can use the generalized form of “keeping

= EY 2 − E (E [Y | X])2

= E [Var(Y | X)] ,
where the last equality is from Equation 1 in the proof of Eve’s Law above.
Exercise 15. Suppose A ∈ A has probability mass function π(a), for a ∈ A =
{1, . . . , k} and R ∈ R is an independent random element. Show that
k
1X
E [f (R, A)g(A)] = E [f (R, a)] π(a)g(a).
k a=1
Proof. In the context that this exercise arises, we start with the RHS and we need
to “discover” the LHS. So starting with the LHS would be a “guess and check”
approach. We’ll start with the RHS:
k
1X
E [f (R, a)] π(a)g(a)
k a=1
k
1X
= π(a)E [f (R, a)g(a)] since g(a) is constant
k a=1
k
1X
= π(a) E [f (R, A)g(A) | A = a] since R and A are independent
k a=1 | {z }
=h(a)

= E [h(A)]
= E [E [f (R, A)g(A) | A]]
= E [f (R, A)g(A)]
As we get more comfortable with conditional expectations, we can skip the step
involving h(a).
19

5 Law of total covariance / Conditional covariance

First, recall the definition of the covariance of X and Y : Cov (X, Y ) = E [XY ] −
EXEY = E (X − EX) (Y − EY ).

Exercise 16. Show that covariance is not affected by changing the means of the
random variables. To be precise, if X 0 = X + c1 and Y 0 = Y + c2 for constants
c1 , c2 ∈ R, then Cov (X 0 , Y 0 ) = Cov(X, Y ).

Definition 5. The conditional covariance of X and Y given Z is

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z]

= E [XY | Z] − E [X | Z] E [Y | Z]

Exercise 17. Use the rules we developed above to show that the two expressions
for Cov (X, Y | Z) are equivalent.

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z] definition

= E [XY |] Z + E [E [X | Z] E [Y | Z] | Z] linearity
−E [XE [Y | Z] | Z] − E [Y E [X | Z] | Z]
:1
= E [XY | Z] + E [X | Z] E [Y | Z] |

E[1 Z] taking out what is known
−E [Y | Z] E [X | Z] − E [X | Z] E [Y | Z]
= E [XY | Z] − E [X | Z] E [Y | Z]

Theorem 8 (Law of Total Covariance (ECCE)). Suppose X and Y are random

variables and Z is a random element on the same probability space. Then

Cov (X, Y ) = E [Cov(X, Y | Z)] + Cov (E [X | Z] , E [Y | Z]) .

Note: Following [KBH19, Ch 9, Exercise 43], we’ll use ECCE as a short-

hand for the law of total covariance, based on the sequence of expectations and
covariances in the formula. (Again, also a good mnemonic.)
Proof. We have

E [Cov(X, Y | Z)] = E [E [XY | Z] − E [X | Z] E [Y | Z]] def

= E[XY ] − E [E [X | Z] E [Y | Z]] linearity and Adam’s
20 A Generalized Adam’s Law

References
[CT06] Thomas M. Cover and Joy A. Thomas, Elements of information the-
ory (wiley series in telecommunications and signal processing), Wiley-
Interscience, USA, 2006.
[KBH19] Joseph K. Blitzstein and Jessica Hwang, Introduction to probability
second edition, 2nd ed., Chapman and Hall/CRC, 2019.
[Wik20] Wikipedia contributors, Conditional expectation — Wikipedia, the free
encyclopedia, 2020, [Online; accessed 31-December-2020].

A Generalized Adam’s Law

Theorem 9 (Generalized Adam’s Law). We have
E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))]
for any f and g with compatible domains and ranges. We also have that
E [E [Y | g(X)] | f (g(X)) = z] = E [Y | f (g(X)) = z]
for any z.
Proof. Let G = {g(x) | x ∈ X }. The key step in the proof is the following
P (Y = y | f (g(X)) = z)
X
= P (Y = y | f (g(X)) = z, g(X) = w) P (g(X) = w | f (g(X)) = z)
w∈G
X
= P (Y = y | g(X) = w) P (g(X) = w | f (g(X)) = z) ,
w∈G
21

where the second step is the law of total probability, and the third step follows
because once we know g(X) = w, we know that f (g(X)) = f (w), and so condi-
tioning on the value of f (g(X)) gives no additional information.
Let h(w) = E [Y | g(X) = w], to ease some calculations below. Then

I claim we’re done at this point. To make it clear, suppose we let k(z) = E [Y | f (g(X)) = z]
and r(z) = E [E [Y | g(X)] | f (g(X)) = z]. Then the calculations above have
shown that k(z) = r(z) for all possible z. So the equality certainly holds when
we plug in the random variable f (g(X)) for z: k(f (g(X))) = r(f (g(X)). And
then by the definition of conditional expectation, we conclude that

E [Y | f (g(X))] = E [E [Y | g(X)] | f (g(X))]

Note: It’s not true that E [E [Y | g(X)] | h(X)] = E [Y | h(X)]. Can you find
the spot in the proof where we can’t just replace f (g(X)) by h(X)?
Remark 9. If we take g(x, z) = (x, z) and f (g(x, z)) = z in the generalized
Adam’s Law, we that

E [E [Y | X, Z] | Z] = E [Y | Z] ,

which is often useful in practice.

Plastics Write Up
No ratings yet
Plastics Write Up
6 pages
Machine Learning
100% (11)
Machine Learning
135 pages
Financial Econometrics Notes: Kevin Sheppard University of Oxford
No ratings yet
Financial Econometrics Notes: Kevin Sheppard University of Oxford
612 pages
Single Maths B Probability & Statistics: Exercises & Solutions
No ratings yet
Single Maths B Probability & Statistics: Exercises & Solutions
18 pages
Statistics For Managers Using Microsoft® Excel 5th Edition: Some Important Discrete Probability Distributions
No ratings yet
Statistics For Managers Using Microsoft® Excel 5th Edition: Some Important Discrete Probability Distributions
48 pages
Conditional Expectation
No ratings yet
Conditional Expectation
7 pages
Conditional Expectation
No ratings yet
Conditional Expectation
32 pages
8 Conditional Expectation
No ratings yet
8 Conditional Expectation
27 pages
Conditional Expectation: Scott Sheffield
No ratings yet
Conditional Expectation: Scott Sheffield
58 pages
05_MultipleRVs-1
No ratings yet
05_MultipleRVs-1
18 pages
Conditioning A Random Variable On An Event
No ratings yet
Conditioning A Random Variable On An Event
11 pages
Stochastic I CE
No ratings yet
Stochastic I CE
2 pages
APA Lecture Notes Part2
No ratings yet
APA Lecture Notes Part2
21 pages
Lecture23 Conditional Expectation
No ratings yet
Lecture23 Conditional Expectation
4 pages
Conditional Expectation: Scott Sheffield
No ratings yet
Conditional Expectation: Scott Sheffield
17 pages
04 Estimation
No ratings yet
04 Estimation
48 pages
Prerequis Esp Cond
No ratings yet
Prerequis Esp Cond
6 pages
ST3236_Note4
No ratings yet
ST3236_Note4
15 pages
Lecture 3 20240318
No ratings yet
Lecture 3 20240318
23 pages
FINA 4250 Applications of Risk Models
No ratings yet
FINA 4250 Applications of Risk Models
67 pages
Lecture 5
No ratings yet
Lecture 5
17 pages
Conditional Expectations E (X - Y) As Random Variables: Sums of Random Number of Random Variables (Random Sums)
No ratings yet
Conditional Expectations E (X - Y) As Random Variables: Sums of Random Number of Random Variables (Random Sums)
2 pages
Introduction to Probability Part I
No ratings yet
Introduction to Probability Part I
14 pages
Mathematical Foundations of Computer Science Lecture Outline
No ratings yet
Mathematical Foundations of Computer Science Lecture Outline
6 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
L17 Conditional Expectation and Variance
No ratings yet
L17 Conditional Expectation and Variance
4 pages
Notes Probability
No ratings yet
Notes Probability
15 pages
Econ 140 (Spring 2018) - Section 1: 1 Random Variable (RV)
No ratings yet
Econ 140 (Spring 2018) - Section 1: 1 Random Variable (RV)
7 pages
2 5431342108386526658
No ratings yet
2 5431342108386526658
8 pages
R300 MT Class 1 Slides
No ratings yet
R300 MT Class 1 Slides
68 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
Expectations of Discrete Random Variables: Scott She Eld
No ratings yet
Expectations of Discrete Random Variables: Scott She Eld
18 pages
MIT14 381F13 Lec1 PDF
No ratings yet
MIT14 381F13 Lec1 PDF
8 pages
Probability and Statistics: A Sample Analogues Approach: Charlie Gibbons Economics 140 University of California, Berkeley
No ratings yet
Probability and Statistics: A Sample Analogues Approach: Charlie Gibbons Economics 140 University of California, Berkeley
44 pages
Introduction To Probability Theory
100% (1)
Introduction To Probability Theory
12 pages
Conditional Expected Value
No ratings yet
Conditional Expected Value
11 pages
Conditional Probability
No ratings yet
Conditional Probability
8 pages
Conditional Probability
No ratings yet
Conditional Probability
8 pages
Conditional Probability and Expectation
No ratings yet
Conditional Probability and Expectation
19 pages
Lectur
No ratings yet
Lectur
4 pages
IP Unit 4 (Expectation)
No ratings yet
IP Unit 4 (Expectation)
22 pages
Probability 2 Notes
No ratings yet
Probability 2 Notes
5 pages
Notes On Random Variables, Expectations, Probability Densities, and Martingales
No ratings yet
Notes On Random Variables, Expectations, Probability Densities, and Martingales
8 pages
Probab Refresh
No ratings yet
Probab Refresh
7 pages
ross chapter 3 sols
No ratings yet
ross chapter 3 sols
5 pages
Learning Material - ITC
No ratings yet
Learning Material - ITC
96 pages
Moments of Functions
No ratings yet
Moments of Functions
9 pages
Distributions and Normal Random Variables
No ratings yet
Distributions and Normal Random Variables
8 pages
Math PDF
No ratings yet
Math PDF
61 pages
5.5. Solved Problems
100% (3)
5.5. Solved Problems
61 pages
Marcin Pitera. Stochastic Processes.
No ratings yet
Marcin Pitera. Stochastic Processes.
45 pages
Expectations of Discrete Random Variables: Scott Sheffield
No ratings yet
Expectations of Discrete Random Variables: Scott Sheffield
61 pages
Conditional Variance Conditional Expectation Iterated Expectations Independent Random Variables
No ratings yet
Conditional Variance Conditional Expectation Iterated Expectations Independent Random Variables
1 page
2 5431342108386526646
No ratings yet
2 5431342108386526646
8 pages
Draw PDF
No ratings yet
Draw PDF
21 pages
meth_2024_SM2
No ratings yet
meth_2024_SM2
26 pages
斯坦福大学机器学习数学基础 25-32
No ratings yet
斯坦福大学机器学习数学基础 25-32
8 pages
Conditional Expectation
No ratings yet
Conditional Expectation
7 pages
Stochastic Processes
No ratings yet
Stochastic Processes
46 pages
Financial Engineering & Risk Management: Review of Basic Probability
No ratings yet
Financial Engineering & Risk Management: Review of Basic Probability
46 pages
Lecture 1: Introduction and Review of Prerequisite Concepts: DR Jay Lee Jay - Lee@unsw - Edu.au
No ratings yet
Lecture 1: Introduction and Review of Prerequisite Concepts: DR Jay Lee Jay - Lee@unsw - Edu.au
33 pages
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Set Theory Essentials
From Everand
Set Theory Essentials
Emil Milewski
No ratings yet
Elgenfunction Expansions Associated with Second Order Differential Equations
From Everand
Elgenfunction Expansions Associated with Second Order Differential Equations
E. C. Titchmarsh
No ratings yet
Serie 3
No ratings yet
Serie 3
3 pages
More About Testing Hypothesis
No ratings yet
More About Testing Hypothesis
22 pages
Maple Stat
No ratings yet
Maple Stat
7 pages
F 18 Final
No ratings yet
F 18 Final
16 pages
Cai. Z (1998)
No ratings yet
Cai. Z (1998)
12 pages
Quiz4 211020-Solutions
No ratings yet
Quiz4 211020-Solutions
4 pages
MATH10BDIS0321SOLNS
No ratings yet
MATH10BDIS0321SOLNS
6 pages
s2 CRV
No ratings yet
s2 CRV
61 pages
Bias Reduction in Nonparametric Hazard
No ratings yet
Bias Reduction in Nonparametric Hazard
112 pages
Cai. Z (1996)
No ratings yet
Cai. Z (1996)
9 pages
Norm Prog
No ratings yet
Norm Prog
9 pages
Doubly Truncated With R (Moreira)
No ratings yet
Doubly Truncated With R (Moreira)
195 pages
Week 4 Lecture Notes
No ratings yet
Week 4 Lecture Notes
37 pages
PG Mtech Cse 2022
No ratings yet
PG Mtech Cse 2022
72 pages
Ocean Engineering: Matteo Diez, Daniele Peri
No ratings yet
Ocean Engineering: Matteo Diez, Daniele Peri
12 pages
(Didactic) - Solving DSGE Models Uhlig
No ratings yet
(Didactic) - Solving DSGE Models Uhlig
60 pages
STA
No ratings yet
STA
20 pages
Statistics Study Guide - Probability
No ratings yet
Statistics Study Guide - Probability
5 pages
Download ebooks file An introduction to financial mathematics : option valuation Second Edition. Edition Hastings all chapters
100% (5)
Download ebooks file An introduction to financial mathematics : option valuation Second Edition. Edition Hastings all chapters
55 pages
Data Science and Machine Learning Mathematical and Statistical Methods Chapman Hall CRC Machine Learning Pattern Recognition 1st Edition Dirk P. Kroese 2024 Scribd Download
100% (3)
Data Science and Machine Learning Mathematical and Statistical Methods Chapman Hall CRC Machine Learning Pattern Recognition 1st Edition Dirk P. Kroese 2024 Scribd Download
52 pages
Measure Theoretic Probability Theory Notes
No ratings yet
Measure Theoretic Probability Theory Notes
3 pages
Solution Manual For Quantitative Analysis For Management 14th Edition by Barry Render Ralph M Stair JR Michael e Hanna
No ratings yet
Solution Manual For Quantitative Analysis For Management 14th Edition by Barry Render Ralph M Stair JR Michael e Hanna
40 pages
CONTROL ENGINEERING-II Sem
No ratings yet
CONTROL ENGINEERING-II Sem
18 pages
1x3 1 - COL TAUHID PROJECT
No ratings yet
1x3 1 - COL TAUHID PROJECT
106 pages
Suresh Kumar 1-4 Chap Pns Notes
No ratings yet
Suresh Kumar 1-4 Chap Pns Notes
19 pages
Bjork - CH 4 Differential - Equations
No ratings yet
Bjork - CH 4 Differential - Equations
27 pages
Stats Distribution Theory
No ratings yet
Stats Distribution Theory
35 pages
Case Studies On Risk Management
100% (4)
Case Studies On Risk Management
63 pages
EIE4105 Multimodal Human Computer Interaction Technology: Fundamental of Statistical Learning
No ratings yet
EIE4105 Multimodal Human Computer Interaction Technology: Fundamental of Statistical Learning
31 pages
Sta 2200 Notes PDF
No ratings yet
Sta 2200 Notes PDF
52 pages
kol1_pc_2019_rozw
No ratings yet
kol1_pc_2019_rozw
10 pages
Q3W3 - Probability Mass Function, Mean - 0
No ratings yet
Q3W3 - Probability Mass Function, Mean - 0
44 pages
Chapter 06 Discrete Probability Distributions Answer Key
No ratings yet
Chapter 06 Discrete Probability Distributions Answer Key
44 pages
Walkthrough For The Clvtools Package: 1 Prerequisites: Setup The R Environment
No ratings yet
Walkthrough For The Clvtools Package: 1 Prerequisites: Setup The R Environment
19 pages
Probability, Statistics and Queuing Theory (I .T) : Course Outcomes: at The End of The Course Student Should Be Able To
No ratings yet
Probability, Statistics and Queuing Theory (I .T) : Course Outcomes: at The End of The Course Student Should Be Able To
3 pages
Assignment 1 Year
No ratings yet
Assignment 1 Year
27 pages

Conditional Expectation Notes

Uploaded by

Conditional Expectation Notes

Uploaded by

Conditional Expectations: Review and

For each fixed x, p(y | x) represents a distribution

As x changes, the conditional distribution of Y given X = x typically changes

3 Identities for conditional expectations

• Adam’s Law / Law of Iterated Expectation:

• Independence: E [Y | X] = E [Y ] if X and Y are independent.

• Taking out what is known1 : E [h(X)Z | X] = h(X)E [Z | X].

• Linearity: E [aX + bY | Z] = aE [X | Z] + bE [Y | Z], for any a, b ∈ R.

• Projection interpretation: E [(Y − E [Y | X])h(X)] = 0 for any function

• Keeping just what is needed: E [XY ] = E [XE [Y | X]] for X, Y ∈ R.

3.1 Law of Iterated Expectations

Just like all other [unconditional] expectations, E [E [Y | X]] is just a number:

Proof. Let Z = 1 [W = 1]. Then

E [1 [W = 1] Y ] = E [ZY ] = E [E (ZY | Z)] by Adam’s Law

3.1.1 Information processing

E [E [Y | g(X)] | f (g(X))] = E [Y | f (g(X))]

for any f and g with compatible domains and ranges.

3.2 Projection interpretation

3.2.1 What we can say about residuals

Definition 3. The covariance of random variables X and Y is defined by Cov (X, Y ) =

Corollary 1. The residual Y − E [Y | X] and h(X) are uncorrelated (i.e. have

Proof. Note that E [Y − E [Y | X]] = EY − E [E [Y | X]] = 0 by linearity and

Cov(Y − E [Y | X] , h(X)) = E [(Y − E [Y | X]) h(X)] − E [Y − E [Y | X]] Eh(X)

where the last equality is by Theorem 3.

Exercise 5. Following the remark above, can we also say that Y − E [Y | X] is

Solution. X is not necessarily real-valued, and covariance and correlation are

3.2.2 Conditional expectation gives the best prediction

Theorem 4 (Conditional expectation minimizes MSE). Suppose we have random

g(x) = arg min E (Y − f (X))2 .

+ E f (X) − E[Y | X] E[Y | X] − Y  Projection interpretation

= E(f (X) − E[Y |X])2 + E (E[Y |X] − Y )2 .

3.2.3 A variance decomposition

Var (Y ) = Var (Y − E [Y | X]) + Var (E [Y | X]) .

3.2.4 [Optional] Why do we call this the “projection interpretation”?

3.2.5 Empirical example of the variance decomposition

So given X = x, the best predictor for Y in MSE is E [Y | X = x] = 6 +

Fig. 2: This plot shows estimates of the densities of Y , E [Y | X], and Y −

3.3 Keeping just what is needed

Theorem 6 (Keeping just what is needed). For any random variables X, Y ∈ R,

Proof. We can show this using the projection interpretation:

Exercise 6. Give an alternative proof of E [XY ] = E [XE [Y | X]] using Adam’s

Exercise 8. Show that E [XE [Y | Z]] = E [E (X | Z) E [Y | Z]] = E [E [X | Z] Z].

E [XE [Y | Z]] = E [E (XE [Y | Z] | Z)] Adam’s Law

Exercise 9. Give a new proof of the “projection interpretation” (Theorem 3) using

3.4 Intuition Builders and Extra Exercises

Solution 3. Nope! Consider X, W i.i.d. with uniform distributions on {0, 1}.

Exercise 14. [KBH19, Ch 9, Exercise 41]Show that for any X and Y ,

Definition 4. The conditional variance of Y given X is

4.1 Law of Total Variance

Proof. Expanding the definitions:

E [Var(Y | X)] = E E Y 2 | X − (E [Y | X])2

Var (E [Y | X]) = E (E [Y | X])2 − (E [E [Y | X]])2 def of Var

Adding these expression together, we get the result.

E [Var (Y | X)] ≤ Var(Y ).

E [Var(Y | X)] = Var (Y − E [Y | X]) = E (Y − E [Y | X])2

Proof. As an exercise in conditional expectations, we’ll prove this without using

Since Y − E [Y | X] has mean 0,

Since E [Y | X] is a function of X, we can use the generalized form of “keeping

5 Law of total covariance / Conditional covariance

Definition 5. The conditional covariance of X and Y given Z is

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z]

Cov (X, Y | Z) = E [(X − E [X | Z]) (Y − E [Y | Z]) | Z] definition

Theorem 8 (Law of Total Covariance (ECCE)). Suppose X and Y are random

Cov (X, Y ) = E [Cov(X, Y | Z)] + Cov (E [X | Z] , E [Y | Z]) .

Note: Following [KBH19, Ch 9, Exercise 43], we’ll use ECCE as a short-

E [Cov(X, Y | Z)] = E [E [XY | Z] − E [X | Z] E [Y | Z]] def

A Generalized Adam’s Law

E [Y | f (g(X))] = E [E [Y | g(X)] | f (g(X))]

which is often useful in practice.

You might also like