Stat Merge
Stat Merge
In this introduction we will preview what we will be studying in 18.05. Don’t worry if many
of the terms are unfamiliar, they will be explained as the course proceeds.
Probability and statistics are deeply connected because all statistical statements are at bot-
tom statements about probability. Despite this the two sometimes feel like very different
subjects. Probability is logically self-contained; there are a few rules and answers all follow
logically from the rules, though computations can be tricky. In statistics we apply proba-
bility to draw conclusions from data. This can be messy and usually involves as much art
as science.
Probability example
You have a fair coin (equal probability of heads or tails). You will toss it 100 times. What
is the probability of 60 or more heads? There is only one answer (about 0.028444) and we
will learn how to compute it.
Statistics example
You have a coin of unknown provenance. To investigate whether it is fair you toss it 100
times and count the number of heads. Let’s say you count 60 heads. Your job as a statis-
tician is to draw a conclusion (inference) from this data. There are many ways to proceed,
both in terms of the form the conclusion takes and the probability computations used to
justify the conclusion. In fact, different statisticians might draw different conclusions.
Note that in the first example the random process is fully known (probability of heads =
0.5). The objective is to find the probability of a certain outcome (at least 60 heads) arising
from the random process. In the second example, the outcome is known (60 heads) and the
objective is to illuminate the unknown random process (the probability of heads).
There are two prominent and sometimes conflicting schools of statistics: Bayesian and
frequentist. Their approaches are rooted in differing interpretations of the meaning of
probability.
Frequentists say that probability measures the frequency of various outcomes of an ex-
periment. For example, saying a fair coin has a 50% probability of heads means that if we
toss it many times then we expect about half the tosses to land heads.
Bayesians say that probability is an abstract concept that measures a state of knowledge
or a degree of belief in a given proposition. In practice Bayesians do not assign a single
value for the probability of a coin coming up heads. Rather they consider a range of values
each with its own probability of being true.
In 18.05 we will study and compare these approaches. The frequentist approach has long
1
Statistics Class 1a, Introduction 2
been dominant in fields like biology, medicine, public health and social sciences. The
Bayesian approach has enjoyed a resurgence in the era of powerful computers and big
data. It is especially useful when incorporating new data into an existing statistical model,
for example, when training a speech or face recognition system. Today, statisticians are
creating powerful tools by using both approaches in complementary ways.
Probability and statistics are used widely in the physical sciences, engineering, medicine, the
social sciences, the life sciences, economics and computer science. The list of applications is
essentially endless: tests of one medical treatment against another (or a placebo), measures
of genetic linkage, the search for elementary particles, machine learning for vision or speech,
gambling probabilities and strategies, climate modeling, economic forecasting, epidemiology,
marketing, googling. . . We will draw on examples from many of these fields during this
course.
Given so many exciting applications, you may wonder why we will spend so much time
thinking about toy models like coins and dice. By understanding these thoroughly we will
develop a good feel for the simple essence inside many complex real-world problems. In
fact, the modest coin is a realistic model for any situations with two possible outcomes:
success or failure of a treatment, an airplane engine, a bet, or even a class.
Sometimes a problem is so complicated that the best way to understand it is through
computer simulation. Here we use software to run virtual experiments many times in order
to estimate probabilities. In this class we will use R for simulation as well as computation
and visualization. Don’t worry if you’re new to R; we will teach you all you need to know.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Counting and Sets
Class 1b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Know the definitions and notation for sets, intersection, union, complement.
4. Be able to use the rule of product, inclusion-exclusion principle, permutations and com-
binations to count the elements in a set.
2 Counting
Example 1. A coin is fair if it comes up heads or tails with equal probability. You flip a
fair coin three times. What is the probability that exactly one of the flips results in a head?
Solution: With three flips, we can easily list the eight possible outcomes:
{T T H, T HT, HT T }
A deck of 52 cards has 13 ranks (2, 3, . . . , 9, 10, J, Q, K, A) and 4 suits (♡, ♠, ♢, ♣,). A
poker hand consists of 5 cards. A one-pair hand consists of two cards having one rank and
three cards having three other ranks, e.g., {2♡, 2♠, 5♡, 8♣, K♢}
1
Statistics Class 1b, Counting and Sets 2
At this point we can only guess the probability. One of our goals is to learn how to compute
it exactly. To start, we note that since every set of five cards is equally probable, we can
compute the probability of a one-pair hand as
number of one-pair hands
P (one-pair) =
total number of hands
So, to find the exact probability, we need to count the number of elements in each of these
sets. And we have to be clever about it, because there are too many elements to simply
list them all. We will come back to this problem after we have learned some counting
techniques.
Several times already we have noted that all the possible outcomes were equally probable
and used this to find a probability by counting. Let’s state this carefully in the following
principle.
Principle: Suppose there are n possible outcomes for an experiment and each is equally
probable. If there are k desirable outcomes then the probability of a desirable outcome is
k/n. Of course we could replace the word desirable by any other descriptor: undesirable,
funny, interesting, remunerative, . . .
Concept question: Can you think of a scenario where the possible outcomes are not
equally probable?
Here’s one scenario: on an exam you can get any score from 0 to 100. That’s 101 different
possible outcomes. Is the probability you get less than 50 equal to 50/101?
Our goal is to learn techniques for counting the number of elements of a set, so we start
with a brief review of sets. (If this is new to you, please come to office hours).
2.2.1 Definitions
S = {Antelope, Bee, Cat, Dog, Elephant, Frog, Gnat, Hyena, Iguana, Jaguar}.
The relationship between union, intersection, and complement is given by DeMorgan’s laws:
(A ∪ B)c = Ac ∩ B c
(A ∩ B)c = Ac ∪ B c
In words the first law says everything not in (A or B) is the same set as everything that’s
(not in A) and (not in B). The second law is similar.
S L R
(L ∪ R)c Lc Rc Lc ∩ Rc
(L ∩ R)c Lc Rc Lc ∪ Rc
Example 3. Verify DeMorgan’s laws for the subsets A = {1, 2, 3} and B = {3, 4} of the
set S = {1, 2, 3, 4, 5}.
Solution: For each law we just work through both sides of the equation and show they are
the same.
1. (A ∪ B)c = Ac ∩ B c :
Right hand side: A ∪ B = {1, 2, 3, 4} ⇒ (A ∪ B)c = {5}.
Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ⇒ Ac ∩ B c = {5}.
The two sides are equal. QED
2. (A ∩ B)c = Ac ∪ B c :
Right hand side: A ∩ B = {3} ⇒ (A ∩ B)c = {1, 2, 4, 5}.
Left hand side: Ac = {4, 5}, B c = {1, 2, 5} ⇒ Ac ∪ B c = {1, 2, 4, 5}.
The two sides are equal. QED
Think: Draw and label a Venn diagram with A the set of Brain and Cognitive Science
majors and B the set of sophomores. Shade the region illustrating the first law. Can you
express the first law in this case as a non-technical English sentence?
S × T = {(s, t) | s ∈ S, t ∈ T }.
In words the right-hand side reads “the set of ordered pairs (s, t) such that s is in S and t
is in T .
The following diagrams show two examples of the set product.
Statistics Class 1b, Counting and Sets 5
4
× 1 2 3 4 3
1 (1,1) (1,2) (1,3) (1,4)
2 (2,1) (2,2) (2,3) (2,4)
3 (3,1) (3,2) (3,3) (3,4)
1
{1, 2, 3} × {1, 2, 3, 4}
1 4 5
[1, 4] × [1, 3] ⊂ [0, 5] × [0, 4]
2.3 Counting
We can illustrate this with a Venn diagram. S is all the dots, A is the dots in the blue
circle, and B is the dots in the red circle.
B A∩B A
|A| is the number of dots in A and likewise for the other sets. The figure shows that |A|+|B|
double-counts |A ∩ B|, which is why |A ∩ B| is subtracted off in the inclusion-exclusion
formula.
Example 4. In a band of singers and guitarists, seven people sing, four play the guitar,
and two do both. How big is the band?
Statistics Class 1b, Counting and Sets 6
Solution: Let S be the set singers and G be the set guitar players. The inclusion-exclusion
principle says
size of band = |S ∪ G| = |S| + |G| − |S ∩ G| = 7 + 4 − 2 = 9.
If there are n ways to perform action 1 and then by m ways to perform action
2, then there are n · m ways to perform action 1 followed by action 2.
Example 5. If you have 3 shirts and 4 pants then you can make 3 · 4 = 12 outfits.
Think: An extremely important point is that the rule of product holds even if the ways to
perform action 2 depend on action 1, as long as the number of ways to perform action 2 is
independent of action 1. To illustrate this:
Example 6. There are 5 competitors in the 100m final at the Olympics. In how many
ways can the gold, silver, and bronze medals be awarded?
Solution: There are 5 ways to award the gold. Once that is awarded there are 4 ways to
award the silver and then 3 ways to award the bronze: answer 5 · 4 · 3 = 60 ways.
Note that the choice of gold medalist affects who can win the silver, but the number of
possible silver medalists is always four.
2.4.1 Permutations
A permutation of a set is a particular ordering of its elements. For example, the set {a, b, c}
has six permutations: abc, acb, bac, bca, cab, cba. We found the number of permutations by
listing them all. We could also have found the number of permutations by using the rule
of product. That is, there are 3 ways to pick the first element, then 2 ways for the second,
and 1 for the third. This gives a total of 3 · 2 · 1 = 6 permutations.
In general, the rule of product tells us that the number of permutations of a set of k elements
is
k! = k · (k − 1) · · · 3 · 2 · 1.
We also talk about the permutations of k things out of a set of n things. We show what
this means with an example.
Example 7. List all the permutations of 3 elements out of the set {a, b, c, d}.
Solution: This is a longer list,
abc acb bac bca cab cba
abd adb bad bda dab dba
acd adc cad cda dac dca
bcd bdc cbd cdb dbc dcb
Statistics Class 1b, Counting and Sets 7
Note that abc and acb count as distinct permutations. That is, for permutations the order
matters.
There are 24 permutations. Note that the rule of product would have told us there are
4 · 3 · 2 = 24 permutations without bothering to list them all.
2.4.2 Combinations
In contrast to permutations, in combinations order does not matter: permutations are lists
and combinations are sets. We show what we mean with an example
Example 8. List all the combinations of 3 elements out of the set {a, b, c, d}.
Solution: Such a combination is a collection of 3 elements without regard to order. So, abc
and cab both represent the same combination. We can list all the combinations by listing
all the subsets of exactly 3 elements.
{a, b, c} {a, b, d} {a, c, d} {b, c, d}
There are only 4 combinations. Contrast this with the 24 permutations in the previous
example. The factor of 6 comes because every combination of 3 things can be written in 6
different orders.
2.4.3 Formulas
2.4.4 Examples
Example 10. (i) Count the number of ways to get 3 heads in a sequence of 10 flips of a
coin.
(ii) If the coin is fair, what is the probability of exactly 3 heads in 10 flips?
Solution: (i) This asks for the number sequences of 10 flips (heads or tails) with exactly
3 heads. That is, we have to choose exactly 3 out of 10 flips to be heads. This is the same
question as in the previous example.
10 · 9 · 8
10 10!
= = = 120.
3 3! 7! 3·2·1
(ii) Each flip has 2 possible outcomes (heads or tails). So the rule of product says there are
210 = 1024 sequences of 10 flips. Since the coin is fair each sequence is equally probable.
So the probability of 3 heads is
120
= 0.117 .
1024
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Probability: Terminology and Examples
Class 2, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2. Be able to organize a scenario with randomness into an experiment and sample space.
2 Terminology
• Sample space: the set of all possible outcomes. We usually denote the sample space by
Ω, sometimes by S.
1
Statistics Class 2, Probability: Terminology and Examples 2
λk
P (k) = e−λ ,
k!
where λ is the average number of taxis. We can put this in a table:
Outcome 0 1 2 3 ... k ...
Probability e−λ e−λ λ e−λ λ2 /2 e−λ λ3 /3! ... e−λ λk /k! ...
∞
X λk
Question: Accepting that this is a valid probability function, what is e−λ ?
k!
k=0
Solution: This is the total probability of all possible outcomes, so the sum equals 1.
∞
λ
X λn
(Note, this also follows from the Taylor series e = .)
n!
n=0
In a given setup there can be more than one reasonable choice of sample space. Here is a
simple example.
Example 5. Two dice (Choice of sample space)
Suppose you roll one die. Then the sample space and probability function are
Outcome 1 2 3 4 5 6
Probability: 1/6 1/6 1/6 1/6 1/6 1/6
Now suppose you roll two dice. What should be the sample space? Here are two options.
1. Record the pair of numbers showing on the dice (first die, second die).
2. Record the sum of the numbers on the dice. In this case there are 11 outcomes
{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. These outcomes are not all equally likely.
As above, we can put this information in tables. For the first case, the sample space is the
product of the sample spaces for each die
Each of the 36 outcomes is equally likely. (Why 36 outcomes?) For the probability function
we will make a two dimensional table with the rows corresponding to the number on the
first die, the columns the number on the second die and the entries the probability.
Statistics Class 2, Probability: Terminology and Examples 3
Die 2
1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
Die 1 3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Two dice in a two dimensional table
In the second case we can present outcomes and probabilities in our usual table.
outcome 2 3 4 5 6 7 8 9 10 11 12
probability 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Events.
An event is a collection of outcomes, i.e. an event is a subset of the sample space Ω. This
sounds odd, but it actually corresponds to the common meaning of the word.
Example 6. Using the setup in Example ?? we would describe the event that you get
exactly two heads in words by E = ‘exactly 2 heads’. Written as a subset this becomes
E = {HHT, HT H, T HH}.
You should get comfortable moving between describing events in words and as subsets of
the sample space.
The probability of an event E is computed by adding up the probabilities of all of the
outcomes in E. In this example each outcome has probability 1/8, so we have P (E) = 3/8.
Definition. A discrete sample space is one that is listable, it can be either finite or infinite.
Examples. {H, T}, {1, 2, 3}, {1, 2, 3, 4, . . . }, {2, 3, 5, 7, 11, 13, 17, . . . } are all
discrete sets. The first two are finite and the last two are infinite.
Example. The interval 0 ≤ x ≤ 1 is not discrete, rather it is continuous. We will deal
with continuous sample spaces in a few days.
Statistics Class 2, Probability: Terminology and Examples 4
So far we’ve been using a casual definition of the probability function. Let’s give a more
precise one.
Careful definition of the probability function.
For a discrete sample space S a probability function P assigns to each outcome ω a number
P (ω) called the probability of ω. P must satisfy two rules:
Rule 2. The sum of the probabilities of all possible outcomes is 1 (something must
occur)
P (L ∪ R) = P (L) + P (R) − P (L ∩ R)
Ac
L R L R
A
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Conditional Probability, Independence and Bayes’ Theorem
Class 3, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
3. Be able to use the multiplication rule to compute the total probability of an event.
6. Be able to organize the computation of conditional probabilities using trees and tables.
2 Conditional Probability
Conditional probability answers the question ‘how does the probability of an event change
if we have extra information’. We’ll illustrate with an example.
Example 1. Toss a fair coin 3 times.
(a) What is the probability of 3 heads?
Solution: Sample space Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
All outcomes are equally probable, so P (3 heads) = 1/8.
(b) Suppose we are told that the first toss was heads. Given this information how should
we compute the probability of 3 heads?
Solution: We have a new (reduced) sample space: Ω′ = {HHH, HHT, HT H, HT T }.
All outcomes are equally probable, so
P (3 heads given that the first toss is heads) = 1/4.
This is called conditional probability, since it takes into account additional conditions. To
develop the notation, we rephrase (b) in terms of events.
Rephrased (b) Let A be the event ‘all three tosses are heads’ = {HHH}.
Let B be the event ‘the first toss is heads’ = {HHH, HHT, HT H, HT T }.
The conditional probability of A knowing that B occurred is written
P (A|B)
This is read as
‘the conditional probability of A given B’
1
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 2
or
‘the probability of A conditioned on B’
or simply
‘the probability of A given B’.
We can visualize conditional probability as follows. Think of P (A) as the proportion of the
area of the whole sample space taken up by A. For P (A|B) we restrict our attention to B.
That is, P (A|B) is the proportion of area of B taken up by A, i.e. P (A ∩ B)/P (B).
B
A A=A∩B B
A∩B
HHH HHT T HH T HT
HT H HT T TTH TTT
P (A ∩ B)
P (A|B) = , provided P (B) ̸= 0. (1)
P (B)
Let’s redo the coin tossing example using the definition in Equation (1). Recall A = ‘3 heads’
and B = ‘first toss is heads’. We have P (A) = 1/8 and P (B) = 1/2. Since A ∩ B = A, we
also have P (A ∩ B) = 1/8. Now according to (1),
P (A ∩ B) 1/8
P (A|B) = = = 1/4,
P (B) 1/2
the second card. Since 13 of the 52 cards are spades we get P (S2 ) = 13/52 = 1/4. Another
way to say this is: if we are not given value of the first card then we have to consider all
possibilities for the second card.
Continuing, we compute P (S1 ∩ S2 ) by counting:
Number of ways to draw a spade followed by a second spade: 13 · 12.
Number of ways to draw any card followed by any other card: 52 · 51.
Thus,
13 · 12
P (S1 ∩ S2 ) = = 3/51.
52 · 51
Now, using (1) we get
P (S2 ∩ S1 ) 3/51
P (S2 |S1 ) = = = 12/51.
P (S1 ) 1/4
This case is simple enough that we can check our answer by computing the conditional
probability directly: if the first card is a spade then of the 51 cards remaining, 12 are
spades. So, the probability the second card is also a spade is
P (S2 |S1 ) = 12/51.
Warning: In more complicated problems it will be be much harder to compute conditional
probability by counting. Usually we have to use Equation (1).
Think: For S1 and S2 in the previous example, what is P (S2 |S1c )?
3 Multiplication Rule
The law of total probability will allow us to use the multiplication rule to find probabilities
in more interesting examples. It involves a lot of notation, but the idea is fairly simple. We
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 4
state the law when the sample space is divided into 3 pieces. It is a simple matter to extend
the rule when there are more than 3 pieces.
Law of Total Probability
Suppose the sample space Ω is divided into 3 disjoint events B1 , B2 , B3 (see the figure
below). Then for any event A:
P (A) = P (A ∩ B1 ) + P (A ∩ B2 ) + P (A ∩ B3 )
P (A) = P (A|B1 ) P (B1 ) + P (A|B2 ) P (B2 ) + P (A|B3 ) P (B3 ) (3)
The top equation says ‘if A is divided into 3 pieces then P (A) is the sum of the probabilities
of the pieces’. The bottom equation (3) is called the law of total probability. It is just a
rewriting of the top equation using the multiplication rule.
Ω
B1
A ∩ B1
A ∩ B2 A ∩ B3
B2 B3
The sample space Ω and the event A are each divided into 3 disjoint pieces.
The law holds if we divide Ω into any number of events, so long as they are disjoint and
cover all of Ω. Such a division is often called a partition of Ω.
Our first example will be one where we already know the answer and can verify the law.
Example 4. An urn contains 5 red balls and 2 green balls. Two balls are drawn one after
the other. What is the probability that the second ball is red?
Solution: The sample space is Ω = {rr, rg, gr, gg}.
Let R1 be the event ‘the first ball is red’, G1 = ‘first ball is green’, R2 = ‘second ball is
red’, G2 = ‘second ball is green’. We are asked to find P (R2 ).
Let’s compute this same value using the law of total probability (3). First, we’ll find the
conditional probabilities. This is a simple counting exercise.
Of course, this example is simple enough that we could have computed P (R2 ) directly the
same way we found P (S2 ) directly in the card example. But, we will see that in more
complicated examples the law of total probability is truly necessary.
Probability urns
The example above used probability urns. Their use goes back to the beginning of the
subject and we would be remiss not to introduce them. This toy model is very useful. We
quote from Wikipedia: https://en.wikipedia.org/wiki/Urn_problem
It doesn’t take much to make an example where (3) is really the best way to compute the
probability. Here is a game with slightly more complicated rules.
Example 5. An urn contains 5 red balls and 2 green balls. A ball is drawn. If it’s green
a red ball is added to the urn and if it’s red a green ball is added to the urn. (The original
ball is not returned to the urn.) Then a second ball is drawn. What is the probability the
second ball is red?
Solution: The law of total probability says that P (R2 ) can be computed using the expres-
sion in Equation (4). Only the values for the probabilities will change. We have
P (R2 |R1 ) = 4/7, P (R2 |G1 ) = 6/7.
Therefore,
4 5 6 2 32
P (R2 ) = P (R2 |R1 )P (R1 ) + P (R2 |G1 )P (G1 ) = · + · = .
7 7 7 7 49
Trees are a great way to organize computations with conditional probability and the law of
total probability. The figures and examples will make clear what we mean by a tree. As
with the rule of product, the key is to organize the underlying process into a sequence of
actions.
We start by redoing Example 5. The sequence of actions are: first draw ball 1 (and add the
appropriate ball to the urn) and then draw ball 2.
5/7 2/7
R1 G1
4/7 3/7 6/7 1/7
R2 G2 R2 G2
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 6
You interpret this tree as follows. Each dot is called a node. The tree is organized by levels.
The top node (root node) is at level 0. The next layer down is level 1 and so on. Each level
shows the outcomes at one stage of the game. Level 1 shows the possible outcomes of the
first draw. Level 2 shows the possible outcomes of the second draw starting from each node
in level 1.
Probabilities are written along the branches. The probability of R1 (red on the first draw)
is 5/7. It is written along the branch from the root node to the one labeled R1 . At the
next level we put in conditional probabilities. The probability along the branch from R1 to
R2 is P (R2 |R1 ) = 4/7. It represents the probability of going to node R2 given that you are
already at R1 .
The muliplication rule says that the probability of getting to any node is just the product of
the probabilities along the path to get there. For example, the node labeled R2 at the far left
really represents the event R1 ∩ R2 because it comes from the R1 node. The multiplication
rule now says
5 4
P (R1 ∩ R2 ) = P (R1 ) · P (R2 |R1 ) = · ,
7 7
which is exactly multiplying along the path to the node.
The law of total probability is just the statement that P (R2 ) is the sum of the probabilities
of all paths leading to R2 (the two circled nodes in the figure). In this case,
5 4 2 6 32
P (R2 ) = · + · = ,
7 7 7 7 49
exactly as in the previous example.
The tree given above involves some shorthand. For example, the node marked R2 at the
far left really represents the event R1 ∩ R2 , since it ends the path from the root through
R1 to R2 . Here is the same tree with everything labeled precisely. As you can see this tree
is more cumbersome to make and use. We usually use the shorthand version of trees. You
should make sure you know how to interpret them precisely.
R1 ∩ R2 R1 ∩ G2 G1 ∩ R2 G1 ∩ G2
6 Independence
Two events are independent if knowledge that one occurred does not change the probability
that the other occurred. Informally, events are independent if they do not influence one
another.
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 7
Example 6. Toss a coin twice. We expect the outcomes of the two tosses to be independent
of one another. In real experiments this always has to be checked. If my coin lands in honey
and I don’t bother to clean it, then the second toss might be affected by the outcome of the
first toss.
More seriously, the independence of experiments can by undermined by the failure to clean or
recalibrate equipment between experiments or to isolate supposedly independent observers
from each other or a common influence. We’ve all experienced hearing the same ‘fact’ from
different people. Hearing it from different sources tends to lend it credence until we learn
that they all heard it from a common source. That is, our sources were not independent.
Translating the verbal description of independence into symbols gives
That is, knowing that B occurred does not change the probability that A occurred. In
terms of events as subsets, knowing that the realized outcome is in B does not change the
probability that it is in A.
If A and B are independent in the above sense, then the multiplication rule gives
This is a nice symmetric definition which makes clear that A is independent of B if and only
if B is independent of A. Unlike the equation with conditional probabilities, this definition
makes sense even when P (B) = 0. In terms of conditional probabilities, we have:
1. If P (B) ̸= 0 then A and B are independent if and only if P (A|B) = P (A).
2. If P (A) ̸= 0 then A and B are independent if and only if P (B|A) = P (B).
Independent events commonly arise as different trials in an experiment, as in the following
example.
Example 7. Toss a fair coin twice. Let H1 = ‘heads on first toss’ and let H2 = ‘heads on
second toss’. Are H1 and H2 independent?
Solution: Since H1 ∩ H2 is the event ‘both tosses are heads’ we have
Example 9. Draw one card from a standard deck of playing cards. Let’s examine the
independence of 3 events ‘the card is an ace’, ‘the card is a heart’ and ‘the card is red’.
Define the events as A = ‘ace’, H = ‘hearts’, R = ‘red’.
(a) We know that P (A) = 4/52 (4 out of 52 cards are aces), P (A|H) = 1/13 (1 out of 13
hearts are aces). Since P (A) = P (A|H) we have that A is independent of H.
(b) P (A|R) = 2/26 = 1/13 = P (A). So A is independent of R. That is, whether the card
is an ace is independent of whether it is red.
(c) Finally, what about H and R? Since P (H) = 1/4 and P (H|R) = 1/2, H and R are not
independent. We could also see this the other way around: P (R) = 1/2 and P (R|H) = 1,
so H and R are not independent. That is, the suit of a card is not independent of the color
of the card’s suit.
An event A with probability 0 is independent of itself, since in this case both sides of
equation (6) are 0. This appears paradoxical because knowledge that A occurred certainly
gives information about whether A occurred. We resolve the paradox by noting that since
P (A) = 0 the statement ‘A occurred’ is vacuous.
Think: For what other value(s) of P (A) is A independent of itself?
7 Bayes’ Theorem
Bayes’ theorem is a pillar of both probability and statistics and it is central to the rest of
this course. For two events A and B Bayes’ theorem (also called Bayes’ rule and Bayes’
formula) says
P (A|B) · P (B)
P (B|A) = . (7)
P (A)
Comments: 1. Bayes’ rule tells us how to ‘invert’ conditional probabilities, i.e. to find
P (B|A) from P (A|B).
2. In practice, P (A) is often computed using the law of total probability.
Proof of Bayes’ rule
The key point is that A ∩ B is symmetric in A and B. So the multiplication rule says
A common mistake is to confuse the meanings of P (A|B) and P (B|A). They can be very
different. This is illustrated in the next example.
Example 10. Toss a coin 5 times. Let H1 = ‘first toss is heads’ and let HA = ‘all 5 tosses
are heads’. Then P (H1 |HA ) = 1 but P (HA |H1 ) = 1/16.
For practice, let’s use Bayes’ theorem to compute P (H1 |HA ) using P (HA |H1 ).The terms
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 9
The base rate fallacy is one of many examples showing that it’s easy to confuse the meaning
of P (B|A) and P (A|B) when a situation is described in words. This is one of the key
examples from probability and it will inform much of our practice and interpretation of
statistics. You should strive to understand it thoroughly.
Example 11. The Base Rate Fallacy
Consider a routine screening test for a disease. Suppose the frequency of the disease in the
population (base rate) is 0.5%. The test is fairly accurate with a 5% false positive rate and
a 10% false negative rate.
You take the test and it comes back positive. What is the probability that you have the
disease?
Solution: We will do the computation three times: using trees, tables and symbols. We’ll
use the following notation for the relevant events:
D+ = ‘you have the disease’
D− = ‘you do not have the disease
T + = ‘you tested positive’
T − = ‘you tested negative’.
We are given P (D+) = 0.005 and therefore P (D−) = 0.995. The false positive and false
negative rates are (by definition) conditional probabilities.
The complementary probabilities are known as the true negative and true positive rates:
0.995 0.005
D− D+
0.05 0.95 0.9 0.1
T+ T− T+ T−
The question asks for the probability that you have the disease given that you tested positive,
i.e. what is the value of P (D+ |T + ). We aren’t given this value, but we do know P (T + |D+ ),
so we can use Bayes’ theorem.
P (T + |D+ ) · P (D+ )
P (D+ |T + ) = .
P (T + )
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 10
The two probabilities in the numerator are given. We compute the denominator P (T + )
using the law of total probability. Using the tree, we just have to sum the probabilities for
each of the nodes marked T +
Thus,
0.9 × 0.005
P (D+ |T + ) = = 0.082949 ≈ 8.3%.
0.05425
Remarks: This is called the base rate fallacy because the base rate of the disease in the
population is so low that the vast majority of the people taking the test are healthy, and
even with an accurate test most of the positives will be healthy people. Ask your doctor
for his/her guess at the odds.
To summarize the base rate fallacy with specific numbers
95% of all tests are accurate does not imply 95% of positive tests are accurate
We will refer back to this example frequently. It and similar examples are at the heart of
many statistical misunderstandings.
Other ways to work Example 11
Tables: Another trick that is useful for computing probabilities is to make a table. Let’s
redo the previous example using a table built with 10000 total people divided according to
the probabilites in this example.
We construct the table as follows. Pick a number, say 10000 people, and place it as the
grand total in the lower right. Using P (D+ ) = 0.005 we compute that 50 out of the 10000
people are sick (D+ ). Likewise 9950 people are healthy (D− ). At this point the table looks
like:
D+ D− total
T+
T−
total 50 9950 10000
Using P (T + |D+ ) = 0.9 we can compute that the number of sick people who tested positive
as 90% of 50 or 45. The other entries are similar. At this point the table looks like the
table below on the left. Finally we sum the T + and T − rows to get the completed table on
the right.
D+ D− total D+ D− total
T+ 45 498 T+ 45 498 543
T− 5 9452 T− 5 9452 9457
total 50 9950 10000 total 50 9950 10000
Using the complete table we can compute
|D+ ∩ T + | 45
P (D+ |T + ) = +
= = 8.3%.
|T | 543
Symbols: For completeness, we show how the solution looks when written out directly in
Statistics Class 3, Conditional Probability, Independence and Bayes’ Theorem 11
symbols.
P (T + |D+ ) · P (D+ )
P (D+ |T + ) =
P (T + )
P (T + |D+ ) · P (D+ )
=
P (T + |D+ ) · P (D+ ) + P (T + |D− ) · P (D− )
0.9 × 0.005
=
0.9 × 0.005 + 0.05 × 0.995
= 8.3%
Visualization: The figure below illustrates the base rate fallacy. The large blue rectangle
represents all the healthy people. The much smaller orange rectangle represents the sick
people. The shaded rectangle represents the people who test positive. The shaded area
covers most of the orange area and only a small part of the blue area. Even so, the most of
the shaded area is over the blue. That is, most of the positive tests are of healthy people.
D− D+
As we said at the start of this section, Bayes’ rule is a pillar of probability and statistics.
We have seen that Bayes’ rule allows us to ‘invert’ conditional probabilities. When we study
statistics we will see that the art of statistical inference involves deciding how to proceed
when one (or more) of the terms on the right side of Bayes’ rule is unknown.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Discrete Random Variables: Expected Value
Class 4, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2. Know the expected value of Bernoulli, binomial and geometric random variables.
2 Expected Value
In the R reading questions for this lecture, you simulated the average value of rolling a die
many times. You should have gotten a value close to the exact answer of 3.5. To motivate
the formal definition of the average, or expected value, we first consider some examples.
Example 1. Suppose we have a six-sided die marked with five 5 3’s and one 6. (This was
the red one from our non-transitive dice.) What would you expect the average of 6000 rolls
to be?
Solution: If we knew the value of each roll, we could compute the average by summing
the 6000 values and dividing by 6000. Without knowing the values, we can compute the
expected average as follows.
Since there are five 3’s and one six we expect roughly 5/6 of the rolls will give 3 and 1/6 will
give 6. Assuming this to be exactly true, we have the following table of values and counts:
value: 3 6
expected counts: 5000 1000
The average of these 6000 values is then
5000 · 3 + 1000 · 6 5 1
= · 3 + · 6 = 3.5
6000 6 6
We consider this the expected average in the sense that we ‘expect’ each of the possible
values to occur with the given frequencies.
Example 2. We roll two standard 6-sided dice. You win $1000 if the sum is 2 and lose
$100 otherwise. How much do you expect to win on average per trial?
1
Solution: The probability of a 2 is 1/36. If you play N times, you can ‘expect’ ·N
36
35
of the trials to give a 2 and 36 · N of the trials to give something else. Thus your total
expected winnings are
N 35N
1000 · − 100 · .
36 36
To get the expected average per trial we divide the total by N :
1 35
expected average = 1000 · − 100 · = −69.44.
36 36
1
Statistics Class 4, Discrete Random Variables: Expected Value 2
Think: Would you be willing to play this game one time? Multiple times?
Notice that in both examples the sum for the expected average consists of terms which are
a value of the random variable times its probabilitiy. This leads to the following definition.
Definition: Suppose X is a discrete random variable that takes values x1 , x2 , . . . , xn with
probabilities p(x1 ), p(x2 ), . . . , p(xn ). The expected value of X is denoted E[X] and defined
by
X n
E[X] = p(xj ) xj = p(x1 )x1 + p(x2 )x2 + . . . + p(xn )xn .
j=1
Notes:
1. The expected value is also called the mean or average of X and often denoted by µ
(“mu”).
2. As seen in the above examples, the expected value need not be a possible value of the
random variable. Rather it is a weighted average of the possible values.
4. If all the values are equally probable then the expected value is just the usual average of
the values.
E[X] = p · 1 + (1 − p) · 0 = p.
Important: This is an important example. Be sure to remember that the expected value of
a Bernoulli(p) random variable is p.
Think: What is the expected value of the sum of two dice?
You may have wondered why we use the name ‘probability mass function’. Here’s one
reason: if we place an object of mass p(xj ) at position xj for each j, then E[X] is the
position of the center of mass. Let’s recall the latter notion via an example.
Example 5. Suppose we have two masses along the x-axis, mass m1 = 500 at position
x1 = 3 and mass m2 = 100 at position x2 = 6. Where is the center of mass?
Statistics Class 4, Discrete Random Variables: Expected Value 3
Solution: Intuitively we know that the center of mass is closer to the larger mass.
m1 m2
x
3 6
When we add, scale or shift random variables the expected values do the same. The
shorthand mathematical way of saying this is that E[X] is linear.
1. If X and Y are random variables on a sample space Ω then
E[aX + b] = aE[X] + b.
Before proving these properties, let’s see them in action with a few examples.
Example 6. Roll two dice and let X be the sum. Find E[X].
Solution: Let X1 be the value on the first die and let X2 be the value on the second
die. Since X = X1 + X2 we have E[X] = E[X1 ] + E[X2 ]. Earlier we computed that
E[X1 ] = E[X2 ] = 3.5, therefore E[X] = 7.
Now we can use the Algebraic Property (1) to make the calculation simple.
n
X X X
X= Xj ⇒ E[X] = E[Xj ] = p = np .
j=1 j j
Statistics Class 4, Discrete Random Variables: Expected Value 4
It is possible to show that the sum of this series is indeed np. We think you’ll agree that
the method using Property (1) is much easier.
Example 8. (For infinite random variables the mean does not always exist.) Suppose X
has an infinite number of values according to the following table.
values x: 2 22 23 ... 2k ...
Try to compute the mean.
pmf p(x): 1/2 1/22 1/23 ... 1/2k ...
Solution: The mean is
∞ ∞
X
k 1 X
E[X] = 2 k = 1 = ∞.
2
k=1 k=1
The mean does not exist! This can happen with infinite series.
Example 9. Mean of a geometric distribution
Let X ∼ geo(p). Recall this means X takes values k = 0, 1, 2, . . . with probabilities
p(k) = (1 − p)k p. (X models the number of tails before the first heads in a sequence of
Bernoulli trials.) The mean is given by
1−p
E[X] = .
p
To see this requires a clever trick. Mathematicians love this sort of thing and we hope you
are able to follow the logic and enjoy it. In this class we will not ask you to come up with
something like this on an exam.
Here’s the trick.: to compute E[X] we have to sum the infinite series
∞
X
E[X] = k(1 − p)k p.
k=0
∞
X 1
Now, we know the sum of the geometric series: xk = .
1−x
k=0
∞
X 1
Differentiate both sides: kxk−1 = .
(1 − x)2
k=0
∞
X x
Multiply by x: kxk = .
(1 − x)2
k=0
∞
X 1−p
Replace x by 1 − p: k(1 − p)k = .
p2
k=0
∞
X 1−p
Multiply by p: k(1 − p)k p = .
p
k=0
Statistics Class 4, Discrete Random Variables: Expected Value 5
Example 10. Flip a fair coin until you get heads for the first time. What is the expected
number of times you flipped tails?
Solution: The number of tails before the first head is modeled by X ∼ geo(1/2). From
1/2
the previous example E[X] = = 1. This is a surprisingly small number.
1/2
Example 11. Michael Jordan, perhaps the greatest basketball player ever, made 80% of
his free throws. In a game what is the expected number he would make before his first miss.
Solution: Here is an example where we want the number of successes before the first
failure. Using the neutral language of heads and tails: success is tails (probability 1 − p)
and failure is heads (probability = p). Therefore p = 0.2 and the number of tails (made
free throws) before the first heads (missed free throw) is modeled by a X ∼ geo(0.2). We
saw in Example 9 that this is
1−p 0.8
E[X] = = = 4.
p 0.2
X 2 3 4 5 6 7 8 9 10 11 12
Y -7 -8 -7 -4 1 8 17 28 41 56 73
prob 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
Here’s the R code I used to compute E[Y ] = 13.833.
x = 2:12
y = x^2 - 6*x + 1
p = c(1 2 3 4 5 6 5 4 3 2 1)/36
ave = sum(p*y)
It gave E[Y ] = 13.833.
To answer the question above: since the expected payoff is positive it looks like a bet worth
taking.
Quiz: If Y = h(X) does E[Y ] = h(E[X])? Solution: NO!!! This is not true in general!
Think: Is it true in the previous example?
Quiz: If Y = 3X + 77 does E[Y ] = 3E[X] + 77?
Solution: Yes. By property (2), scaling and shifting does behave like this.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
Statistics Class 4, Discrete Random Variables: Expected Value 7
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Variance of Discrete Random Variables
Class 5a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2 Spread
The expected value (mean) of a random variable is a measure of location or central tendency.
If you had to summarize a random variable with a single number, the mean would be a good
choice. Still, the mean leaves out a good deal of information. For example, the random
variables X and Y below both have mean 0, but their probability mass is spread out about
the mean quite differently.
values X -2 -1 0 1 2 values Y -3 3
pmf p(x) 1/10 2/10 4/10 2/10 1/10 pmf p(y) 1/2 1/2
It’s probably a little easier to see the different spreads in plots of the probability mass
functions. We use bars instead of dots to give a better sense of the mass.
p(x) pmf for X pmf for Y p(y)
1/2
4/10
2/10
1/10
x y
-2 -1 0 1 2 -3 0 3
pmf’s for two different distributions both with mean 0
In the next section, we will learn how to quantify this spread.
Taking the mean as the center of a random variable’s probability distribution, the variance
is a measure of how much the probability mass is spread out around this center. We’ll start
with the formal definition of variance and then unpack its meaning.
Definition: If X is a random variable with mean E[X] = µ, then the variance of X is
defined by
Var(X) = E[(X − µ)2 ].
1
Statistics Class 5a, Variance of Discrete Random Variables 2
If the relevant random variable is clear from context, then the variance and standard devi-
ation are often denoted by σ 2 and σ (‘sigma’), just as the mean is µ (‘mu’).
What does this mean? First, let’s rewrite the definition explicitly as a sum. If X takes
values x1 , x2 , . . . , xn with probability mass function p(xi ) then
n
X
Var(X) = E[(X − µ)2 ] = p(xi )(xi − µ)2 .
i=1
In words, the formula for Var(X) says to take a weighted average of the squared distance
to the mean. By squaring, we make sure we are averaging only non-negative values, so that
the spread to the right of the mean won’t cancel that to the left. By using expectation,
we are weighting high probability values more than low probability values. (See Example 2
below.)
Note on units:
1. σ has the same units as X.
2. Var(X) has the same units as the square of X. So if X is in meters, then Var(X) is in
meters squared.
Because σ and X have the same units, the standard deviation is a natural measure of
spread.
Let’s work some examples to make the notion of variance clear.
Example 1. Compute the mean, variance and standard deviation of the random variable
X with the following table of values and probabilities.
value x 1 3 5
pmf p(x) 1/4 1/4 1/2
Solution: First we compute E[X] = 7/2. Then we extend the table to include (X − 7/2)2 .
value x 1 3 5
p(x) 1/4 1/4 1/2
(x − 7/2)2 25/4 1/4 9/4
Now the computation of the variance is similar to that of expectation:
25 1 1 1 9 1 11
Var(X) = · + · + · = .
4 4 4 4 4 2 4
p
Taking the square root we have the standard deviation σ = 11/4.
Example 2. For each random variable X, Y , Z, and W plot the pmf and compute the
mean and variance.
(i) value x 1 2 3 4 5
pmf p(x) 1/5 1/5 1/5 1/5 1/5
(ii) value y 1 2 3 4 5
pmf p(y) 1/10 2/10 4/10 2/10 1/10
Statistics Class 5a, Variance of Discrete Random Variables 3
(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
Solution: Each random variable has the same mean 3, but the probability is spread out
differently. In the plots below, we order the pmf’s from largest to smallest variance: Z, X,
Y , W.
p(z) pmf for Z p(x) pmf for X
.5
1/5
z x
1 2 3 4 5 1 2 3 4 5
p(w)
1
.3
.1
y W
1 2 3 4 5 1 2 3 4 5
Next we’ll verify our visual intuition by computing the variance of each of the variables.
All of them have mean µ = 3. Since the variance is defined as an expected value, we can
compute it using the tables.
(i) value x 1 2 3 4 5
pmf p(x) 1/5 1/5 1/5 1/5 1/5
(X − µ)2 4 1 0 1 4
4 1 0 1 4
Var(X) = E[(X − µ)2 ] = 5 + 5 + 5 + 5 + 5 = 2 .
(ii) value y 1 2 3 4 5
p(y) 1/10 2/10 4/10 2/10 1/10
(Y − µ)2 4 1 0 1 4
4 2 0 2 4
Var(Y ) = E[(Y − µ)2 ] = 10 + 10 + 10 + 10 + 10 = 1.2 .
(iii) value z 1 2 3 4 5
pmf p(z) 5/10 0 0 0 5/10
(Z − µ)2 4 1 0 1 4
20 20
Var(Z) = E[(Z − µ)2 ] = 10 + 10 = 4 .
Statistics Class 5a, Variance of Discrete Random Variables 4
(iv) value w 1 2 3 4 5
pmf p(w) 0 0 1 0 0
(W − µ)2 4 1 0 1 4
So far we have been using the notion of independent random variable without ever carefully
defining it. For example, a binomial distribution is the sum of independent Bernoulli trials.
This may (should?) have bothered you. Of course, we have an intuitive sense of what inde-
pendence means for experimental trials. We also have the probabilistic sense that random
variables X and Y are independent if knowing the value of X gives you no information
about the value of Y .
In a few classes we will work with continuous random variables and joint probability func-
tions. After that we will be ready for a full definition of independence. For now we can use
the following definition, which is exactly what you expect and is valid for discrete random
variables.
Definition: The discrete random variables X and Y are independent if
P (X = a, Y = b) = P (X = a)P (Y = b)
Property 3 gives a formula for Var(X) that is often easier to use in hand calculations. The
computer is happy to use the definition! We’ll prove Properties 2 and 3 after some examples.
Example 3. Suppose X and Y are independent and Var(X) = 3 and Var(Y ) = 5. Find:
(i) Var(X + Y ), (ii) Var(3X + 4), (iii) Var(X + X), (iv) Var(X + 3Y ).
Solution: To compute these variances we make use of Properties 1 and 2.
(i) Since X and Y are independent, Var(X + Y ) = Var(X) + Var(Y ) = 8.
(ii) Using Property 2, Var(3X + 4) = 9 · Var(X) = 27.
(iii) Don’t be fooled! Property 1 fails since X is certainly not independent of itself. We can
use Property 2: Var(X + X) = Var(2X) = 4 · Var(X) = 12. (Note: if we mistakenly used
Property 1, we would the wrong answer of 6.)
(iv) We use both Properties 1 and 2.
Var(X + 3Y ) = Var(X) + Var(3Y ) = 3 + 9 · 5 = 48.
Suppose X ∼ binomial(n, p). Since X is the sum of independent Bernoulli(p) variables and
each Bernoulli variable has variance p(1 − p) we have
X ∼ binomial(n, p) ⇒ Var(X) = np(1 − p).
Statistics Class 5a, Variance of Discrete Random Variables 6
Proof of Property 2: This follows from the properties of E[X] and some algebra.
Let µ = E[X]. Then E[aX + b] = aµ + b and
Proof of Property 3: We use the properties of E[X] and a bit of algebra. Remember
that µ is a constant and that E[X] = µ.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Continuous Random Variables
Class 5b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2. Know the definition of the probability density function (pdf) and cumulative distribution
function (cdf).
3. Be able to explain why we use probability density for continuous random variables.
2 Introduction
We now turn to continuous random variables. All random variables assign a number to
each outcome in a sample space. Whereas discrete random variables take on a discrete set
of possible values, continuous random variables have a continuous set of values.
Computationally, to go from discrete to continuous we simply replace sums by integrals. It
will help you to keep in mind that (informally) an integral is just a continuous sum.
Example 1. Since time is continuous, the amount of time Jon is early (or late) for class is
a continuous random variable. Let’s go over this example in some detail.
Suppose you measure how early Jon arrives to class each day (in units of minutes). That
is, the outcome of one trial in our experiment is a time in minutes. We’ll assume there are
random fluctuations in the exact time he shows up. Since in principle Jon could arrive, say,
3.43 minutes early, or 2.7 minutes late (corresponding to the outcome -2.7), or at any other
time, the sample space consists of all real numbers. So the random variable which gives the
outcome itself has a continuous range of possible values.
It is too cumbersome to keep writing ‘the random variable’, so in future examples we might
write: Let T = “time in minutes that Jon is early for class on any given day.”
3 Calculus Warmup
While we will assume you can compute the most familiar forms of derivatives and integrals
by hand, we do not expect you to be calculus whizzes. For tricky expressions, we’ll let the
computer do most of the calculating. Conceptually, you should be comfortable with two
views of a definite integral.
Z b
1. f (x) dx = area under the curve y = f (x).
a
Z b
2. f (x) dx = ‘sum of f (x) dx’.
a
1
Statistics Class 5b, Continuous Random Variables 2
∆x
x
x x0 x1 x2 ··· xn
a b a b
A continuous random variable takes a range of values, which may be finite or infinite in
extent. Here are a few examples of ranges: [0, 1], [0, ∞), (−∞, ∞), [a, b].
Definition: A random variable X is continuous if there is a function f (x) such that for
any c ≤ d we have
Z d
P (c ≤ X ≤ d) = f (x) dx. (1)
c
The function f (x) is called the probability density function (pdf).
The pdf always satisfies the following properties:
1. f (x) ≥ 0 (f is nonnegative).
Z ∞
2. f (x) dx = 1 (This is equivalent to: P (−∞ < X < ∞) = 1).
−∞
The probability density function f (x) of a continuous random variable is the analogue of
the probability mass function p(x) of a discrete random variable. Here are two important
differences:
1. Unlike p(x), the pdf f (x) is not a probability. You have to integrate it to get proba-
bility. (See section 4.2 below.)
2. Since f (x) is not a probability, there is no restriction that f (x) be less than or equal
to 1.
Statistics Class 5b, Continuous Random Variables 3
Note: In Property 2, we integrated over (−∞, ∞) since we did not know the range of values
taken by X. Formally, this makes sense because we just define f (x) to be 0 outside of the
range of X. In practice, we would integrate between bounds given by the range of X.
If you graph the probability density function of a continuous random variable X then
P (c ≤ X ≤ d) = area under the graph between c and d.
f (x)
P (c ≤ X ≤ d)
x
c d
Why do we use the terms mass and density to describe the pmf and pdf? What is the
difference between the two? The simple answer is that these terms are completely analogous
to the mass and density you saw in physics and calculus. We’ll review this first for the
probability mass function and then discuss the probability density function.
Mass as a sum:
If masses m1 , m2 , m3 , and m4 are set in a row at positions x1 , x2 , x3 , and x4 , then the
total mass is m1 + m2 + m3 + m4 .
m1 m2 m3 m4
x
x1 x2 x3 x4
We can define a ‘mass function’ p(x) with p(xj ) = mj for j = 1, 2, 3, 4, and p(x) = 0
otherwise. In this notation the total mass is p(x1 ) + p(x2 ) + p(x3 ) + p(x4 ).
The probability mass function behaves in exactly the same way, except it has the dimension
of probability instead of mass.
Mass as an integral of density:
Suppose you have a rod of length L meters with varying density f (x) kg/m. (Note the units
are mass/length.)
∆x
x
0 x1 x2 x3 xi xn = L
mass of ith piece ≈ f (xi )∆x
Statistics Class 5b, Continuous Random Variables 4
If the density varies continuously, we must find the total mass of the rod by integration:
Z L
total mass = f (x) dx.
0
This formula comes from dividing the rod into small pieces and ’summing’ up the mass of
each piece. That is:
X n
total mass ≈ f (xi ) ∆x
i=1
In the limit as ∆x goes to zero the sum becomes the integral.
The probability density function behaves exactly the same way, except it has units of
probability/(unit x) instead of kg/m. Indeed, equation (1) is exactly analogous to the
above integral for total mass.
While we’re on a physics kick, note that for both discrete and continuous random variables,
the expected value is simply the center of mass or balance point.
Example 2. Suppose X has pdf f (x) = 3 on [0, 1/3] (this means f (x) = 0 outside of
[0, 1/3]). Graph the pdf and compute P (0.1 ≤ X ≤ 0.2) and P (0.1 ≤ X ≤ 1).
Solution: P (0.1 ≤ X ≤ 0.2) is shown below at left. We can compute the integral:
Z 0.2 Z 0.2
P (0.1 ≤ X ≤ 0.2) = f (x) dx = 3 dx = 0.3.
0.1 0.1
P (0.1 ≤ X ≤ 1) is shown below at right. Since there is only area under f (x) up to 1/3, we
have P (0.1 ≤ X ≤ 1) = 3 · (1/3 − 0.1) = 0.7.
3 f (x) 3 f (x)
x x
.1 .2 1/3 .1 1/3 1
Think: In the previous example f (x) takes values greater than 1. Why does this not
violate the rule that probabilities are always between 0 and 1?
Note on notation. We can define a random variable by giving its range and probability
density function. For example we might say, let X be a random variable with range [0,1]
Statistics Class 5b, Continuous Random Variables 5
and pdf f (x) = x/2. Implicitly, this means that X has no probability density outside of the
given range. If we wanted to be absolutely rigorous, we would say explicitly that f (x) = 0
outside of [0,1], but in practice this won’t be necessary.
Example 3. Let X be a random variable with range [0,1] and pdf f (x) = Cx2 . What is
the value of C?
Solution: Since the total probability must be 1, we have
Z 1 Z 1
f (x) dx = 1 ⇔ Cx2 dx = 1.
0 0
C/3 = 1 ⇒ C=3.
Note: We say the constant C above is needed to normalize the density so that the total
probability is 1.
What is P (a ≤ X ≤ a)?
What is P (X = 0)?
In words the above questions get at the fact that the probability that a random person’s
height is exactly 5’9” (to infinite precision, i.e. no rounding!) is 0. Yet it is still possible
that someone’s height is exactly 5’9”. So the answers to the thinking questions are 0, 0,
and No.
F (b) = P (X ≤ b).
Note well that the definition is about probability. When using the cdf you should first think
of it as a probability. Then when you go to calculate it you can use
Z b
F (b) = P (X ≤ b) = f (x) dx, where f (x) is the pdf of X.
−∞
Notes:
1. For discrete random variables, we defined the cumulative distribution function but did
Statistics Class 5b, Continuous Random Variables 6
not have much occasion to use it. The cdf plays a far more prominent role for continuous
random variables.
2. As before, we started the integral at −∞ because we did not know the precise range of
X. Formally, this still makes sense since f (x) = 0 outside the range of X. In practice, we’ll
know the range and start the integral at the start of the range.
3. In practice we often say ‘X has distribution F (x)’ rather than ‘X has cumulative distri-
bution function F (x).’
Example 5. Find the cumulative distribution function for the density in Example 2.
Z a Z a
Solution: For a in [0,1/3] we have F (a) = f (x) dx = 3 dx = 3a.
0 0
Since f (x) is 0 outside of [0,1/3] we know F (a) = P (X ≤ a) = 0 for a < 0 and F (a) = 1
for a > 1/3. Putting this all together we have
0
if a < 0
F (a) = 3a if 0 ≤ a ≤ 1/3
1 if 1/3 < a.
x x
1/3 1/3
Note the different scales on the vertical axes. Remember that the vertical axis for the pdf
represents probability density and that of the cdf represents probability.
Example 6. Find the cdf for the pdf in Example 3, f (x) = 3x2 on [0, 1]. Suppose X is a
random variable with this distribution. Find P (X < 1/2).
Z a
2
Solution: f (x) = 3x on [0,1] ⇒ F (a) = 3x2 dx = a3 on [0,1]. Therefore,
0
0
if a < 0
F (a) = a3 if 0 ≤ a ≤ 1
1 if 1 < a
Thus, P (X < 1/2) = F (1/2) = 1/8. Here are the graphs of f (x) and F (x):
1
3 f (x) F (x)
x x
1 1
Statistics Class 5b, Continuous Random Variables 7
1. (Definition) F (x) = P (X ≤ x)
2. 0 ≤ F (x) ≤ 1
5. P (a ≤ X ≤ b) = F (b) − F (a)
6. F ′ (x) = f (x).
Properties 2, 3, 4 are identical to those for discrete distributions. The graphs in the previous
examples illustrate them.
Property 5 can be seen algebraically:
Z b Z a Z b
f (x) dx = f (x) dx + f (x) dx
−∞ −∞ a
Z b Z b Z a
⇔ f (x) dx = f (x) dx − f (x) dx
a −∞ −∞
⇔ P (a ≤ X ≤ b) = F (b) − F (a).
Property 5 can also be seen geometrically. The orange region below represents F (b) and
the striped region represents F (a). Their difference is P (a ≤ X ≤ b).
P (a ≤ X ≤ b)
x
a b
We find it helpful to think of sampling values from a continuous random variable as throw-
ing darts at a funny dartboard. Consider the region underneath the graph of a pdf as a
dartboard. Divide the board into small equal size squares and suppose that when you throw
a dart you are equally likely to land in any of the squares. The probability the dart lands
in a given region is the fraction of the total area under the curve taken up by the region.
Since the total area equals 1, this fraction is just the area of the region. If X represents
the x-coordinate of the dart, then the probability that the dart lands with x-coordinate
between a and b is just
Statistics Class 5b, Continuous Random Variables 8
Z b
P (a ≤ X ≤ b) = area under f (x) between a and b = f (x) dx.
a
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Gallery of Continuous Random Variables
Class 5c, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to give examples of what uniform, exponential and normal distributions are used
to model.
2. Be able to give the range and pdf’s of uniform, exponential and normal distributions.
2 Introduction
Here we introduce a few fundamental continuous distributions. These will play important
roles in the statistics part of the class. For each distribution, we give the range, the pdf,
the cdf, and a short description of situations that it models. These distributions all depend
on parameters, which we specify.
As you look through each distribution do not try to memorize all the details; you can always
look those up. Rather, focus on the shape of each distribution and what it models.
Although it comes towards the end, we call your attention to the normal distribution. It is
easily the most important distribution defined here.
When we studied discrete random variables we learned, for example, about the Bernoulli(p)
distribution. The probability p used to define the distribution is called a parameter and
Bernoulli(p) is called a parametrized distribution. For example, tosses of fair coin follow a
Bernoulli distribution where the parameter p = 0.5. When we study statistics one of the
key questions will be to estimate the parameters of a distribution. For example, if I have
a coin that may or may not be fair then I know it follows a Bernoulli(p) distribution, but
I don’t know the value of the parameter p. I might run experiments and use the data to
estimate the value of p.
As another example, the binomial distribution Binomial(n, p) depends on two parameters
n and p.
In the following sections we will look at specific parametrized continuous distributions.
The applet https://mathlets.org/mathlets/probability-distributions/ allows you
to visualize the pdf and cdf of these distributions and to dynamically change the parameters.
3 Uniform distribution
1. Parameters: a, b.
2. Range: [a, b].
1
Statistics Class 5c, Gallery of Continuous Random Variables 2
6. Models: Situtations where all outcomes in the range have equal probability (more
precisely all outcomes have the same probability density).
Graphs:
1
b−a f (x)
F (x)
1
x x
a b a b
pdf and cdf for uniform(a,b) distribution.
Example 1. 1. Suppose we have a tape measure with markings at each millimeter. If we
measure (to the nearest marking) the length of items that are roughly a meter long, the
rounding error will be uniformly distributed between -0.5 and 0.5 millimeters.
2. Many board games use spinning arrows (spinners) to introduce randomness. When spun,
the arrow stops at an angle that is uniformly distributed between 0 and 2π radians.
3. In most pseudo-random number generators, the basic generator simulates a uniform
distribution and all other distributions are constructed by transforming the basic generator.
4 Exponential distribution
1. Parameter: λ.
Example 2. If I step out to 77 Mass Ave after class and wait for the next taxi, my waiting
time in minutes is exponentially distributed. We will see that in this case λ is given by
1/(average number of taxis that pass per minute).
Statistics Class 5c, Gallery of Continuous Random Variables 3
Example 3. The exponential distribution models the waiting time until an unstable isotope
undergoes nuclear decay. In this case, the value of λ is related to the half-life of the isotope.
Memorylessness: There are other distributions that also model waiting times, but the
exponential distribution has the additional property that it is memoryless. Here’s what this
means in the context of Example 2: suppose that the probability that a taxi arrives within
the first five minutes is p. If I wait five minutes and, in this case, no taxi arrives, then the
probability that a taxi arrives within the next five minutes is still p. That is, my previous
wait of 5 minutes has no impact on the length of my future wait!
By contrast, suppose I were to instead go to Kendall Square subway station and wait for
the next inbound train. Since the trains are coordinated to follow a schedule (e.g., roughly
12 minutes between trains), if I wait five minutes without seeing a train then there is a far
greater probability that a train will arrive in the next five minutes. In particular, waiting
time for the subway is not memoryless, and a better model would be the uniform distribution
on the range [0,12].
The memorylessness of the exponential distribution is analogous to the memorylessness
of the (discrete) geometric distribution, where having flipped 5 tails in a row gives no
information about the next 5 flips. Indeed, the exponential distribution is precisely the
continuous counterpart of the geometric distribution, which models the waiting time for a
discrete process to change state. More formally, memoryless means that the probability of
waiting t more minutes is independent of the amount of time already waited. In symbols,
since the event ‘waited at least s minutes’ contains the event ’waited at least s + t minutes’.
Therefore the formula for conditional probability gives
P (X > s + t) e−λ(s+t)
P (X > s + t | X > s) = = = e−λt = P (X > t).
P (X > s) e−λs
The probability P (X > s + t) = e−λ(s+t) is the formula for the right tail probability given
above.
Graphs:
Statistics Class 5c, Gallery of Continuous Random Variables 4
5 Normal distribution
In 1809, Carl Friedrich Gauss published a monograph introducing several notions that have
become fundamental to statistics: the normal distribution, maximum likelihood estimation,
and the method of least squares (we will cover all three in this course). For this reason,
the normal distribution is also called the Gaussian distribution, and it is by far the most
important continuous distribution.
1. Parameters: µ, σ.
The standard normal distribution N (0, 1) has mean 0 and variance 1. We reserve Z for a
1 2
standard normal random variable, ϕ(z) = √ e−z /2 for the standard normal density, and
2π
Φ(z) for the standard normal distribution.
Note: we will define mean and variance for continuous random variables next time. They
have the same interpretations as in the discrete case. As you might guess, the normal
distribution N (µ, σ 2 ) has mean µ, variance σ 2 , and standard deviation σ.
Here are some graphs of normal distributions. Note that they are shaped like a bell curve.
Note also that as σ increases they become more spread out.
Statistics Class 5c, Gallery of Continuous Random Variables 5
The bell curve: First we show the standard normal probability density and cumulative
distribution functions. Below that is a selection of normal densities. Notice that the graph
is centered on the mean and the bigger the variance the more spread out the curve.
0.5 1.0
0.4 0.8
φ(z) Φ(z)
0.3 0.6
0.2 0.4
0.1 0.2
−4 −2 0 2 4
z z
−4 −2 0 2 4
0.3
0.2
0.1
−4 −2 0 2 4 6 8 10
Notation note. In the figure above we use our notation√ N (µ, σ 2 ). So, for example,
N (8, 0.5) has variance 0.5 and standard deviation σ = 0.5 ≈ 0.7071.
To make approximations it is useful to remember the following rule of thumb for three
approximate probabilities from the standard normal distribution:
within 1 · σ ≈ 68%
within 3 · σ ≈ 99%
68%
95%
99%
z
−3σ −2σ −σ σ 2σ 3σ
Symmetry calculations
We can use the symmetry of the standard normal distribution about z = 0 to make some
calculations.
Example 4. The rule of thumb says P (−1 ≤ Z ≤ 1) ≈ 0.68. Use this to estimate Φ(1).
Solution: Φ(1) = P (Z ≤ 1). In the figure, the two tails (in blue) have combined area
1 − 0.68 = 0.32. By symmetry the left tail has area 0.16 (half of 0.32), so P (Z ≤ 1) ≈
0.68 + 0.16 = 0.84.
P (−1 ≤ Z ≤ 1)
P (Z ≤ −1) P (Z ≥ 1)
.34 .34
.16 .16
z
−1 1
pnorm(0,0,1)
[1] 0.5
pnorm(1,0,2)
[1] 0.6914625
pnorm(1,0,1) - pnorm(-1,0,1)
[1] 0.6826895
pnorm(5,0,5) - pnorm(-5,0,5)
[1] 0.6826895
# Of course z can be a vector of values
pnorm(c(-3,-2,-1,0,1,2,3),0,1)
[1] 0.001349898 0.022750132 0.158655254 0.500000000 0.841344746 0.977249868 0.998650102
Statistics Class 5c, Gallery of Continuous Random Variables 7
Note: The R function pnorm(x, µ, σ) uses σ whereas our notation for the normal distri-
bution N(µ, σ 2 ) uses σ 2 .
Here’s a table of values with fewer decimal points of accuracy
z: -2 -1 0 0.3 0.5 1 2 3
Φ(z): 0.0228 0.1587 0.5000 0.6179 0.6915 0.8413 0.9772 0.9987
In 18.05, we only have time to work with a few of the many wonderful distributions that are
used in probability and statistics. We hope that after this course you will feel comfortable
learning about new distributions and their properties when you need them. Wikipedia is
often a great starting point.
The Pareto distribution is one common, beautiful distribution that we will not have time
to cover in depth.
7. Models: The Pareto distribution models a power law, where the probability that
an event occurs varies as a power of some attribute of the event. Many phenomena
follow a power law, such as the size of meteors, income levels across a population, and
population levels across cities. See Wikipedia for loads of examples:
https://en.wikipedia.org/wiki/Pareto_distribution#Applications
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Statistics Class 5c, Gallery of Continuous Random Variables 8
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Expectation, Variance and Standard Deviation for
Continuous Random Variables
Class 6a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to compute and interpret expectation, variance, and standard deviation for
continuous random variables.
2. Be able to compute and interpret quantiles for discrete and continuous random variables.
2 Introduction
So far we have looked at expected value, standard deviation, and variance for discrete
random variables. These summary statistics have the same meaning for continuous random
variables:
To move from discrete to continuous, we will simply replace the sums in the formulas by
integrals. We will do this carefully and go through many examples in the following sections.
In the last section, we will introduce another type of summary statistic, quantiles. You may
already be familiar with the 0.5 quantile of a distribution, otherwise known as the median
or 50th percentile.
Definition: Let X be a continuous random variable with range [a, b] and probability
density function f (x). The expected value of X is defined by
Z b
E[X] = xf (x) dx.
a
Let’s see how this compares with the formula for a discrete random variable:
n
X
E[X] = xi p(xi ).
i=1
The discrete formula says to take a weighted sum of the values xi of X, where the weights
are the probabilities p(xi ). Recall that f (x) is a probability density. Its units are
1
Statistics Class 6a, Expectation and Variance for Continuous Random Variables 2
3.1 Examples
Let’s go through several example computations. Where the solution requires an integration
technique, we push the computation of the integral to the appendix.
Example 1. Let X ∼ uniform(0, 1). Find E[X].
Solution: X has range [0, 1] and density f (x) = 1. Therefore,
1 1
x2 1
Z
E[X] = x dx = = .
0 2 0 2
Not surprisingly the mean is at the midpoint of the range.
x
1 µ = 1.5
µ is “pulled” to the right of the midpoint 1 because there is more mass to the right.
f (x) = λe−λx
x
µ = 1/λ
φ(z)
z
µ=0
The properties of E[X] for continuous random variables are the same as for discrete ones:
1. If X and Y are random variables on a sample space Ω then
E[X + Y ] = E[X] + E[Y ]. (linearity I)
2. If a and b are constants then
E[aX + b] = aE[X] + b. (linearity II)
This works exactly the same as the discrete case. if h(x) is a function then Y = h(X) is a
random variable and Z ∞
E[Y ] = E[h(X)] = h(x)fX (x) dx.
−∞
4 Variance
Now that we’ve defined expectation for continuous random variables, the definition of vari-
ance is identical to that of discrete random variables.
Definition: Let X be a continuous random variable with mean µ. The variance of X is
2 /2 2 /2
(using integration by parts with u = z, v ′ = ze−z ⇒ u′ = 1, v = −e−z )
∞
1 ∞ 1
Z
−z 2 /2 2 /2
=√ −ze +√ e−z dz.
2π −∞ 2π −∞
The first term equals 0 because the exponential goes to zero much faster than z grows at
both ±∞. The second term equals 1 because it is exactly the total probability integral of
the pdf φ(z) for N(0, 1). So Var(X) = 1.
The integral in the last line is the same one we computed for Var(Z).
5 Quantiles
Definition: The median of X is the value x for which P (X ≤ x) = 0.5, i.e. the value
of x such that P (X ≤ x) = P (X ≥ x). In other words, X has equal probability of
being above or below the median, and each probability is therefore 1/2. In terms of the
cdf F (x) = P (X ≤ x), we can equivalently define the median as the value x satisfying
F (x) = 0.5.
Think: What is the median of Z?
Solution: By symmetry, the median is 0.
Solution: The cdf for X is F (x) = x on the range [0,1]. So q0.6 = 0.6.
f (x) F (x)
F (q0.6 ) = 0.6
x x
q0.6 = 0.6 q0.6 = 0.6
Φ(z)
φ(z)
1
left tail area = prob. = 0.6 F (q0.6 ) = 0.6
z z
q0.6 = 0.253 q0.6 = 0.253
Quantiles give a useful measure of location for a random variable. We will use them more
in coming lectures.
For convenience, quantiles are often described in terms of percentiles, deciles or quartiles.
The 60th percentile is the same as the 0.6 quantile. For example you are in the 60th percentile
for height if you are taller than 60 percent of the population, i.e. the probability that you
are taller than a randomly chosen person is 60 percent.
Likewise, deciles represent steps of 1/10. The third decile is the 0.3 quantile. Quartiles are
in steps of 1/4. The third quartile is the 0.75 quantile and the 75th percentile.
∞
Z ∞
−λx
= −xe + e−λx dx
0 0
∞
e−λx 1
=0− = .
λ 0 λ
∞
Z ∞
= −x2 e−λx + 2xe−λx dx
0 0
(the first term is 0, for the second term use integration by parts: u = 2x, v ′ = e−λx ⇒
−λx
u′ = 2, v = − e λ )
∞ ∞
e−λx e−λx
Z
= −2x + dx
λ 0 0 λ
−λx ∞
e 2
=0− 2 = .
λ2 0 λ2
Statistics Class 6a, Expectation and Variance for Continuous Random Variables 8
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Central Limit Theorem and the Law of Large Numbers
Class 6b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
3. Be able to use the central limit theorem to approximate probabilities of averages and
sums of independent identically-distributed random variables.
2 Introduction
We all understand intuitively that the average of many measurements of the same unknown
quantity tends to give a better estimate than a single measurement. Intuitively, this is
because the random error of each measurement cancels out in the average. In these notes
we will make this intuition precise in two ways: the law of large numbers (LoLN) and the
central limit theorem (CLT).
Briefly, both the law of large numbers and central limit theorem are about many independent
samples from same distribution. The LoLN tells us two things:
1. The average of many independent samples is (with high probability) close to the mean
of the underlying distribution.
2. The density histogram of many independent samples is (with high probability) close
to the graph of the density of the underlying distribution.
The mathematics of the LoLN says that the average of a lot of independent samples from a
random variable will almost certainly approach the mean of the variable. The mathematics
cannot tell us if the tool or experiment is producing data worth averaging. For example,
if the measuring device is defective or poorly calibrated then the average of many mea-
surements will be a highly accurate estimate of the wrong thing! This is an example of
1
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 2
systematic error or sampling bias, as opposed to the random error controlled by the law of
large numbers.
Note that X n is itself a random variable. The law of large numbers and central limit
theorem tell us about the value and distribution of X n , respectively.
LoLN: As n grows, the probability that X n is close to µ goes to 1.
CLT: As n grows, the distribution of X n converges to the normal distribution N (µ, σ 2 /n).
Before giving a more formal statement of the LoLN, let’s unpack its meaning through a
concrete example (we’ll return to the CLT later on).
The law of large numbers says that this probability goes to 1 as the number of flips n gets
large. Our R code produces the following values for P (0.4 ≤ X n ≤ 0.6).
n = 10: pbinom(6, 10, 0.5) - pbinom(3, 10, 0.5) = 0.65625
n = 50: pbinom(30, 50, 0.5) - pbinom(19, 50, 0.5) = 0.8810795
n = 100: pbinom(60, 100, 0.5) - pbinom(39, 100, 0.5) = 0.9647998
n = 500: pbinom(300, 500, 0.5) - pbinom(199, 500, 0.5) = 0.9999941
n = 1000: pbinom(600, 1000, 0.5) - pbinom(399, 1000, 0.5) = 1
As predicted by the LoLN the probability goes to 1 as n grows.
We redo the computations to see the probability of being within 0.01 of the mean. Our R
code produces the following values for P (0.49 ≤ X n ≤ 0.51).
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 3
This says precisely that as n increases the probability of being within a of the mean goes
to 1. Think of a as a small tolerance of error from the true mean µ.
Looking back at Example 1, we see that for tosses of a fair coin: If we choose the number
of tosses n = 500, then with probability p = 0.99999, the experimental frequency of heads
X n will be within a = 0.1 of 0.5. In words, this tells us that, on average, only 1 in 100,000
experiments will produce an experimental frequency outside this range. If we decrease the
tolerance a and/or increase the probability p, then n will need to be larger.
4 Histograms
1. Pick an interval of the real line and divide it into m intervals, with endpoints b0 , b1 , . . . ,
bm . Usually these are equally sized, so let’s assume this to start.
x
b0 b1 b2 b3 b4 b5 b6
2. Place each xi into the bin that contains its value. If xi lies on the boundary of two bins,
we’ll put it in the left bin (this is the R default, though it can be changed).
3. To draw a frequency histogram: put a vertical bar above each bin. The height of the
bar should equal the number of xi in the bin.
4. To draw a density histogram: put a vertical bar above each bin. The area of the bar
should equal the fraction of all data points that lie in the bin.
Notes:
1. When all the bins have the same width, the frequency histogram bars have area propor-
tional to the count. So the density histogram results from simply by dividing the height of
each bar by the total area of the frequency histogram. Ignoring the vertical scale, the two
histograms look identical.
2. Caution: if the bin widths differ, the frequency and density histograms may look very
different. There is an example of this below. Don’t let anyone fool you by manipulating
bin widths to produce a histogram that suits their mischievous purposes!
In 18.05, we’ll stick with equally-sized bins. In general, we prefer the density histogram
since its vertical scale is the same as that of the pdf.
Examples. Here are some examples of histograms, all with the data [0.5,1,1,1.5,1.5,1.5,2,2,2,2].
The R code that drew them is in the file ’class6-prep-b.r’. You can find it in the usual place
on our website.
1. Here the density and frequency plots look the same but have different vertical scales.
Histogram of x
Histogram of x
4
0.8
3
0.6
Frequency
Density
0.4
2
0.2
1
0.0
Bins centered at 0.5, 1, 1.5, 2, i.e. width 0.5, bounds at 0.25, 0.75, 1.25, 1.75, 2.25.
2. Note the values are all on the bin boundaries and are put into the left-hand bin. That
is, the bins are right-closed, e.g the first bin is for values in the right-closed interval (0, 0.5].
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 5
Histogram of x Histogram of x
0.8
4
0.6
3
Frequency
Density
0.4
2
0.2
1
0.0
0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x
3. Here we show density histograms based on different bin widths. Note that the scale
keeps the total area equal to 1. The gaps are bins with zero counts.
Histogram of x Histogram of x
1.5
0.6
1.0
Density
Density
0.4
0.5
0.2
0.0
0.0
x x
4. Here we use unqual bin widths, so the density and frequency histograms look different
Histogram of x Histogram of x
0.8
5
4
0.6
Frequency
Density
3
0.4
2
0.2
1
0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0
x x
if you try to make a frequency histogram with unequal bin widths. Compare the frequency
histogram with unequal bin widths with all the other histograms we drew for this data. It
clearly looks different. What happened is that by combining the data in bins (0.5, 1] and
(1, 1.5] into one bin (0.5, 1.5) we effectively made the height of both smaller bins greater.
The reason the density histogram is nice is discussed in the next section.
The law of large number has an important consequence for density histograms.
LoLN for histograms: With high probability the density histogram of a large number
of samples from a distribution is a good approximation of the graph of the underlying pdf
f (x) over the range of the histogram.
Let’s illustrate this by generating a density histogram with bin width 0.1 from 100000 draws
from a standard normal distribution. As you can see, the density histogram very closely
tracks the graph of the standard normal pdf ϕ(z).
0.5
0.4
0.3
0.2
0.1
0
-4 -2 0 2 4
Density histogram of 10000 draws from a standard normal distribution, with ϕ(z) in blue.
5.1 Standardization
Given a random variable X with mean µ and standard deviation σ, we define its standard-
ization of X as the new random variable
X −µ
Z= .
σ
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 7
Note that Z has mean 0 and standard deviation 1. Note also that if X has a normal
distribution, then the standardization of X is the standard normal distribution Z with
mean 0 and variance 1. This explains the term ‘standardization’ and the notation of Z
above.
Suppose X1 , X2 , . . . , Xn , . . . are i.i.d. random variables each having mean µ and standard
deviation σ. For each n, let Sn denote the sum and let X n be the average of X1 , . . . , Xn .
n
X
Sn = X1 + X2 + . . . + Xn = Xi
i=1
X1 + X2 + . . . + Xn Sn
Xn = = .
n n
The properties of mean and variance show
√
E[Sn ] = nµ, Var(Sn ) = nσ 2 , σSn = nσ
σ2 σ
E[X n ] = µ, Var(X n ) = , σX n =√ .
n n
Since they are multiples of each other, Sn and X n have the same standardization
Sn − nµ Xn − µ
Zn = √ = √
σ n σ/ n
The proof of the Central Limit Theorem is more technical than we want to get in 18.05. It
is accessible to anyone with a decent calculus background.
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 8
To apply the CLT, we will want to have some normal probabilities at our fingertips. The
following probabilities appeared in Class 5. Let Z ∼ N (0, 1), a standard normal random
variable. Then with rounding we have:
1. P (|Z| < 1) ≈ 0.68
2. P (|Z| < 2) ≈ 0.95; more precisely P (|Z| < 1.96) ≈ 0.95.
3. P (|Z| < 3) ≈ 0.997
These numbers are easily computed in R using pnorm. However, they are well worth re-
membering as rules of thumb. You should think of them as:
1. The probability that a normal random variable is within 1 standard deviation of its
mean is 0.68.
2. The probability that a normal random variable is within 2 standard deviations of its
mean is 0.95.
3. The probability that a normal random variable is within 3 standard deviations of its
mean is 0.997.
within 1 · σ ≈ 68%
Normal PDF within 2 · σ ≈ 95%
68% within 3 · σ ≈ 99%
95%
99% z
µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ
Example 2. Flip a fair coin 100 times. Estimate the probability of more than 55 heads.
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 9
Solution: Let Xj be the result of the j th flip, so Xj = 1 for heads and Xj = 0 for tails.
The total number of heads is
S = X1 + X2 + . . . + X100 .
The central limit theorem says that the standardization of S is approximately N(0, 1). The
question asks for P (S > 55). Standardizing and using the CLT we get
S − 50 55 − 50
P (S > 55) = P > ≈ P (Z > 1) = 0.16.
5 5
Example 3. Estimate the probability of more than 220 heads in 400 flips.
Solution: This is nearly identical to the previous example. Now µS = 200 and σS = 10
and we want P (S > 220). Standardizing and using the CLT we get:
S − µS 220 − 200
P (S > 220) = P > ≈ P (Z > 2) = 0.025.
σS 10
40 − 50 S − 50 60 − 50
P (40 ≤ S ≤ 60) = P ≤ ≤ ≈ P (−2 ≤ Z ≤ 2)
5 5 5
We can compute the right-hand side using our rule of thumb. For a more accurate answer
we use R:
pnorm(2) - pnorm(-2) = 0.954 . . .
Recall that in Section 3 we used the binomial distribution to compute an answer of 0.965. . . .
So our approximate answer using CLT is off by about 1%.
Think: Would you expect the CLT method to give a better or worse approximation of
P (200 < S < 300) with n = 500?
We encourage you to check your answer using R.
Example 5. Polling. When taking a political poll the results are often reported as a
number with a margin of error. For example 52% ± 3% favor candidate A. The rule of
√
thumb is that if you poll n people then the margin of error is ±1/ n. We will now see
exactly what this means and that it is an application of the central limit theorem.
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 10
Suppose there are 2 candidates A and B. Suppose further that the fraction of the population
who prefer A is p0 . That is, if you ask a random person who they prefer then the probability
they’ll answer A is po
To run the poll a pollster selects n people at random and asks ‘Do you support candidate
A or candidate B?’ Thus we can view the poll as a sequence of n independent Bernoulli(p0 )
trials, X1 , X2 , . . . , Xn , where Xi is 1 if the ith person prefers A and 0 if they prefer B. The
fraction of people polled that prefer A is just the average X.
We know that each Xi ∼ Bernoulli(p0 ) so,
p
E[Xi ] = p0 and σXi = p0 (1 − p0 ).
In a normal distribution 95% of the probability is within 2 standard deviations of the mean.
√
This means that in 95% of polls of n people the sample mean X will be within 2σ/ n of
the true mean p0 . The final step is to note that for any value of p0 we have σ ≤ 1/2. (It is
an easy calculus exercise to see that 1/4 is the maximum value of σ 2 = p0 (1 − p0 ).) This
means that we can conservatively say that in 95% of polls of n people the sample mean
√
X is within 1/ n of the true mean. The frequentist statistician then takes the interval
√
X ± 1/ n and calls it the 95% confidence interval for p0 .
A word of caution: it is tempting and common, but wrong, to think that there is a 95%
probability the true fraction p0 is in a particular confidence interval. This is subtle, but
the error is the same one as thinking you have a disease if a test with a 95% true positive
rate comes back positive. We will go into this in much more detail when we learn about
confidence intervals.
Since the probabilities in the above examples can be computed exactly using the binomial
distribution, you may be wondering what is the point of finding an approximate answer
using the CLT. In fact, we were only able to compute these probabilities exactly because
the Xi were Bernoulli and so the sum S was binomial. In general, the distribution of the
Xi may not be familiar, or may not even be known, so you will not be able to compute the
probabilities for S exactly. It can also happen that the exact computation is possible in
theory but too computationally intensive in practice, even for a computer. The power of
the CLT is that it applies whenever Xi has a mean and a variance. Though the CLT applies
to many distributions, we will see in the next section that some distributions require larger
n for the approximation to be a good one.
First we show the standardized average of n i.i.d. uniform random variables with n =
1, 2, 4, 8, 12. The pdf of the average is in blue and the standard normal pdf is in red. By
the time n = 12 the fit between the standardized average and the true normal looks very
good.
Uniform: n = 1 Uniform: n = 2 Uniform: n = 4
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x x
Uniform: n = 8 Uniform: n = 12
0.4
0.4
0.2
0.2
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x
Next we show the standardized average of n i.i.d. exponential random variables with
n = 1, 2, 4, 8, 16, 64. Notice that this asymmetric density takes more terms to converge to
the normal density.
Exponential: n = 1 Exponential: n = 2 Exponential: n = 4
0.4
0.8
0.4
0.2
0.4
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x x
0.2
0.2
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x x
0.8
0.8
0.4
0.4
0.4
0.0
0.0
0.0
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
x x x
Exponential: n = 16 Exponential: n = 64
3.0
0.0 0.5 1.0 1.5
2.0
1.0
0.0
−2 −1 0 1 2 3 4 −2 −1 0 1 2 3 4
x x
The central limit theorem works for discrete variables also. Here is the standardized average
of n i.i.d. Bernoulli(0.5) random variables with n = 1, 2, 12, 64. Notice that as n grows, the
average can take more values, which allows the discrete distribution to ’fill in’ the normal
density.
Bernoulli: n = 1 Bernoulli: n = 2 Bernoulli: n = 12
0.4
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
x x x
Bernoulli: n = 64
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
x
Note. In order to put the binomial (sum of Bernoulli) and normal distribution on the same
axes, we had to convert the binomial probability mass function to a density. We did this
by making it a bar graph with bars centered on each value and with bar width equal to the
distance between values. Then the height of each bar is chosen so that the area equals the
probability of the corresponding value.
Finally we show the (non-standardized) average of n Bernoulli(0.5) random variables, with
n = 4, 12, 64. Notice how the standard deviation gets smaller resulting in a spikier (more
peaked) density. (In these figures, rather than plotting colored bars, we made the bars
white and only plotted a blue line at the center of each bar.
Statistics Class 6b, Central Limit Theorem and the Law of Large Numbers 13
6
2.0
0.8
4
1.0
0.4
2
0.0
0.0
0
−1.0 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 0.0 0.5 1.0
x x x
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Covariance and Correlation
Class 7b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2 Covariance
Covariance is a measure of how much two random variables vary together. For example,
height and weight of giraffes have positive covariance because when one is big the other
tends also to be big.
Definition: Suppose X and Y are random variables with means µX and µY . The
covariance of X and Y is defined as
3. Cov(X, X) = Var(X)
4. Cov(X, Y ) = E[XY ] − µX µY .
Notes. 1. Property 4 is like the similar property for variance. Indeed, if X = Y it is exactly
that property: Var(X) = E[X 2 ] − µ2X .
By Property 5, the formula in Property 6 reduces to our earlier formula Var(X + Y ) =
Var(X) + Var(Y ) when X and Y are independent.
We give the proofs below. However, understanding and using these properties is more
important than memorizing their proofs.
1
Statistics Class 7b, Covariance and Correlation 2
Since covariance is defined as an expected value we compute it in the usual way as a sum
or integral.
Discrete case: If X and Y have joint pmf p(xi , yj ) then
n X
X m Xn X
m
Cov(X, Y ) = p(xi , yj )(xi − µX )(yj − µY ) = p(xi , yj )xi yj − µX µY .
i=1 j=1 i=1 j=1
Continuous case: If X and Y have joint pdf f (x, y) over range [a, b] × [c, d] then
Z dZ b Z dZ b
Cov(X, Y ) = (x − µx )(y − µy )f (x, y) dx dy = xyf (x, y) dx dy − µx µy .
c a c a
2.3 Examples
Example 1. Flip a fair coin 3 times. Let X be the number of heads in the first 2 flips
and let Y be the number of heads on the last 2 flips (so there is overlap on the middle flip).
Compute Cov(X, Y ).
Solution: We’ll do this twice, first using the joint probability table and the definition of
covariance, and then using the properties of covariance.
With 3 tosses there are only 8 outcomes {HHH, HHT,...}, so we can create the joint prob-
ability table directly.
X\Y 0 1 2 p(xi )
0 1/8 1/8 0 1/4
1 1/8 2/8 1/8 1/2
2 0 1/8 1/8 1/4
p(yj ) 1/4 1/2 1/4 1
From the marginals we compute E[X] = 1 = E[Y ]. Now we use use the definition:
X
Cov(X, Y ) = E[(X − µx )(Y − µY )] = p(xi , yj )(xi − 1)(yj − 1)
i,j
We write out the sum leaving out all the terms that are 0, i.e. all the terms where xi = 1
or yi = 1 or the probability is 0.
1 1 1
Cov(X, Y ) = (0 − 1)(0 − 1) + (2 − 1)(2 − 1) = .
8 8 4
We could also have used property 4 to do the computation: From the full table we compute
2 1 1 1 5
E[XY ] = 1 · +2 +2 +4 = .
8 8 8 8 4
Statistics Class 7b, Covariance and Correlation 3
5 1
So Cov(XY ) = E[XY ] − µX µY = −1= .
4 4
Next we redo the computation of Cov(X, Y ) using the properties of covariance. As usual,
let Xi be the result of the ith flip, so Xi ∼ Bernoulli(0.5). We have
X = X1 + X2 and Y = X2 + X3 .
We know E[Xi ] = 1/2 and Var(Xi ) = 1/4. Therefore using Property 2 of covariance, we
have
Looking at the expression for Cov(X, Y ) there is only one non-zero term
1
Cov(X, Y ) = Cov(X2 , X2 ) = Var(X2 ) = .
4
Example 2. (Zero covariance does not imply independence.) Let X be a random variable
that takes values −2, −1, 0, 1, 2; each with probability 1/5. Let Y = X 2 . Show that
Cov(X, Y ) = 0 but X and Y are not independent.
Solution: We make a joint probability table:
Y \X -2 -1 0 1 2 p(yj )
0 0 0 1/5 0 0 1/5
1 0 1/5 0 1/5 0 2/5
4 1/5 0 0 0 1/5 2/5
p(xi ) 1/5 1/5 1/5 1/5 1/5 1
Since these are not equal X and Y are not independent. Finally we compute covariance
using Property 4:
1
Cov(X, Y ) = (−8 − 1 + 1 + 8) − µX µy = 0.
5
Discussion: This example shows that Cov(X, Y ) = 0 does not imply that X and Y are
independent. In fact, X and X 2 are as dependent as random variables can be: if you know
the value of X then you know the value of X 2 with 100% certainty.
Statistics Class 7b, Covariance and Correlation 4
The key point is that Cov(X, Y ) measures the linear relationship between X and Y . In
the above example X and X 2 have a quadratic relationship that is completely missed by
Cov(X, Y ).
Continuous covariance works the same way, except our computations are done with integrals
instead of sums. Here is an example.
Example 3. Continuous covariance. Suppose X and Y are jointly distributed random
variables, with range on the unit square [0, 1] × [0, 1] and joint pdf f (x, y) = 2x3 + 2y 3 .
(i) Verify the f (x, y) is a valid probability density.
(ii) Compute µX and µY .
(iii) Compute the covariance of Cov(X, Y )
Solution: Part of the point of this example is to show how to set up and compute the inte-
grals using a joint density function. Since the pdf here is a polynomial, these computations
are relatively easy.
(i) A valid pdf has two properties: it is nonnegative and the total integral over the entire
joint range is 1.
Nonnegativity is clear: f (x, y) ≥ 0. The integral is not hard to compute
Z 1Z 1 Z 1Z 1
f (x, y) dx dy = 2x3 + 2y 3 dx dy
0 0 0 0
1 1
x4 1
Z
Inner integral: 2x3 + 2y 3 dx = + 2xy 3 = + 2y 3 .
0 2 0 2
1 1
1 y y4
Z
Outer integral: + 2y 3 dy = + = 1.
0 2 2 2 0
So, the integral over the entire joint range is 1. Thus, f (x, y) = x + y is a valid probability
density.
(ii) We need to compute integrals to find the means. We will write down the integrals, but
not show the details of their computation. (Also, by symmetry, we know the two means are
the same.)
Z 1Z 1 Z 1Z 1
13
µX = xf (x, y) dx dy = 2x4 + 2xy 3 ) dx dy =
0 0 0 0 20
Z 1Z 1 Z 1Z 1
13
µY = yf (x, y) dx dy = 2yx3 + 2y 4 dx dy =
0 0 0 0 20
(iii) We know Cov(X, Y ) = E [(X − µx )(Y − µY )]. This is an integral. Again, we will write
down the integral, but not show details of its computation,
Z 1Z 1
Cov(X, Y ) = E [(X − µx )(Y − µY )] = (x − 13/20)(y − 13/20)f (x, y) dx dy
0 0
Z 1Z 1
9
= (x − 7/12)(y − 7/12)(2x3 + 2y 3 ) dx dy = −
0 0 400
Statistics Class 7b, Covariance and Correlation 5
(In fact, we wrote down the integral in the most straightforward way, but secretly we did
the computation by computing E[XY ] − E[X]E[Y ].)
Here’s a plot of the pseudo-random samples generated from this distribution. Because the
R code could do it easily, we also include a plot with a more extreme density function.
1.0
1.0
0.8
0.8
0.6
0.6
y
y
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
3 Correlation
The units of covariance Cov(X, Y ) are ‘units of X times units of Y ’. This makes it hard to
compare covariances: if we change scales then the covariance changes as well. Correlation
is a way to remove the scale from the covariance.
Definition: The correlation coefficient between X and Y is defined by
Cov(X, Y )
Cor(X, Y ) = ρ = .
σX σY
3. −1 ≤ ρ ≤ 1. Furthermore,
ρ = +1 if and only if Y = aX + b with a > 0,
ρ = −1 if and only if Y = aX + b with a < 0.
Property 3 shows that ρ measures the linear relationship between variables. If the corre-
lation is positive then when X is large, Y will tend to large as well. If the correlation is
negative then when X is large, Y will tend to be small.
Example 2 above shows that correlation can completely miss higher order relationships.
√ √
2Var(Xj ) = 1/2. So, σX = 1/ 2. Likewise σY = 1/ 2. Thus
Cov(X, Y ) 1/4 1
Cor(X, Y ) = = = .
σX σY 1/2 2
We see a positive correlation, which means that larger X tend to go with larger Y and
smaller X with smaller Y . In Example 1 this happens because toss 2 is included in both X
and Y , so it contributes to the size of both.
Example 5. Look back at Example 3. See if you can compute the following.
p
Var(X) = 31/400, so σX = 31/400 ≈ 0.28
Var(Y ) = Var(X), so σY ≈ 0.28
Cov(X, Y )
Cor(X, Y ) = ≈ −0.29.
σX σY
rho=0.00 rho=0.30
3
● ●
3
●
● ●
● ● ● ● ●
● ●● ● ● ● ●
●●
●
● ● ● ● ●
●● ● ● ● ●
● ●
●● ●
2
● ●
2
●● ●
●● ● ● ●● ● ●
● ● ● ●● ● ●● ●
●● ● ●● ●
● ● ● ●● ●● ● ●
● ● ● ● ●● ● ●● ●●● ●
●● ● ● ● ● ● ● ● ● ●●
● ●● ●● ● ● ● ● ●
● ● ●● ● ●●
● ● ● ●● ● ●
● ●●
● ● ● ●●● ●● ●●●●● ● ●
● ●●● ● ● ● ●● ●
● ●
● ● ● ●● ●● ● ●● ● ● ● ●● ● ●● ●●● ●● ● ●
● ●●● ● ●● ●● ●●● ● ●● ●●● ● ● ● ●●
● ● ● ●●●●
● ●● ●
●
● ●●●●
●● ● ●● ● ●●● ● ● ● ●●● ● ●●
● ●● ● ●
●●● ●● ●●● ●●●
● ●●
●●
●● ●
● ●
1
● ● ●●● ● ● ● ●● ● ●● ●
1
●●●● ●●●
● ● ● ● ●
●●
● ● ● ● ●● ● ●●● ● ●●● ●●●
●● ●●● ●● ● ● ●● ●
● ● ● ● ● ● ● ●
●
● ●● ●● ● ●● ●●● ● ● ●● ●
● ●● ●●●● ●● ● ● ●● ●●● ● ●● ●
●● ● ●●●● ● ● ●
●● ●●●●● ●● ● ● ●●
●●
●
● ●
●●
● ●●●
● ●● ●● ● ●
● ●● ●● ● ● ● ● ● ● ●● ●● ● ●●
● ●● ●●● ● ●● ●● ●●
●● ●
●
● ●● ● ●
●●●
●●●
●●
●
●
●
● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●
● ●● ●
● ● ● ●● ●●● ●●●●
●●●●●● ●
● ● ● ●● ● ●●● ●● ● ● ● ●● ●● ●● ●●●●
● ● ● ● ●● ●●●●●●●● ● ●● ● ●
● ●● ● ●● ●●● ● ●● ●
●●
●●●
●● ● ● ● ●●● ●●● ●● ●● ●●
●● ● ● ● ● ● ●● ● ● ● ●● ●●● ●
● ●●● ●●● ● ●
●●● ● ● ●● ●● ●
●
●● ● ●
● ● ● ●● ● ●● ●
● ●● ●● ●● ●
● ●●●●
●●
●●●●● ● ●● ● ●● ● ● ●● ●●●● ●●●●●● ●●● ●●●● ●● ● ●●
●● ●●●●
●● ● ●●
● ● ●●● ●●● ● ● ●● ● ●●●
● ● ●
● ●
●●●● ●●●● ●● ● ●●
● ●● ●●
● ● ●●● ●●● ●
● ●●●●●●●●● ●●●●
●●●
●
● ●●
●●●
●●●
● ● ●●
●
●
● ●● ● ●●● ● ● ●● ● ● ●●● ● ● ●●●
● ● ● ● ●● ●
●●
●●● ● ●
●
● ●●●● ● ●●
●● ●
● ●●● ●●● ● ●●●●
● ●● ●●
●● ●
● ● ●● ● ●●
●● ● ●●●
●●●● ● ●●●●● ● ●● ●●
●● ●●●●●
●● ●● ●
−1 0
● ● ●
●● ●●
−1 0
y
● ●●●
y
● ●●● ● ●● ●● ●● ●
●●
●●●● ●●●
●● ●● ● ● ●
●
●●
● ●●
●
●
●●● ●● ● ●●● ●
● ● ● ● ●● ● ●● ● ●●
●●●● ● ● ● ● ●●● ●●●● ●
●●●●● ●
●
●●●●
● ●● ●
●● ●● ●
● ● ●● ●●●●●●●●
● ●●●●●● ●
●●●● ●
●●●● ●● ●●●●●●● ●
● ●●●● ●●● ●●
● ●● ● ●
● ●●●● ●●
●●
●●●● ●●
● ● ●●●●
● ● ● ●● ●●●● ●●●●● ●
●
● ●● ●●● ●● ●
●
● ●●● ● ● ●●●
●
● ●● ●●
● ●●● ● ●● ●●●● ● ● ● ● ● ● ●●● ● ●
● ●●
●● ●●● ●●●●● ●
●●●
●
●
● ● ●● ● ● ●
● ●
● ●
●● ● ● ●●●
●● ●●● ● ●● ● ● ●● ● ●●●●
●● ●● ●● ●● ● ● ● ● ● ●● ● ●
●
●●●●●●● ● ● ● ● ● ●●● ●● ●●
●
● ●
●
●● ● ● ●●● ●
●● ● ●●●
● ●●● ●●
● ●●●● ●●● ●
●●● ● ● ●●●●
●
● ●●● ●●●●● ●● ● ●● ●●●●
●●
● ●
●● ●
● ● ●●
● ●
●●
●
● ● ● ●●
●● ●● ●● ● ●●● ● ● ●●
● ●
● ●
●● ● ● ● ●● ●●● ● ● ● ●●●● ●● ● ●●
●●●●●●●● ●●
● ●●●●
● ●
● ●● ● ● ● ●● ●● ●
● ● ●●●● ●● ● ● ●●● ● ●●● ●●●● ● ●●●
●
● ● ●● ●● ● ● ●●● ●● ● ● ● ● ●● ●●
● ●●●●
●●●●●
●●●●
● ●●●● ●● ●●● ●
●● ●●
● ●●● ●
●● ● ●● ● ● ●●
● ● ●
●●● ●●
●
●●
●● ●
●●●●●● ●● ● ● ● ●● ●● ●●
● ●● ●●●●● ●●● ● ●●●
● ● ●
●●●● ● ● ● ● ●● ●●
● ●● ●●●● ● ● ●● ● ● ● ● ●● ●●●
●●●● ● ●●●●
●● ●
●●●
●●
● ●● ●●● ●●●● ● ●
● ● ● ● ● ● ● ●●● ● ● ● ●●●●
● ● ●●● ● ●●●● ● ●
●
●
● ●
● ●
● ● ● ●●●
●
●●● ● ●● ●●● ● ● ●●● ● ● ● ● ●●● ●● ●● ●●
● ●● ●●●●● ● ●
● ● ● ●●●
●● ● ● ●
●
●
● ● ● ●●●●●
● ● ●●●● ● ●●
●● ● ●
● ● ● ● ●● ● ●
● ●● ● ● ● ● ●●● ●●● ●● ● ●● ● ●● ● ●
● ● ●● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ●● ●● ●● ● ● ●
● ●
● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ●
●●●● ●● ●
●●●●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●
● ● ●
●
● ●
●
● ● ●
●●● ●●
● ● ● ● ●● ●
● ● ●● ●● ●
● ●● ●
● ●●
●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
● ● ●
● ● ● ●
● ●
●
−3
●
−3
● ●● ●
−3 −2 −1 0 1 2 3 −4 −3 −2 −1 0 1 2 3
x x
Statistics Class 7b, Covariance and Correlation 7
rho=0.70 rho=1.00
● ●● ● ●
● ●
●● ●● ● ●● ● ●● ● ● ●●
●● ●
2
● ●
● ● ● ● ●
●
● ●●
●●
●● ●
●●●
●
●
●● ● ● ● ●● ●● ●●● ●
● ●
●●
●
2
●
●● ●● ●● ●● ● ● ●● ● ● ●
●
●●
●
●●● ●●● ●●●●
● ● ●
●●●● ●
●
●●●● ● ● ● ●
●
●●●
●
●
● ● ●●● ●●● ●● ● ● ● ●●●
● ●●●●●● ●● ● ●● ● ●
●
●●
●
●● ●●● ●● ●●●●●● ● ●● ●● ● ●●
●
● ● ● ● ●● ● ●
●
1
●● ●● ●● ●● ●●●
●● ●
●
●●
●
●
●
● ● ●● ●● ● ●●●● ● ● ●●
● ●● ●●●●
●●
● ●●●
●●●
●● ●●●● ●●
●
●
●
●
●
●
● ●●● ● ● ●●● ●● ●● ●
●
●
●●
●
●●
●●●●
●●●● ●● ●
● ●● ● ●
●
●
●●
●
●
●
●
●●
●● ●● ●● ●● ● ●
●●
●●
●
●● ●● ●● ●●●●●●●
● ●● ●
●●●
● ●●●● ●● ●
●
● ●●●● ●
● ●●●●● ●
●●
●
●
●
●
●
●
●●
●
●● ● ● ●● ●● ● ● ●● ●
●● ●●●●● ● ● ● ● ●
●
●●
● ●●● ●●● ●●●●
●
●●●
●
●
●
●●
●● ●●● ● ●● ● ●● ● ● ●
●
●●
●
●
●●
●● ●●●●●●●●
● ● ●● ● ● ●● ● ●● ● ●●● ●● ●●
●
●
●●
●
●
● ● ●● ●● ●● ●●●●●
● ●
●●●●● ●●● ● ● ●
●
●
●●
●
●● ● ●
● ●● ●
●●●
●●
●●●
●●●●●●●
●●● ● ●●
● ●● ●● ●
●
●
●●
●
●
●● ●●
● ●● ●
● ●
●
0
● ●
0
●● ●●●● ●● ●●● ● ●● ●
●● ● ●● ●●●●●●● ●
● ● ●
●●
●
●●
● ● ● ●● ●●●●● ● ● ●●●●●●● ●● ●●●● ●●
y
y
●●●● ●● ●
● ● ● ●
●●
●
●
● ●● ● ● ●● ● ●●
●
● ●●● ●● ●●
●●●●● ●● ● ●● ● ● ● ●
●
●●
●
●●●● ●
●● ● ●●
●●●●
● ●●● ● ●
●●
●
●●●●
●
●
●
●●
●● ●●● ●
● ●●●● ●● ●
●●
●
●
●
●
●
●●
●
●
● ● ● ● ● ●● ●
●●● ● ●●● ● ●
● ●● ● ●
●●
●
● ●●●
●
● ● ●
●●
●● ●●● ●●●●
● ●●●●
● ● ● ●● ● ●
●●
●
●
●●
● ● ●● ●● ●● ● ●●●●●● ●● ●
● ●
●●●
●● ●● ●●● ●
●
●●
●
●●
● ● ● ● ● ●●● ● ●●●
●●
●
●● ●●●●●●● ●● ●●● ● ●● ●● ● ● ●
●
●
●
●
●
● ● ●● ●● ● ● ● ●● ●
●
●
●
−1
●●● ● ●
●● ● ●●● ●●● ● ●●
●
● ●●●●● ●●● ● ●
● ●
● ● ●
●
●●
●
●●
●
●● ● ●●
●●●●● ●● ● ●● ●
●●●● ● ●●
●
●
● ● ●● ●●●
● ●●●●● ●
●●●●● ● ● ● ●
●
●●
●
●● ● ● ● ●●●●●●● ●●●●● ● ●● ●● ● ●
●
●
●
●
●
●●
● ● ● ● ● ●● ● ●● ●●● ● ●● ●
●
● ● ●● ●
●● ●●
● ● ●
●
●
●
●
●●
●
−2
● ●
●
● ●●● ●● ● ●●●●●●●●●● ●● ● ● ● ● ●
●●●
●
●
● ●● ● ● ●●
●
● ● ● ●● ● ●
●
●●
●● ● ●●● ● ● ●
●
●
● ● ● ●
● ●● ●
● ●● ●●
●
●● ● ● ● ● ●●
●
●
● ● ●●
● ●●
●●
● ●● ●● ●
● ● ●
● ●
●
−3
−4
● ●
−3 −2 −1 0 1 2 3 −4 −2 0 2
x x
rho=−0.50 rho=−0.90
3
● ●
● ●●
● ●
● ● ● ● ●
●● ● ●●
● ● ●
2
● ● ● ● ● ● ● ● ●● ● ● ●●● ● ●
● ●
● ● ●●● ● ●●● ●●●● ●
2
●● ● ●● ●
● ● ●● ●●●●●●● ●●●
● ●●
●● ●●● ● ●
●●
● ● ● ● ● ● ●● ●
● ● ● ● ● ● ●● ●●
●●
●●
●●●
●●●●●● ●●
● ●●
● ● ●
●● ● ●● ●● ● ●●● ●
●●●●● ● ● ● ● ●
●●● ● ●
● ●● ●●
●● ●●● ●●
●
●
● ●
●
● ● ●●●●● ●● ●● ●
●●●●● ●●●●●●
● ●● ●●● ●● ●
●● ●● ●●● ●
● ● ● ● ●●● ● ● ●● ●
● ● ● ●
●
●●● ●●●
●
●● ●●
●
●●
● ●● ●
●
● ● ● ● ●● ●●
●
●● ● ●
●●●
●● ●
●
●●●● ●● ● ●● ● ● ● ● ● ● ● ●●
● ●● ●●●
●●●●●●●●●●
●●
●
●
●
●
●
●●●●●
●●
● ●
●●
●●●● ●● ●● ●●
● ● ● ● ● ● ● ●
● ●
●●
● ●
● ● ● ●● ●● ●
● ●
● ● ●
● ● ● ● ● ●
●● ●
●● ●● ● ●● ●● ● ● ●●
●●●
● ●
●●●●●●
●
●● ●●
●●●
● ●●
●
●●●● ●
1
●● ●● ● ●● ●
● ● ●
●●
●
●●●●●●●● ●● ● ● ●
●●● ●
●●● ●
●
●● ● ●
●
● ● ● ●●
●●●●●
● ●●●
●●
●
●
●
●●
● ● ●●
●●
●
●
●
●
●●
●
●
●●
●
●
● ●● ●●●
● ●● ● ● ●
●●●●● ●
● ● ●●●●● ●● ● ● ● ● ●● ●●●●
●● ● ●●●
●●●
●●●●●●
● ●●
●●●●
● ●●
● ● ●●● ● ●
● ●
●● ●●●●●● ● ●●● ●●
● ●
●●●
● ●
●●
●● ● ●●
●●
●●●●
●● ●●● ●
●●●
0
● ●● ●
● ●● ●
●●● ●●●● ●● ● ●● ● ● ● ●● ● ●● ●
●● ●● ●
●
●●●●●● ●●● ● ●●●●
●●● ●● ● ● ● ●●●●● ●●●
●●●
●●●●
●●●
●●●●
●●● ● ● ●
● ●●●●●●●●● ●
● ●● ● ●●
●●●
●
●●
●
● ●●●●● ● ●● ●● ● ●● ●● ●
●
● ●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●●
●●●
●●
●
●●
●●●
●●
●●●●
●
●●
●
●
●●●
● ●●●
●
● ● ● ● ●
● ● ● ● ●●
● ●
●●● ● ● ●● ●
●●
●● ● ●● ●● ● ●● ●● ●●●
●●●● ●●
●●
● ● ●●●● ●
● ●●● ● ● ● ●● ● ● ● ●●
●●● ●● ●● ●●
●●
● ●
●●●●●
●
●●●●●
●●
●
●●
●●
● ●
●●●●●● ●● ●
● ● ● ●●●●●●● ●●● ●●● ● ● ●
● ● ● ●●● ● ●● ● ●● ●●●
●●
● ●
●
●●●
●●●●●●●●●●●
●● ●● ●● ●●●
y
● ● ● ● ●● ●
●● ● ● ●
●●●● ● ● ●
● ●
●●● ● ● ●● ●
●● ●●●●●●●● ●●●●● ● ● ● ● ● ●● ●●
● ●●●
●●●●
●●
●●●●
●
●●●
●●
●●●●●
●
●●●●●●●●● ●
●● ●●●● ●●●●●● ●●● ● ●
●
● ● ●● ●● ● ●●●● ●
●●●●
●●●
●●● ●
● ●●● ●● ●●● ●
0
● ●●●● ●●●
●●● ●●● ●●●●● ● ●●
● ● ● ● ●●
● ● ● ● ●●● ●● ●●●●
●● ● ● ● ●●● ● ●
● ●● ●●●
●●● ●●
●
●●● ●●
● ●● ●● ●● ●
● ●●● ● ●● ●●● ●● ●●●● ● ●
●●●●
● ●
● ● ●
● ●● ● ● ●● ● ●● ●
● ● ●●●●●●●●●●●
●● ●
●●●● ●
● ● ● ●●●●●●●●●● ● ●● ●●
● ● ●
● ●
● ●● ●● ●● ●
● ● ●● ● ● ● ● ●
● ●●●●
● ●● ●●
●●
●● ●●
●●●●● ●
●●●●
● ● ●● ● ●●●●●●●●●
●●
● ●●●● ●●●●● ●●
● ●●● ●●● ●●●● ●● ●●●● ●
● ● ●●● ●● ●
● ● ● ●
●●
● ● ●●● ●● ●●
● ●●●● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●●
● ●●
●●● ●●●● ●●●●●●
● ●●● ● ●●
● ●●●
●● ●● ●●● ●●●
● ●● ● ●● ●●
● ● ●
●●●●● ●● ●●
●
●●●●● ●●●
● ● ●●● ● ● ● ● ●
●
●
●● ●●
●●
●● ●●●●●
● ● ●
●●●● ● ●● ●
●●● ●●
● ●●● ● ● ● ●
●●
● ●●
● ● ● ● ● ● ●● ●
●
●● ● ● ● ●
● ● ● ●●● ●●
−2
●● ●
● ●● ●●
● ●
●●● ●● ● ● ●
●● ● ● ●● ●
●●● ● ● ● ● ● ● ●●● ● ● ●
● ● ●● ●
●● ●● ●
●●
●● ●●● ●● ● ●●●●●●●● ●● ● ●● ●● ●● ●●
●●● ● ●●
●●
●● ●● ●● ● ●● ●●●● ● ●
●● ●
●● ●●●● ● ●●●● ●●●●● ●
●● ● ●●●
● ● ●●● ●● ●●
●
● ●●
● ● ● ●●● ●
● ● ● ●
●● ● ●●● ●●● ● ●● ● ●
●● ●
● ● ●● ●
● ●● ● ● ●●●●
● ●● ●●
● ●● ● ●● ● ● ● ● ● ● ●
● ●●● ● ● ● ●● ●●●●●●
● ● ●● ●
● ●● ● ● ● ●
● ●● ●
−2
●●● ● ● ● ●
● ● ● ●● ● ●● ●
● ● ●
● ● ● ● ●
−4
● ●
● ● ●
● ●
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4
x x
We ran simulations in R of the following scenario. X1 , X2 , . . . , X20 are i.i.d and follow a
U (0, 1) distribution. X and Y are both sums of the same number of Xi . We call the number
of Xi common to both X and Y the overlap. The notation in the figures below indicates
the number of Xi being summed and the number which overlap. For example, 5,3 indicates
that X and Y were each the sum of 5 of the Xi and that 3 of the Xi were common to both
sums. (The data was generated using rand(1,1000);)
Using the linearity of covariance it is easy to compute the theoretical correlation. For
each plot we give both the theoretical correlation and the correlation of the data from the
simulated sample.
Statistics Class 7b, Covariance and Correlation 8
2.0
● ● ● ●● ● ● ● ●
● ● ●● ● ● ●● ● ●● ● ● ●
● ●● ● ● ● ● ●●
● ●
● ●
● ● ● ●● ●●
● ● ● ●
● ●●● ●●●● ● ● ● ● ● ● ●● ● ●
● ●●
●● ●● ●● ●● ●● ● ● ●● ●● ● ● ●
● ● ●
● ● ●● ● ●
● ● ● ● ●● ●● ●● ●
●● ●● ● ● ● ● ●
●●● ● ● ●● ●● ●● ● ● ●● ● ●● ●●● ● ● ●●●● ● ●
● ●● ●● ● ●
● ● ● ● ●● ● ●● ● ● ● ● ●●
●●
●●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●● ●
0.8
● ● ● ● ●
●●
● ●
● ●
●
● ●●● ● ● ● ● ●● ●
● ● ●
●
●● ● ● ● ●● ●
●
● ● ●● ●
●
● ●● ● ● ● ● ● ● ●● ● ●● ● ●
1.5
● ● ● ● ●
●●●● ● ● ●
● ●
● ●
● ● ●●● ● ● ● ●● ●● ● ● ●
● ●
●● ● ●●●● ● ● ● ●● ● ●● ● ●
● ● ● ● ●● ●● ● ● ● ● ● ● ●
● ●● ● ● ● ●●●● ●●● ●●● ●
● ● ●● ●● ●● ●●● ●● ● ● ● ● ● ● ●●● ● ●● ●●● ●
●● ● ●●● ● ●● ● ● ●●●● ● ●● ● ●
● ●●
●●●
● ●
● ● ● ● ●
●● ●● ● ● ●● ●●● ●●● ●● ●● ● ●
●● ● ●
● ● ● ● ● ●●● ● ● ● ●●● ● ● ●● ●
● ●
●
● ●● ● ● ● ● ● ● ●
● ● ●● ●
● ●● ●● ● ● ● ● ● ●●●●● ● ● ●●●●
● ●●●● ●
● ● ● ●● ● ●● ● ●● ● ● ●● ● ●● ● ● ● ●● ●
●● ● ● ● ●●
● ●●●●
● ● ●● ● ●●●● ●● ● ● ● ●
●●● ●● ●●● ● ●
● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●●●●●● ●● ● ●
●● ●●●● ●● ● ● ●●●● ● ●●●●
● ● ● ● ● ● ● ● ● ●
● ●●● ●● ●
● ●● ● ● ●● ●● ●● ●● ● ● ● ● ●
●
● ● ● ● ● ● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ●●
● ● ● ●● ● ●
● ● ● ● ● ● ● ●● ● ● ● ●
●
●● ● ● ●● ● ● ●● ●
●● ● ● ●● ●● ●●
● ●
●● ●
●
●● ●●● ● ●● ●●● ●● ● ●
●
●●● ● ●● ●●
● ● ● ●● ●
● ● ● ● ●●●● ●● ●●●
● ●●● ● ●● ●●● ●●● ● ●●
● ●
●● ●● ●● ●● ● ●●●● ● ● ●● ●●● ● ● ● ● ● ● ●● ●●●●●● ●●●● ● ● ●
● ●● ● ● ● ● ● ●● ● ●●● ● ● ● ●● ● ●● ●● ●●● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ●
1.0
●● ● ●● ● ●● ●● ●
● ● ● ● ●● ● ● ●
● ● ●
● ● ●● ● ● ●
●● ●
●
● ● ● ●● ●● ● ●
●● ● ● ● ● ●
● ● ● ● ●●● ● ●
● ●●● ●● ● ●● ●● ● ●● ● ●●
● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ●●●●● ●● ●
y
y
●● ● ● ● ●● ● ●● ● ●●● ●●●● ●●●● ● ●
● ● ●● ● ● ● ● ● ●● ● ● ●● ● ●● ●●● ● ●●●● ● ● ● ●
● ● ●
● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●
●
● ● ● ●●● ● ●●● ●● ●●● ●● ● ● ● ● ●
● ● ● ●● ●● ● ● ● ● ● ●
●● ● ●●● ●● ● ● ●●●●●● ●●
●
●●● ●● ●● ●●●●● ● ● ● ●● ●● ●●
● ●●
●● ●● ●● ● ● ● ●● ● ●● ●●● ●●●●● ● ●● ●
0.4
● ●● ● ●● ● ●●
● ●● ●
● ● ● ● ●● ● ●
● ●● ●● ● ● ●●● ● ●
● ● ●●● ●
●
● ●●● ●● ● ● ● ● ● ● ●
● ●●
● ● ●● ● ● ● ● ● ●
● ● ●
●● ● ● ● ● ●
●● ●● ● ● ● ●●● ● ● ● ●
●● ●● ●● ●
● ●● ● ● ● ●● ●● ●● ●●●●
● ●
● ●
● ●● ●
●●● ● ● ● ● ●● ●●●● ●● ● ● ●
●●
● ● ● ● ●●● ● ●●
● ●●
●●●● ● ●● ● ●●●
●●
●● ●●● ● ●●●
● ● ● ● ●● ● ● ●●
● ● ● ●●
● ● ● ● ●
●● ● ● ●●● ●● ● ●
● ● ● ● ●● ● ●● ● ● ●
● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●
● ●
● ●● ● ● ● ●●● ● ●● ● ●● ●● ● ●● ●
● ●● ● ●
● ● ●●● ●
●● ●
● ●
● ● ●● ●● ● ● ● ●● ●● ●● ● ● ●● ● ●
● ● ●● ●● ●● ● ●● ●● ● ●● ● ● ● ●● ● ●●● ● ●● ●●●● ● ●● ●●
● ● ● ● ●
0.5
● ● ● ● ●● ● ● ● ● ●●●
● ● ● ● ●● ● ● ●
● ●
● ● ● ● ●●
● ● ●● ● ● ● ●● ●● ●
● ●
● ●
● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●
●
● ● ●
●
●
●
●●
● ●●
● ● ●● ●
●●●●● ● ● ●
●●
● ●●● ● ●● ● ●●
● ●● ● ● ● ●● ● ●●●● ●● ● ● ●
●
● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ●
● ● ● ●●
● ● ●
●●● ● ● ●
●●●●●
● ● ●
●● ● ●
●● ● ● ●
● ● ●● ● ● ● ● ● ●● ● ●● ● ●● ● ●● ●
● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ●● ●●
● ●●●
● ●● ● ● ●● ●● ● ● ●●●● ● ●● ● ● ● ●● ●● ● ●● ●● ● ●
● ● ● ●●●● ● ●● ● ●● ●
● ● ●
● ●● ● ●● ●● ● ● ●● ● ●● ● ●● ●● ●
●
● ●●● ●● ● ● ● ●
● ●●
●●
● ●●● ●● ● ●● ● ● ● ● ● ●●
●
● ●● ●●● ● ● ● ● ● ●● ● ●●
●
● ● ● ●
● ● ● ● ● ● ●●●●● ● ● ● ● ●● ●●
● ● ●● ● ● ● ●● ● ●● ●●● ●● ●● ● ● ● ● ● ● ●
● ●
0.0
● ●● ● ● ● ● ● ● ●
●● ●
0.0
● ●● ● ● ●● ● ●● ● ● ●
● ●● ●● ● ●
● ● ● ● ● ●● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 1.5 2.0
x x
● ●
●
● ●
● ● ●
●● ● ● ●
4
● ● ●
4
● ● ● ● ● ●
● ● ● ●
●● ● ●● ● ● ● ● ● ●● ● ● ●
● ● ●● ●● ●● ●●● ● ● ●
●●●● ●● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ●
● ● ● ● ●●●● ●●● ● ● ●● ● ● ● ●
●●● ● ● ●● ●●● ● ● ● ● ●● ● ● ● ●●● ●● ●● ●
● ● ● ● ●●●●● ● ● ●● ●
● ●● ●● ● ●
● ●
● ●● ● ●●● ●●● ● ●●●
● ● ● ● ● ● ●
●● ● ● ● ●● ● ●● ● ● ● ● ●● ● ● ● ●
●●● ●●● ● ● ●
●● ●●●●● ● ●
● ●
● ● ●
● ● ● ● ● ● ● ●●
● ●●● ● ● ●● ●● ● ●●● ●●● ●●
●
● ●● ●
●● ●●●●
●●●●
● ●● ●●● ● ● ●● ● ● ● ● ● ●● ● ●●● ● ● ●
● ●
●● ●● ● ● ●
●
● ●●●●● ●● ● ● ● ●●● ● ● ● ● ● ●● ● ● ● ●● ●
● ●● ● ●● ●
● ● ●●
●
●● ●●● ●●● ● ●●● ● ●●● ●● ● ● ● ●●
●●●●● ● ●●● ●
●
●●●●●● ●
● ●● ●● ●● ●●
● ●●● ● ●● ●●
3
● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ● ●● ●●●●●●●
● ●● ●●● ●● ●● ●●●● ● ●●● ● ●● ●● ● ●
3
●●
● ●● ●●●● ● ●●● ● ● ●●● ● ●
●
● ● ●●●● ●● ● ●
● ●
● ●● ●
● ● ● ●● ●● ● ● ● ●
●
● ●● ● ● ● ●●● ●●
● ● ●● ●● ●●●●●●● ● ●●
●● ●● ●● ●
● ● ●● ● ● ● ●●●●● ●● ● ● ●●● ●●●●●●● ●
●
● ● ●● ● ● ● ● ● ● ●● ●● ● ●●● ●● ●●●
●●● ●●
● ● ●
● ● ● ●●● ●● ● ●
●●
● ● ●
●
●● ● ● ● ●
● ●
● ● ● ●● ● ● ●● ● ●
●●●●●
● ●●●
●●
● ● ●●●●●
● ●● ● ● ●● ●●●●●●●●● ●●
●
●
● ● ● ●● ●
● ●●● ● ● ●●●●●● ●●●
●
● ● ● ●
●
●●●●● ● ●●
●●
●
●● ● ● ● ●
● ●●●●●
●●●● ● ●
● ●
● ● ●● ●● ●●● ● ●●●
●●●●● ●●
● ● ● ●● ● ● ●
●● ●● ●●●●● ● ● ●
● ● ●●●●●●● ●
● ●● ●● ● ● ● ● ●
●● ● ● ●
● ● ●●●
● ●●● ●● ●● ● ●●●● ●● ●●
● ●●
●●●● ●●
●● ● ●
●●● ●●
●● ● ●● ●●● ● ●● ●●● ●
●●
● ● ●●● ●
●
●●●●●●●● ●●● ●●
● ● ● ●●
●● ● ●
● ● ●●● ●
●●
● ●● ●● ● ● ● ● ●●● ●● ●
●●
●●● ●●●●
●●● ●●●● ●● ●●●● ●
y
● ● ● ● ●
●● ● ●
● ●● ● ●●● ● ● ● ● ●● ●●●●●●● ●●● ● ● ● ●
●● ● ●● ●● ● ●●●● ● ● ●● ● ● ●●●●●● ● ●●●●
●● ●●● ● ● ● ● ● ● ● ●● ●●●●●●● ●
●
●●●● ●●●● ●●● ● ●
●●●
●
●● ●● ●●
● ● ●●● ●●●● ●●● ●
●
●
●
● ● ●
● ● ● ● ●●●● ● ●●●
●
●●●
● ●● ●
●
●
●●●●
● ●●● ●●●●● ● ●
● ●●●●
● ● ● ●
● ●
●
● ● ● ●● ● ●
●●●
●
●● ●●●●● ●
●
●●
●●●●
●
●
●
●● ●●
●
● ●●●●●● ●
● ●●●● ● ● ●●
● ●
● ●
● ● ●● ●
●● ●
●●●
●●●●
●●● ● ●
●
●
●
●●
●
● ●●●●
●
● ●●
●
●
●● ●●● ●●●
●
●●●
●
● ● ● ●● ● ●
●●●
● ● ● ● ● ● ●●● ● ●●●●●
●
● ●
● ●● ● ●●●● ● ● ● ● ● ● ● ●● ●● ●
●● ●
● ●●●
●●
● ● ●● ●● ● ● ●●
●●
●● ●● ●● ●●●
●
● ● ● ● ● ● ●
●● ● ●● ●
● ●
●● ● ● ● ●●
●●● ●●● ●
● ●● ● ●●●
● ● ● ●
●● ● ● ●●● ● ● ●●●●
●●● ●● ● ●●● ●● ●
●●● ● ● ●● ● ● ● ● ● ●●● ●●
●●●●●●●●●● ●● ● ●●● ●●● ●● ●●●● ●
2
● ● ● ●●●● ●
●● ● ●
● ●●●●
● ●● ● ● ●
●●●● ●● ● ● ● ● ●●
● ● ●● ●●● ●
● ●● ●●●●
●
● ●
● ●●● ● ●● ● ●
● ● ●● ●●●●● ●● ● ●
●●●●●● ●● ● ● ● ●
● ● ●● ●●
●●●●● ●
●
● ●● ●●● ● ● ●
●●● ● ● ●
● ● ●● ●● ●
● ● ●
● ●
●●●●●● ● ● ● ●
●
● ●●● ●
● ●● ● ●●●●
●●●●●● ● ● ●
● ● ●● ● ● ● ●● ●● ●
●● ● ● ●
● ● ●●●● ● ●●
●
● ● ●●●
● ●
● ●● ●●● ● ●●
●●
● ● ● ● ● ●●● ●●
● ●● ●●● ●● ●● ●● ●
●
●
●
● ● ●
●
●● ●● ● ● ●●
●●●●●●●●
● ●●
● ● ● ● ●● ●●
● ●●
● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ●
● ● ●
● ● ●● ● ● ●●●
●
●● ● ●● ● ● ● ●●● ● ●
● ●●●● ● ●● ●● ●● ● ● ●
● ●
● ●
● ● ●
● ●● ●
●● ●● ● ● ● ●●
● ● ● ● ●● ● ●● ● ● ● ● ● ●
●●
●●
● ● ●●
●● ● ●● ●
● ● ●
●
● ●● ●
● ●● ●● ● ● ●●●● ●
●
● ●
● ● ● ●
1
● ●●
1
● ● ● ● ●● ● ●●
● ● ● ● ●●●
● ● ● ●
● ● ●
● ●
●
●● ● ●
1 2 3 4 1 2 3 4
x x
● ● ●
● ●
● ● ●
●●● ● ● ●
● ● ●
● ● ● ●
●
7
●● ● ●
● ●●
7
● ● ●● ● ● ● ●● ●● ●
● ● ● ● ● ● ●
● ●●
● ●
● ● ● ●● ● ● ● ●
● ●● ●● ●● ● ● ●● ● ● ● ● ●
● ● ●
●
●● ●●●● ●● ●● ●● ● ● ● ● ● ● ●
● ●
● ●
●
● ● ●
● ●● ● ● ●
● ● ● ● ● ● ●●●● ●●●● ● ●●
● ●●
●
● ●●● ● ●●●
●●
● ● ● ● ●● ●●● ●● ● ● ●
●
● ●● ●●
●
●● ●
● ●
● ●● ●●
● ●
●
●●● ●● ●●●● ●
●●
●●
●● ●●●
●●● ●● ● ● ●
● ●●
● ● ●●●
6
●
●●●● ● ● ● ● ● ● ● ● ● ●● ●
●
●● ● ●● ●
● ● ●●●●●
●● ●● ●
●● ●
● ●● ●●●●
●●● ● ●● ● ● ● ● ●
●
● ● ●● ● ● ●● ● ● ●●● ● ●
●● ●●● ●
●●● ●
● ● ● ● ● ● ● ● ●
●
●● ● ●●●●● ●●● ●● ●
●
●●
●● ●
●●
●●
● ●
●●
●●●
●●●
●
●
●●●
●●
●● ●
●●
●●●
●
●
●● ●
●● ●●●●●●
●
●●●
●●
● ● ● ●●
● ● ●●●●● ●●● ● ●● ● ●●●
● ●● ● ●
● ● ● ●●●●●● ● ● ● ● ● ●
●● ●●●●● ●●● ●
●●●
●● ●●●●● ● ●●●
● ● ●●● ●● ●●● ● ●
●
● ● ● ●● ● ●●●●●●
●● ● ● ●●●●●●
●●●●
●
● ●●
● ● ●
● ●●● ●● ● ●●●
● ●
● ● ● ●● ●●●●
● ● ●● ●●● ●●
●
●●● ● ●● ●● ●●
●
●●
● ●● ●●
● ● ● ●●● ●●●
● ●● ●●● ●● ● ●●
●
● ●
● ● ● ● ● ● ●●
●●●●●●
●
●●
●
●●● ●● ● ●●●●
●●●●●
● ●
● ●● ●●● ● ●●● ●
● ●● ●●● ● ●●● ● ●●●
●●
●●
●
●● ●●●●●
● ●●●
● ●●
● ● ●● ● ● ●●
●●● ●● ●●● ● ● ● ●●
●
● ●●
●● ●● ●
● ●
● ●● ●● ●● ●
● ●●●● ●●● ●●● ●●● ● ●● ●
●● ●●●● ●
●
●
● ● ● ● ●●● ● ●●● ● ●● ●
●● ● ●
●●
●●● ●●●●●
●●●
● ●● ● ●●
● ● ●● ●● ● ●
● ●
● ●● ●● ●●●●
●●●●● ●●
●● ●●
●●●●
● ●●●● ●●● ●● ●●●● ●● ●
●●● ●● ● ● ● ●
●
●●
●
●●
●● ●●●●●●
●●
● ●●●●●●● ●● ●
●●●
● ●●
5
● ● ● ●● ● ●
5
● ●● ●● ●
● ● ● ● ●
● ●● ● ● ● ●
● ●● ●● ●●●
● ● ●● ●●● ●● ● ●●
● ● ●●
● ●● ●● ● ●● ● ●● ● ● ●
●
y
● ●● ● ● ●● ●●
4
● ● ● ● ● ● ● ●● ●● ● ● ●● ●● ●●●
● ● ●● ●
●●
● ● ● ● ●
● ●● ● ● ●● ●● ●● ●●
●●●●● ●
● ●●● ●●●
● ●●
●
●● ●●●● ● ● ●●● ● ●● ● ●
● ● ● ● ● ●●
● ● ●
● ●●●
●
●●● ●
● ●●●● ● ●●●
●● ● ● ●●●● ●●● ● ● ●●● ●
● ● ●● ● ●●●●
●● ●●●
●
● ●● ●
●● ● ●● ● ●
● ● ● ●●● ● ● ●●●● ● ●●● ●●● ● ●
● ● ●● ●● ●● ●●● ● ●● ● ● ●●
●● ●
● ●●
● ●●● ● ●●
●
● ●●
●
●● ●
●● ● ● ● ●● ●● ●● ● ●● ● ●
●
● ●● ●● ● ●
●
●● ●
● ●
● ●● ●● ● ● ● ●
● ● ● ● ●●● ●
● ●● ● ●● ● ● ●
3
●● ● ●
● ●● ● ●
3
●● ●
● ●● ● ● ●●
● ● ●
●
● ● ● ● ● ●
2
● ●
2
2 3 4 5 6 7 3 4 5 6 7 8
x x
Statistics Class 7b, Covariance and Correlation 9
= E[X − µX ]E[Y − µY ]
= 0.
This implies ρ ≤ 1
X Y
Likewise 0 ≤ Var + , so −1 ≤ ρ.
σX σY
X Y X Y
If ρ = 1 then 0 = Var − ⇒ − = c.
σX σY σX σY
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
Statistics Class 7b, Covariance and Correlation 10
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Joint Distributions, Independence
Class 7a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Understand what is meant by a joint pmf, pdf and cdf of two random variables.
2 Introduction
In science and in real life, we are often interested in two (or more) random variables at the
same time. For example, we might measure the height and weight of giraffes, or the IQ
and birthweight of children, or the frequency of exercise and the rate of heart disease in
adults, or the level of air pollution and rate of respiratory illness in cities, or the number of
Facebook friends and the age of Facebook members.
Think: What relationship would you expect in each of the five examples above? Why?
In such situations the random variables have a joint distribution that allows us to compute
probabilities of events involving both variables and understand the relationship between the
variables. This is simplest when the variables are independent. When they are not, we use
covariance and correlation as measures of the nature of the dependence between them.
3 Joint Distribution
Suppose X and Y are two discrete random variables and that X takes values {x1 , x2 , . . . , xn }
and Y takes values {y1 , y2 , . . . , ym }. The ordered pair (X, Y ) take values in the product
{(x1 , y1 ), (x1 , y2 ), . . . (xn , ym )}. The joint probability mass function (joint pmf) of X and Y
is the function p(xi , yj ) giving the probability of the joint outcome X = xi , Y = yj .
We organize this in a joint probability table as shown:
1
Statistics Class 7a, Joint Distributions, Independence 2
Example 1. Roll two dice. Let X be the value on the first die and let Y be the value on
the second die. Then both X and Y take values 1 to 6 and the joint pmf is p(i, j) = 1/36
for all i and j between 1 and 6. Here is the joint probability table:
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Example 2. Roll two dice. Let X be the value on the first die and let T be the total on
both dice. Here is the joint probability table:
X\T 2 3 4 5 6 7 8 9 10 11 12
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36
The continuous case is essentially the same as the discrete case: we just replace discrete sets
of values by continuous intervals, the joint probability mass function by a joint probability
density function, and the sums by integrals.
If X takes values in [a, b] and Y takes values in [c, d] then the pair (X, Y ) takes values in
the product [a, b] × [c, d]. The joint probability density function (joint pdf) of X and Y
is a function f (x, y) giving the probability density at (x, y). That is, the probability that
(X, Y ) is in a small rectangle of width dx and height dy around (x, y) is f (x, y) dx dy.
y
d
Prob. = f (x, y) dx dy
dy
dx
x
a b
Note: as with the pdf of a single random variable, the joint pdf f (x, y) can take values
greater than 1; it is a probability density, not a probability.
In 18.05 we won’t expect you to be experts at double integration. Here’s what we will
expect.
For a non-rectangular region, when f (x, y) = c is constant, you should know that the
double integral is the same as the c × (the area of the region).
3.3 Events
Random variables are useful for describing events. Recall that an event is a set of outcomes
and that random variables assign numbers to outcomes. For example, the event ‘X > 1’
is the set of all outcomes for which X is greater than 1. These concepts readily extend to
pairs of random variables and joint outcomes.
Statistics Class 7a, Joint Distributions, Independence 4
B = {(1, 3), (1, 4), (1, 5), (1, 6), (2, 4), (2, 5), (2, 6), (3, 5), (3, 6), (4, 6)}.
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
The probability of B is the sum of the probabilities in the orange shaded squares, so
P (B) = 10/36.
Example 4. Suppose X and Y both take values in [0,1] with uniform density f (x, y) = 1.
Visualize the event ‘X > Y ’ and find its probability.
Solution: Jointly X and Y take values in the unit square. The event ‘X > Y ’ corresponds
to the shaded lower-right triangle below. Since the density is constant, the probability is
just the fraction of the total area taken up by the event. In this case, it is clearly 0.5.
y
1
‘X > Y ’
x
1
The event ‘X > Y ’ in the unit square.
Example 5. Suppose X and Y both take values in [0,1] with density f (x, y) = 4xy. Show
f (x, y) is a valid joint pdf, visualize the event A = ‘X < 0.5 and Y > 0.5’ and find its
probability.
Solution: Jointly X and Y take values in the unit square.
Statistics Class 7a, Joint Distributions, Independence 5
y
1
x
1
The event A in the unit square.
To show f (x, y) is a valid joint pdf we must check that it is positive (which it clearly is)
and that the total probability is 1.
Z 1Z 1 Z 1 Z 1
2 1
Total probability = 4xy dx dy = 2x y 0 dy = 2y dy = 1. QED
0 0 0 0
The event A is just the upper-left-hand quadrant. Because the density is not constant we
must compute an integral to find the probability.
0.5 Z 1 0.5 0.5
3x 3
Z Z Z
1
P (A) = 4xy dy dx = 2xy 2 0.5 dx = dx = .
0 0.5 0 0 2 16
Suppose X and Y are jointly-distributed random variables. We will use the notation ‘X ≤
x, Y ≤ y’ to mean the event ‘X ≤ x and Y ≤ y’. The joint cumulative distribution function
(joint cdf) is defined as
F (x, y) = P (X ≤ x, Y ≤ y)
Continuous case: If X and Y are continuous random variables with joint density f (x, y)
over the range [a, b] × [c, d] then the joint cdf is given by the double integral
Z yZ x
F (x, y) = f (u, v) du dv.
c a
To recover the joint pdf, we differentiate the joint cdf. Because there are two variables we
need to use partial derivatives:
∂2F
f (x, y) = (x, y).
∂x∂y
Discrete case: If X and Y are discrete random variables with joint pmf p(xi , yj ) then the
joint cdf is give by the double sum
X X
F (x, y) = p(xi , yj ).
xi ≤x yj ≤y
Statistics Class 7a, Joint Distributions, Independence 6
Example 6. Find the joint cdf for the random variables in Example 5.
Solution: The event ‘X ≤ x and Y ≤ y’ is a rectangle in the unit square.
y
1
(x, y)
‘X ≤ x & Y ≤ y’
x
1
X\Y 1 2 3 4 5 6
1 1/36 1/36 1/36 1/36 1/36 1/36
2 1/36 1/36 1/36 1/36 1/36 1/36
3 1/36 1/36 1/36 1/36 1/36 1/36
4 1/36 1/36 1/36 1/36 1/36 1/36
5 1/36 1/36 1/36 1/36 1/36 1/36
6 1/36 1/36 1/36 1/36 1/36 1/36
Statistics Class 7a, Joint Distributions, Independence 7
Adding up the probability in the shaded squares we get F (3.5, 4) = 12/36 = 1/3.
Note. One unfortunate difference between the continuous and discrete visualizations is that
for continuous variables the value increases as we go up in the vertical direction while the
opposite is true for the discrete case. We have experimented with changing the discrete
tables to match the continuous graphs, but it causes too much confusion. We will just have
to live with the difference!
When X and Y are jointly-distributed random variables, we may want to consider only one
of them, say X. In that case we need to find the pmf (or pdf or cdf) of X without Y . This
is called a marginal pmf of the joint pmf (or pdf or cdf). The next example illustrates the
way to compute this and the reason for the term ‘marginal’.
Example 8. In Example 2 we rolled two dice and let X be the value on the first die and
T be the total on both dice. Compute the marginal pmf for X and for T .
Solution: In the table each row represents a single value of X. So the event ‘X = 3’ is
the third row of the table. To find P (X = 3) we simply have to sum up the probabilities in
this row. We put the sum in the right-hand margin of the table. Likewise P (T = 5) is just
the sum of the column with T = 5. We put the sum in the bottom margin of the table.
X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(tj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1
Note: Of course in this case we already knew the pmf of X and of T . It is good to see that
our computation here is in agreement!
As motivated by this example, marginal pmfs are obtained from the joint pmf by summing:
X X
pX (xi ) = p(xi , yj ), pY (yj ) = p(xi , yj )
j i
The term marginal refers to the fact that the values are written in the margins of the table.
Statistics Class 7a, Joint Distributions, Independence 8
For a continous joint density f (x, y) with range [a, b] × [c, d], the marginal pdfs are:
Z d Z b
fX (x) = f (x, y) dy, fY (y) = f (x, y) dx.
c a
Compare these with the marginal pmfs above; as usual the sums are replaced by integrals.
We say that to obtain the marginal for X, we integrate out Y from the joint pdf and vice
versa.
Example 9. Suppose (X, Y ) takes values on the square [0, 1]×[1, 2] with joint pdf f (x, y) =
8 3
3 x y. Find the marginal pdfs fX (x) and fY (y).
Solution: To find fX (x) we integrate out y and to find fY (y) we integrate out x.
2
4 3 2 2
8 3
Z
fX (x) = x y dy = x y = 4x3
1 3 3 1
1
2 4 1 1
8 3 2
Z
fY (y) = x y dx = x y = y.
0 3 3 0 3
Example 10. Suppose (X, Y ) takes values on the unit square [0, 1] × [0, 1] with joint pdf
f (x, y) = 23 (x2 + y 2 ). Find the marginal pdf fX (x) and use it to find P (X < 0.5).
Solution:
1 1
y3
3 2 3 2 3 1
Z
2
fX (x) = (x + y ) dy = x y+ = x2 + .
0 2 2 2 0 2 2
0.5 0.5
1 3 1 0.5
3 2 1 5
Z Z
P (X < 0.5) = fX (x) dx = x + dx = x + x = .
0 0 2 2 2 2 0 16
Finding the marginal cdf from the joint cdf is easy. If X and Y jointly take values on
[a, b] × [c, d] then
FX (x) = F (x, d), FY (y) = F (b, y).
If d is ∞ then this becomes a limit FX (x) = lim F (x, y). Likewise for FY (y).
y→∞
Example 11. The joint cdf in the last example was F (x, y) = 21 (x3 y + xy 3 ) on [0, 1] × [0, 1].
Find the marginal cdfs and use FX (x) to compute P (X < 0.5).
Solution: We have FX (x) = F (x, 1) = 21 (x3 + x) and FY (y) = F (1, y) = 21 (y + y 3 ). So
P (X < 0.5) = FX (0.5) = 21 (0.53 + 0.5) = 16
5
: exactly the same as before.
3.10 3D visualization
We visualized P (a < X < b) as the area under the pdf f(x) over the interval [a, b]. Since
the range of values of (X, Y ) is already a two dimensional region in the plane, the graph of
Statistics Class 7a, Joint Distributions, Independence 9
f (x, y) is a surface over that region. We can then visualize probability as volume under the
surface.
Think: Summoning your inner artist, sketch the graph of the joint pdf f (x, y) = 4xy and
visualize the probability P (A) as a volume for Example 5.
4 Independence
P (A ∩ B) = P (A)P (B).
Random variables X and Y define events like ‘X ≤ 2’ and ‘Y > 5’. So, X and Y are
independent if any event defined by X is independent of any event defined by Y . The
formal definition that guarantees this is the following.
Definition: Jointly-distributed random variables X and Y are independent if their joint
cdf is the product of the marginal cdfs:
For discrete variables this is equivalent to the joint pmf being the product of the marginal
pmfs.:
p(xi , yj ) = pX (xi )pY (yj ).
For continous variables this is equivalent to the joint pdf being the product of the marginal
pdfs.:
f (x, y) = fX (x)fY (y).
Once you have the joint distribution, checking for independence is usually straightforward
although it can be tedious.
Example 12. For discrete variables independence means the probability in a cell must be
the product of the marginal probabilities of its row and column. In the first table below
this is true: every marginal probability is 1/6 and every cell contains 1/36, i.e. the product
of the marginals. Therefore X and Y are independent.
In the second table below most of the cell probabilities are not the product of the marginal
probabilities. For example, none of marginal probabilities are 0, so none of the cells with 0
probability can be the product of the marginals.
Statistics Class 7a, Joint Distributions, Independence 10
X\Y 1 2 3 4 5 6 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 1/6
2 1/36 1/36 1/36 1/36 1/36 1/36 1/6
3 1/36 1/36 1/36 1/36 1/36 1/36 1/6
4 1/36 1/36 1/36 1/36 1/36 1/36 1/6
5 1/36 1/36 1/36 1/36 1/36 1/36 1/6
6 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/6 1/6 1/6 1/6 1/6 1/6 1
X\T 2 3 4 5 6 7 8 9 10 11 12 p(xi )
1 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 0 1/6
2 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 0 1/6
3 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 0 1/6
4 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 0 1/6
5 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 0 1/6
6 0 0 0 0 0 1/36 1/36 1/36 1/36 1/36 1/36 1/6
p(yj ) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36 1
Example 13. For continuous variables independence means you can factor the joint pdf
or cdf as the product of a function of x and a function of y.
(i) Suppose X has range [0, 1/2], Y has range [0, 1] and f (x, y) = 96x2 y 3 then X and Y
are independent. The marginal densities are fX (x) = 24x2 and fY (y) = 4y 3 .
(ii) If f (x, y) = 1.5(x2 +y 2 ) over the unit square then X and Y are not independent because
there is no way to factor f (x, y) into a product fX (x)fY (y).
(iii) If F (x, y) = 21 (x3 y + xy 3 ) over the unit square then X and Y are not independent
because the cdf does not factor into a product FX (x)FY (y).
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Introduction to Statistics
Class 8a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2 Introduction to statistics
Statistics deals with data. Generally speaking, the goal of statistics is to make inferences
based on data. We can divide this the process into three phases: collecting data, describing
data and analyzing data. This fits into the paradigm of the scientific method. We make
hypotheses about what’s true, collect data in experiments, describe the results, and then
infer from the results the strength of the evidence concerning our hypotheses.
The design of an experiment is crucial to making sure the collected data is useful. The
adage ‘garbage in, garbage out’ applies here. A poorly designed experiment will produce
poor quality data, from which it may be impossible to draw useful, valid inferences. To
quote R.A. Fisher one of the founders of modern statistics,
Raw data often takes the form of a massive list, array, or database of labels and numbers.
To make sense of the data, we can calculate summary statistics like the mean, median, and
interquartile range. We can also visualize the data using graphical devices like histograms,
scatterplots, and the empirical cdf. These methods are useful for both communicating and
exploring the data to gain insight into its structure, such as whether it might follow a
familiar probability distribution.
Ultimately we want to draw inferences about the world. Often this takes the form of
specifying a statistical model for the random process by which the data arises. For example,
suppose the data takes the form of a series of measurements whose error we believe follows
a normal distribution. (Note this is always an approximation since we know the error must
1
Statistics Class 8a, Introduction to Statistics 2
have some bound while a normal distribution has range (−∞, ∞).) We might then use the
data to provide evidence for or against this hypothesis. Our focus in 18.05 will be on how
to use data to draw inferences about model parameters. For example, assuming gestational
length follows a N (µ, σ) distribution, we’ll use the data of the gestational lengths of, say,
500 pregnancies to draw inferences about the values of the parameters µ and σ. Similarly,
we may model the result of a two-candidate election by a Bernoulli(p) distribution, and use
poll data to draw inferences about the value of p.
We can rarely make definitive statements about such parameters because the data itself
comes from a random process (such as choosing who to poll). Rather, our statistical evidence
will always involve probability statements. Unfortunately, the media and public at large
are wont to misunderstand the probabilistic meaning of statistical statements. In fact,
researchers themselves often commit the same errors. In this course, we will emphasize the
meaning of statistical statements alongside the methods which produce them.
Example 1. To study the effectiveness of new treatment for cancer, patients are recruited
and then divided into an experimental group and a control group. The experimental group
is given the new treatment and the control group receives the current standard of care.
Data collected from the patients might include demographic information, medical history,
initial state of cancer, progression of the cancer over time, treatment cost, and the effect of
the treatment on tumor size, remission rates, longevity, and quality of life. The data will
be used to make inferences about the effectiveness of the new treatment compared to the
current standard of care.
Notice that this study will go through all three phases described above. The experimental
design must specify the size of the study, who will be eligible to join, how the experimental
and control groups will be chosen, how the treatments will be administered, whether or
not the subjects or doctors know who is getting which treatment, and precisely what data
will be collected, among other things. Once the data is collected it must be described and
analyzed to determine whether it supports the hypothesis that the new treatment is more
(or less) effective than the current one(s), and by how much. These statistical conclusions
will be framed as precise statements involving probabilities.
As noted above, misinterpreting the exact meaning of statistical statements is a common
source of error which has led to tragedy on more than one occasion.
Example 2. In 1999 in Great Britain, Sally Clark was convicted of murdering her two
children after each child died weeks after birth (the first in 1996, the second in 1998). Her
conviction was largely based on a faulty use of statistics to rule out sudden infant death
syndrome. Though her conviction was overturned in 2003, she developed serious psychiatric
problems during and after her imprisonment and died of alcohol poisoning in 2007. See
https://en.wikipedia.org/wiki/Sally_Clark
This TED talk discusses the Sally Clark case and other instances of poor statistical intuition:
https://www.youtube.com/watch?v=kLmzxmRcUTo
Example 3. Consider the data of 1000 rolls of a die. All of the following are statistics:
the average of the 1000 rolls; the number of times a 6 was rolled; the sum of the squares
of the rolls minus the number of even rolls. It’s hard to imagine how we would use the
last example, but it is a statistic. On the other hand, the probability of rolling a 6 is not a
statistic, whether or not the die is truly fair. Rather this probability is a property of the die
(and the way we roll it) which we can estimate using the data. Such an estimate is given
by the statistic ‘proportion of the rolls that were 6’.
Example 4. Suppose we treat a group of cancer patients with a new procedure and collect
data on how long they survive post-treatment. From the data we can compute the average
survival time of patients in the group. We might employ this statistic as an estimate of the
average survival time for future cancer patients following the new procedure. The “expected
survival time” for the new procedure (if that even has a meaning) is not a statistic.
Example 5. Suppose we ask 1000 residents whether or not they support the proposal to
legalize marijuana in Massachusetts. The proportion of the 1000 who support the proposal
is a statistic. The proportion of all Massachusetts residents who support the proposal is
not a statistic since we have not queried every single one (note the word “collected” in the
definition). Rather, we hope to draw a statistical conclusion about the state-wide proportion
based on the data of our random sample.
The following are two general types of statistics we will use in 18.05.
1. Point statistics: a single value computed from data, such as the sample average xn or
the sample standard deviation sn .
2. Interval statistics: an interval [a, b] computed from the data. This is really just a pair of
point statistics, and will often be presented in the form x ± s.
We cannot stress strongly enough how important Bayes’ theorem is to our view of inferential
statistics. Recall that Bayes’ theorem allows us to ‘invert’ conditional probabilities. That
is, if H and D are events, then Bayes’ theorem says
P (D|H)P (H)
P (H|D) = .
P (D)
In scientific experiments we start with a hypothesis and collect data to test the hypothesis.
We will often let H represent the event ‘our hypothesis is true’ and let D be the collected
data. In these words Bayes’ theorem says
P (data |hypothesis is true) · P (hypothesis is true)
P (hypothesis is true | data) =
P (data)
The left-hand term is the probability our hypothesis is true given the data we collected.
This is precisely what we’d like to know. When all the probabilities on the right are known
exactly, we can compute the probability on the left exactly. This will be our focus next
week. Unfortunately, in practice we rarely know the exact values of all the terms on the
Statistics Class 8a, Introduction to Statistics 4
right. Statisticians have developed a number of ways to cope with this lack of knowledge
and still make useful inferences. We will be exploring these methods for the rest of the
course.
Example 6. Screening for a disease redux
Suppose a screening test for a disease has a 1% false positive rate and a 1% false negative
rate. Suppose also that the rate of the disease in the population is 0.002. Finally suppose
a randomly selected person tests positive. In the language of hypothesis and data we have:
Hypothesis: H = ‘the person has the disease’
Data: D = ‘the test was positive.’
What we want to know: P (H|D) = P (the person has the disease | a positive test)
In this example all the probabilities on the right are known so we can use Bayes’ theorem
to compute what we want to know.
Before the test we would have said the probability the person had the disease was 0.002.
After the test we see the probability is 0.166. That is, the positive test provides some
evidence that the person has the disease.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Maximum Likelihood Estimates
Class 8b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to define the likelihood function for a parametric model given data.
2. Be able to compute the maximum likelihood estimate of unknown parameter(s).
2 Introduction
There are many methods for estimating unknown parameters from data. We will first
consider the maximum likelihood estimate (MLE), which answers the question:
For which parameter value does the observed data have the biggest probability?
The MLE is an example of a point estimate because it gives a single value for the unknown
parameter (later our estimates will involve intervals and probabilities). Two advantages of
1
Statistics Class 8b, Maximum Likelihood Estimates 2
the MLE are that it is often easy to compute and that it agrees with our intuition in simple
examples. We will explain the MLE through a series of examples.
Example 1. A coin is flipped 100 times. Given that there were 55 heads, find the maximum
likelihood estimate for the probability p of heads on a single toss.
Before actually solving the problem, let’s establish some notation and terms.
We can think of counting the number of heads in 100 tosses as an experiment. For a given
value of p, the probability of getting 55 heads in this experiment is the binomial probability
100 55
P (55 heads) = p (1 − p)45 .
55
The probability of getting 55 heads depends on the value of p, so let’s include p in by using
the notation of conditional probability:
100 55
P (55 heads | p) = p (1 − p)45 .
55
Experiment: Flip the coin 100 times and count the number of heads.
Data: The data is the result of the experiment. In this case it is ‘55 heads’.
Likelihood, or likelihood function: this is P (data | p). Note it is a function of both the
data and the parameter p. In this case the likelihood is
100 55
P (55 heads | p) = p (1 − p)45 .
55
Definition: Given data the maximum likelihood estimate (MLE) for the parameter p is
the value of p that maximizes the likelihood P (data | p). That is, the MLE is the value of
p for which the data is most likely.
Solution: For the problem at hand, we saw above that the likelihood
100 55
P (55 heads | p) = p (1 − p)45 .
55
Statistics Class 8b, Maximum Likelihood Estimates 3
We’ll use the notation p̂ for the MLE. We use calculus to find it by taking the derivative of
the likelihood function and setting it to 0.
d 100
P (data |p) = (55p54 (1 − p)45 − 45p55 (1 − p)44 ) = 0.
dp 55
Note: 1. The MLE for p turned out to be exactly the fraction of heads we saw in our data.
2. The MLE is computed from the data. That is, it is a statistic.
3. Officially we need to check that this critical point is actually the maximum. We could
use the second derivative test. Another way is to notice that we are interested only in
0 ≤ p ≤ 1; that the probability is bigger than zero for 0 < p < 1; and that the probability
is equal to zero for p = 0 and for p = 1. From these facts it follows that the critical point
must be the unique maximum.
If is often easier to work with the natural log of the likelihood function. For short this is
simply called the log likelihood. Since ln(x) is an increasing function, the maxima of the
likelihood and log likelihood coincide.
Maximizing likelihood is the same as maximizing log likelihood. We check that calculus
gives us the same answer as before:
d d 100
(log likelihood) = ln + 55 ln(p) + 45 ln(1 − p)
dp dp 55
55 45
= − =0
p 1−p
⇒ 55(1 − p) = 45p
⇒ p̂ = 0.55
Statistics Class 8b, Maximum Likelihood Estimates 4
For continuous distributions, we use the probability density function to define the likelihood.
We show this in a few examples. In the next section we explain how this is analogous to
what we did in the discrete case.
Example 3. Light bulbs
Suppose that the lifetime of Badger brand light bulbs is modeled by an exponential distri-
bution with (unknown) parameter λ. We test 5 bulbs and find they have lifetimes of 2, 3,
1, 3, and 4 years, respectively. What is the MLE for λ?
Solution: We need to be careful with our notation. With five different values it is best to
use subscripts. Let Xi be the lifetime of the ith bulb and let xi be the value Xi takes. Then
each Xi has pdf fXi (xi ) = λe−λxi . We assume the lifetimes of the bulbs are independent,
so the joint pdf is the product of the individual densities:
f (x1 , x2 , x3 , x4 , x5 | λ) = (λe−λx1 )(λe−λx2 )(λe−λx3 )(λe−λx4 )(λe−λx5 ) = λ5 e−λ(x1 +x2 +x3 +x4 +x5 ) .
Note that we write this as a conditional density, since it depends on λ. Viewing the data
as fixed and λ as variable, this density is the likelihood function. Our data had values
x1 = 2, x2 = 3, x3 = 1, x4 = 3, x5 = 4.
So the likelihood and log likelihood functions with this data are
f (2, 3, 1, 3, 4 | λ) = λ5 e−13λ , ln(f (2, 3, 1, 3, 4 | λ) = 5 ln(λ) − 13λ
Finally we use calculus to find the MLE:
d 5 5
(log likelihood) = − 13 = 0 ⇒ λ̂ = .
dλ λ 13
Note: 1. In this example we used an uppercase letter for a random variable and the
corresponding lowercase letter for the value it takes. This will be our usual practice.
2. The MLE for λ turned out to be the reciprocal of the sample mean x̄, so X ∼ exp(λ̂)
satisfies E[X] = x̄.
The following example illustrates how we can use the method of maximum likelihood to
estimate multiple parameters at once.
Example 4. Normal distributions
Suppose the data x1 , x2 , . . . , xn is drawn from a N(µ, σ 2 ) distribution, where µ and σ are
unknown. Find the maximum likelihood estimate for the pair (µ, σ 2 ).
Solution: Let’s be precise and phrase this in terms of random variables and densities. Let
uppercase X1 , . . . , Xn be i.i.d. N(µ, σ 2 ) random variables, and let lowercase xi be the value
Xi takes. The density for each Xi is
1 (xi −µ)2
fXi (xi ) = √ e− 2σ2 .
2π σ
Since the Xi are independent their joint pdf is the product of the individual pdf’s:
n P
(xi −µ)2
1 n
f (x1 , . . . , xn | µ, σ) = √ e− i=1 2σ2 .
2π σ
Statistics Class 8b, Maximum Likelihood Estimates 5
For the fixed data x1 , . . . , xn , the likelihood and log likelihood are
n n
(xi −µ)2 √ (xi − µ)2
1 −
Pn X
f (x1 , . . . , xn |µ, σ) = √ e i=1 2σ 2 , ln(f (x1 , . . . , xn |µ, σ)) = −n ln( 2π)−n ln(σ)− .
2π σ 2σ 2
i=1
Since ln(f (x1 , . . . , xn |µ, σ)) is a function of the two variables µ, σ we use partial derivatives
to find the MLE. The easy value to find is µ̂:
n n Pn
∂f (x1 , . . . , xn |µ, σ) X (xi − µ) X xi
= =0 ⇒ xi = nµ ⇒ µ̂ = i=1 = x.
∂µ σ2 n
i=1 i=1
We already know µ̂ = x, so we use that as the value for µ in the formula for σ̂. We get the
maximum likelihood estimates
µ̂ =x = the mean of the data
n n
X 1 X 1
σ̂ 2 = (xi − µ̂)2 = (xi − x)2 = the unadjusted variance of the data.
n n
i=1 i=1
Pn
− µ̂)2
i=1 (xi
(Later we will learn that the sample variance is .)
n−1
Example 5. Uniform distributions
Suppose our data x1 , . . . xn are independently drawn from a uniform distribution U (a, b).
Find the MLE for a and b.
Solution: This example is different from the previous ones in that we won’t use calculus
1
to find the MLE. The density for U (a, b) is b−a on [a, b]. Therefore our likelihood function
is ( n
1
b−a if all xi are in the interval [a, b]
f (x1 , . . . , xn | a, b) =
0 otherwise.
This is maximized by making b − a as small as possible. The only restriction is that the
interval [a, b] must include all the data. Thus the MLE for the pair (a, b) is
â = min(x1 , . . . , xn ) b̂ = max(x1 , . . . , xn ).
Solution: Our unknown parameter n is the number of animals in the wild. Our data is
that 4 out of 20 recaptured animals were tagged (and that there are 10 tagged animals).
The likelihood function is
n−10 10
16 4
P (data | n animals) = n
20
(The numerator is the number of ways to choose 16 animals from among the n−10 untagged
ones times the number of was to choose 4 out of the 10 tagged animals. The denominator
is the number of ways to choose 20 animals from the entire population of n.) We can use
R to compute that the likelihood function is maximized when n = 50. This should make
some sense. It says our best estimate is that the fraction of all animals that are tagged is
10/50 which equals the fraction of recaptured animals which are tagged.
genotype AA Aa aa
probability θ2 2θ(1 − θ) (1 − θ)2
Suppose we test a random sample of people and find that k1 are AA, k2 are Aa, and k3 are
aa. Find the MLE of θ.
Solution: The likelihood function is given by
k1 + k2 + k3 k2 + k3 k3 2k1
P (k1 , k2 , k3 | θ) = θ (2θ(1 − θ))k2 (1 − θ)2k3 .
k1 k2 k3
So the log likelihood is given by
The idea for the maximum likelihood estimate is to find the value of the parameter(s) for
which the data has the highest probability. In this section we ’ll see that we’re doing this
Statistics Class 8b, Maximum Likelihood Estimates 7
is really what we are doing with the densities. We will do this by considering a smaller
version of the light bulb example.
Example 8. Suppose we have two light bulbs whose lifetimes follow an exponential(λ)
distribution. Suppose also that we independently measure their lifetimes and get data
x1 = 2 years and x2 = 3 years. Find the value of λ that maximizes the probability of this
data.
Solution: The main paradox to deal with is that for a continuous distribution the proba-
bility of a single value, say x1 = 2, is zero. We resolve this paradox by remembering that a
single measurement really means a range of values, e.g. in this example we might check the
light bulb once a day. So the data x1 = 2 years really means x1 is somewhere in a range of
1 day around 2 years.
If the range is small we call it dx1 . The probability that X1 is in the range is approximated
by fX1 (x1 |λ) dx1 . This is illustrated in the figure below. The data value x2 is treated in
exactly the same way.
density fX1 (x1 |λ) density fX2 (x2 |λ)
λ λ
probability ≈ fX1 (x1 |λ) dx1 probability ≈ fX2 (x2 |λ) dx2
dx1
dx2
x x
x1 x2
The usual relationship between density and probability for small ranges.
Since the data is collected independently the joint probability is the product of the individual
probabilities. Stated carefully
P (X1 in range, X2 in range|λ) ≈ fX1 (x1 |λ) dx1 · fX2 (x2 |λ) dx2
Finally, using the values x1 = 2 and x2 = 3 and the formula for an exponential pdf we have
P (X1 in range, X2 in range|λ) ≈ λe−2λ dx1 · λe−3λ dx2 = λ2 e−5λ dx1 dx2 .
Now that we have a genuine probability we can look for the value of λ that maximizes it.
Looking at the formula above we see that the factor dx1 dx2 will play no role in finding the
maximum. So for the MLE we drop it and simply call the density the likelihood:
The value of λ that maximizes this is found just like in the example above. It is λ̂ = 2/5.
For the interested reader, we note several nice features of the MLE. These are quite technical
and will not be on any exams.
Statistics Class 8b, Maximum Likelihood Estimates 8
The MLE behaves well under transformations. That is, if p̂ is the MLE for p and g is a
one-to-one function, then g(p̂) is the MLE for g(p). For example, if σ̂ is the MLE for the
standard deviation σ then (σ̂)2 is the MLE for the variance σ 2 .
Furthermore, under some technical smoothness assumptions, the MLE is asymptotically
unbiased and has asymptotically minimal variance. To explain these notions, note that
the MLE is itself a random variable since the data is random and the MLE is computed
from the data. Let x1 , x2 , . . . be an infinite sequence of samples from a distribution with
parameter p. Let p̂n be the MLE for p based on the data x1 , . . . , xn .
Asymptotically unbiased means that as the amount of data grows, the mean of the MLE
converges to p. In symbols: E[p̂n ] → p as n → ∞. Of course, we would like the MLE to be
close to p with high probability, not just on average, so the smaller the variance of the MLE
the better. Asymptotically minimal variance means that as the amount of data grows, the
MLE has the minimal variance among all unbiased estimators of p. In symbols: for any
unbiased estimator p̃n and ϵ > 0 we have that Var(p̃n ) + ϵ > Var(p̂n ) as n → ∞.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Bayesian Updating with Discrete Priors
Class 9, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to apply Bayes’ theorem to compute probabilities.
2. Be able to define and to identify the roles of prior probability, likelihood (Bayes term),
posterior probability, data and hypothesis in the application of Bayes’ Theorem.
3. Be able to use a Bayesian update table to compute posterior probabilities.
Recall that Bayes’ theorem allows us to ‘invert’ conditional probabilities. If H and D are
events, then:
P (D | H)P (H)
P (H | D) =
P (D)
Our view is that Bayes’ theorem forms the foundation for inferential statistics. We will
begin to justify this view today.
When we first learned Bayes’ theorem we worked an example about screening tests showing
that P (D|H) can be very different from P (H|D). In the appendix we work a similar example.
If you are not comfortable with Bayes’ theorem you should read the example in the appendix
now.
We now use a coin tossing problem to introduce terminology and a tabular format for Bayes’
theorem. This will provide a simple, uncluttered example that shows our main points.
Example 1. There are three types of coins which have different probabilities of landing
heads when tossed.
1
Statistics Class 9, Bayesian Updating with Discrete Priors 2
Solution: Let A, B, and C be the event that the chosen coin was type A, type B, and
type C. Let D be the event that the toss is heads. The problem asks us to find
P (A|D), P (B|D), P (C|D).
Before applying Bayes’ theorem, let’s introduce some terminology.
Experiment: pick a coin from the drawer at random, flip it, and record the result.
Data: the result of our experiment. In this case the event D = ‘heads’. We think of
D as data that provides evidence for or against each hypothesis.
Hypotheses: we are testing three hypotheses: the coin is type A, B or C.
Prior probability: the probability of each hypothesis prior to tossing the coin (collect-
ing data). Since the drawer has 2 coins of type A, 2 of type B and 1 of type C we
have
P (A) = 0.4, P (B) = 0.4, P (C) = 0.2.
Likelihood: (This is the same likelihood we used for the MLE.) The likelihood function
is P (D|H), i.e., the probability of the data assuming that the hypothesis is true. Most
often we will consider the data as fixed and let the hypothesis vary. For example,
P (D|A) = probability of heads if the coin is type A. In our case the likelihoods are
P (D|A) = 0.5, P (D|B) = 0.6, P (D|C) = 0.9.
The name likelihood is so well established in the literature that we have to teach
it to you. However in colloquial language likelihood and probability are synonyms.
This leads to the likelihood function often being confused with the probability of a
hypothesis. Because of this we’d prefer to use the name Bayes’ term. However since
we are stuck with ‘likelihood’ we will try to use it very carefully and in a way that
minimizes any confusion.
Posterior probability: the probability (posterior to) of each hypothesis given the data
from tossing the coin.
P (A|D), P (B|D), P (C|D).
These posterior probabilities are what the problem asks us to find.
We now use Bayes’ theorem to compute each of the posterior probabilities. We are going
to write this out in complete detail so we can pick out each of the parts (Remember that
the data D is that the toss was heads.)
First we organize the probabilities into a tree:
A B C
0.5 0.5 0.6 0.4 0.9 0.1
H T H T H T
P (D|A)P (A)
Bayes’ theorem says, e.g. P (A|D) = . The denominator P (D) is computed
P (D)
using the law of total probability:
P (D) = P (D|A)P (A) + P (D|B)P (B) + P (D|C)P (C) = 0.5 · 0.4 + 0.6 · 0.4 + 0.9 · 0.2 = 0.62.
1. There are two types of probabilities: Type one is the standard probability of data, e.g.
the probability of heads is p = 0.9. Type two is the probability of the hypotheses, e.g.
the probability the chosen coin is type A, B or C. This second type has prior (before
the data) and posterior (after the data) values.
2. The posterior (after the data) probabilities for each hypothesis are in the last column.
We see that coin B is now the most probable, though its probability has decreased from
a prior probability of 0.4 to a posterior probability of 0.39. Meanwhile, the probability
of type C has increased from 0.2 to 0.29.
3. The Bayes numerator column determines the posterior probability column. To compute
the latter, we simply divided each numerator by P (D), i.e. rescaled the Bayes numerators
so that they sum to 1.
Statistics Class 9, Bayesian Updating with Discrete Priors 4
4. If all we care about is finding the most likely hypothesis, the Bayes numerator works as
well as the normalized posterior.
5. The likelihood column does not sum to 1. The likelihood function is not a probability
function.
6. The posterior probability represents the outcome of a ‘tug-of-war’ between the likelihood
and the prior. When calculating the posterior, a large prior may be deflated by a small
likelihood, and a small prior may be inflated by a large likelihood.
7. The maximum likelihood estimate (MLE) for Example 1 is hypothesis C, with a likeli-
hood P (D|C) = 0.9. The MLE is useful, but you can see in this example that it is not
the entire story, since type B has the greatest posterior probability.
P (D|H)P (H)
P (H|D) =
P (D)
P (data|hypothesis)P (hypothesis)
P (hypothesis|data) =
P (data)
With the data fixed, the denominator P (D) just serves to normalize the total posterior prob-
ability to 1. So we can also express Bayes’ theorem as a statement about the proportionality
of two functions of H (i.e, of the last two columns of the table).
This leads to the most elegant form of Bayes’ theorem in the context of Bayesian updating:
Earlier in the course we saw that it is convenient to use random variables and probability
mass functions. To do this we had to assign values to events (head is 1 and tails is 0). We
will do the same thing in the context of Bayesian updating.
Our standard notations will be:
p(θ|D) is the posterior probability mass function of the hypothesis given the data.
In Example 1 we can represent the three hypotheses A, B, and C by θ = 0.5, 0.6, 0.9. For
the data we’ll let x = 1 mean heads and x = 0 mean tails. Then the prior and posterior
probabilities in the table define the prior and posterior probability mass functions.
Statistics Class 9, Bayesian Updating with Discrete Priors 5
.4 .4
.2 .2
θ θ
.5 .6 .9 .5 .6 .9
Prior pmf p(θ) and posterior pmf p(θ|x = 1) for Example 1
If the data was different then the likelihood column in the Bayesian update table would be
different. We can plan for different data by building the entire likelihood table ahead of
time. In the coin example there are two possibilities for the data: the toss is heads or the
toss is tails. So the full likelihood table has two likelihood columns:
hypothesis likelihood p(x|θ)
θ p(x = 0|θ) p(x = 1|θ)
0.5 0.5 0.5
0.6 0.4 0.6
0.9 0.1 0.9
Important convention. Notice that in the above table we used the value of θ as the
hypothesis. Of course, hypothesizing ‘θ = 0.5’ is exactly the same as hypothesizing ‘the coin
is type A’. It is also useful in settings where we haven’t named all the possible hypotheses.
Example 2. Using the notation p(θ), etc., redo Example 1 assuming the flip was tails.
Solution: Since the data has changed, the likelihood column in the Bayesian update table
is now for x = 0. That is, we must take the p(x = 0|θ) column from the likelihood table.
Bayes
hypothesis prior likelihood numerator posterior
θ p(θ) p(x = 0 | θ) p(x = 0 | θ)p(θ) p(θ | x = 0)
0.5 0.4 0.5 0.2 0.5263
0.6 0.4 0.4 0.16 0.4211
0.9 0.2 0.1 0.02 0.0526
total 1 NO SUM 0.38 1
Now the probability that θ = 0.5, i.e. the coin is type A, has increased from 0.4 to 0.5263,
while the probability that θ = 0.9, i.e the coin is type C, has decreased from 0.2 to only
0.0526. Here are the corresponding plots:
Statistics Class 9, Bayesian Updating with Discrete Priors 6
p(θ) p(θ|x = 0)
.4 .4
.2 .2
θ θ
.5 .6 .9 .5 .6 .9
Suppose that in Example 1 you didn’t know how many coins of each type were in the
drawer. You picked one at random and got heads. How would you go about deciding which
hypothesis (coin type) if any was most supported by the data?
In life we are continually updating our beliefs with each new experience of the world. In
Bayesian inference, after updating the prior to the posterior, we can take more data and
update again! For the second update, the posterior from the first data becomes the prior
for the second data.
Example 3. Suppose you have picked a coin as in Example 1. You flip it once and get
heads. Then you flip the same coin and get heads again. What is the probability that the
coin was type A? Type B? Type C?
Solution: As we update several times the table gets big, so we use a smaller font to fit it
in:
Bayes Bayes
hypothesis prior likelihood 1 numerator 1 likelihood 2 numerator 2 posterior 2
θ p(θ) p(x1 = 1|θ) p(x1 = 1|θ)p(θ) p(x2 = 1|θ) p(x2 = 1|θ)p(x1 = 1|θ)p(θ) p(θ|x1 = 1, x2 = 1)
0.5 0.4 0.5 0.2 0.5 0.1 0.2463
0.6 0.4 0.6 0.24 0.6 0.144 0.3547
0.9 0.2 0.9 0.18 0.9 0.162 0.3990
total 1 NO SUM NO SUM 0.406 1
Note that the second Bayes numerator is computed by multiplying the first Bayes numerator
and the second likelihood; since we are only interested in the final posterior, there is no
need to normalize until the last step. As shown in the last column and plot, after two heads
the type C hypothesis has finally taken the lead!
Statistics Class 9, Bayesian Updating with Discrete Priors 7
.4 .4 .4
.2 .2 .2
θ θ θ
.5 .6 .9 .5 .6 .9 .5 .6 .9
Example 4. A screening test for a disease is both sensitive and specific. By that we mean
it is usually positive when testing a person with the disease and usually negative when
testing someone without the disease. Let’s assume the true positive rate is 99% and the
false positive rate is 2%. Suppose the prevalence of the disease in the general population is
0.5%. If a random person tests positive, what is the probability that they have the disease?
Solution: As a review we first do the computation using trees. Next we will redo the
computation using tables.
Let’s use notation established above for hypotheses and data: let H+ be the hypothesis
(event) that the person has the disease and let H− be the hypothesis they do not. Likewise,
let T+ and T− represent the data of a positive and negative screening test respectively. We
are asked to compute P (H+ |T+ ).
We are given
From these we can compute the false negative and true negative rates:
0.005 0.995
H+ H−
0.99 0.01 0.02 0.98
T+ T− T+ T−
Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (T+ |H) P (T+ |H)P (H) P (H|T+ )
H+ 0.005 0.99 0.00495 0.19920
H− 0.995 0.02 0.01990 0.80080
total 1 NO SUM P (T+ ) = 0.02485 1
The table shows that the posterior probability P (H+ |T+ ) that a person with a positive test
has the disease is about 20%. This is far less than the sensitivity of the test (99%) but
much higher than the prevalence of the disease in the general population (0.5%).
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Bayesian Updating: Probabilistic Prediction
Class 10a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to use the law of total probability to compute prior and posterior predictive
probabilities.
2 Introduction
In the previous class we looked at updating the probability of hypotheses based on data.
We can also use the data to update the probability of each possible outcome of a future
experiment. In this class we will look at how this is done.
Prediction using words of estimative probability (WEP): “It is likely to rain tomor-
row.”
Probabilistic prediction: “Tomorrow it will rain with probability 60% (and not rain
with probability 40%).”
Weather forecasting
Climate change
Sports betting
Elections
...
1
Statistics Class 10a, Bayesian Updating: Probabilistic Prediction 2
These are all situations where there is uncertainty about the outcome and we would like as
precise a description of what could happen as possible.
3 Predictive Probabilities
You have a drawer containing 4 coins: 2 of type A, 1 of type B, and 1 of type C. You reach
into the drawer and pick a coin at random. We let A stand for the event ‘the chosen coin
is of type A’. Likewise for B and C.
Before taking data we can compute the probability that our chosen coin will land heads (or
tails) if flipped. Let DH be the event it lands heads and let DT the event it lands tails. We
can use the law of total probability to determine the probabilities of these events. Either
by drawing a tree or directly proceeding to the algebra, we get:
P (DH ) = P (DH |A)P (A) + P (DH |B)P (B) + P (DH |C)P (C)
= 0.5 · 0.5 + 0.6 · 0.25 + 0.9 · 0.25 = 0.625
P (DT ) = P (DT |A)P (A) + P (DT |B)P (B) + P (DT |C)P (C)
= 0.5 · 0.5 + 0.4 · 0.25 + 0.1 · 0.25 = 0.375
Definition: These probabilities give a (probabilistic) prediction of what will happen if the
coin is tossed. Because they are computed before we collect any data they are called prior
predictive probabilities.
Suppose we flip the coin once and it lands heads. We now have data D, which we can use
to update the prior probabilities of our hypotheses to posterior probabilities. Last class we
learned to use a Bayes table to facilitate this computation:
Statistics Class 10a, Bayesian Updating: Probabilistic Prediction 3
Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (D|H) P (D|H)P (H) P (H|D)
A 0.5 0.5 0.25 0.4
B 0.25 0.6 0.15 0.24
C 0.25 0.9 0.225 0.36
total 1 NO SUM 0.625 1
Having flipped the coin once and gotten heads, we can compute the probability that our
chosen coin will land heads (or tails) if flipped a second time. We proceed just as before, but
using the posterior probabilities P (A|D), P (B|D), P (C|D) in place of the prior probabilities
P (A), P (B), P (C).
P (DH |D) = P (DH |A)P (A|D) + P (DH |B)P (B|D) + P (DH |C)P (C|D)
= 0.5 · 0.4 + 0.6 · 0.24 + 0.9 · 0.36 = 0.668
P (DT |D) = P (DT |A)P (A|D) + P (DT |B)P (B|D) + P (DT |C)P (C|D)
= 0.5 · 0.4 + 0.4 · 0.24 + 0.1 · 0.36 = 0.332
Definition: These probabilities give a (probabilistic) prediction of what will happen if the
coin is tossed again. Because they are computed after collecting data and updating the
prior to the posterior, they are called posterior predictive probabilities.
Note that heads on the first toss increases the probability of heads on the second toss.
3.3 Review
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
Statistics Class 10a, Bayesian Updating: Probabilistic Prediction 4
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Bayesian Updating with Continuous Priors
Class 11a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2. Be able to state Bayes’ theorem and the law of total probability for continous densities.
2 Introduction
Up to now we have only done Bayesian updating when we had a finite number of hypothesis,
e.g. our dice example had five hypotheses (4, 6, 8, 12 or 20 sides). Now we will study
Bayesian updating when there is a continuous range of hypotheses. The Bayesian update
process will be essentially the same as in the discrete case. As usual when moving from
discrete to continuous we will need to replace the probability mass function by a probability
density function, and sums by integrals.
The first few sections of this note are devoted to working with pdfs. In particular we will
cover the law of total probability and Bayes’ theorem. We encourage you to focus on how
these are essentially identical to the discrete versions. After that, we will apply Bayes’
theorem and the law of total probability to Bayesian updating.
Example 1. Suppose you have a system that can succeed or fail with probability p. Then
we can hypothesize that p is anywhere in the range [0, 1]. That is, we have a continuous
range of hypotheses. We will often model this example with a ‘bent’ coin with unknown
probability p of heads.
1
Statistics Class 11a, Bayesian Updating with Continuous Priors 2
In all of these examples we modeled the random process giving rise to the data by a dis-
tribution with parameters –called a parametrized distribution. Every possible choice of the
parameter(s) is a hypothesis, e.g. we can hypothesize that the probability of succcess in
Example 1 is p = 0.7313. We have a continuous set of hypotheses because we could take
any value between 0 and 1.
4 Notational conventions
As in the examples above our hypotheses often take the form a certain parameter has value
θ. We will often use the letter θ to stand for an arbitrary hypothesis. This will leave
symbols like p, f , and x to take there usual meanings as pmf, pdf, and data. Also, rather
than saying ‘the hypothesis that the parameter of interest has value θ’ we will simply say
the hypothesis θ.
In the coin example we might have H0.6 = ‘the chosen coin has probability 0.6 of heads’, D
= ‘3 flips landed HHT’, so P (D|H0.6 ) = (0.6)2 (0.4)
2. (Small letters) Hypothesis values θ and data values x both have probabilities or proba-
bility densities:
p(θ) p(x) p(θ|x) p(x|θ)
f (θ) f (x) f (θ|x) f (x|θ)
In the coin example we might have θ = 0.6 and x is the sequence 1, 1, 0. So, p(x|θ) =
(0.6)2 (0.4). We might also write p(x = 1, 1, 0|θ = 0.6) to emphasize the values of x and θ.
Although we will still use both types of notation, from now on we will mostly use the small
letter notation involving pmfs and pdfs. Hypotheses will usually be parameters represented
by Greek letters (θ, λ, µ, σ, . . . ) while data values will usually be represented by English
letters (x, xi , y, . . . ).
Statistics Class 11a, Bayesian Updating with Continuous Priors 3
Suppose X is a random variable with pdf f (x). Recall f (x) is a density; its units are
probability/(units of x).
f (x) f (x)
probability f (x)dx
P (c ≤ X ≤ d)
x dx
c x
d x
The probability that X is in an infinitesimal range dx around x is f (x) dx. In fact, the
integral formula is just the ‘sum’ of these infinitesimal probabilities. We can visualize these
probabilities by viewing the integral as area under the graph of f (x).
In order to manipulate probabilities instead of densities in what follows, we will make
frequent use of the notion that f (x) dx is the probability that X is in an infinitesimal range
around x of width dx. Please make sure that you fully understand this notion.
In the Bayesian framework we have probabilities of hypotheses –called prior and posterior
probabilities– and probabilities of data given a hypothesis –called likelihoods. In earlier
classes both the hypotheses and the data had discrete ranges of values. We saw in the
introduction that we might have a continuous range of hypotheses. The same is true for
the data, but for today we will assume that our data can only take a discrete set of values.
In this case, the likelihood of data x given hypothesis θ is written using a pmf: p(x|θ).
We will use the following coin example to explain these notions. We will carry this example
through in each of the succeeding sections.
Example 4. Suppose we have a bent coin with unknown probability θ of heads. In this
case, we’ll say the coin is of ’type θ’ and we’ll label the hypothesis that a random coin is
of type θ by Hθ . The value of θ is random and could be anywhere between 0 and 1. For
this and the examples that follow we’ll suppose that the value of θ follows a distribution
with continuous prior probability density f (θ) = 2θ. We have a discrete likelihood because
tossing a coin has only two outcomes, x = 1 for heads and x = 0 for tails.
As we stated earlier, we will often write θ for the hypothesis Hθ . So the above probabilities
become
p(x = 1|θ) = θ, p(x = 0|θ) = 1 − θ.
Statistics Class 11a, Bayesian Updating with Continuous Priors 4
Think: This can be tricky to wrap your mind around. We have a continuous range of
types of coins –we identify the type by the value of the parameter θ. We are able to choose
a coin at random and the type chosen has a probability density f (θ).
It may help to see that the discrete examples we did in previous classes are similar. In one
example, we had three types of coin with probability of heads 0.5, 0.6, or 0.9. So, we called
our hypotheses H0.5 , H0.6 , H0.9 and these had prior probabilities P (H0.5 ) etc. In other
words, we had a type of coin with an unknown probability of heads, we had hypotheses
about that probability and each of these hypotheses had a prior probability.
The law of total probability for continuous probability distributions is essentially the same
as for discrete distributions. We replace the prior pmf by a prior pdf and the sum by an
integral. We start by reviewing the law for the discrete case.
Recall that for a discrete set of hypotheses H1 , H2 , . . . Hn the law of total probability says
n
X
P (D) = P (D|Hi )P (Hi ). (1)
i=1
This is the total prior probability of D because we used the prior probabilities P (Hi )
In the little letter notation with θ1 , θ2 , . . . , θn for hypotheses and x for data the law of total
probability is written
X n
p(x) = p(x|θi )p(θi ). (2)
i=1
We also called this the prior predictive probability of the outcome x to distinguish it from
the prior probability of the hypothesis θ.
Likewise, there is a law of total probability for continuous pdfs. We state it as a theorem
using little letter notation.
Theorem. Law of total probability. Suppose we have a continuous parameter θ in the
range [a, b], and discrete random data x. Assume θ is itself random with density f (θ) and
that x and θ have likelihood p(x|θ). In this case, the total probability of x is given by the
formula. Z b
p(x) = p(x|θ)f (θ) dθ (3)
a
Proof. Our proof will be by analogy to the discrete version: The probability term p(x|θ)f (θ) dθ
is perfectly analogous to the term p(x|θi )p(θi ) in Equation 2 (or the term P (D|Hi )P (Hi )
in Equation 1). Continuing the analogy: the sum in Equation 2 becomes the integral in
Equation 3
As in the discrete case, when we think of θ as a hypothesis explaining the probability of the
data we call p(x) the prior predictive probability for x.
Example 5. (Law of total probability.) Continuing with Example 4. We have a bent coin
with probability θ of heads. The value of θ is random with prior pdf f (θ) = 2θ on [0, 1].
Statistics Class 11a, Bayesian Updating with Continuous Priors 5
Suppose I am about to flip the coin. What is the total probability of heads, i.e what is the
prior predictive probability of heads?
Solution: In Example 4 we noted that the likelihoods are p(x = 1|θ) = θ and p(x = 0|θ) =
1 − θ. So the total probability of x = 1 is
Z 1 Z 1 Z 1
2
p(x = 1) = p(x = 1|θ) f (θ) dθ = θ · 2θ dθ = 2θ2 dθ = .
0 0 0 3
Since the prior is weighted towards higher probabilities of heads, so is the total probability
of heads.
The statement of Bayes’ theorem for continuous pdfs is essentially identical to the statement
for pmfs. We state it including dθ so we have genuine probabilities:
Theorem. Bayes’ Theorem. Use the same assumptions as in the law of total probability,
i.e. θ is a continuous parameter with pdf f (θ) and range [a, b]; x is random discrete data;
together they have likelihood p(x|θ). With these assumptions:
Proof. Since this is a statement about probabilities it is just the usual statement of Bayes’
theorem. We hope this is clear.
It is important enough to spell out somewhat formally: Let Θ be the random variable that
produces the value θ. Consider the events
and
D = ‘the value of the data is x’.
Then P (H) = f (θ) dθ, P (D) = p(x), and P (D|H) = p(x|θ). Now our usual form of Bayes’
theorem becomes
P (D|H)P (H) p(x|θ)f (θ) dθ
f (θ|x) dθ = P (H|D) = =
P (D) p(x)
Looking at the first and last terms in this equation we see the new form of Bayes’ theorem.
Finally, we firmly believe that it is more conducive to careful thinking about probability
to keep the factor of dθ in the statement of Bayes’ theorem. But because it appears in the
numerator on both sides of Equation 4 many people drop the dθ and write Bayes’ theorem
in terms of densities as
p(x|θ)f (θ) p(x|θ)f (θ)
f (θ|x) = = Rb .
p(x) p(x|θ)f (θ) dθ
a
Statistics Class 11a, Bayesian Updating with Continuous Priors 6
Now that we have Bayes’ theorem and the law of total probability we can finally get to
Bayesian updating. Before continuing with Example 4, we point out two features of the
Bayesian updating table that appears in the next example:
1. The table for continuous priors is very simple: since we cannot have a row for each of
an infinite number of hypotheses we’ll have just one row which uses a variable to stand for
all hypotheses Hθ .
2. By including dθ, all the entries in the table are probabilities and all our usual probability
rules apply.
Example 6. (Bayesian updating.) Continuing Examples 4 and 5. We have a bent coin
with unknown probability θ of heads. The value of θ is random with prior pdf f (θ) = 2θ.
Suppose we flip the coin three times and get the sequence HT T . Compute the posterior
pdf for θ.
Solution: We make the usual update table, with an added column giving the range of
values that θ can take. We make the first row an abstract version of Bayesian updating and
the second row is Bayesian updating for this particular example. In later examples we will
skip that abstract version.
hypothesis range prior likelihood Bayes posterior
numerator
Rb p(x = 1, 1, 0)
total [0, 1] f (θ) dθ = 1 no sum R1 1
a = 0 2θ3 (1 − θ) dθ = 1/10
3. (i) As always p(x) is the total probability. Since we have a continuous distribution
instead of a sum we compute an integral.
(ii) Notice that by including dθ in the table, it is clear what integral we need to compute
to find the total probability p(x).
4. The table organizes the continuous version of Bayes’ theorem. Namely, the posterior pdf
Statistics Class 11a, Bayesian Updating with Continuous Priors 7
Removing the dθ in the numerator of both sides we have the statement in terms of densities.
5. Regarding both sides as functions of θ, we can again express Bayes’ theorem in the form:
f (θ|x) ∝ p(x|θ) · f (θ)
posterior ∝ likelihood × prior.
One important prior is called a flat or uniform prior. A flat prior assumes that every
hypothesis is equally probable. For example, if θ has range [0, 1] then f (θ) = 1 is a flat
prior.
Example 7. (Flat priors.) We have a bent coin with unknown probability θ of heads.
Suppose we toss it once and get heads. Assume a flat prior and find the posterior probability
for θ.
Solution: This is similar Example 6 with a different prior and data.
hypothesis range prior likelihood Bayes numerator posterior
θ θ f (θ) dθ p(x = 1|θ) f (θ|x = 1) dθ
θ [0, 1] 1 · dθ θ θ dθ 2θ dθ
Z 1
Rb
total [0, 1] a f (θ) dθ = 1 no sum p(x = 1) = θ dθ = 1/2 1
0
Example 8. In the previous example the prior probability was flat. First show that this
means that a priori the coin is equally like to be biased towards heads or tails. Then, after
observing one heads, what is the (posterior) probability that the coin is biased towards
heads?
Solution: Since the parameter θ is the probability the coin lands heads, the first part of the
problem asks us to show P (θ > 0.5) = 0.5 and the second part asks for P (θ > 0.5 | x = 1).
These are easily computed from the prior and posterior pdfs respectively.
The prior probability that the coin is biased towards heads is
Z 1 Z 1
1
P (θ > 0.5) = f (θ) dθ = 1 · dθ = θ|10.5 = .
0.5 0.5 2
The probability of 1/2 means the coin is equally likely to be biased toward heads or tails.
The posterior probabilitiy that it’s biased towards heads is
Z 1 Z 1
1 3
P (θ > 0.5|x = 1) = f (θ|x = 1) dθ = 2θ dθ = θ2 0.5 = .
0.5 0.5 4
Statistics Class 11a, Bayesian Updating with Continuous Priors 8
We see that observing one heads has increased the probability that the coin is biased towards
heads from 1/2 to 3/4.
10 Predictive probabilities
Just as in the discrete case we are also interested in using the posterior probabilities of the
hypotheses to make predictions for what will happen next.
Example 9. (Prior and posterior prediction.) Continuing Examples 4, 5, 6: we have a
coin with unknown probability θ of heads and the value of θ has prior pdf f (θ) = 2θ. Find
the prior predictive probability of heads. Then suppose the first flip was heads and find the
posterior predictive probabilities of both heads and tails on the second flip.
Solution: For notation let x1 be the result of the first flip and let x2 be the result of the
second flip. The prior predictive probability is exactly the total probability computed in
Examples 5 and 6.
Z 1 Z 1
2
p(x1 = 1) = p(x1 = 1|θ)f (θ) dθ = 2θ2 dθ = .
0 0 3
The posterior predictive probabilities are the total probabilities computed using the poste-
rior pdf. From Example 6 we know the posterior pdf is f (θ|x1 = 1) = 3θ2 . So the posterior
predictive probabilities are
Z 1 Z 1
p(x2 = 1|x1 = 1) = p(x2 = 1|θ, x1 = 1)f (θ|x1 = 1) dθ = θ · 3θ2 dθ = 3/4
0 0
Z 1 Z 1
p(x2 = 0|x1 = 1) = p(x2 = 0|θ, x1 = 1)f (θ|x1 = 1) dθ = (1 − θ) · 3θ2 dθ = 1/4
0 0
(More simply, we could have computed p(x2 = 0|x1 = 1) = 1 − p(x2 = 1|x1 = 1) = 1/4.)
This section is optional. In it we will try to develop intuition for the transition from discrete
to continuous Bayesian updating. We’ll walk a familiar road from calculus. Namely we will:
(i) divide the continuous range of hypotheses into a finite number of short intervals.
(ii) create the discrete updating table for the finite number of hypotheses.
(iii) consider how the table changes as the number of hypotheses goes to infinity.
In this way, we’ll see the prior and posterior pmfs converge to the prior and posterior pdfs.
Example 10. To keep things concrete, we will work with the same prior and data as in
Example 7. We have a ‘bent’ coin with a flat prior f (θ) = 1. Our data is we tossed the
coin once and got heads.
Our goal is to go from discrete to continuous by increasing the number of hypotheses.
Statistics Class 11a, Bayesian Updating with Continuous Priors 9
4 hypotheses. Suppose we have four types of coins that have probability of heads 1/8,
3/8, 5/8 and 7/8 respectively. If one coin is chosen at random, our hypotheses for its type
are
H1 : θ = 1/8, H2 : θ = 3/8, H3 : θ = 5/8, H4 : θ = 7/8.
To get this, we divided [0, 1] into 4 equal intervals: [0, 1/4], [1/4, 1/2], [1/2, 3/4], [3/4, 1].
Each interval has width ∆θ = 1/4. We put our the value of θ for our coin types at the
centers of the four intervals.
(Just as with forming Riemann sums in calculus, it’s not important where in each interval
we choose θ. The center is one easy choice.)
Let’s name each of these values θj = j/8, where j = 1, 3, 5, 7.
The flat prior gives each hypothesis a probability of 1/4 = 1 · ∆θ. We have the table:
hypothesis prior likelihood Bayes num. posterior
1 1 1 1 1
θ = θ1 = 8 4 8 4 · 8 0.0625
3 1 3 1 3
θ = θ2 = 8 4 8 4 · 8 0.1875
5 1 5 1 5
θ = θ3 = 8 4 8 4 · 8 0.3125
7 1 7 1 7
θ = θ4 = 8 4 8 4 · 8 0.4375
n
X
Total 1 – θi ∆θ 1
i=1
Here are the density histograms of the prior and posterior pmf. The prior and posterior
pdfs from Example 7 are superimposed on the histograms in red.
2 density 2 density
1.5 1.5
1 1
.5 .5
x x
1/8 3/8 5/8 7/8 1/8 3/8 5/8 7/8
8 hypotheses. Next we slice [0,1] into 8 intervals each of width ∆θ = 1/8 and use the
center of each slice for our 8 hypotheses θi .
θ1 : ’θ = 1/16’, θ2 : ’θ = 3/16’, θ3 : ’θ = 5/16’, θ4 : ’θ = 7/16’
θ5 : ’θ = 9/16’, θ6 : ’θ = 11/16’, θ7 : ’θ = 13/16’, θ8 : ’θ = 15/16’
The flat prior gives each hypothesis the probablility 1/8 = 1 · ∆θ. Here are the table and
density histograms.
Statistics Class 11a, Bayesian Updating with Continuous Priors 10
2 density 2 density
1.5 1.5
1 1
.5 .5
x x
1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16 1/16 3/16 5/16 7/16 9/16 11/16 13/16 15/16
20 hypotheses. Finally we slice [0,1] into 20 pieces. This is essentially identical to the
previous two cases. Let’s skip right to the density histograms.
2 density 2 density
1.5 1.5
1 1
.5 .5
x x
3/7 3/7
Looking at the sequence of plots we see how the prior and posterior density histograms
converge to the prior and posterior probability density functions.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Notational conventions
Class 11, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be able to work with the various notations and terms we use to describe probabilities
and likelihood.
2 Introduction
We’ve introduced a number of different notations for probability, hypotheses and data. We
collect them here, to have them in one place.
The problem of labeling data and hypotheses is a tricky one. When we started the course
we talked about outcomes, e.g. heads or tails. Then when we introduced random variables
we gave outcomes numerical values, e.g. 1 for heads and 0 for tails. This allowed us to do
things like compute means and variances. We need to do something similar now. Recall
our notational conventions:
The connection between values and events: ‘X = x’ is the event that X takes the
value x.
A discrete random variable has a probability mass function small p(x) The connection
between P and p is that P (X = x) = p(x).
A continuous random variable has a probability density function f (x) The connection
Rb
between P and f is that P (a ≤ X ≤ b) = a f (x) dx.
1
Statistics Class 11, Notational conventions 2
We use lower case letters, especially θ, to indicate the hypothesized value of a model
parameter, e.g. the probability the coin lands heads is θ = 0.5.
We use upper case letters, especially D, when talking about data as events. For
example, D = ‘the sequence of tosses was HTH.
We use lower case letters, especially x, when talking about data as values. For exam-
ple, the sequence of data was x1 , x2 , x3 = 1, 0, 1.
When the set of hypotheses is discrete we can use the probability of individual hy-
potheses, e.g. p(θ). When the set is continuous we need to use the probability for an
infinitesimal range of hypotheses, e.g. f (θ) dθ.
The following table summarizes this for discrete θ and continuous θ. In both cases we
are assuming a discrete set of possible outcomes (data) x. Tomorrow we will deal with a
continuous set of outcomes.
Bayes
hypothesis prior likelihood numerator posterior
H P (H) P (D|H) P (D|H)P (H) P (H|D)
Discrete θ: θ p(θ) p(x|θ) p(x|θ)p(θ) p(θ|x)
Continuous θ: θ f (θ) dθ p(x|θ) p(x|θ)f (θ) dθ f (θ|x) dθ
Remember the continuous hypothesis θ is really a shorthand for ‘the parameter θ is in an
interval of width dθ around θ’.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Continuous Data with Continuous Priors
Class 11c, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
Note by Prof. Mayer: This reading is not assigned (not in the original MIT course, and not
in the AIN-B course). It goes into a little more detail on Bayesian updating where both
hypotheses and data are continuous. Just for those interested!
1 Learning Goals
1. Be able to construct a Bayesian update table for continuous hypotheses and continuous
data.
2. Be able to recognize the pdf of a normal distribution and determine its mean and variance.
2 Introduction
We are now ready to do Bayesian updating when both the hypotheses and the data take
continuous values. The pattern is the same as what we’ve done before, so let’s first review
the previous two cases.
3 Previous cases
Notation
Hypotheses H
Data x
Prior P (H)
Likelihood p(x | H)
Posterior P (H | x).
Example 1. Suppose we have data x and three possible explanations (hypotheses) for the
data that we’ll call A, B, C. Suppose also that the data can take two possible values, -1
and 1.
In order to use the data to help estimate the probabilities of the different hypotheses we
need a prior pmf and a likelihood table. Assume the prior and likelihoods are given in
the following table. (For this example we are only concerned with the formal process of
Bayesian updating. So we just made up the prior and likelihoods.)
1
Statistics Class 11c, Continuous Data with Continuous Priors 2
Question: Suppose we run one trial and obtain the data x1 = 1. Use this to find the
posterior probabilities for the hypotheses.
Solution: The data picks out one column from the likelihood table which we then use in
our Bayesian update table.
Bayes
hypothesis prior likelihood numerator posterior
p(x | H)P (H)
H P (H) p(x = 1 | H) p(x | H)P (H) P (H | x) =
p(x)
A 0.1 0.8 0.08 0.195
B 0.3 0.5 0.15 0.366
C 0.6 0.3 0.18 0.439
total 1 no sum p(x) = 0.41 1
To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total probability p(x) is the sum of the probabilities in the Bayes numerator column; and
we divide by p(x) to normalize the Bayes numerator.
Note: As usual, the term ‘no sum’ in the likelihood column is not literally true. What it
means is that the sum is not meaningful to us. In particular, we don’t expect the likelihood
column to sum to 1.
Hypotheses θ
Data x
Prior f (θ) dθ
Likelihood p(x | θ)
Posterior f (θ | x) dθ.
Statistics Class 11c, Continuous Data with Continuous Priors 3
Likelihoods
Question: Suppose we run one trial and obtain the data x = 2. Use this to find the
posterior pdf for the parameter (hypotheses) θ.
Solution: As before, the data picks out one column from the likelihood table which we
can use in our Bayesian update table. Since we want to work with probabilities we write
f (θ)d θ and f (θ | x) dθ for the pdfs.
5 5 7! 3
θ2 (1 − θ)3 θ3 (1 − θ)3 dθ θ (1 − θ)3 dθ
θ 2θ dθ 2 2 2 f (θ | x) dθ =
3! 3!
R1
2 52 θ3 (1 − θ)3 dθ
p(x) = 0
total 1 no sum 1
5 3! 3!
=2 2 7!
To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total probability p(x) is the integral of the probabilities in the Bayes numerator column;
and we divide by p(x) to normalize the Bayes numerator.
When both data and hypotheses are continuous, the only change to the previous example is
that the likelihood function uses a pdf ϕ(x | θ) instead of a pmf p(x | θ). The general shape
of the Bayesian update table is the same.
Notation
Statistics Class 11c, Continuous Data with Continuous Priors 4
Hypotheses θ. For continuous hypotheses, this really means that we hypothesize that
the parameter is in a small interval of size dθ around θ.
Data x. For continuous data, this really means that the data is in a small interval of
size dx around x.
Prior f (θ)dθ. This is our initial belief about the probability that the parameter is in
a small interval of size dθ around θ.
Likelihood ϕ(x | θ) dx. This is the (calculated) probability that the data is in a small
interval of size dx around x, ASSUMING the hypothesis θ.
Posterior f (θ | x) dθ. This is the (calculated) probability that the parameter is in a
small interval of size dθ around θ, GIVEN the data x.
Simplifying the notation. In the previous cases we included dθ so that we were working
with probabilities instead of densities. When both data and hypotheses are continuous
we will need both dθ and dx. This makes things conceptually simpler, but notationally
cumbersome. To simplify the notation we will sometimes allow ourselves to drop dx in our
tables. This is fine because the data x is fixed in each calculation. We keep the dθ because
the hypothesis θ is allowed to vary.
For comparison, we first show the general table in simplified notation followed immediately
afterward by the table showing both infinitesimals.
Bayes
hypoth. prior likelihood numerator posterior
ϕ(x | θ)f (θ) dθ
θ f (θ) dθ ϕ(x | θ) ϕ(x | θ)f (θ) dθ f (θ | x) dθ =
ϕ(x)
R
total 1 no sum ϕ(x) = ϕ(x | θ)f (θ) dθ 1
(integrate over θ) = prior prob. density for data x
Bayes
hypoth. prior likelihood numerator posterior
total probability ϕ(x) dx is the integral of the probabilities in the Bayes numerator column;
we divide by ϕ(x) dx to normalize the Bayes numerator.
We have chosen to use the notation ϕ(x), ϕ(x | θ) for the pdfs of data and f (θ), f (θ | x) for
the pdfs of hypotheses. This is nice because ϕ is a Greek f , but the different symbols help
us distinguish the two types of pdfs. Many, perhaps most, writers use the same letter f for
both. This forces the reader to look at the arguments to the function to understand what
is meant. That is, f (x|θ) is the probability of data given an hypothesis, i.e. the likelihood
and f (θ|x) is the probability of an hypothesis given the data, i.e. the posterior pdf.
As mathematicians this makes us pull our hair out. But, to be fair, there is a philosoph-
ical underpinning to this notation. We can think of f as a universal probability density
which gives the probability of absolutely any combination of things. Thus f (x, y) is the
joint probability density for the quantities denoted by x and y. If we just write f (x) the
implication is that this means the marginal density for x, i.e. the density for x when y is
allowed to take any value. Similarly we can write f (x, y|z) for the conditional density of x
and y given z.
A standard example of continuous hypotheses and continuous data assumes that both the
data and prior follow normal distributions. The following example assumes that the variance
of the data is known.
Example 3. Suppose we have data x = 5 which was drawn from a normal distribution
with unknown mean θ and standard deviation 1.
x ∼ N(θ, 1)
Suppose further, that our prior distribution for the unknown parameter θ is θ ∼ N(2, 1).
Let x represent an arbitrary data value.
(a) Make a Bayesian table with prior, likelihood, and Bayes numerator.
(b) Show that the posterior distribution for θ is normal as well.
(c) Find the mean and variance of the posterior distribution.
Solution: As we did with the tables above, a good compromise on the notation is to include
dθ but not dx. The reason for this is that the total probability is computed by integrating
over θ and the dθ reminds of us that.
Our prior pdf is
1 2
f (θ) = √ e−(θ−2) /2 .
2π
The likelihood function is
1 2
ϕ(x = 5 | θ) = √ e−(5−θ) /2 .
2π
Statistics Class 11c, Continuous Data with Continuous Priors 6
We know we are going to multiply the prior and the likelihood, so we carry out that algebra
first. In the very last step we give the complicated constant factor the name c1 .
1 2 1 2
prior · likelihood = √ e−(θ−2) /2 · √ e−(5−θ) /2
2π 2π
1 −(2θ2 −14θ+29)/2
= e
2π
1 −(θ2 −7θ+29/2)
= e (complete the square)
2π
1 −((θ−7/2)2 +9/4)
= e
2π
e−9/4 −(θ−7/2)2 )
= e
2π
2
= c1 e−(θ−7/2)
Bayes posterior
hypothesis prior likelihood numerator f (θ | x = 5) dθ
ϕ(x = 5 | θ)f (θ) dθ
θ f (θ) dθ ϕ(x = 5 | θ) ϕ(x = 5 | θ)f (θ) dθ
ϕ(x = 5)
2 2 2 2
θ √1 e−(θ−2) /2 dθ √1 e−(5−θ) /2 c1 e−(θ−7/2) c2 e−(θ−7/2)
2π 2π
R
total 1 no sum ϕ(x = 5) = ϕ(x = 5 | θ)f (θ) dθ 1
We can see by the form of the posterior pdf that it is a normal distribution. Because the
2 2
exponential for a normal distribution is e−(θ−µ) /2σ we have mean µ = 7/2 and 2σ 2 = 1,
so variance σ 2 = 1/2.
We don’t need to bother computing the total probability; it is just used for normalization
1
and we already know the normalization constant √ for a normal distribution. To
σ 2π
summarize,
The posterior pdf follows a N(7/2, 1/2) distribution.
Here is the graph of the prior and the posterior pdfs for this example. Note how the data
‘pulls’ the prior (the wider bell on the left) towards the data. The posterior is the narrower
bell on the right. After collecting data, we have a new opinion about the mean, and we are
more sure of this new opinion.
Statistics Class 11c, Continuous Data with Continuous Priors 7
0.5
0.4
0.3
0.2
0.1
0.0
0 1 2 3 4 5 6 7
Now we’ll repeat the previous example for general x. When reading this if you mentally
substitute 5 for x you will understand the algebra.
Example 4. Suppose our data x is drawn from a normal distribution with unknown mean
θ and standard deviation 1.
x ∼ N(θ, 1)
Suppose further, that our prior distribution for the unknown parameter θ is θ ∼ N(2, 1).
Solution: As before, we show the algebra used to simplify the Bayes numerator: The prior
pdf and likelihood function are
1 2 1 2
f (θ) = √ e−(θ−2) /2 f (x | θ) = √ e−(x−θ) /2 .
2π 2π
The Bayes numerator is the product of the prior and the likelihood:
1 2 1 2
prior · likelihood = √ e−(θ−2) /2 · √ e−(x−θ) /2
2π 2π
1 −(2θ2 −(4+2x)θ+4+x2 )/2
= e
2π
1 −(θ2 −(2+x)θ+(4+x2 )/2)
= e (complete the square)
2π
1 −((θ−(1+x/2))2 −(1+x/2)2 +(4+x2 )/2)
= e
2π
2
= c1 e−(θ−(1+x/2))
Just as in the previous example, in the last step we replaced all the constants, including
the exponentials that just involve x, by by the simple constant c1 .
Bayes posterior
hypothesis prior likelihood numerator f (θ | x) dθ
2 2 2 2
θ √1 e−(θ−2) /2 dθ √1 e−(x−θ) /2 c1 e−(θ−(1+x/2)) c2 e−(θ−(1+x/2))
2π 2π
R
total 1 no sum ϕ(x) = ϕ(x | θ)f (θ) dθ 1
As in the previous example we can see by the form of the posterior that it must be a normal
distribution with mean 1 + x/2 and variance 1/2. That is,
The posterior pdf follows a N(1 + x/2, 1/2) distribution.
You should compare this with the case x = 5 in the previous example.
7 Predictive probabilities
Since the data x is continuous it has prior and posterior predictive pdfs. The prior predictive
pdf is the total probability density computed at the bottom of the Bayes numerator column:
Z
ϕ(x) = f (x|θ)f (θ) dθ,
In this case the formula for the posterior predictive pdf is a little simpler:
Z
ϕ(x2 |x1 ) = ϕ(x2 |θ)f (θ|x1 ) dθ.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
Statistics Class 11c, Continuous Data with Continuous Priors 9
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Conjugate priors: Beta and normal
Class 12, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Be familiar with the 2-parameter family of beta distributions and its normalization.
4. Understand and be able to use the formula for updating a normal prior given a normal
likelihood with known variance.
2 Introduction
Our main goal here is to introduce the idea of conjugate priors and look at some specific
conjugate pairs. These simplify the job of Bayesian updating to simple arithmetic. We’ll
start by introducing the beta distribution and using it as a conjugate prior with a binomial
likelihood. After that we’ll look at other conjugate pairs.
Note by Prof. Markus Mayer: In AIN-B Statistics, we derivate from this path. We skip the
beta distribution (it is removed from this script). The idea of conjugate priors will become
clear with having only the normal distribution as an example.
3 Conjugate priors
The beta distribution is called a conjugate prior for the binomial distribution. This means
that if the likelihood function is binomial, then a beta prior gives a beta posterior –this is
what we saw in the previous examples. In fact, the beta distribution is a conjugate prior
for the Bernoulli and geometric distributions as well.
We will soon see another important example: the normal distribution is its own conjugate
prior. In particular, if the likelihood function is normal with known variance, then a normal
prior gives a normal posterior.
Conjugate priors are useful because they reduce Bayesian updating to modifying the param-
eters of the prior distribution (so-called hyperparameters) rather than computing integrals.
We saw this for the beta distribution in the last table. For many more examples see:
https://en.wikipedia.org/wiki/Conjugate_prior_distribution
We now give a definition of conjugate prior. It is best understood through the examples in
the subsequent sections.
Definition. Suppose we have data with likelihood function ϕ(x|θ) depending on a hy-
pothesized parameter θ. Also suppose the prior distribution for θ is one of a family of
1
Statistics Class 12, Conjugate priors: Beta and normal 2
parametrized distributions. If the posterior distribution for θ is in this family then we say
the the family of priors are conjugate priors for the likelihood.
This definition will be illustrated with specific examples in the sections below.
The idea here is essentially identical to the Bayesian updating we’ve already done. The
only change is, with a continuous likelihood, we have to compute the total probability of
the data (i.e. sum of the Bayes numerator column, i.e. normalizing factor) as an integral
instead of a sum. We will cover this briefly. For those who are interested, a bit more detail
is given in an optional note.
Notation
Hypotheses θ. For continuous hypotheses, this really means that we hypothesize that
the parameter is in a small interval of size dθ around θ.
Data x. For continuous data, this really means that the data is in a small interval of
size dx around x.
Prior f (θ)dθ. This is our initial belief about the probability that the parameter is in
a small interval of size dθ around θ.
Likelihood ϕ(x | θ). So the probability that the data is in a small interval of size dx
around x, ASSUMING the hypothesis θ is ϕ(x | θ) dx
Bayes
hypoth. prior likelihood numerator posterior
ϕ(x | θ)f (θ) dθ
θ f (θ) dθ ϕ(x | θ) ϕ(x | θ)f (θ) dθ f (θ | x) dθ =
ϕ(x)
R
total 1 no sum ϕ(x) = ϕ(x | θ)f (θ) dθ 1
(integrate over θ) = prior prob. density for data x
To summarize: the prior probabilities of hypotheses and the likelihoods of data given hy-
pothesis were given; the Bayes numerator is the product of the prior and likelihood; the
total likelihood ϕ(x) is the integral of the probabilities in the Bayes numerator column; we
divide by ϕ(x) to normalize the Bayes numerator.
Statistics Class 12, Conjugate priors: Beta and normal 3
µpost µprior x 1 1 1
2 = 2 + 2, 2 = 2 + (1)
σpost σprior σ σpost σprior σ2
The following form of these formulas is easier to read and shows that µpost is a weighted
average between µprior and the data x.
1 1 aµprior + bx 2 1
a= 2 b= , µpost = , σpost = . (2)
σprior σ2 a+b a+b
With these formulas in mind, we can express the update via the table:
We leave the proof of the general formulas to the problem set. It is an involved algebraic
manipulation which is essentially the same as the following numerical example.
Example 1. Suppose we have prior θ ∼ N(4, 8), and likelihood function likelihood x ∼
N(θ, 5). Suppose also that we have one measurement x1 = 3. Show the posterior distribution
is normal.
Solution: We will show this by grinding through the algebra which involves completing
the square.
2 /16 2 /10 2 /10
prior: f (θ) = c1 e−(θ−4) ; likelihood: ϕ(x1 |θ) = c2 e−(x1 −θ) = c2 e−(3−θ)
This has the form of the pdf for N(44/13, 40/13). QED
2 1 1
µprior = 4, σprior = 8, σ2 = 5 ⇒ a = , b= .
8 5
Therefore
aµprior + bx 44
µpost = = = 3.38
a+b 13
2 1 40
σpost = = = 3.08.
a+b 13
The updating formula 2 gives µpost as a weighted average of the µprior and the data. The
weight on µprior is a/(a + b), and the weight on the data is b/(a + b). These weights are
always positive numbers summing to 1. If b is very large (that is, if the data has a tiny
2
variance) then most of the weight is on the data. If a is very large (that is, σprior is small,
i.e. if you are very confident in your prior) then most of the weight is on the prior.
In the above example the variance on the prior was bigger than the variance on the data,
so a was smaller than b; so the weight was mostly on the data. The posterior 3.38 for the
mean was closer to the data 3 than to the prior 4 for the mean.
Example 2. Suppose that we know the data x ∼ N(θ, 4/9) and we have prior N(0, 1). We
get one data value x = 6.5. Describe the changes to the pdf for θ in updating from the
prior to the posterior.
2
Solution: µprior = 0, σprior = 1, σ 2 = 4/9. So, using the updating formulas 2 we have
1 9 aµprior + bx 2 1 4
a = 1, b= = , µpost = = 4.5, σpost = = .
4/9 4 a+b a+b 13
Statistics Class 12, Conjugate priors: Beta and normal 5
Here is a graph of the prior and posterior pdfs with the data point marked by a red line.
0.6
0.4
0.2
0.0
−2 0 2 4 6 8 10
2 1 1 2
σpost = < = σprior
a+b a
That is the posterior has smaller variance than the prior, i.e. data makes us more certain
about where in its range θ lies.
2 1 1
Likewise σpost = < = σ 2 . So, the posterior variance is smaller than σ 2 .
a+b b
Example 4. Suppose we have data x1 , x2 , x3 . Use the formulas (1) to update sequentially.
Solution: Let’s label the prior mean and variance as µ0 and σ02 . The updated means and
Statistics Class 12, Conjugate priors: Beta and normal 6
1 1 1 µ1 µ0 x1
2 = 2 + 2; 2 = 2+ 2
σ1 σ0 σ σ1 σ0 σ
1 1 1 1 2 µ2 µ1 x2 µ0 x 1 + x 2
2 = 2 + 2 = 2 + 2; 2 = 2+ 2 = 2+
σ2 σ1 σ σ0 σ σ2 σ1 σ σ0 σ2
1 1 1 1 3 µ3 µ2 x3 µ0 x 1 + x 2 + x 3
= 2 + 2 = 2 + 2; = 2+ 2 = 2+
σ32 σ2 σ σ0 σ σ32 σ2 σ σ0 σ2
Again we give the easier to read form, showing µpost is a weighted average of µprior and the
sample average x̄:
1 n aµprior + bx̄ 2 1
a= 2 b= , µpost = , σpost = . (4)
σprior σ2 a+b a+b
Interpretation: µpost is a weighted average of µprior and x̄. If the number of data points is
2
large then the weight b is large and x̄ will have a strong influence on the posterior. If σprior
is small then the weight a is large and µprior will have a strong influence on the posterior.
To summarize:
1. Lots of data has a big influence on the posterior.
2. High certainty (low variance) in the prior has a big influence on the posterior.
The actual posterior is a balance of these two influences.
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Choosing priors
Class 13a, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
1. Learn that the choice of prior affects the posterior.
2. See that too rigid a prior can make it difficult to learn from the data.
3. See that more data lessens the dependence of the posterior on the prior.
4. Be able to make a reasonable choice of prior, based on prior understanding of the system
under consideration.
2 Introduction
Up to now we have always been handed a prior pdf. In this case, statistical inference from
data is essentially an application of Bayes’ theorem. When the prior is known there is no
controversy on how to proceed. The art of statistics starts when the prior is not known
with certainty. There are two main schools on how to proceed in this case: Bayesian and
frequentist. For now we are following the Bayesian approach. Starting next week we will
learn the frequentist approach.
Recall that given data D and a hypothesis H we used Bayes’ theorem to write
P (D|H) · P (H)
P (H|D) =
P (D)
posterior ∝ likelihood · prior.
Bayesian: Bayesians make inferences using the posterior P (H|D), and therefore always
need a prior P (H). If a prior is not known with certainty the Bayesian must try to make
a reasonable choice. There are many ways to do this and reasonable people might make
different choices. In general it is good practice to justify your choices and to explore a range
of priors to see if they all point to the same conclusion.
Frequentist: Very briefly, frequentists do not try to create a prior. Instead, they make
inferences using the likelihood P (D|H).
We will compare the two approaches in detail once we have more experience with each. For
now we simply list two benefits of the Bayesian approach.
1. The posterior probability P (H|D) for the hypothesis given the evidence is usually exactly
what we’d like to know. The Bayesian can say something like ‘the parameter of interest has
probability 0.95 of being between 0.49 and 0.51.’
2. The assumptions that go into choosing the prior can be clearly spelled out.
More good data: It is always the case that more and better data allows for stronger
conclusions and lessens the influence of the prior. The emphasis should be as much on
better data (quality) as on more data (quantity).
1
Statistics Class 13a, Choosing priors 2
3 Example: Dice
Suppose we have a drawer full of dice, each of which has either 4, 6, 8, 12, or 20 sides. This
time, we do not know how many of each type are in the drawer. A die is picked at random
from the drawer and rolled 5 times. The results in order are 4, 2, 4, 7, and 5.
Suppose we have no idea what the distribution of dice in the drawer might be. In this case
it’s reasonable to use a flat prior. Here is the update table for the posterior probabilities
that result from updating after each roll. In order to fit all the columns, we leave out the
Bayes numerators.
hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 1/5 1/4 0.370 1/4 0.542 1/4 0.682 0 0.000 0 0.000
H6 1/5 1/6 0.247 1/6 0.241 1/6 0.202 0 0.000 1/6 0.000
H8 1/5 1/8 0.185 1/8 0.135 1/8 0.085 1/8 0.818 1/8 0.876
H12 1/5 1/12 0.123 1/12 0.060 1/12 0.025 1/12 0.161 1/12 0.115
H20 1/5 1/20 0.074 1/20 0.022 1/20 0.005 1/20 0.021 1/20 0.009
This should look familiar. Given the data the final posterior is heavily weighted towards
hypthesis H8 that the 8-sided die was picked.
To see how much the above posterior depended on our choice of prior, let’s try some other
priors. Suppose we have reason to believe that there are ten times as many 20-sided dice
in the drawer as there are each of the other types. The table becomes:
hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0.071 1/4 0.222 1/4 0.453 1/4 0.650 0 0.000 0 0.000
H6 0.071 1/6 0.148 1/6 0.202 1/6 0.193 0 0.000 1/6 0.000
H8 0.071 1/8 0.111 1/8 0.113 1/8 0.081 1/8 0.688 1/8 0.810
H12 0.071 1/12 0.074 1/12 0.050 1/12 0.024 1/12 0.136 1/12 0.107
H20 0.714 1/20 0.444 1/20 0.181 1/20 0.052 1/20 0.176 1/20 0.083
hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0.0096 1/4 0.044 1/4 0.172 1/4 0.443 0 0.000 0 0.000
H6 0.0096 1/6 0.030 1/6 0.077 1/6 0.131 0 0.000 1/6 0.000
H8 0.0096 1/8 0.022 1/8 0.043 1/8 0.055 1/8 0.266 1/8 0.464
H12 0.0096 1/12 0.015 1/12 0.019 1/12 0.016 1/12 0.053 1/12 0.061
H20 0.9615 1/20 0.889 1/20 0.689 1/20 0.354 1/20 0.681 1/20 0.475
With such a strong prior belief in the 20-sided die, the final posterior gives a lot of weight
to the theory that the data arose from a 20-sided die, even though it extremely unlikely the
Statistics Class 13a, Choosing priors 3
20-sided die would produce a maximum of 7 in 5 roles. The posterior now gives roughly
even odds that an 8-sided die versus a 20-sided die was picked.
Mild cognitive dissonance. Too rigid a prior belief can overwhelm any amount of data.
Suppose I’ve got it in my head that the die has to be 20-sided. So I set my prior to
P (H20 ) = 1 with the other 4 hypotheses having probability 0. Look what happens in the
update table.
hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 post4 lik5 post5
H4 0 1/4 0 1/4 0 1/4 0 0 0 0 0
H6 0 1/6 0 1/6 0 1/6 0 0 0 1/6 0
H8 0 1/8 0 1/8 0 1/8 0 1/8 0 1/8 0
H12 0 1/12 0 1/12 0 1/12 0 1/12 0 1/12 0
H20 1 1/20 1 1/20 1 1/20 1 1/20 1 1/20 1
No matter what the data, a hypothesis with prior probability 0 will have posterior probabil-
ity 0. In this case I’ll never get away from the hypothesis H20 , although I might experience
some mild cognitive dissonance.
Severe cognitive dissonance. Rigid priors can also lead to absurdities. Suppose I now
have it in my head that the die must be 4-sided. So I set P (H4 ) = 1 and the other prior
probabilities to 0. With the given data on the fourth roll I reach an impasse. A roll of 7
can’t possibly come from a 4-sided die. Yet this is the only hypothesis I’ll allow. My Bayes
numerator is a column of all zeros which cannot be normalized.
hyp. prior lik1 post1 lik2 post2 lik3 post3 lik4 Bayes numer4 post4
H4 1 1/4 1 1/4 1 1/4 1 0 0 ???
H6 0 1/6 0 1/6 0 1/6 0 0 0 ???
H8 0 1/8 0 1/8 0 1/8 0 1/8 0 ???
H12 0 1/12 0 1/12 0 1/12 0 1/12 0 ???
H20 0 1/20 0 1/20 0 1/20 0 1/20 0 ???
I must adjust my belief about what is possible or, more likely, I’ll suspect you of accidently
or deliberately messing up the data.
4 Example: Malaria
Here is a real example adapted from Statistics, A Bayesian Perspective by Donald Berry:
By the 1950s scientists had begun to formulate the hypothesis that carriers of the sickle-cell
gene were more resistant to malaria than noncarriers. There was a fair amount of circum-
stantial evidence for this hypothesis. It also helped explain the persistance of an otherwise
deleterious gene in the population. In one experiment scientists injected 30 African volun-
teers with malaria. Fifteen of the volunteers carried one copy of the sickle-cell gene and the
other 15 were noncarriers. Fourteen out of 15 noncarriers developed malaria while only 2
Statistics Class 13a, Choosing priors 4
out of 15 carriers did. Does this small sample support the hypothesis that the sickle-cell
gene protects against malaria?
Let S represent a carrier of the sickle-cell gene and N represent a non-carrier. Let D+
indicate developing malaria and D− indicate not developing malaria. The data can be put
in a table.
D+ D−
S 2 13 15
N 14 1 15
16 14 30
Before analysing the data we should say a few words about the experiment and experimental
design. First, it is clearly unethical: to gain some information they infected 16 people with
malaria. We also need to worry about bias. How did they choose the test subjects? Is
it possible the noncarriers were weaker and thus more susceptible to malaria than the
carriers? Berry points out that it is reasonable to assume that an injection is similar to
a mosquito bite, but it is not guaranteed. This last point means that if the experiment
shows a relation between sickle-cell and protection against injected malaria, we need to
consider the hypothesis that the protection from mosquito transmitted malaria is weaker or
non-existent. Finally, we will frame our hypothesis as ’sickle-cell protects against malaria’,
but really all we can hope to say from a study like this is that ’sickle-cell is correlated with
protection against malaria’.
Model. For our model let θS be the probability that an injected carrier S develops malaria
and likewise let θN be the probability that an injected noncarrier N develops malaria. We
assume independence between all the experimental subjects. With this model, the likelihood
is a function of both θS and θN :
As usual we leave
15the constant factor c as a letter. (It is a product of two binomial coeffi-
cients: c = 15
2 14 .)
Hypotheses. Each hypothesis consists of a pair (θN , θS ). To keep things simple we will
only consider a finite number of values for these probabilities. We could easily consider
many more values or even a continuous range of hypotheses. Assume θS and θN are each
one of 0, 0.2, 0.4, 0.6, 0.8, 1. This leads to two-dimensional tables.
First is a table of hypotheses. The color coding indicates the following:
1. Light blue squares along the diagonal are where θS = θN , i.e. sickle-cell makes no
difference one way or the other.
2. Orange and darker blue squares above the diagonal are where θN > θS , i.e. sickle-cell
provides some protection against malaria.
3. In the orange squares θN − θS ≥ 0.6, i.e. sickle-cell provides a lot of protection.
4. White squares below diagonal are where θS > θN , i.e. sickle-cell actually increases the
probability of developing malaria.
Statistics Class 13a, Choosing priors 5
Suppose we have no opinion whatsoever on whether and to what degree sickle-cell protects
against malaria. In this case it is reasonable to use a flat prior. Since there are 36 hypotheses
each one gets a prior probability of 1/36. This is given in the table below. Remember each
square in the table represents one hypothesis. Because it is a probability table we include
the marginal pmfs.
Some protection: P (θN > θS ) = sum of orange and darker blue = 0.99995
The experiment was not run without prior information. There was a lot of circumstantial
evidence that the sickle-cell gene offered some protection against malaria. For example it
was reported that a greater percentage of carriers survived to adulthood.
Here’s one way to build an informed prior. We’ll reserve a reasonable amount of probability
for the hypotheses that S gives no protection. Let’s say 24% split evenly among the 6 (light
blue) cells where θN = θS . We know we shouldn’t set any prior probabilities to 0, so let’s
spread 6% of the probability evenly among the 15 white cells below the diagonal. That
leaves 70% of the probability for the 15 orange and darker blue squares above the diagonal.
Informed prior p(θS , θN ): makes use of prior information that sickle-cell is protective.
We then compute the posterior pmf.
Statistics Class 13a, Choosing priors 7
Some protection: P (θN > θS ) = sum of orange and darker blue = 0.99996
4.3 PDALX
The following plot is based on the flat prior. For each x, it gives the probability that
θN − θS ≥ x. To make it smooth we used many more hypotheses.
1.0
0.8
Prob. diff. at least x
0.6
0.4
0.2
0.0
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
Statistics Class 13a, Choosing priors 8
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.
Probability intervals
Class 13b, Statistics, AIN-B
Jeremy Orloff and Jonathan Bloom
1 Learning Goals
2 Probability intervals
Suppose we have a pmf p(θ) or pdf f (θ) describing our belief about the value of an unknown
parameter of interest θ.
Definition: A p-probability interval for θ is an interval [a, b] with P (a ≤ θ ≤ b) = p.
Notes. P
1. In the discrete case with pmf p(θ), this means a≤θi ≤b p(θi ) = p.
Rb
2. In the continuous case with pdf f (θ), this means a f (θ) dθ = p.
3. We may say 90%-probability interval to mean 0.9-probability interval. Probability
intervals are also called credible intervals to contrast them with confidence intervals, which
we will introduce in the frequentist unit.
Example 1. Between the 0.05 and 0.55 quantiles is a 0.5 probability interval. There are
many 50% probability intervals, e.g. the interval from the 0.25 to the 0.75 quantiles.
In particular, notice that the p-probability interval for θ is not unique.
Q-notation. We can phrase probability intervals in terms of quantiles. Recall that the
s-quantile for θ is the value qs with P (θ ≤ qs ) = s. So for s ≤ t, the amount of probability
between the s-quantile and the t-quantile is just t − s. In these terms, a p-probability
interval is any interval [qs , qt ] with t − s = p.
Example 2. We have 0.5 probability intervals [q0.25 , q0.75 ] and [q0.05 , q0.55 ].
1
Statistics Class 13b, Probability intervals 2
1. Different p-probability intervals for θ may have different widths. We can make the width
smaller by centering the interval under the highest part of the pdf. Such an interval is
usually a good choice since it contains the most likely values. See the examples below for
normal and beta distributions.
2. Since the width can vary for fixed p, a larger p does not always mean a larger width.
Here’s what is true: if a p1 -probability interval is fully contained in a p2 -probability interval,
then p1 is smaller than p2 .
Probability intervals for a normal distribution. The figure shows a number of prob-
ability intervals for the standard normal.
1. All of the blue bars span a 0.68-probability interval. Notice that the smallest blue bar
runs between -1 and 1. This runs from the 16th percentile to the 84th percentile so it is a
symmetric interval.
2. All the green bars span a 0.9-probability interval. They are longer than the blue bars
because they include more probability. Note again that the shortest green bar is symmetric.
0.4
N(0, 1)
0.3
0.2
0.1
−0.1 0.0
−3 −2 −1 0 1 2 3
Probabilitiy intervals for a beta distribution. The following figure shows probability
intervals for a beta distribution. Notice how the two blue bars have very different lengths
yet cover the same probability p = 0.68.
beta(10, 4)
3
2
1
0
Probability intervals are an intuitive and effective way to summarize and communicate your
beliefs. It’s hard to describe an entire function f (θ) to a friend in words. If the function isn’t
from a parameterized family then it’s especially hard. Even with a beta distribution, it’s
easier to interpret “I think θ is between 0.45 and 0.65 with 50% probability” than “I think θ
follows a beta(8,6) distribution”. An exception to this rule of communication might be the
normal distribution, but only if the recipient is also comfortable with standard deviation.
Of course, what we gain in clarity we lose in precision, since the function contains more
information than the probability interval.
Probability intervals also play well with Bayesian updating. If we update from the prior
f (θ) to the posterior f (θ|x), then the p-probability interval for the posterior will tend to be
shorter than than the p-probability interval for the prior. In this sense, the data has made
us more certain. See for example the election example below.
Probability intervals are also useful when we do not have a pmf or pdf at hand. In this
case, subjective probability intervals give us a method for constructing a reasonable prior
for θ “from scratch”. The thought process is to ask yourself a series of questions, e.g., ‘what
is my expected value for θ?’; ‘my 0.5-probability interval?’; ‘my 0.9-probability interval?’
Then build a prior that is consistent with these intervals.
In the district in the 2012 presidential election the Republican Romney beat the
Democrat Obama 58% to 40%.
The Colbert bump: Elizabeth Colbert Busch is the sister of well-known comedian
Stephen Colbert.
Statistics Class 13b, Probability intervals 4
Our strategy will be to use our intuition to construct some probability intervals and then
find a beta distribution that approximately matches these intervals. This is subjective so
someone else might give a different answer.
Step 1. Use the evidence to construct 0.5 and 0.9 probability intervals for θ.
We’ll start by thinking about the 90% interval. The single strongest prior evidence is the
58% to 40% of Romney over Obama. Given the negatives for Sanford we don’t expect he’ll
win much more than 58% of the vote. So we’ll put the top of the 0.9 interval at 0.65. With
all of Sanford’s negatives he could lose big. So we’ll put the bottom at 0.3.
0.9 interval: [0.3, 0.65]
For the 0.5 interval we’ll pull these endpoints in. It really seems unlikely Sanford will get
more votes than Romney, so we can leave 0.25 probability that he’ll get above 57%. The
lower limit seems harder to predict. So we’ll leave 0.25 probability that he’ll get under 42%.
0.5 interval: [0.42, 0.57]
Step 2. Use our 0.5 and 0.9 probability intervals to pick a beta distribution that approx-
imats these intervals. We used the R function pbeta and a little trial and error to choose
beta(11,12). Here is our R code.
a = 11
b = 12
pbeta(0.65, a, b) - pbeta(0.3, a, b)
pbeta(0.57, a, b) - pbeta(0.42, a, b)
This computed P ([0.3, 0.65]) = 0.91 and P ([0.42, 0.57]) = 0.52. So our intervals are actually
0.91 and 0.52-probability intervals. This is pretty close to what we wanted!
The plot below shows the density of beta(11,12). The horizontal orange line shows our
interval [0.42, 0.57] and the blue line shows our interval [0.3, 0.65].
PDF for beta(11,12)
The method in Example 3 gives a good feel for building priors from probability intervals.
Here we illustrate a slightly different way of building a prior by estimating quantiles. The
Statistics Class 13b, Probability intervals 5
basic strategy is to first estimate the median, then divide and conquer to estimate the first
and third quantiles. Finally you choose a prior distribution that fits these estimates.
Example 4. Redo the Sanford vs. Colbert-Busch election example using quantiles.
Solution: We start by estimating the median. Just as before the single strongest evidence
is the 58% to 40% victory of Romney over Obama. However, given Sanford’s negatives and
Busch’s Colbert bump we’ll estimate the median at 0.47.
In a district that went 58 to 40 for the Republican Romney it’s hard to imagine Sanford’s
vote going a lot below 40%. So we’ll estimate Sanford 25th percentile as 0.40. Likewise,
given his negatives it’s hard to imagine him going above 58%, so we’ll estimate his 75th
percentile as 0.55.
We used R to search through values of a and b for the beta distribution that matches these
quartiles the best. Since the beta distribution does not require a and b to be integers we
looked for the best fit to 1 decimal place. We found beta(9.9, 11.0). Above is a plot of
beta(9.9,11.0) with its actual quartiles shown. These match the desired quartiles pretty
well.
PDF for beta(9.9,11.0)
q0.25 = 0.399
q0.5 = 0.472
q0.75 = 0.547
3
2
1
0
Historic note. In the election Sanford won 54% of the vote and Busch won 45.2%. (Source:
https://elections.huffingtonpost.com/2013/mark-sanford-vs-elizabeth-colbert-busch-sc1
Note by Prof. Mayer: This script is the work of the original authors (Jeremy Orloff and Jonathan Bloom), which
appear in the title and all praise for this work must be to them. The only modifications done to fit it to ”Statistics”
(DIT, Faculty AI, Study course AIN-B) are leaving out certain topics (e.g. the complete frequentist statistics, but
also other topics like the beta distribution) and changing the class numbering.
The complete original script (and more: slides, exams, question sheets, and solutions) can be downloaded at MIT
OpenCourseWare: Introduction To Probability And Statistics, 18.05, Spring 2014, Undergraduate
The OpenCourseWare material is distributed under the following Terms of use and license:
https://ocw.mit.edu/pages/privacy-and-terms-of-use/
Prof. Mayer thanks Jeremy Orloff and Jonathan Bloom not only for making the course material open to the public,
but also for giving him access to the original LaTeX sources to tailor the material better to the DITs needs. Without
their help, this course would by far not have the quality of its current state.