0% found this document useful (0 votes)
4 views

Section 5 - Conditional probability

The document discusses conditional probability, explaining how new information can alter the probability distribution of outcomes in a random experiment. It provides definitions, examples, and theorems related to conditional probability, including the law of total probability and Bayes' formula, illustrating their applications through various scenarios. Additionally, it introduces memoryless distributions and the geometric distribution, emphasizing their significance in probability theory.

Uploaded by

gongzexin123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Section 5 - Conditional probability

The document discusses conditional probability, explaining how new information can alter the probability distribution of outcomes in a random experiment. It provides definitions, examples, and theorems related to conditional probability, including the law of total probability and Bayes' formula, illustrating their applications through various scenarios. Additionally, it introduces memoryless distributions and the geometric distribution, emphasizing their significance in probability theory.

Uploaded by

gongzexin123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

5 Conditional probability

When studying a random experiment, it sometimes happens that we obtain some new information.
This information might change our perspective significantly; perhaps some outcomes are now much
more likely, while others are impossible. It is therefore natural to change the probability distribution
in light of this information, and this gives rise to the important topic of conditional probability.

5.1 Conditional probability

To begin, it is helpful to focus on an example.


Example 5.1. A deck of 52 cards is shuffled and you are dealt a card from the top. As in Section

1, we can take Ω = A♥, A♦, A♣, A♠, K♥, . . . , 2♥, 2♦, 2♣, 2♠ . As each outcome is equally likely to
occur, we take P to be uniform on Ω, so that each card has probability 1/52 to appear.
Now suppose that before your card is dealt, you glimpse that A♥ is on the bottom of the deck. This
card can no longer appear, so it no longer seems reasonable to assign probability 1/52 to this outcome...
but what should the new probabilities be?
Definition 5.2 (Conditional probability). Let P be a probability distribution on a sample space Ω
and let A, B ⊆ Ω be events with P (B) > 0. The conditional probability of A given B is defined by
 P (A ∩ B)
P A|B := .
P (B)

For P (B) = 0 then we set P A|B := 0.
Remark 5.3. .

• It is a good exercise to show that if P (B) > 0 then the map A 7→ P (A|B) is a probability
distribution 1 . This is often called the conditional distribution of P given B.

• Note that if P (B) > 0 then P (A|B) = P (A) if and only if A, B are independent.

• Intuitively, P (A|B) is the probability of the event A conditional on the occurrence of event B.
Noting that P (A ∩ B) = P (B) · P (A|B), we can then view ‘the event both A and B occur’ as
equal to ‘the event B occurs’ and ‘the event A occurs conditional on the occurrence of B’.
Example 5.4. Returning to Example 5.1, we are conditioning on the event B = Ω \ {A♥}, since A♥
is now impossible. Definition 5.2 then gives that the new probability distribution should be P(·|B).

In particular, for any outcome ω 6= A♥ we have P ω|B = (1/52)/(51/52) = 1/51. Thus, in this case,
we get the uniform distribution on the remaining 51 cards, which should seem quite natural.

Example 5.5. We roll a fair die. What is the probability that the die shows a prime given the
information that the outcome is at most five?
The sample space Ω = {1, . . . , 6} and P is the uniform distribution on Ω. Then writing A = {2, 3, 5}
and B = {1, 2, 3, 4, 5}, we are asked for P(A|B) = P(A ∩ B)/P(B) = (3/6)/(5/6) = 3/5.
1
That is, check that the properties from Definition 1.6 hold for A → P(A|B), provided P(B) > 0.

1
Figure 1: The blue set in the left-hand square denotes an event A, and the red set in the middle square
denotes a second event B from the same sample space. Thinking of the areas of these sets as representing their
probabilities, the conditional probability P(A|B) equals to (area of green set)/(area of green and yellow set).

Example 5.6. Mrs Smith has two children. If at least one of them is a girl, what is the probability
that both children are girls?

We can take our sample space Ω = (B, B), (B, G), (G, B), (G, G) , where the first entry in each pair
indicates the older child. The uniform distribution on Ω is the natural choice. The probability that
both children are girls, given that at least one is a girl, is equal to 2
 P ({(G, G)}) 1/4 1
P (G, G)}|{(G, B), (B, G), (G, G)} = = = .
P ({(G, G), (G, B), (B, G)}) 3/4 3

5.2 The law of total probability and Bayes’ formula

Theorem 5.7 (Law of total probability). For events A, B ⊆ Ω, we have

P (A) = P (A|B) · P (B) + P (A|B c ) · P (B c ) .

Proof. Note that the sets A ∩ B and A ∩ B c are disjoint. As P is additive for disjoint events, we find

P(A) = P (A ∩ B) ∪ (A ∩ B c ) = P(A ∩ B) + P(A ∩ B c ) = P(A|B) · P(B) + P(A|B c ) · P(B c ),




where the final equality uses the definition of P(A|B) and P(A|B c ).

Remark 5.8. Theorem 5.7 is very useful, as it relates the probability of an event to certain conditional
probabilities. The result also easily generalizes. Let B1 , B2 , . . . ⊆ Ω be a finite or infinite partition of
S
Ω, i.e. B1 , B2 , . . . are pairwise disjoint and i≥1 Bi = Ω. Then given any event A ⊆ Ω, we have
X
P (A) = P (A|Bi ) · P (Bi ) .
i≥1

2
This problem is the so called Boy or Girl paradox dating back to a 1959 article of Martin Gardner. There are many
funny variants. For example, what is the probability that both children are girls given that at least one child is a girl
born on a Tuesday? The answer now is neither 1/2 nor 1/3.

2
Figure 2: The law of total probability: P (A) can be decomposed into P (A ∩ B) and P (A ∩ B c ) as illustrated.

Example 5.9 (Snow and ice). If there is snow and ice on a morning then the probability we are late
to work is 0.6. Without such conditions this probability is just 0.1. Suppose snow and ice are expected
on nine days in January. What is the probability to be late to work on a given day in January?
Write A for the event ‘we are late at work’ and B for the event ‘there is snow and ice’. Then we are
given P (B) = 9/31, P (A|B) = 0.6 and P (A|B c ) = 0.1. Hence, the desired probability is
9 22
P (A) = P (A|B) · P (B) + P (A|B c ) · P (B c ) = 0.6 ·
+ 0.1 · = 0.245 [3dp].
31 31
Example 5.10 (Phone shop). A shop sells mobile phones from three factories: 60% of the phones
are produced at factory G(enuine), 30% are produced at factory F (ake) and the remainder at factory
M (ixed). Phones from factory G work with probability 0.98. Phones from factory F only work with
probability 0.3. Factory M is in between: half of the phones produced there are identical to those
from G, and the other half are identical to those from factory F . If we buy a phone from the shop,
what is the probability that it works?
Write A for the event ‘the phone works’ and B for the event ‘the phone is genuine’. Then, P (A|B) =
0.98 and P (A|B c ) = 0.3. Next we calculate P (B). Let Gd denote the event that ‘the phone comes from
factory G’, Fd denote the event that ‘it comes from F ’ and Md the event that ‘it comes from M ’. As
G only makes genuine phones, we have P (B|Gd ) = 1. Similarly P (B|Fd ) = 0, while P (B|Md ) = 1/2.
P (B) = 1 · P (Gd ) + 0 · P (Fd ) + P (B|Md ) · P (Md ) = 0.6 + (0.5) · (0.1) = 0.65.
Finally, the desired probability is
P (A) = P (A|B) · P (B) + P (A|B c ) · P (B c ) = (0.98) · (0.65) + (0.3) · (0.35) = 0.742.
Example 5.11 (Coin and dice). We flip a fair coin and roll two fair dice. If the coin comes up tails,
we set Z = X · Y , otherwise, if it comes up heads, we set Z = X + Y . What is P (Z = 10)? Let B
denote the event that ‘the coin comes up tails’ and let X and Y denote the outcomes of the first and
second dice rolls. Using the law of total probability, we can compute
1 1
P (Z = 10) = P (Z = 10|B) · P (B) + P (Z = 10|B c ) · P (B c ) = · P (X · Y = 10) + · P (X + Y = 10)
2 2
1 1 1 1 5
= · + · = .
2 18 2 12 72

3
Theorem 5.12 (Bayes’ formula). Let A ⊆ Ω be an event with P (A) > 0. Then, for any event B ⊆ Ω
P (A|B) · P (B)
P (B|A) = .
P (A|B) · P (B) + P (A|B c ) · P (B c )

Proof. From Definition 5.2 we have


P (A ∩ B) P (A|B) · P (B)
P (B|A) = = ,
P (A) P (A|B) · P (B) + P (A|B c ) · P (B c )
where the second inequality again uses Definition 5.2 and Theorem 5.7.

Remark 5.13. Again, as in the law of total probability, there is the following more general variant
of Bayes’ formula: for a finite or infinite partition B1 , B2 , . . . ⊆ Ω of Ω and all i ≥ 1,
P (A|Bi ) · P (Bi )
P (Bi |A) = P .
k≥1 P (A|Bk ) · P (Bk )

Example 5.14 (Diseases). Researchers develop a new medical test for a disease. From surveys, it is
known that about 1% of the population carries the disease. It is also known that with probability 0.9
the test is positive when applied to an infected person. Unfortunately, the test also occasionally gives
a false positive; the test is positive for a non-infected patient with probability 0.05. If a patient tests
positive for the disease, what is the probability that they are infected?
Letting A denote the event that ‘the test result is positive’ and B denote the event that ‘the selected
person carries disease’, we are given that P (B) = 0.01, P (A|B) = 0.9, and that P (A|B c ) = 0.05.
Applying Bayes’ formula with these values then gives
P (A|B) P (B) 0.9 · 0.01
P (B|A) = c c
= = 0.153 . . .
P (A|B) P (B) + P (A|B ) P (B ) 0.9 · 0.01 + 0.05 · 0.99
Note: This test initially appears quite effective in screening for the disease, but our calculation gives
that in a series of tests of randomly selected people, one can expect that more than five out of six
positive test results are false positives.
Example 5.15 (Twins). About 0.3% of all birth events result in identical twins and 0.7% lead to
fraternal twins 3 . Identical twins have the same sex, but the sexes of fraternal twins are independent.
If two girls are twins, what is the probability that they are fraternal twins?
Let A denote the event that ‘the twins are girls’ and B the event that ‘the twins are fraternal twins’.
Then we are given P (B) = 0.7, P (A|B) = 0.25 and P (A|B c ) = 0.5. Hence, by Bayes’ formula,
P (A|B) P (B) (0.25) · (0.7) 7
P (B|A) = = = .
P (A|B) P (B) + P (A|B c ) P (B c ) (0.25) · (0.7) + (0.5) · (0.3) 13

Bayes’ formula in court. Bayes’ formula and other arguments involving conditional probabilities are
often used in court to support arguments. For example, a defendant might argue that the probability
that they are guilty given the evidence is very small 4 . A very famous case from the last century,
which featured several dubious applications of Bayes’ formula, was the O.J. Simpson trial (California
vs Simpson). You might like to look up prosecutor’s fallacy on the web to find out more on this topic.
3
This excludes babies conceived through in-vitro fertilisation where fraternal twin rates are much more likely.
4
A typical example would be DNA or blood factors recorded at a crime scene, or similarities in paternity tests.

4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13

Figure 3: Mass function for X ∼ geop for p = 1/5 (left) and p = 1/3 (right).

5.3 Memoryless distributions

In Section 1 we met a probability distribution which described the time taken until the first success
in a repeated random experiment, where in each iteration the probability of success was p and all
successes were independent (see Example 1.16). This is a prototype for the geometric distribution.

Definition 5.16 (Geometric distribution). The geometric distribution 5 with parameter p, with p ∈
(0, 1], is the probability distribution on {1, 2, . . .} given by

geop (k) = p(1 − p)k−1 , for k ∈ {1, 2, . . .}.

A random variable X follows this distribution if SX = {1, 2, . . .} and P (X = k) = geop (k) for k ∈
SX = {1, 2, . . .}. We then write X ∼ geop . The distribution function FX is then given, for t ∈ R, by
(
1 − (1 − p)btc for t ≥ 0
FX (t) =
0 for t < 0.

Example 5.17 (Lottery). The probability to win the National Lottery jackpot in the UK 6 is roughly
p = 7.15 · 10−8 . How many times would we need to play in order to have at least 1% chance to win
the jackpot at least once?
Let X denote the first time we would win the jackpot if we were to repeatedly play. Then X ∼ geop .
Therefore, if we wish to find k ∈ {1, 2, . . .} with FX (k) ≥ 0.01 we are lead to the calculation

log 0.99
0.99 ≥ 1 − FX (k) = (1 − p)k ⇐⇒ k≥ ≈ 140564.
log(1 − p)

For context, if we were to play twice a week then we would need to play for more than 1350 years!
5
Many textbooks define the geometric distribution with mass function p(1 − p)k for k = 0, 1, . . .. You could view this
as the number of failures before the first success. As the number of failures differs by exactly one from the time of the
first success, there is no conceptual difference between the two models – although this shift is a nuisance.
6
This refers to the old 6 out of 49 game played until October 2015.

5
Example 5.18 (Birth control). A couple decides to continue to have children until the first girl is
born (and at that point, they stop). Let X denote the number of their children. Assuming that there
are no medical reasons for which they would have to adjust their strategy, we have X ∼ geo0.5 . Thus,
the probability they have at least 3 boys is P (X ≥ 4) = 1 − P (X ≤ 3) = 1/8.

Proposition 5.19. Let X ∼ geop with p ∈ (0, 1). Then



P X ≥ n + m|X ≥ n = P (X ≥ m + 1) for all n ≥ 1, m ≥ 0. (1)

Proof. For all n ∈ {1, 2, . . .} we have P(X ≥ n) = 1 − FX (n − 1) = (1 − p)n−1 , and so

P (X ≥ n + m) (1 − p)n+m−1
P (X ≥ n + m|X ≥ n) = = = (1 − p)m = P (X ≥ m + 1) ,
P (X ≥ n) (1 − p)n−1
as required.

Remark 5.20. Property (1) is often referred to as the memoryless property of geometric distributions.
Identity (1) becomes a little nicer if we consider Y = X − 1, that is, the number of failures before the
first success. Then, the memoryless property states that

P Y ≥ n + m|Y ≥ n = P (Y ≥ m) , for all n, m ∈ {0, 1, 2, . . .}.

The continuous analogue of the geometric distribution is the exponential distribution 7 .

Definition 5.21 (Exponential distribution). Let λ > 0 and


(
λe−λx , x ≥ 0,
f (x) =
0, x < 0.

The distribution with density f is called the exponential distribution with parameter λ. A continuous
random variable X follows this distribution if X has density f . We then write X ∼ expλ , and
(
1 − e−λt , t ≥ 0,
FX (t) =
0, t < 0.

The next proposition shows that the exponential distribution also has the memoryless property.

Proposition 5.22. Let X ∼ expλ with λ > 0. Then, for t, s ≥ 0, we have



P X ≥ t + s|X ≥ t = P (X ≥ s) .

Proof. Given t ≥ 0, we have P(X ≥ t) = 1 − FX (t) = e−λt . Then



 P X ≥t+s e−λ(t+s)
P X ≥ t + s|X ≥ t =  = = e−λs = P(X ≥ s),
P X≥t e−λt

as required.
7
In fact we already met this distribution too – see Question 12 on Problem Sheet 3.

6
Example 5.23 (Radioactive decay). Uranium-235 is an uranium isotope used as fuel in nuclear power
plants (and weapons). Its half-life is ≈ 703.8 million years, which means that the lifetime X (in years)
of a single atom is (approximately) exponentially distributed 8 with λ = log 2 · 7.038−1 · 10−8 since

P X ≥ 7.038 · 108 = exp(−λ · 7.038 · 108 ) = 1/2.




Due to the memoryless property, the exponential distribution is often used to model

• the life-time of electronic devices,

• the time between two car accidents on the same road or plane crashes.

To conclude this section we calculate the expectation and variance of both of these distributions.
Proposition 5.24 (Geometric distribution). Let X ∼ geop with 0 < p < 1. Then,

1 1−p 1−p
E [X] = , Var (X) = 2
, and σX = .
p p p

Note that E [X] = 1/p is very natural, since the probability for a die to land on six is 1/6, then one
should expect to roll the die six times on average until the first six appears.

Proof. Write q := 1 − p. Then


∞ ∞ ∞ ∞ ∞ ∞
X X X X X X 1
E [X] = k(1 − q)q k−1 = kq k−1 − kq k = (k + 1)q k − kq k = qk = .
p
k=0 k=0 k=0 k=0 k=0 k=0

Similarly,
∞ ∞ ∞ ∞ ∞ q
  X X X X X 1
E X2 = k 2 (1 − q)q k−1 = (k + 1)2 q k − k2 qk = 2 kq k + qk = 2 · · E [X] + .
p p
k=0 k=0 k=0 k=0 k=0

The result now follows from the formula Var(X) = E X 2 − (E [X])2 .


 

Proposition 5.25 (Exponential distribution). Let X ∼ expλ with λ > 0. We have


1 1 1
E [X] = , Var (X) = , and σX = .
λ λ2 λ

Proof. To evaluate these expressions we integrate by parts (with u = x, v = −e−λx ) to get


Z ∞ i∞ Z ∞
h 1
E [X] = λxe−λx dx = −e−λx + e−λx dx = ,
0 0 0 λ
and (with u = x2 , v = −e−λx )
Z ∞ i∞ Z ∞
h 2
E X2 = λx2 e−λx dx = −2xe−λx xe−λx dx =
 
+2 .
0 0 0 λ2
Hence, Var(X) = E X 2 − (E [X])2 = λ−2 .
 
8
The law of large numbers, which is discussed in the next section, will explain why in a bulk of uranium isotopes,
roughly 50% have decayed after 703.8 million years.

7
Example 5.26 (Birth control). Coming back to Example 5.18 recall that if X denotes the number of
children born to the couple, we have E [X] = 2 as X ∼ geo0.5 . As one child is a girl, on average, the
couple has one son. This is a little surprising, as the strategy seems to favour girls. (More precisely, if
the entire population adopted this strategy, then there would be roughly as many girls born as boys.)

Example 5.27 (Dice). We roll a fair die repeatedly and let X be the number of rolls necessary for
all faces to show up at least once. What is E [X]?
To calculate this, for each i ∈ {1, . . . , 6} let Ti denote the roll when the i-th new face first appears 9 .
In particular, we have T1 = 1 and T6 = X. Then we can write

X = T1 + (T2 − T1 ) + (T3 − T2 ) + (T4 − T3 ) + (T5 − T4 ) + (T6 − T5 ).

By linearity of expectation, our calculation now reduces to calculating E[Ti+1 − Ti ] for i ∈ {1, . . . , 5}.
What is the distribution of Ti+1 − Ti if i ∈ {1, . . . , 5}? At time Ti , we have seen i faces. The number of
rolls until the next new face appears follows the geometric distribution with parameter pi = (6 − i)/6.
Hence, Ti+1 − Ti ∼ geopi and E [Ti+1 − Ti ] = 1/pi = 6/(6 − i). By linearity of expectation, we find

5 5 5
X X 6 X1
E [X] = 1 + E [Ti+1 − Ti ] = 1 + =1+6 = 14.7
6−i i
i=1 i=1 i=1

This calculation is a variant of the coupon collector problem, which tends to appear quite often 10 .

Most important takeaways in this chapter. You should

• understand the meaning of a conditional probability and know its formal definition,

• be able to compute conditional probabilities in simple scenarios,

• know the law of total probability and Bayes’ formula and be able to apply these in standard
situations,

• know the definitions of the geometric and the exponential distribution, their expectations and
variances and be able to use them in standard situations,

• be familiar with the memoryless property of distributions.

9
Note: Ti is not the first time that face i appears.
10
You might like to see the Wikipedia article for more information here.

You might also like