Introduction To Probability QMUL
Introduction To Probability QMUL
Introduction to Probability
Rosemary J. Harris
School of Mathematical Sciences
0 Prologue 1
0.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
0.2 This Course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
0.3 Further Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
0.4 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Properties of Probabilities 13
2.1 Kolmogorov’s Axioms . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Deductions from the Axioms . . . . . . . . . . . . . . . . . . . . . 15
2.3 Inclusion-Exclusion Formulae . . . . . . . . . . . . . . . . . . . . . 18
2.4 Equally-Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3 Sampling 24
3.1 Basics for Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Ordered Sampling with Replacement . . . . . . . . . . . . . . . . . 25
3.3 Ordered Sampling without Replacement . . . . . . . . . . . . . . . 26
3.4 Unordered Sampling without Replacement . . . . . . . . . . . . . . 28
3.5 Sampling in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4 Conditional Probability 34
4.1 Introducing Extra Information . . . . . . . . . . . . . . . . . . . . 34
4.2 Implications of Extra Information . . . . . . . . . . . . . . . . . . . 36
4.3 The Multiplication Rule . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Ordered Sampling Revisited . . . . . . . . . . . . . . . . . . . . . . 39
4.5 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
i
5 Independence 42
5.1 Independence for Two Events – Basic Definition . . . . . . . . . . 42
5.2 Independence for Two Events – More Details . . . . . . . . . . . . 43
5.3 Independence for Three or More Events . . . . . . . . . . . . . . . 45
5.4 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . 47
5.5 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ii
12 Covariance and Conditional Expectation 103
12.1 Covariance and Correlation . . . . . . . . . . . . . . . . . . . . . . 103
12.2 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . 105
12.3 Law of Total Probability for Expectations . . . . . . . . . . . . . . 107
12.4 Further Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
A Errata 113
iii
Chapter 0
Prologue
0.1 Motivation
Before starting the module properly, we will spend a bit of time thinking about
what probability is and why it is important to study. As a warm-up, I invite you
to think about the following question.
[You may well have seen a calculation similar to this before; if not, then try to
guess what the answer might be and perhaps discuss it with your class in the first
tutorial.]
In general terms, probability theory is about “chance”; it helps to describe
situations where there is some randomness, i.e., events we cannot predict with
certainty. Such situations could be truly random (arguably like tossing a coin)
or they may simply be beyond our knowledge (like the birthdays). Of course,
probability is about more than just description – we want to be able to quantify
the randomness (loosely speaking, “give it a number”). Indeed, for Exercise 0.1
the best answer would be not just something vague like “very low” but a fraction
or decimal with value depending on the size of your group. [You should certainly
be able to do this calculation by the end of Chapter 3. In fact, most people have
rather poor intuition for the birthday problem and similar probability questions
so you might find that your initial guess was a long way from the true answer.]
Introductory probability courses are typically full of examples involving birth-
days, dice, coins, playing cards, etc. and I’m afraid you will see plenty of them
here too. They enable us to clearly demonstrate the mathematical framework of
sample spaces and events (as introduced in Chapter 1) but you should not assume
that the applications of probability are limited to such artificial scenarios. To
1
0.1 Motivation 2
name just a few more “real world” examples: forecasting the weather, predicting
financial markets, and modelling the spread of diseases all rely crucially on prob-
ability theory. At QMUL you will encounter probability in many later courses,
e.g., if you take “Actuarial Mathematics” modules you will be concerned with the
probabilities of life and death, together with their implications for life insurance.
Notice that in the above paragraphs we have not actually defined probability.
In fact, the question of what probability is does not have an entirely satisfactory
answer. We need to associate a number to an event which will measure the like-
lihood of it occurring but what does this really mean? Well, you can think of
the required number as some kind of limiting frequency – an informal definition is
given by the following procedure:
• Suppose the event comes up m times (among the N repetitions of the ex-
periment);
• Then, in the limit of very large values of N , the ratio m/N gives the proba-
bility of the event A.
[The above exercise shows the danger of assuming that all outcomes are equally
likely – we will discuss this more carefully later.]
• ...be given £1, or toss a coin 10 times winning £1000 if it comes up heads every
time?
[This last exercise is not entirely a mathematical one and there is no right or wrong
answer for each part. However some tools from probability can help to explain the
various choices. In each case the average amount we expect to win and the degree
of variation in our gain are relevant. Properties of so-called random variables
(the focus of the second half of this course) can be used to describe the choices
quantitatively. Of course there are lots of extra ingredients which will influence
an individual’s decision (for instance how useful a particular sum of money is to
them, and how much they enjoy the excitement of taking a risk). One attempt to
model some of these extra factors mathematically is the idea of utility functions
from game theory but this is beyond the scope of the present module.]
must be sure to understand what these mean, both in everyday language (could you
explain them to your grandmother or your next-door neighbour?) and in terms
of the associated mathematical formalism. The notes also contain a number of
“examples” and “exercises”. For the former, you will find full details of the working
which you should study carefully to see not just how to get correct answers but
how to present your solutions (an important skill for university-level mathematics).
For the latter, at most the answers are in the notes although some solutions will
be presented in interactive sessions (see below). A good way to check your own
understanding would be to read the text and try the associated exercises – please
ask if you have any difficulty getting answers. I intend the unstarred exercises to
correspond roughly to the Key Objectives for the course (i.e., everyone should be
able to do them); single-starred exercises should be attempted by those aiming
for a high grade and double-starred exercises are for those seeking a real challenge
(beyond the syllabus).
The content of the notes will also be presented in a series of recorded video
lectures (typically one per section) in which the main points will be emphasized
and further explanation and exam tips given. These lectures will feature some
“real-time” written demonstrations and you are strongly recommended to follow
along and fill in the gaps in the slides or annotate these notes – for most people,
the practice of writing out maths (rather than just looking at a screen) helps the
material to be absorbed and the necessary notation to become automatic. Corre-
sponding to each video lecture there will also be an interactive session in which
the solutions to some of the associated exercises will be presented and there will
be an opportunity for queries/misunderstandings to be addressed. The QMplus
page contains full information about course practicalities (timetable, coursework,
assessment details, etc.,) and important announcements/corrections will also be
posted there.
0.4 Acknowledgements
Large sections of these notes are based on those of previous lecturers, Dr Robert
Johnson and Dr Wolfram Just, and I am greatly indebted to both of them. How-
ever, the responsibility for errors and typos remains mine; if you find any, please
contact me at [email protected].
Chapter 1
1.1 Framework
The general setting is that we perform an experiment and record an outcome.
Outcomes must be precisely specified, mutually exclusive and cover all possibilities.
Definition 1.1. The sample space is the set of all possible outcomes for the
experiment. It is denoted by S (or Ω).
Definition 1.2. An event is a subset of the sample space. The event occurs if
the actual outcome is an element of this subset.
To see how the framework is applied in practice, let’s consider some examples.
We start with a very simple one.
Example 1.1: Die roll
Suppose your experiment is to roll an ordinary six-sided die and record the number showing
as the outcome.
(a) Use set notation to write down the sample space.
(b) Denote by A the event “You roll an even number” and write it as a subset of the
sample space.
Solution:
(a) The sample space is the set containing the integers 1 to 6 (inclusive) so we write
S = {1, 2, 3, 4, 5, 6}.
(b) The event corresponding to the rolling of an even number is A = {2, 4, 6}.
The situation is more complicated when the outcome is not just an observation
of a single thing. In this case, in choosing your notation you need to think about
whether order matters.
6
1.1 Framework 7
Solution:
(a) Denoting a Head with h and a Tail with t, we can write the sample space as
where, for example, htt means the first toss is a Head, the second is a Tail, and the
third is also a Tail. Note that here order matters: htt is not the same as tth so we have
to include both. In general, if we need to include information about order, we can
either write an ordered list as above or use round brackets and commas, e.g., (h, t, t).
However, this is not the only way to write the sample space for this example. If you are
a (lazy) experimenter recording outcomes you might realise that you only need to note
which tosses are, say, Heads and you have all the information about what happened.
So, adding a tilde (“squiggle”) to the S to indicate we’ve changed notation, we could
write
S̃ = {{1, 2, 3}, {1, 2}, {1, 3}, {2, 3}, {1}, {2}, {3}, {}}
where we record the set of tosses which are Heads so the outcome {1, 3} corresponds
to hth. Note that the curly braces indicate that order doesn’t matter; {3, 1} would
mean exactly the same thing as {1, 3}. [Here we are representing the outcomes within
the sample space as sets and in set notation the order of elements is unimportant –
for a brief review of set theory see the next section.]
(b) Using the first notation for the sample space and denoting the events “Exactly one
Head is seen” and “The second toss is a Head” by B and C respectively, we have
B̃ = {{1}, {2}, {3}} and C̃ = {{2}, {1, 2}, {2, 3}, {1, 2, 3}}.
[Normally you should avoid using two different notations within the same solution but, if
for some reason you need to, you should distinguish them clearly, e.g., by S and S̃, S and
S1 , or S and S 0 ; in the last case, the prime is not to be confused with the set complement
for which we will use a c superscript. Notice too that the number of elements is always
invariant under a change of notation and this can sometimes provide a way to check your
answers are sensible.]
1.2 Basic Set Theory 8
If A is a set we write x ∈ A to mean that the object x is in the set A and say that x
is an element of A. If x is not an element of A then we write x 6∈ A. For a finite set,
the size (or cardinality) is just the number of elements; if A = {a1 , a2 , . . . , an },
we write |A| = n. [Do not confuse the size of a set with the absolute value of a
number.]
We now summarize some facts which will be useful in the rest of this course.
Let A and B both be sets.
A ∪ B = {x : x ∈ A or x ∈ B}. (1.1)
• If all the elements of A are contained in the set B, we say that A is a subset
of B and write A ⊆ B.
• If all sets are subsets of some fixed set S, then Ac (“the complement of A”)
3
Some books, including [Ros20, ASV18, Tij12], use AB without a “cap” in the middle to
denote the intersection of two events.
1.2 Basic Set Theory 10
Ac = S \ A. (1.5)
• We say two sets A and B are disjoint (or mutually exclusive) if they have
no element in common, i.e., A ∩ B = {}. The empty set {} is often denoted
by ∅.
Exercise 1.6: Venn diagrams
Draw Venn diagrams to illustrate the bullet points above.
(a) Can the set D = {2, 4, 8} be expressed in terms of A, B, and C using intersections,
unions, and complements?
(b) Can the set E = {4, 5} be expressed in terms of A, B, and C using intersections,
unions, and complements?
It is very useful to remember the following two identities which are known as
De Morgan’s laws:
(A ∪ B)c = Ac ∩ B c , (1.6)
(A ∩ B)c = Ac ∪ B c . (1.7)
(a) You may have seen a proof of De Morgan’s laws elsewhere. Show that, in fact, you
can derive (1.7) from (1.6) so you only really need to remember one of them.
(b) Write down (or look up) the generalization of De Morgan’s laws to n sets.
1.3 More on Events 11
• If A is an event then Ac contains the elements of the sample space which are
not contained in A, i.e., Ac is the event that “A does not occur”.
• If A and B are events then the event E1 “A and B both occur” consists of
all elements of both A and B, i.e., E1 = A ∩ B.
• The event E3 “A occurs but B does not” consists of all elements in A but
not in B, i.e., E3 = A \ B.
Solution:
We have A = {2, 4, 6} (see Example 1.1) and B = {2, 3, 5}. Using these, the required
events are
F1 = A ∪ B = {2, 3, 4, 5, 6},
and
F2 = A4B = {3, 4, 5, 6}.
(a) Write down the sample space, explaining your notation carefully.
(b) Write down the event “Exactly three of them attend the last lecture” as a set.
(c) Write down the event “Amina attends the last lecture but Daniel does not” as a set.
1.4 Further Exercises 12
Properties of Probabilities
(b) P(S) = 1,
13
2.1 Kolmogorov’s Axioms 14
then
n
X
P(A1 ∪ A2 ∪ · · · ∪ An ) = P(A1 ) + P(A2 ) + · · · + P(An ) = P(Ak ).
k=1
Remarks:
• Pairwise disjoint in Definition 2.1(c) simply means that every pair of events
is disjoint. [Equivalently, we can use the term mutually exclusive.]
• Definition 2.1(c) is here stated for a finite number of events. In fact, there
is a version for a countably infinite number of events as well but we do not
state that formally in this module as we would need to clarify the notion of
an infinite sum first.2 Further subtleties occur if S is not countable.
Show that this defines a probability measure, i.e., that the given probabilities satisfy
Kolmogorov’s Axioms.
Solution:
All four events (i.e., all four subsets of the sample space {h, t}) have probabilities greater
than or equal to zero so Definition 2.1(a) is obviously satisfied. We also straightforwardly
have P(S) = P({h, t}) = 1 so Definition 2.1(b) is also satisfied. For Definition 2.1(c) we
2
A concept many of you will meet in the module “Calculus II”.
2.2 Deductions from the Axioms 15
1 2
P({h} ∪ {t}) = P({h, t}) = 1 = + = P({h}) + P({t}),
3 3
1 1
P(∅ ∪ {h}) = P({h}) = = 0 + = P(∅) + P({h}),
3 3
2 2
P(∅ ∪ {t}) = P({t}) = = 0 + = P(∅) + P({t}),
3 3
P(∅ ∪ {h, t}) = P({h, t}) = 1 = 0 + 1 = P(∅) + P({h, t}),
1 2
P(∅ ∪ {h} ∪ {t}) = P({h, t}) = 1 = 0 + + = P(∅) + P({h}) + P({t}).
3 3
Hence Definition 2.1(c) is satisfied and we have a probability measure as required.
Note that for simple events you may sometimes see notation like P(h) as a
shorthand for P({h}) but it is better to include the curly braces as a reminder
that events are always sets.
Exercise 2.3: A more complicated construction
Let S = {1, 2, 3, 4, 5, 6, 7, 8} and for A ⊆ S define a probability measure by
1
P(A) = (|A ∩ {1, 2, 3, 4}| + 2|A ∩ {5, 6, 7, 8}|) .
12
(a) Verify that this satisfies the axioms for probability.
(b) Give an example of a physical situation which this probability measure could describe.
P(Ac ) = 1 − P(A).
This statement makes perfect sense: if P(A) is the probability of the event A
then the probability of the complementary event Ac should be 1 − P(A). [This
can be very useful in calculations; sometimes it is much easier to calculate P(Ac )
than P(A).] Although the proposition may be obvious, we want to provide a
formal proof starting from Definition 2.1 to provide evidence that our axioms are
consistent with the real world and to demonstrate the structure of probability
theory.
2.2 Deductions from the Axioms 16
Proof:
Let A be any event. Then we can set A1 = A and A2 = Ac . By definition of the
complement, A1 ∩ A2 = ∅ and so we can apply Definition 2.1(c), with n = 2, to
get
P(A1 ∪ A2 ) = P(A1 ) + P(A2 ) = P(A) + P(Ac ). (2.1)
1 = P(A) + P(Ac )
We can use what we have just proved to deduce further results which are
called here “corollaries” as they are straightforward consequences of the preceding
proposition.
Corollary 2.3.
P(∅) = 0.
This statement makes perfect sense as well. The probability of “no outcome” is
zero. [We already had this in Example 2.2 but we now see that it has to be true.]
Proof:
By definition of the complement, S c = S \ S = ∅. Hence, by Proposition 2.2,
P(∅) = 1 − 1 = 0,
as required.
Again the statement agrees with our intuition. Probabilities are always smaller
than or equal to one.
Proof:
By Proposition 2.2,
P(Ac ) = 1 − P(A).
2.2 Deductions from the Axioms 17
0 ≤ P(Ac ) = 1 − P(A)
and hence
P(A) ≤ 1,
as required.
The following statements are less obvious consequences of Definition 2.1 and
the statements we have shown so far. Thus we call them again propositions.
P(A) ≤ P(B).
This statement looks sensible as well. If an event B contains all the outcomes of
an event A then the probability of the former must be at least as big as that of
the latter.
Proof:
Consider the events A1 = A and A2 = B \ A, with A ⊆ B. Then A1 ∩ A2 = ∅ (i.e.,
the two events are pairwise disjoint) and A1 ∪ A2 = B. [A Venn diagram may help
you to see what is going on here.] So, by Definition 2.1(c), with n = 2,
This statement is quite remarkable. The probability of a (finite) event is the sum
of the probabilities of the corresponding simple events. [Again, you may see P(ai )
for P({ai }) although writing the former should really be avoided.]
Proof:
Denote the simple events by Ai = {ai }, i = 1, .., n. These events are pairwise
2.3 Inclusion-Exclusion Formulae 18
P(A) = P(A1 ∪ A2 ∪ · · · ∪ An )
= P(A1 ) + P(A2 ) + · · · + P(An )
= P({a1 }) + P({a2 }) + · · · + P({an }),
as required.
Remarks:
• Note that combining Proposition 2.6 with Definition 2.1(b) leads to the ob-
vious fact that the sum of the individual probabilities for all outcomes in the
sample space is unity. In other words, this agrees with our intuition that
“probabilities sum to one”.
• Notice that most of the proofs in this section involve expressing events of in-
terest as the union of disjoint events so that Definition 2.1(c) can be applied.
This is a powerful general strategy.
Exercise 2.4: Rugby squad (revisited)
Consider the set-up of Exercise 1.13. Suppose that 50% of the squad are forwards, 25%
are injured and 10% are injured forwards and that the player chosen is equally likely to be
any member of the squad. Calculate the probability that the chosen player is a forward
who is not injured.
Proposition 2.7 (Inclusion-exclusion for two events). For any two events A and
B we have
P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
This statement is not entirely obvious. For general events the probability of the
event “A or B” is not the sum of the probabilities of events A and B; in fact,
because of some “double counting” one needs to correct by the probability of the
event “A and B”.
2.3 Inclusion-Exclusion Formulae 19
Proof:
Consider the three events E1 = A \ B, E2 = A ∩ B and E3 = B \ A. The events
are pairwise disjoint and E1 ∪ E2 ∪ E3 = A ∪ B. [This should remind you of
Exercise 1.8.] Hence, by Definition 2.1(c) with n = 3,
as required.
Solution:
The probability of being able to travel via line 1 is P (A∩B). Using the inclusion-exclusion
formula we can write this as
but this only helps if we know P(A ∪ B). The trick here is to remember that the union of
A and B means at least one of A and B occurs. Since at least three of the four buses are
2.3 Inclusion-Exclusion Formulae 20
running, at least one of the two on line 1 must must be running and hence P (A ∪ B) = 1.
The probability of being able to travel via line 1 is thus
Proposition 2.8 (Inclusion-exclusion for three events). For any three events A,
B, and C we have
As for two events, there exists an “intuitive” argument but that is not a proof.
Proof:
Essentially we will apply Proposition 2.7 three times. Let D = A ∪ B so that
A ∪ B ∪ C = C ∪ D. Then
P(A ∪ B ∪ C) = P(C ∪ D)
= P(C) + P(D) − P(C ∩ D) [by Proposition 2.7]
= P(C) + P(A ∪ B) − P(C ∩ D)
= P(C) + P(A) + P(B) − P(A ∩ B) − P(C ∩ D) [by Proposition 2.7].
(2.3)
Now C ∩ D = C ∩ (A ∪ B) = (C ∩ A) ∪ (C ∩ B) so that
as required.
1
P(A) = P(B) = P(C) = .
3
Furthermore, assume that the probabilities for each of the events “A and B”, “A and C”,
2.4 Equally-Likely Outcomes 21
1
P(A ∩ B) = P(A ∩ C) = P(B ∩ C) = .
10
What can be said about the probability of the event that none of A, B, or C occur?
(b) Now try to write down a general formula for n events. How could you prove it?
Remark: Notice how the few simple axioms of Section 2.1 led to a plethora of
results in Sections 2.2 and 2.3 – such structure is essentially what mathematics is
about. In the next section, we will see how the general framework applies to the
special case of equally-likely outcomes.
|A|
P(A) = . (2.5)
|S|
For example, if we roll a fair die then the sample space is S = {1, 2, 3, 4, 5, 6} and
the probability of the event A “the number shown is smaller than 3” is given by
|{1, 2}| 2 1
P(A) = = = .
|{1, 2, 3, 4, 5, 6}| 6 3
We emphasize again that we are using here that every outcome in the sample space
is equally likely (we say “we pick an outcome at random”) and this special case
should not be assumed without justification; for example, it wouldn’t apply to a
biased die. We also run into difficulties if S is infinite. If S = N (the set of positive
integers) as in Exercise 1.4, then there is no reasonable way to choose an element
of S with all outcomes equally likely; there are, however, ways to choose so that
every positive integer has some chance of occurring.
Solution:
We need to check all three parts of Definition 2.1.
so
|A1 | + |A2 | + · · · + |An |
P(A1 ∪ A2 ∪ · · · ∪ An ) =
|S|
|A1 | |A2 | |An |
= + + ··· +
|S| |S| |S|
= P(A1 ) + P(A2 ) + · · · + P(An ).
Since Definition 2.1 is fulfilled, all the other results in Sections 2.2 and 2.3 also
hold. In particular, the inclusion-exclusion formula for two events reads
|A ∪ B| |A| |B| |A ∩ B|
= + −
|S| |S| |S| |S|
which implies
|A ∪ B| = |A| + |B| − |A ∩ B|.
This last statement is a statement about the sizes of finite sets which can be proved
in other ways as well. This cross-fertilization of ideas between different branches
of mathematics is an important part of higher-level study (and research!).
Note that in this set-up of equally-likely outcomes, calculating probabilities
becomes counting! In the next chapter we will see some combinatorial arguments
for finding |A| and |S| in different situations.
(a) P(B c ),
(c) P(A ∩ B c ).
Make sure that each step in your proofs is justified by a definition, axiom or result from
the notes (or is a simple manipulation).
Chapter 3
Sampling
24
3.2 Ordered Sampling with Replacement 25
Solution:
This is an ordered selection of three things from a collection of five things with repetition
allowed – you can think of taking Scrabble tiles from a bag containing (one each of) A, B,
C, D, E, and replacing each letter after you have written it down. There are five choices
for the first letter. For each of these, there are five choices for the second letter and, for
each of these, there are five choices for the third letter. Hence, using the basic principle
of counting, there are a total of 5 × 5 × 5 = 53 = 125 possible words.
In formal terms, we let the set of letters be U = {A, B, C, D, E}. Then the experiment
is to pick three letters in order and each outcome can be written as a triple, e.g., (A, C, E).
The sample space of all words is the set of all such triples, i.e., S = {(s1 , s2 , s3 ) : si ∈ U }
and we have |S| = 53 .
S = {(s1 , s2 , . . . , sr ) : si ∈ U }.
• If |U | = n there are n choices for u1 ; for each of these, there are n choices
for u2 , and so on. Hence, we have
|S| = |U |r = nr . (3.1)
Solution:
If the word contains no vowels than it must be made up solely of the letters B, C, and D.
Hence there are three choices for the first letter, three choices for the second letter, and
three choices for the third letter. It follows that
|A| 3×3×3 27
P(A) = = = .
|S| 125 125
If the word begins and ends with the same letter then there are five choices for the first
letter and five choices for the second letter but only one choice for the third because it
must be the same as the first. It follows that
|B| 5×5×1 25 1
P(B) = = = = .
|S| 125 125 5
(a) What is the cardinality of the sample space for this experiment?
(b) By computing the cardinality of each of these events find the probability that:
Solution:
There are still five choices for the first letter but now there are only four choices for the
second letter (only four letters are “left in the bag”) and only three choices for the third
letter (only three letters “left in the bag”). Hence there are 5 × 4 × 3 = 60 different words.
selected only once) then the sample space is the set of all ordered r-tuples of
distinct elements of U . That is
|S| = n × (n − 1) × (n − 2) × · · · × (n − r + 1)
n × (n − 1) × (n − 2) × · · · × (n − r + 1) × (n − r) × (n − r − 1) × · · · × 2 × 1
=
(n − r) × (n − r − 1) × · · · × 2 × 1
n!
= (3.2)
(n − r)!
|C| 60 12
P(C) = = = .
|S| 125 25
3.4 Unordered Sampling without Replacement 28
Solution:
We are already know that there are 5!/2! = 60 ways to make an ordered selection without
replacement (see Example 3.4) but that is overcounting because now we want, e.g., ECB,
EBC, BEC, BCE, CBE, CEB to all count as the same outcome. For each choice of three
letters there are 3!=6 permutations (see Example 3.5) so the overcounting is by a factor
of six and there are 60/6 = 10 ways to make an unordered selection without replacement.
S = {A ⊆ U : |A| = r}.
n!
r! × |S| = ,
(n − r)!
and so
n!
|S| = . (3.3)
(n − r)!r!
3.4 Unordered Sampling without Replacement 29
Solution:
There are 197 202
2 ways to choose the MTH4107 students and 2 ways to choose the
MTH4207 students. Hence, by the basic principle of counting,
197 202
Number of selections = ×
2 2
197! 202!
= ×
195! 2! 200! 2!
197 × 196 202 × 201
= ×
2 2
= 197 × 98 × 101 × 201
= 391931106.
(c) *What is the probability of matching five balls out of six in the current set-up? Can
you generalize to the probability of matching n balls?
n
(c) Unordered without replacement (no repetition): r .
Remarks:
• We have not covered the case of unordered sampling with replacement (i.e.,
repetition allowed). This is rather more difficult to deal with but sometimes
the previous point provides a workaround.
Solution:
Since we pick coins at random, P(D) = |D|/|S|. To calculate the probability we there-
fore need to determine the size of the sample space and the size of the event. The
3.5 Sampling in Practice 31
set U of objects contains seven different gold coins and three different copper coins,
U = {c1 , c2 , c3 , g1 , g2 , . . . , g6 , g7 }. It is important to note that the coins are all differ-
ent objects even if they may share the same colour – by definition the elements of a set
must be distinct.
Let us first consider the experiment as ordered sampling without replacement. In this
case, the outcomes are an ordered selection of four objects, i.e., a 4-tuple (s1 , s2 , s3 , s4 )
with sk ∈ U and no repetition (i.e., si 6= sj for i 6= j). Using n = 10 and r = 4 in
Theorem 3.1(b), the size of the sample space is
10! 10!
|S| = = = 10 × 9 × 8 × 7 .
(10 − 4)! 6!
7! 7!
|D| = = = 7 × 6 × 5 × 4,
(7 − 4)! 3!
and hence
|D| 7×6×5×4 1
P(D) = = = .
|S| 10 × 9 × 8 × 7 6
Now let us consider the experiment as unordered sampling – imagine revealing all the
picked coins at once rather than one-by-one. The outcomes are now subsets of U with
cardinality four (i.e., {s1 , s2 , s3 , s4 } with sk ∈ U ) and to calculate the size of the sample
space we need to use Theorem 3.1(c), with n = 10 and r = 4:
10 10! 10 × 9 × 8 × 7
|S| = = = .
4 6! 4! 4×3×2×1
which, of course, is the same result as obtained using ordered sampling. [In the fortunate
circumstance that you are able to do a question using two different methods, comparison
of the results provides a good way to check your answer!]
events:
(a) The event, E, that we pick two gold coins followed by two copper coins;
(b) The event, F , that we pick two gold and two copper coins in any order.
Solution:
We are sampling r = 5 objects from the set U = {1, 2, 3, 4, 5, 6} of size n = 6. The exper-
iment allows repetition so we have to consider ordered sampling with replacement. [Re-
member that we haven’t really covered how to treat unordered sampling with replacement!]
Outcomes are thus 5-tuples and the size of the sample space is given by Theorem 3.1(a):
|S| = 65 .
Now let G be the event “we roll a pair”. We consider first those outcomes in the event
G which have the pair as the first two entries, i.e., outcomes of the form (p, p, r1 , r2 , r3 ).
There are obviously six choices for p; for each of those there are five choices for r1 ; for
each of those there are four choices for r2 ; for each of those there are three choices for
r3 . Hence by the basic principle of counting, there are 6 × 5 × 4 × 3 such outcomes.
However, the pair doesn’t have to appear as the first two entries of our 5-tuple; we need to
take other arrangements into account, e.g., (p, r1 , p, r2 , r3 ), (r1 , r2 , r3 , p, p). Since we are
effectively choosing two (distinct) dice from five to be the pair, there are 52 = 10 different
and
|G| 10 × 6 × 5 × 4 × 3 25
P(G) = = = .
|S| 65 54
(a) What is the probability that the team is made up of six batsmen and five bowlers?
(b) What is the probability that the team contains fewer than three bowlers?
(b) Without using your answer to part (a), calculate the probability that 1 is not an
element of the chosen subset.
Conditional Probability
Solution:
(a) Just as in Example 1.1 we can simply write the sample space as S = {1, 2, 3, 4, 5, 6}
and we have A = {1, 3, 5} and B = {1, 2, 3}. Hence, since outcomes are equally likely,
|A| 3 1
P(A) = = = ,
|S| 6 2
|B| 3 1
P(B) = = = .
|S| 6 2
34
4.1 Introducing Extra Information 35
(b) Again using the fact that outcomes are equally likely, we have
|{1, 3}| 2
P(B given event A) = = .
|{1, 3, 5}| 3
This last expression gives a general definition of conditional probability (the proba-
bility of the event B conditioned on event A having occurring), which holds beyond
the special case of equally-likely outcomes.
Definition 4.1. If E1 and E2 are events and P(E1 ) 6= 0 then the conditional
probability of E2 given E1 , usually denoted by P(E2 |E1 ), is
P(E1 ∩ E2 )
P(E2 |E1 ) = .
P(E1 )
Remarks:
• The notation P(E2 |E1 ) is rather unfortunate since E2 |E1 is not an event.
Do not confuse the conditional probability P(E2 |E1 ) with P(E2 \ E1 ), the
probability for the event “E2 and not E1 ”.
• Note that the definition does not require that E2 happens after E1 , only
that we know about E1 but not about E2 . One way of thinking of this is
to imagine that the experiment is performed secretly and the fact that E1
occurred is revealed to you (without the full outcome being revealed). The
conditional probability of E2 given E1 is the new probability of E2 in these
circumstances.
(a) Find the conditional probability that the second pen is blue, given the first pen is red.
(b) Find the conditional probability that the first pen is red, given the second pen is blue.
4.2 Implications of Extra Information 36
• If P(E2 |E1 ) = P(E2 ) then the event E1 has no impact on the probability of
event E2 and we say the events are independent (see Chapter 5).
Solution:
(a) This situation can be treated as ordered sampling with replacement; we write the
outcomes as ordered pairs and |S| = 36 (see Exercise 1.3). Since outcomes are equally
likely we can simply count the number of outcomes in each event to find
|A| 6 1
P(A) = = = ,
|S| 36 6
|B| 6 1
P(B) = = = ,
|S| 36 6
|C| 27 3
P(C) = = = .
|S| 36 4
[If the last of these is not obvious, use the fact that there are 3 × 3 ways to get two
even numbers, together with Proposition 2.2.]
4.2 Implications of Extra Information 37
|A ∩ B| 1
P(A ∩ B) = =
|S| 36
and hence
P(A ∩ B) 1/36 1
P(B|A) = = = .
P(A) 1/6 6
This fits with the intuition that rolling a six first does not change the probability
of a double.
(ii) Now we have
P(A ∩ B) 1/36 1
P(A|B) = = = .
P(B) 1/6 6
It is perhaps slightly less obvious that rolling a double does not change the
probability of getting a six on the first roll. However, remember that conditional
probability does not require the condition to happen “first”; indeed, here event
B happens “after” event A has happened.
(iii) Here B ∩ C = {(1, 1), (3, 3), (5, 5)} so
|B ∩ C| 3 1
P(B ∩ C) = = =
|S| 36 12
and hence
P(B ∩ C) 1/12 1
P(C|B) = = = .
P(B) 1/6 2
So rolling a double reduces the probability of having at least one odd number.
(iv) Now we have
P(B ∩ C) 1/12 1
P(B|C) = = = .
P(C) 3/4 9
Similarly, rolling at least one odd number reduces the probability of a double –
this is probably not obvious at all!
Notice from the last example that for events E1 and E2 , the probability of E1
given E2 does not have to be equal to the probability of E2 given E1 ; in general
these are different probabilities which is related to the discussion of Exercise 0.3.
(a) Prove that if P(E1 |E2 ) > P(E1 ) then P(E2 |E1 ) > P(E2 ).
(c) If P(E1 |E2 ) is known, what can be deduced about P(E1 |E2c )?
4.3 The Multiplication Rule 38
which is a useful formula for calculating P(E1 ∩ E2 ) from knowledge of the con-
ditional probability P (E2 |E1 ). In fact, one can generalize this to an arbitrary
number of events as expressed in the following theorem.
P(E1 ∩E2 ∩· · ·∩En ) = P(E1 )×P(E2 |E1 )×P(E3 |E1 ∩E2 )×· · ·×P(En |E1 ∩E2 ∩· · ·∩En−1 )
Proof:
For a given number of events, one can easily prove the theorem statement by plug-
ging in the definition of conditional probabilities and cancelling common factors
in the numerator and denominator. To prove it in general we will use induction.
As the base case, let us take n = 2. This is precisely what we already have
in (4.1) from direct rearrangement of Definition 4.1. Now, for the inductive step,
we assume we have shown that the statement of the theorem holds for n = k, i.e.,
P(E1 ∩E2 ∩· · ·∩Ek ) = P(E1 )×P(E2 |E1 )×P(E3 |E1 ∩E2 )×· · ·×P(Ek |E1 ∩E2 ∩· · ·∩Ek−1 )
(4.2)
and seek to prove the statement for n = k + 1. First note that we can write
E1 ∩ E2 ∩ · · · ∩ Ek+1 as F1 ∩ F2 with F1 = E1 ∩ E2 ∩ · · · ∩ Ek and F2 = Ek+1 . From
Definition 4.1, we have P(F1 ∩ F2 ) = P(F1 )P(F2 |F1 ), i.e.,
which means the statement of the theorem also holds for n = k + 1 and hence, by
the principle of induction, for all n ≥ 2.
4.4 Ordered Sampling Revisited 39
P(E1 ∩E2 ∩· · ·∩Er ) = P(E1 )×P(E2 |E1 )×P(E3 |E1 ∩E2 )×· · ·×P(Er |E1 ∩E2 ∩· · ·∩Er−1 ).
Note that this expression is valid whether we sample with or without replacement
but the form of the conditional probabilities is different in each case.
Solution:
In this case,
1
P(E1 ) =,
n
as we pick the element v1 at random from the set U of size |U | = n. Similarly,
1
P(E2 |E1 ) = ,
n−1
4.5 Further Exercises 40
as we pick the element v2 at random from the set U \ {v1 } of size n − 1. In general,
1
P(Ei |E1 ∩ E2 ∩ · · · ∩ Ei−1 ) = ,
n−i+1
as we pick the element vi at random from the set U \ {v1 , v2 , . . . , vi−1 } of size n − i + 1.
Hence
P(A) = P(E1 ∩ E2 ∩ · · · ∩ Er )
1 1 1
= × × ··· ×
n n−1 n−r+1
(n − r)!
= .
n!
Unsurprisingly, the results for P(A) in Example 4.7 and Exercise 4.8 agree
with 1/|S| when |S| is calculated from Theorem 3.1(b) and 3.1(a) respectively.
However, you may find the equally-likely assumption more transparent with the
method of this chapter.
(a) Find the probability that the sum of the two rolls is at least nine.
(b) Find the conditional probability that the first roll is a “four” given that the sum of
the two rolls is at least nine.
(c) Find the conditional probability that the first roll is not a “four” given that the sum
of the two rolls is at least nine.
(d) Find the conditional probability that the first roll is a “four” given that the sum of
the two rolls is less than nine.
(e) Find the conditional probability that the sum of the two rolls is at least nine given
that the first roll is a “four”.
(a) Show that the conditional probability that the train is more than 15 minutes late
given that the train is late is equal to 1/2.
(b) Show that the conditional probability that you get a seat given that the train is late
is equal to 1/6.
(a) For which of A and B is there a higher probability that a patient chosen randomly from
among those given that treatment recovers? Express this as an inequality between
two conditional probabilities.
(b) For which of A and B is there a higher probability that a man chosen randomly from
among those given that treatment recovers? Express this as an inequality between
two conditional probabilities.
(c) For which of A and B is there a higher probability that a woman chosen randomly from
among those given that treatment recovers? Express this as an inequality between
two conditional probabilities.
(d) Compare the inequality in part (a) with the inequalities in part (b) and (c). Are you
surprised by the result?
Chapter 5
Independence
Solution:
Just as in Example 4.4, we can treat this situation as ordered sampling with replacement
and write the outcomes as ordered pairs. We obviously have |S| = 36 and, counting the
number of possibilities for each die roll we have |A| = 3 × 6 = 18 and |B| = 6 × 2 = 12.
Hence, since all outcomes are equally likely and (2.5) applies,
18 1 12 1
P(A) = = and P(B) = = .
36 2 36 3
For the event A∩B (“the first roll is even and the second roll is larger than four”), there are
obviously three possibilities for the first roll and two for the second so |A ∩ B| = 3 × 2 = 6
and
6 1
P(A ∩ B) = = .
36 6
Notice here that
1 1 1
P(A ∩ B) = = × = P(A)P(B).
6 2 3
42
5.2 Independence for Two Events – More Details 43
As you may already know, events with such a property are said to be independent. We
also have that
P(A ∩ B) 1/6 1
P(A|B) = = = ,
P(B) 1/3 2
and
P(A ∩ B) 1/6 1
P(B|A) = = = .
P(A) 1/2 3
Notice here that P(A|B) = P(A) and P(B|A) = P(B), i.e. the conditional probabilities do
not depend on the condition. We will further explore the connection between conditional
probabilities and independence in the next section.
Definition 5.1. We say that the events E1 and E2 are (pairwise) independent
if
P(E1 ∩ E2 ) = P(E1 )P(E2 ).
[If this equation does not hold we say that the events are dependent.]
Remarks:
• Be careful not to assume independence without good reason. You may as-
sume that two events E1 and E2 are independent in the following situations:
(i) They are clearly physically unrelated (e.g., they are associated with
different tosses of a coin);
(ii) You calculate their probabilities and find that P(E1 ∩E2 ) = P(E1 )P(E2 )
(i.e., you check Definition 5.1);
(iii) A question tells you that they are independent!
Theorem 5.2. Let E1 and E2 be events with P(E1 ) > 0 and P(E2 ) > 0. The
following are equivalent:
(a) E1 and E2 are independent,
• (b) implies (c): We start by assuming (b) is true, i.e., P(E1 |E2 ) = P(E1 ).
Then, by Definition 4.1,
P(E1 ∩ E2 )
= P(E1 ).
P(E2 )
P(E2 ∩ E1 )
= P(E2 )
P(E1 )
which means, again by Definition 4.1, P(E2 |E1 ) = P(E2 ), i.e., (c) is also true.
• (c) implies (a): We start by assuming (c) is true, i.e., P(E2 |E1 ) = P(E2 ).
Then, once again by Definition 4.1,
P(E2 ∩ E1 )
= P(E2 )
P(E1 )
which implies P(E2 ∩E1 ) = P(E1 )P(E2 ), i.e., E1 and E2 satisfy Definition 5.1
and are independent. Hence (a) is also true and the proof is complete.
5.3 Independence for Three or More Events 45
Solution:
We start from
P(E1 ∪ E2 ) = P(E1 )P(E2c ) + P(E2 )
and use the inclusion-exclusion formula (Proposition 2.7) on the left-hand side and the
obvious Proposition 2.2 on the right-hand side to obtain:
Solution:
Obviously |S| = 36 and |D| = |E| = 18 so P(D) = P(E) = 1/2 (cf. Example 5.2). The
cardinality of F is slightly less obvious but note that if the sum is odd, one roll must be odd
and the other even; there are nine ordered pairs which are odd followed by even and nine
ordered pairs which are even followed by odd. Hence, we also have P(F ) = 18/36 = 1/2.
Turning our attention to independence, we now consider each pair of events in turn.
• D and E relate to different die rolls so we can argue that they are physically unre-
lated. Hence D and E are independent. This is easily confirmed since
|D ∩ E| 9 1 1 1
P(D ∩ E) = = = = × = P(D)P(E).
|S| 36 4 2 2
• The event D ∩ F contains all pairs which are odd followed by even so
|D ∩ F | 9 1 1 1
P(D ∩ F ) = = = = × = P(D)P(F ).
|S| 36 4 2 2
• Similarly, the event E ∩ F contains all pairs which are even followed by odd so
|E ∩ F | 9 1 1 1
P(E ∩ F ) = = = = × = P(E)P(F ).
|S| 36 4 2 2
Each pair of events is independent so we say that D, E, and F are pairwise independent.
However, notice that the event D ∩ E ∩ F is impossible since when both outcomes are odd
the sum is even. Hence
1 1 1
P(D ∩ E ∩ F ) = P(∅) = 0 6= × × = P(D)P(E)P(F ).
2 2 2
In fact, for three or more events the notion of independence is slightly more
subtle than for two events. For example, for three events we have the following
definition.
Armed with this definition, we see that although the events in Example 5.6 are
pairwise independent they are not mutually independent.
Definition 5.3 can be generalized to four, five, and more events. The formal
definition looks awkward but the basic idea is that, for mutual independence, the
probability of the intersection of any finite subset of the events should factorize
into the probabilities of the individual events.
Solution:
(a) Assuming we pick the fair coin, the probability of a Head on each toss is 1/2, i.e.,
1 1
P(H1 |F ) = , and P(H2 |F ) = .
2 2
Furthermore, coin tosses of the same coin are considered to be independent (see the
discussion below Definition 5.1) so
1 1 1
P(H1 ∩ H2 |F ) = P(H1 |F )P(H2 |F ) = × = ,
2 2 4
i.e., H1 and H2 are conditionally independent given F .
(b) Assuming we pick the biased coin, the probability of a Head on each toss is 3/4, i.e.,
3 3
P(H1 |F c ) = , and P(H2 |F c ) = .
4 4
Again, coin tosses of the same coin are considered to be independent so, by construc-
tion of the experiment,
3 3 9
P(H1 ∩ H2 |F c ) = P(H1 |F c )P(H2 |F c ) = × = ,
4 4 16
i.e., H1 and H2 are conditionally independent given F c .
(c) To check Definition 5.1 and determine whether H1 and H2 are independent, we need
to know P(H1 ), P(H2 ), and P(H1 ∩ H2 ). To calculate the first of these note that1
where in the first line we have employed Kolmogorov’s third axiom, Definition 2.1(c),
and in the second line we have used the multiplication rule, Theorem 4.2. Now,
since the coin is picked at random, we obviously have P(F ) = P(F C ) = 1/2. Hence,
substituting numbers, we find
1 1 3 1 5
P(H1 ) = × + × = .
2 2 4 2 8
1
This is a taster of a general method which will appear in the next chapter.
5.5 Further Exercises 49
Similarly, we have
(a) Find an example with two events, A and B, which are independent but not condi-
tionally independent with respect to another event C.
(b) Find an example with two events, D and E, which are conditionally independent with
respect to an event F but not with respect to F c .
likely. Let E be the event “the chosen number is even”, O be the event “the chosen number
is odd”, Q be the event “the chosen number is a perfect square”, and Dk be the event
“the chosen number is a multiple of k”. Carefully justify your answers to the following.
(a) *If A, B, and C are mutually independent events then A and B ∪ C are independent.
(b) If E and F are mutually exclusive events with positive probabilities, then P(E|F ) = 0.
Chapter 6
Solution:
With the same (obvious) notation as in Example 1.2, each outcome is a list of Heads (h)
and Tails (t) in the order in which they are seen. We have
Notice that every outcome in the sample space appears in exactly one of these sets. The
three events are pairwise disjoint (Ei ∩ Ej = ∅ for i 6= j) and S = E1 ∪ E2 ∪ E3 . Loosely
speaking, the three events “split” the sample space into three parts; more formally we say
that E1 , E2 , and E3 partition the sample space.
51
6.1 Law of Total Probability 52
Remarks:
Proof:
Let Ak = A ∩ Ek , for k = 1, 2, . . . , n. Note that, by Definition 6.1, the sets
E1 , E2 , . . . , En are pairwise disjoint and E1 ∪ E2 ∪ · · · ∪ En = S. Since Ak ⊆ Ek
the events A1 , A2 , . . . , An are also pairwise disjoint and, furthermore,
A1 ∪ A2 ∪ · · · ∪ An = A ∩ (E1 ∪ E2 ∪ · · · ∪ En ) = A ∩ S = A.
P(A ∩ Ek )
P(Ak ) = P(A ∩ Ek ) = P(Ek ) = P(A|Ek )P(Ek ). (6.2)
P(Ek )
In fact, we have already seen an example of the use of Theorem 6.2 in Exam-
ple 5.8: by definition F and F c partition S. More generally, the approach is very
widely applicable but for different problems one needs to think carefully about
what partition to use. The technique is called conditioning.
from the results shows the number of survey participants in different age categories and
the percentage of them saying that they did consider Watford as part of London.
Age
18-24 25-49 50–64 65+
Number of participants 124 544 247 173
Percentage saying Watford is in London 31 34 15 19
P(A|B) = P(A|B ∩ E1 )P(E1 |B) + P(A|B ∩ E2 )P(E2 |B) + · · · + P(A|B ∩ En )P(En |B)
n
X
= P(A|B ∩ Ek )P(Ek |B).
k=1
Proof:
The idea is to use the definition of conditional probability together with the result
we proved in the previous section. Specifically, we start from Definition 4.1
P(A ∩ B)
P(A|B) = ,
P(B)
1 1 P(A ∩ B ∩ Ek )
P(A ∩ B|Ek )P(Ek ) = P(Ek ) [by Definition 4.1]
P(B) P(B) P (Ek )
1 P(B ∩ Ek )
= P(A ∩ B ∩ Ek ) [using P(B ∩ Ek ) > 0]
P(B) P(B ∩ Ek )
P(A ∩ B ∩ Ek ) P(B ∩ Ek )
=
P(B ∩ Ek ) P(B)
= P(A|B ∩ Ek )P(Ek |B) [by Definition 4.1]. (6.4)
(a) Supposing there is a Head on the first toss, determine the probability that the coin is
fair.
(b) Use the result from (a) together with Theorem 6.3 to find the probability of getting a
Head on the second toss given there is a Head on the first toss.
Solution:
(a) We already have P(F ) = 1/2, P(H1 |F ) = 1/2, and P(H1 ) = 5/8 (see Example 5.8).
We expect P(F |H1 ) to be different to P(H1 |F ); to calculate the former, we start with
the definition of conditional probability. Using the results we already know, we have1
P(F ∩ H1 )
P(F |H1 ) = [by Definition 4.1]
P(H1 )
P(H1 |F )P(F )
= [by Theorem 4.2]
P(H1 )
(1/2) × (1/2)
=
5/8
2
= .
5
P(H2 |H1 ) = P(H2 |H1 ∩ F )P(F |H1 ) + P(H2 |H1 ∩ F c )P(F c |H1 ). (6.5)
We have P(F |H1 ) = 2/5 [from (a)] and P(F c |H) = 1 − P(F |H1 ) = 3/5 [see Exer-
cise 4.5]. The other two conditional probabilities on the right-hand side of (6.5) look
more complicated at first sight. However, the property of conditional independence
discussed in Example 5.8, leads to considerable simplification. Starting once again
with the definition of conditional probability, we find
P (H2 ∩ H1 ∩ F )
P(H2 |H1 ∩ F ) = [by Definition 4.1]
P(H1 ∩ F )
P(H2 ∩ H1 |F )P(F )
= [by Theorem 4.2]
P(H1 |F )P(F )
P(H2 ∩ H1 |F )
=
P(H1 |F )
P(H2 |F )P(H1 |F )
= [by conditional independence of H1 and H2 given F ]
P(H1 |F )
= P(H2 |F )
1
= .
2
1
We’ll see this method again in the next section.
6.3 Bayes’ Theorem 55
3
P(H2 |H1 ∩ F c ) = P(H2 |F c ) = .
4
Hence, putting everything together, we conclude
1 2 3 3 13
P(H2 |H1 ) = × + × = .
2 5 4 5 20
Theorem 6.4 (Bayes’ theorem). If A and B are events with P(A), P(B) > 0, then
P(A|B)P(B)
P(B|A) = .
P(A)
Proof:
Starting again from Definition 4.1 (and using that P(A), P(B) > 0) we have
P(B ∩ A)
P(B|A) =
P(A)
P(A ∩ B) P(B)
=
P(A) P(B)
P(A ∩ B) P(B)
=
P(B) P(A)
P(B)
= P(A|B) .
P(A)
Remarks:
• We often need to use Theorem 6.2 (law of total probability) to calculate the
probability in the denominator of Theorem 6.4.
6.3 Bayes’ Theorem 56
Solution:
Let us define the events:
D: “The selected person has the disease”;
P : “The test for the selected person is positive”.
We know
1 99 5
P(D) = , P(P |D) = , and P(P |Dc ) = .
1000 100 1000
We want to compute P(D|P ) so, using Bayes’ theorem (Theorem 6.4), we write
P(P |D)P(D)
P(D|P ) = .
P(P )
To calculate P(P ) we can use Theorem 6.2 with the partition {D, Dc }:
(99/100) × (1/1000)
P(D|P ) =
(99/100) × (1/1000) + (5/1000) × (999/1000)
990
=
5985
22
=
133
= 0.1654 (to 4 decimal places).
So assuming the test is positive, there is only about 17% chance that the person has the
disease. In other words, about 83% of positive tests are false positives. Does this mean
the test is useless or is there anything one can do about the problem?
(b) Suppose that London has a population of 10 million and the murderer is assumed to
be one of these. If there is no evidence against the suspect apart from the fingerprint
match then it is reasonable to regard the suspect as a randomly-chosen citizen. Under
this assumption, what is the probability the suspect is innocent?
(c) How does the argument change if one knows that the suspect is one of only 100 people
who had access to the building at the time Professor Damson was killed?
Solution:
(a) Let us write I for the event “the suspect is innocent” and F for the event “the
fingerprints of the suspect match those at the crime scene”. The prosecutor notes that
P(F |I) = 1/50000 and deduces that P(I|F ) = 1/50000. This is nonsense. In general
there is no reason why two conditional probabilities P(A|B) and P(B|A) should be
equal or even close to equal.
(b) By Bayes’ theorem combined with the law of total probability (using partition {I, I c }),
Now, we are told P(F |I) and it is reasonable to assume that P(F |I c ) = 1 since if the
suspect is guilty the fingerprints should certainly match.2 The quantity P(I) in Bayes’
theorem is the probability that the suspect is innocent in the absence of any evidence
2
In fact, you could set P(F |I c ) to be any reasonably large probability without changing the
general conclusions.
6.5 Further Exercises 58
at all. We are told that the suspect should be regarded as a randomly-chosen citizen
from a city of 10 million people so P(I c ) = 1/10000000 and P(I) = 1 − 1/10000000 =
9999999/10000000. This gives
(1/50000) × (9999999/10000000)
P(I|F ) =
(1/50000) × (9999999/10000000) + 1 × (1/100000000)
9999999
=
9999999 + 50000
= 0.9950 (to 4 decimal places).
(c) This new information will decrease our initial value of P(I) which, remember, is the
probability that our suspect is innocent before we consider the fingerprint evidence.
From the information given it is now reasonable to treat the suspect as randomly-
chosen from among the 100 people with access to the building. Hence we take P(I c ) =
1/100 and P(I) = 99/100 which gives
(1/50000) × (99/100)
P(I|F ) =
(1/50000) × (99/100) + 1 × (1/100)
99
=
99 + 50000
= 0.0020 (to 4 decimal places).
Hence there is now only about a 0.2% chance that the suspect is innocent.
The above example illustrates the so-called “prosecutor’s fallacy” which is not just
of academic interest – it has been associated with several high-profile miscarriages
of justice.
• Read actively, not passively. Highlight your notes or annotate them as you
read and ask yourself questions as if you’re an annoying lecturer! For exam-
ple, when you read a definition, see if you can think of cases which satisfy
it and cases which don’t. In a proof, check you understand how to get from
every line to the next; even better, cover the proof up and see if you can
60
7.3 Tips for Doing Examples/Exercises 61
work it out for yourself, uncovering to give yourself a clue where necessary.
• Do the examples and exercises. As you read, you should try to redo the
examples/exercises which are already solved and attempt the “Further Ex-
ercises” if you have not done so. Remember that the starred exercises are
somewhat harder so could be skipped on a first pass of the notes; come
back to them later if you’re aiming for a high mark. For more advice on
examples/exercises, see the next section...
• Identify what you know and what you’re trying to find. If a problem is
phrased entirely in words, you will usually need to establish some notation
before you can do this (e.g., to define events) – often there are many valid
notational choices but you need to be clear and consistent.
• Think about the main tools that might help get from what you know to what
you’re trying to find. If you suspect a particular theorem or definition will
be useful, make sure you have the exact statement to hand.
• In a written solution, show all your working. For a proof you should try to
justify every step; for a more applied calculation you should indicate at least
the main methods you are using (e.g., “by inclusion-exclusion” or “using
Proposition 2.7”).
7.3 Tips for Doing Examples/Exercises 62
• Consider how to check your answer. This is a really important skill since
in the real world (and in exams!) there are rarely solutions to consult. For
instance, you can ask yourself whether a calculated probability is plausible
and whether there might be another way to do the same question.
Chapter 8
Remarks:
• If S is uncountable then this definition is, in fact, not quite correct. It turns
out that some functions are too complicated to regard as random variables
(just as some sets are too complicated to regard as events). This subtlety is
well beyond the scope of this module and will not concern us at all.
• Random variables are usually denoted by capital letters but should not be
confused with events. To aid the distinction, in this course we generally use
letters from towards the beginning of the alphabet for events and letters from
towards the end for random variables.
63
8.1 Concept of a Random Variable 64
Random variables and events are different concepts but, as we now discuss,
events can be described in terms of random variables. If X is a random variable
then P(X) makes no sense as X is not an event. The set of all outcomes ω ∈ S
such that X(ω) = x is, however, an event. Note that we use a lowercase letter
(sometimes labelled with a subscript) to denote a particular value of a random
variable. We use the shorthand “X = x” for the set {ω ∈ S : X(ω) = x}.
Hence, for example, P(X = 2) does makes sense; it is the probability that the
random variable X takes the value two. Similarly, “X ≤ x” denotes the set
{ω ∈ S : X(ω) ≤ x} so we can write things like P(X ≤ 6).
Another type of event involves the relationship between the values of different
random variables for the same experiment. For example, if Y and Z are both
random variables on the same sample space (i.e., functions with the same domain),
then “Y > 2Z” is shorthand for the set {ω ∈ S : Y (ω) > 2Z(ω)}; in other words,
P(Y > 2Z) is the probability of the event that the value of the random variable Y is
more than double the value of the random variable Z. There will be more detailed
analysis of cases where several random variables are of interest in Chapters 11
and 12.
(b) Evaluate each of the following or explain why it does not make sense: X( (5, 2) ),
X( (6, 4) ), X( (4, 6) ), X( (2, 2) ), X(5, 6), X(∅).
(c) Determine the following probabilities: P(X = 5), P(X = 3), P(X = 1), P(X ≤ 2),
P(X ≤ 12).
Solution:
The sample space is the set of all ordered pairs with elements which are integers between
1 and 6 (inclusive), i.e., S = {(j, k) : j, k ∈ {1, 2, 3, 4, 5, 6}}, and |S| = 36 (cf. Exercise 1.3,
amongst others).
(a) To get the sum of the two dice, then for an outcome (j, k) we must write down j + k.
This recipe is a function (j, k) 7→ j + k from S, to the set of integers Z (or if you prefer
the set of natural numbers N. With this definition, we see that the domain is S and
the co-domain is Z (or N). It is also obvious that the range, i.e., the set of values that
X can actually take is {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}.
[Note that the function X, in common with most random variables, is not injective.]
However, rather sneakily, X(5, 6) does not make sense as the input must be a pair
(an element of S), not just two numbers. Similarly, X(∅) does not make sense as the
empty set is not an element of S here.
(c) The event “X = 5” contains all outcomes (pairs) such that the sum of the two rolls
is five, i.e., it is the set {(1, 4), (2, 3), (3, 2), (4, 1)}. Since the cardinality of this set is
four and all outcomes are equally likely, we have P(X = 5) = 4/36 = 1/9. The other
probabilities are similarly determined:
2 1
P(X = 3) = P({(1, 2), (2, 1)}) = = ,
36 18
P(X = 1) = P(∅) = 0,
1
P(X ≤ 2) = P(X = 2) = P({(1, 1)}) = ,
36
P(X ≤ 12) = P(S) = 1.
(a) Let X be the random variable counting the number of Heads, and Y be the random
variable counting the number of Tails.
(i) State the range of the functions X and Y , i.e., list the values they can take.
(ii) Evaluate X(hht) and Y (hht).
(i) State the range of the function Z, i.e., list the values it can take.
(ii) Evaluate Z(hht).
(c) Determine the probability that we see more Tails than Heads.
Definition 8.2. A random variable X is discrete if the set of values that X takes
is either finite or countably infinite.
In this case we can label the possible values x1 , x2 , x3 , etc. and use xk for a generic
value. In this course we only really consider such discrete random variables.
8.2 Distributions of Discrete Random Variables 66
Now let us turn to the central question of how to describe the probability
distribution of a discrete random variable. We need to associate probabilities to
the events of the random variable taking each possible value; this information is
encoded in the so-called probability mass function.
x 7→ P(X = x).
Remarks:
• Do not confuse the lower case p, which is the “name” of a function, with the
P for probability; p(X = x) and P(x) are both wrong notation.
• In situations with more than one random variable we can label each p.m.f.
with a subscript, e.g., pX (x) = P(X = x) and pY (y) = P(Y = y).
In the next section we will return to general properties of the p.m.f.; for the present
we note that it can be given either by a closed-form expression or by a table, as
illustrated in the following examples and exercises.
Solution:
The random variable X takes values in the set {2, 3, 4, . . . , 12}. Since the dice are fair, we
can calculate probabilities from the cardinalities of the associated events just as we did
2
If X is a continuous random variable, what can you say about P(X = x)?
8.2 Distributions of Discrete Random Variables 67
|{(1, 1)}| 1
P(X = 2) = = ,
36 36
|{(1, 2), (2, 1)}| 2 1
P(X = 3) = = = ,
36 36 18
|{(1, 3), (2, 2), (3, 1)}| 3 1
P(X = 4) = = = ,
36 36 12
|{(1, 4), (2, 3), (3, 2), (4, 1)}| 4 1
P(X = 5) = = = ,
36 36 9
|{(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}| 5
P(X = 6) = = ,
36 36
|{(1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)}| 6 1
P(X = 7) = = = ,
36 36 6
|{(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)}| 5
P(X = 8) = = ,
36 36
|{(3, 6), (4, 5), (5, 4), (6, 3)}| 4 1
P(X = 9) = = = ,
36 36 9
|{(4, 6), (5, 5), (6, 4)}| 3 1
P(X = 10) = = = ,
36 36 12
|{(5, 6), (6, 5)}| 2 1
P(X = 11) = = = ,
36 36 18
|{(6, 6)}| 1
P(X = 12) = = .
36 36
These results can simply be displayed in a table of the p.m.f.:
x 2 3 4 5 6 7 8 9 10 11 12
.
P(X = x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36
Solution:
Denoting as usual a Head by h and a Tail by t, the sample space can be written as
{t, ht, hht, hhht} where, e.g., hhht means three Heads followed by a Tail. The random
variable T takes values 1, 2, 3, 4, ..., i.e., values from the countably infinite set of natural
numbers. It is easy to see that the function T is injective so to calculate the p.m.f., we have
to calculate the probabilities of simple events. Since the coin is fair and different tosses
are physically unrelated (so we can assume independence and multiply probabilities) we
8.3 Properties of the Probability Mass Function 68
have
1
P(T = 1) = P({t}) = ,
2
1 1 1
P(T = 2) = P({ht}) = × = ,
2 2 4
1 1 1 1
P(T = 3) = P({hht}) = × × = ,
2 2 2 8
..
.
It is easy to spot the pattern and we can write the p.m.f. as a table
n 1 2 3 4 ...
,
P(T = n) 1/2 1/4 1/8 1/16 ...
or as a compact formula
1
2n for n ∈ N
P(T = n) =
0 otherwise.
[Note that the “0 otherwise” line is sometimes not written in p.m.f. formulae, it being
simply assumed that the probability is zero for values of the random variable which are
not explicitly listed.]
where the sum is over all values which X takes (a finite or infinite set).
8.3 Properties of the Probability Mass Function 69
Proof:
The random variable X takes the values xk (with k = 1, 2, 3, . . .); we let Ak be
the event “X = xk ”. The Ak ’s are pairwise disjoint which can easily be proved by
contradiction. [If Ai and Aj were not disjoint for some i 6= j then Ai ∩ Aj would
contain at least one common element, say ω, but that would mean that X(ω) = xi
and X(ω) = xj which is impossible if i 6= j.] Furthermore A1 ∪ A2 ∪ · · · = S since,
for any ω ∈ S, X(ω) takes some value. [X(ω) = xi means ω ∈ Ai so there is no
ω ∈ S which is not in one of the Ak ’s.] In other words, the Ak ’s partition the
sample space; together with Kolmogorov’s axioms, this yields
X
P(X = xk ) = P(X = x1 ) + P(X = x2 ) + P(X = x3 ) + · · ·
k
= P(A1 ) + P(A2 ) + P(A3 ) + · · ·
= P(A1 ∪ A2 ∪ A3 ∪ · · · ) [using Definition 2.1(c)]
= P(S)
=1 [using Definition 2.1(b)]
Note that Proposition 8.4 provides a good way to check that a calculated p.m.f. is
at least reasonable.
Solution:
For the p.m.f. of X, we have to check a finite sum:
12
X
P(X = x) = P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) + P(X = 6) + P(X = 7)
x=2
+ P(X = 8) + P(X = 9) + P(X = 10) + P(X = 11) + P(X = 12)
1 2 3 4 5 6 7 6 5 4 3 2
= + + + + + + + + + + +
36 36 36 36 36 36 36 36 36 36 36 36
36
=
36
= 1.
8.3 Properties of the Probability Mass Function 70
[We have used here the formula for the sum of a geometric series (see Exercise 8.14); you
will learn much more about infinite series in the module Calculus II.] Hence, in both cases
Proposition 8.4 holds, as of course it must.
The fact that the events “X = xk ” are pairwise disjoint means that we can
find the probabilities of other events by summing values of the probability mass
function. For example, if the random variable X takes values in the integers, then
P(0 ≤ X < 3) = P(X = 0) + P(X = 1) + P(X = 2). We can also find the p.m.f.
of another random variable, say Y , which is itself a function of X, by considering
which values of X are mapped to which values of Y .
x −2 −1 0 1 2
.
P(X = x) 1/10 2/5 1/4 1/5 1/20
3
For continuous (non-discrete) random variables, one can still define the c.d.f. but the p.m.f.
is replaced by a probability density function (p.d.f.).
8.4 Further Exercises 71
(c) Confirm the statement in Proposition 8.4 for the p.m.f. calculated in (b).
n−1
X 1 − zn
1 + z + z 2 + z 3 + · · · + z n−1 = zk = .
1−z
k=0
(b) Now assume that |z| < 1. By taking the limit n → ∞, derive the sum of the geometric
series
∞
X 1
1 + z + z2 + z3 + · · · = zk = .
1−z
k=0
(c) Suppose you toss the coin of Exercise 8.13 until either a Head appears or a total
of n Tails has been seen. Let the random variable T be the number of flips made.
Determine the probability mass function of T and use your result from part (a) to
verify that Proposition 8.4 holds.
(a) Let the random variable Yn be the number of Tails appearing when a fair coin is
tossed n times. Determine the probability mass function of Yn and hence deduce a
8.4 Further Exercises 72
(b) You play a game where you first choose a positive integer n and then flip a fair coin
n times. You win a prize if you get exactly two Tails. How should you choose n
to maximize your chances of winning? What is the probability of winning with an
optimal choice of n?
Chapter 9
Remarks:
• The sum again ranges over all the possible values of the random variable;
there are some further subtleties in the case of infinite sums (largely beyond
the scope of this course).
• The expected value does not have to be one of the possible values of the
random variable.
73
9.1 Expected Value 74
Solution:
The random variable W obviously has p.m.f. P(W = w) = 1/6 for w ∈ {1, 2, 3, 4, 5, 6} so
6
X
E(W ) = wP(W = w)
w=1
1 1 1 1 1 1
=1× +2× +3× +4× +5× +6×
6 6 6 6 6 6
7
= .
2
Example 9.1 and Exercise 9.2 clearly illustrate that the expectation may or
may not be one of the values the random variable can actually take: it is possible
for the sum of two dice rolls to be seven (in fact the most likely sum we see) but
it is certainly not possible to roll 3.5 on a single die! What then can we say about
the expectation in general? Is there any way to check our calculations? Well, you
would (hopefully!) have been surprised if you had calculated E(W ) for the single
die as 7.1 or E(X) for the two dice as 1.5. This leads to the following proposition.
m ≤ E(X) ≤ M.
Proof:
If every value xk (k = 1, 2, 3, . . . ) that X takes is less than or equal to M , xk ≤ M ,
we have that xk P(X = xk ) ≤ M P(X = xk ) [since probabilities are non-negative
by Definition 2.1(a)] and, using also Proposition 8.4,
X X X
E(X) = xk P(X = xk ) ≤ M P(X = xk ) = M P(X = xk ) = M.
k k k
In general, it turns out that if a random variable can take infinitely many values,
its expectation may be infinite or even not well defined.1 From the point of view
of the present course, this is a complication that need not trouble you but see
Exercise 9.17 for a non-examinable challenge.
Solution:
Since we already calculated the p.m.f. of Y we can simply use that to determine E(Y ):
Notice, however, that we can write this calculation in another way; with f (X) = X 2 + 4,
we have
You should check that by using the values in the p.m.f. of X (see Exercise 8.10) we obtain
again E(Y ) = 26/5.
1
This is related to the question of whether or not an infinite series converges which you can
learn about in “Calculus II” and similar modules.
9.2 Expectation of a Function of a Random Variable 76
The above example illustrates a general principle which is stated in the next
proposition.
Solution:
Definition 9.4. The nth moment of the random variable X is the expectation
E(X n ).
Such expectations can easily be calculated for discrete random variables using
Proposition 9.3.2 Their values give information about the “shape” of the proba-
bility mass function. In particular, the second moment is related to the variance
which quantifies the spread of the distribution.
Remarks:
• Armed with Proposition 9.3, we see that Var(X) = E([X − E(X)]2 ), i.e., it
is the expectation of the square of the difference between X and E(X).
• Since the square of any real number is non-negative and the values of the
p.m.f. are also non-negative [from Definition 2.1(a) of course], it is clear that
Var(X) ≥ 0.
• The square root of the variance is called the standard deviation. Mathe-
matically it is usually more convenient to work with the variance than the
standard deviation.
Solution:
1 1
E(X) = 99 × + 101 × = 100,
2 2
and
1 1
E(Y ) = 90 × + 110 × = 100.
2 2
[These results are also obvious from a symmetry argument.] Hence the expectations
of the amounts gained from the two investments are the same.
1 1
Var(X) = (99 − 100)2 × + (101 − 100)2 × = (1)2 = 1,
2 2
and
1 1
Var(Y ) = (90 − 100)2 × + (110 − 100)2 × = (10)2 = 100.
2 2
So the variance of Y is much bigger than that of X; we can interpret this as the second
investment being, in some sense, riskier.
Proof:
As usual we write x1 , x2 , x3 , . . . for the possible values of X. Starting from Defini-
tion 9.5 and the fact that [xk − E(X)]2 = (xk )2 − 2E(X)xk + [E(X)]2 , we have
X
Var(X) = [xk − E(X)]2 P(X = xk )
k
X X X
= (xk )2 P(X = xk ) − 2E(X)xk P(X = xk ) + [E(X)]2 P(X = xk )
k k k
X 2
X
2
= E(X ) − 2E(X) xk P(X = xk ) + [E(X)] P(X = xk ) [from Proposition 9.3]
k k
= E(X 2 ) − 2E(X) × E(X) + [E(X)]2 × 1 [using Definition 9.1 and Proposition 8.4]
= E(X 2 ) − [E(X)]2 .
Remarks:
• You can remember this expression for the variance as “the mean of the square
minus the square of the mean”.
Solution:
Using the formula in Proposition 9.6 we have
E(aX + b) = aE(X) + b.
Remarks:
Var(aX + b) = a2 Var(X).
Proof:
Starting from Definition 9.5, we have
Remarks:
• Note that adding a constant to a random variable does not change the vari-
ance; this is intuitively reasonable as the spread of the distribution is un-
changed by a shift.
Solution:
Using the results of Exercises 9.2 and 9.9, together with Propositions 9.7 and 9.8, we have
X 1 7
E(Y ) = E = E(X) = ,
2 2 2
and 2
X 1 35/6 35
Var(Y ) = Var = Var(X) = = .
2 2 4 24
Notice that E(Y ) is the same as the expectation of the number on a single die (Example 9.1)
but Var(Y ) is smaller than the variance of the number on a single die (Example 9.8). Can
you understand why?
In the next chapter we will derive formulae for the expectation and variance of
various distributions which appear so frequently that they are given special names.
(a) E(3X)
(b) Var(3X),
(e) E(4 − 3X 2 ).
(a)
∞
X 1
1 + 2z + 3z 2 + 4z 3 + · · · = kz k−1 = ,
(1 − z)2
k=1
(b)
∞
X 2 1
1 + 4z + 9z 2 + 16z 3 + · · · = k 2 z k−1 = 3
− .
(1 − z) (1 − z)2
k=1
(b) Deduce that if E(Z) < 1, then Z takes the value 0 with non-zero probability.
E(Z)
P(Z ≥ t) ≤ .
t
(a) Consider a game where you flip a fair coin until it comes up Heads, starting with £2
and doubling the prize fund with every appearance of Tails. Let the random variable
N be the number of flips; if N takes value n, you win £2n . Let X denote the prize
you win (in pounds). Find E(X) and discuss how much you would be prepared to pay
to enter such a game.
(b) Now consider the following two-player game. Angela and Boris flip a fair coin until
it comes up Tails. The number of flips needed is again a random variable N taking
values n = 1, 2, 3, . . . . If n is odd then Boris pays Angela 2n €; if n is even then Angela
pays Boris 2n €. Let Y denote Boris’ net reward (in euros). Show that E(Y ) does not
exist.
Chapter 10
k 0 1
.
P(X = k) 1−p p
This p.m.f. is called the Bernoulli distribution, with parameter p, and we write
X ∼ Bernoulli(p), where the symbol “∼” loosely means “has the distribution of”.
As with all the distributions in this chapter, we are interested in the expectation
and variance. In this case they are very easily calculated; from Definitions 9.1
and 9.5, we have
E(X) = 0 × (1 − p) + 1 × p = p, (10.1)
Var(X) = (0 − p)2 × (1 − p) + (1 − p)2 × p = p(1 − p)[p + (1 − p)] = p(1 − p).
(10.2)
Note that the Bernoulli distribution applies whenever the sample space of an exper-
iment is partitioned into “success” and “failure” and we are interested in whether
or not a “success” occurs.
83
10.2 Binomial Distribution 84
Solution:
The sample space of the experiment is {1, 2, 3, 4, 5, 6}. Letting A be the event that we
roll a six (i.e., A = {6} and Ac = {1, 2, 3, 4, 5}), then we have Y (A) = 1 and Y (Ac ) = 0.
In this experiment, rolling a six is identified with “success”; the success probability is
P(Y = 1) = P(A) = 1/6 and so Y ∼ Bernoulli(1/6). Hence, from (10.1) and (10.2) above,
we obtain
1 1 1 5
E(Y ) = and Var(Y ) = 1− = .
6 6 6 36
• First consider the special outcome that the first k trials are successes and the
remaining n−k trials are failures. Using mutual independence the probability
of this simple event is pk (1 − p)n−k .
We call this the binomial distribution, with parameters n and p, and write
X ∼ Bin(n, p). Note that the Bernoulli(p) distribution is just Binomial(1, p).
1
You can think of this as unordered sampling of k trials from n without replacement: the
order doesn’t matter but once a trial has been picked to be a success, it can’t be picked again.
10.2 Binomial Distribution 85
In dealing with the binomial distribution, the following identities are often
useful:
n
X n
(a + b)n = ak bn−k [binomial theorem], (10.3)
k
k=0
n n n−1
= . (10.4)
k k k−1
(a) Use (10.3) to verify Proposition 8.4 for the binomial distribution.
(b) Starting from the definition of the binomial coefficient, prove (10.4).
Armed with (10.3) and (10.4), we now turn to the expectation and variance. The
expectation is calculated as
n
X
E(X) = kP(X = k)
k=0
n
X n k
= k p (1 − p)n−k
k
k=1
n
X n−1 k
= n p (1 − p)n−k [using (10.4)]
k−1
k=1
n
X n − 1 k−1
= np p (1 − p)n−k
k−1
k=1
n−1
X n − 1
= np p` (1 − p)n−1−` [setting ` = k − 1]
`
`=0
In fact, we will see a much easier argument for (10.5) and (10.6) in the next chapter!
10.3 Geometric Distribution 86
Solution:
Let the number of Heads seen in ten tosses be Z. For a fair coin the probability of a Head
on each toss (i.e., the success probability in each trial) is 1/2. Hence Z ∼ Bin(10, 1/2)
and, from (10.5) and (10.6),
1 1 1 5
E(Z) = 10 × = 5 and Var(Z) = 10 × × 1 − = .
2 2 2 2
We say that T has the geometric distribution with parameter p and write
T ∼ Geom(p). A word of warning is in order here: there is an alternative defini-
tion of the geometric distribution which involves counting the number of failures
(0, 1, 2, . . .) before the first success; in this course, we always use the definition
above but you should check carefully if consulting other books or websites.
In order to derive the expectation and variance of the geometric distribution
we will need the following two results for the sums of infinite series with |z| < 1:
∞
X 1
kz k−1 = 1 + 2z + 3z 2 + 4z 3 + . . . = , (10.7)
(1 − z)2
k=1
∞
X 2 1
k 2 z k−1 = 1 + 4z + 9z 2 + 16z 3 + . . . = 3
− . (10.8)
(1 − z) (1 − z)2
k=1
[One way to prove these is to start from the known formula for the sum of a
geometric series and differentiate both sides; the derivations belong more properly
in a calculus or analysis course but see Exercise 9.15 if you want to have a try.]
10.4 Poisson Distribution 87
Substituting from (10.9) and (10.10) into Proposition 9.6, we have for the variance
λk −λ
P(X = k) = e for k = 0, 1, 2, . . . .
k!
In this case we say that X has the Poisson distribution with parameter λ and
we write X ∼ Poisson(λ). If 0 < λ ≤ 1 then the p.m.f. P(X = k) is monotonically
decreasing; if λ > 1 then the p.m.f. has a maximum at a finite value k > 0.
For the variance, we use the trick of first considering E(X 2 ) − E(X) which can be
10.4 Poisson Distribution 89
obtained as
2k e−2
P(WS = k) = for k ∈ {0, 1, 2, 3, . . .}.
k!
A Junior Elf produces WJ presents per hour, also according to a Poisson distribution but
with parameter λ = 1. One morning, a snap inspection is organised to check whether
the elves are meeting the minimum performance criterion of each producing at least one
present per hour.
(a) Determine the probability that a Senior Elf fails to meet the criterion.
(b) For a team of one Senior Elf and two Junior Elves (all working independently), find
the probability that at least one elf fails to meet the criterion.
[You may leave powers of e in your answers.]
10.5 Distributions in Practice 90
Distribution Values
P(X = k) E(X) Var(X)
1 − p for k = 0
X ∼ Bernoulli(p) X = 0, 1 p p(1 − p)
p for k = 1
n
pk (1 − p)n−k
X ∼ Bin(n, p) X = 0, 1, . . . , n k np np(1 − p)
1 1−p
X ∼ Geom(p) X = 1, 2, 3, . . . (1 − p)k−1 p
p p2
λk −λ
X ∼ Poisson(λ) X = 0, 1, 2, . . . k! e λ λ
Note, however, that the first row is not really necessary since Bernoulli(p) is just
Bin(1, p).
An oft-asked question is how to determine which distribution applies in a par-
ticular situation. Of course, sometimes an exercise or exam problem will explicitly
give a distribution (especially in the Poissonian case) but, if not, a very good clue
is the range of the random variable in question – note the differences in the second
column of the table above. Loosely speaking, we can say the following.
• If a random variable takes only two possible values, it can always be related
to a Bernoulli random variable.
In real life of course, a random variable may not have any of the above distributions
but one of them may serve as a good approximation. You will see more of this in
future courses but, for now, note that it is always a good idea to state clearly any
assumptions you are making in modelling a situation.
Solution:
If W = a, then X = (a − a)/(b − a) = 0; if W = b then X = (b − a)/(b − a) = 1. Hence
P(X = 0) = P(W = a) = 1 − p and P(X = 1) = P(W = b) = p, i.e., X ∼ Bernoulli(p).
[Note that one can rearrange to get W = a + (b − a)X and obtain the expectation and
variance of W from the known Bernoulli results for E(X) and Var(X); in most cases,
however, it is probably easier to calculate E(W ) and Var(W ) directly.]
Solution:
(a) Each random pick results in a red ball with probability M/N and the outcome of each
pick is independent of all the others (i.e. we perform n independent Bernoulli trials
with p = M/N ). Hence R ∼ Bin(n, M/N ) and we have
k n−k
n M M
P(R = k) = 1− for k = 0, 1, 2, . . . , n,
k N N
M
E(R) = n ,
N
M M
Var(R) = n 1− .
N N
(b) Treating the situation as unordered sampling without replacement, the sample space
has cardinality N
n since we choose n balls from N . The event “R = k” corresponds
M
to choosing k red balls from M red balls, which can be done in k ways, and choosing
−M
n − k non-red balls from N − M non-red balls, which can be done in Nn−k
ways.
Hence, since the outcomes are all equally likely,
M N −M
k n−k
P(R = k) = N
. (10.14)
n
Obviously, R cannot be larger than M and n−R cannot be larger than N −M . Hence,
the possible values k of R satisfy max(0, n − (N − M )) ≤ k ≤ min(n, M ). However,
with the convention that ka = 0 for integers k > a ≥ 0, then the above formula for
the p.m.f. is valid for k = 0, 1, 2, . . . , n and gives zero probability to any impossible
values of R. [You can check that one gets the same p.m.f. by treating the situation as
ordered sampling without replacement.]
In fact, (10.14) gives the probability mass function of the so-called hypergeometric
distribution; you don’t need to know formulae for its expectation and variance but
10.6 Further Exercises 92
You may leave any powers of e in your answers but you should simplify all factorials and
other powers.
[Answers: 9/(2e3 ), 1 − 17/(2e3 ), 4/27, 19/27, 25/216]
(b) Say in words what the conclusion of part (a) means for the lifetime of the component.
Why do you think this is sometimes called the “memoryless property” of the geometric
distribution?
(a) Suppose that the fisherman catches m fish. What is the probability that k of them
are salmon?
10.6 Further Exercises 93
(c) Using part (b), find the probability mass function of Y . What is the name of the
distribution of Y ?
(a) Your friend proposes the following solution: There are three possible ways in which
we could have a Head followed by another Head (at the first and second, the second
and third, or third and fourth toss). We have a probability 1/2 × 1/2 = 1/4 of
getting a Head followed by another Head at each of these positions. Hence N is the
number of successes in three Bernoulli trials each with success probability 1/4 and so
N ∼ Bin(3, 1/4). Explain carefully what is wrong with this argument.
(b) Determine the correct probability mass function and then the expectation and variance
of N .
Definition 11.1. Let X and Y be two discrete random variables defined on the
same sample space and taking values x1 , x2 , . . . and y1 , y2 , . . . respectively. The
function
(xk , y` ) 7→ P( (X = xk ) ∩ (Y = y` ) )
Remarks:
• The values of the joint p.m.f. must be non-negative and sum up to one:
P P P
k ` P(X = xk , Y = y` ) = 1 (also written as k,` P(X = xk , Y = y` ) = 1).
• The definition can be easily extended to joint distributions of three (or more)
random variables but three dimensional tables are more difficult to construct!
94
11.1 Joint and Marginal Distributions 95
Solution:
We can treat this situation as unordered sampling without replacement (see Section 3.4).
As we are picking three balls from a set of seven balls, we have
7 7!
|S| = = = 35.
3 4! 3!
The event (R = 1) ∩ (Y = 1) is the event that we pick one red ball (from three red balls),
one yellow ball (from two yellow balls), and one green ball (from two green balls). Since
all outcomes are equally likely, we can calculate the probability of this event as:
3
2 2
|(R = 1) ∩ (Y = 1)| 1 1 1 3×2×2 12
P(R = 1, Y = 1) = = = = .
|S| 35 35 35
It is even easier to calculate P(R = 3, Y = 1); it is impossible to draw three red balls and
one yellow ball if we only draw three balls in total so (R = 3) ∩ (Y = 1) = ∅ and
P(R = 3, Y = 1) = 0.
HH R
H
0 1 2 3
Y H
HH
0 .
1 12/35 0
2
The next proposition relates the joint distribution P(X = xk , Y = y` ) and the
so-called marginals P(X = xk ) and P(Y = y` ).
Proposition 11.2. Let X and Y be two discrete random variables defined on the
same sample space and taking values x1 , x2 , . . . and y1 , y2 , . . . respectively. The
marginal distribution of X can be obtained from the joint distribution as
X
P(X = xk ) = P(X = xk , Y = y` ).
`
11.2 Expectations in the Multivariate Context 96
Loosely speaking, the idea is that if we only care about the probability of X
taking a particular value, we need to sum over all possible values of Y (and vice
versa, of course). The values of the marginals are the column sums and the row
sums in a table of the joint p.m.f.; they can be written in the margins, hence the
name.
where the sum ranges over all possible values xk , y` of the two random variables.
Remarks:
• Again we implicitly assume (in this course) that such expectations are well
defined, even when they involve infinite sums.
• Setting g(X, Y ) = 1 recovers the result that the values of the joint p.m.f.
sum to one; setting g(X, Y ) = Y gives an expression for E(Y ), and so on.
11.2 Expectations in the Multivariate Context 97
H U
HH
1 2
V HH
H .
1 1/2 1/6
3 1/3 0
Solution:
From Proposition 11.3, the required expectation can be written as a sum of four terms:
1 1 1
E(U V + U ) = (1 × 1 + 1) × + (2 × 1 + 2) × + (1 × 3 + 1) × + (2 × 3 + 2) × 0
2 6 3
1 1 1
=2× +4× + .
2 6 3
= 3.
Proof:
Starting from Proposition 11.3, and using series properties, we have
XX
E(X + Y ) = (xk + y` )P(X = xk , Y = y` )
k `
XX XX
= xk P(X = xk , Y = y` ) + y` P(X = xk , Y = y` )
k ` k `
! !
X X X X
= xk P(X = xk , Y = y` ) + y` P(X = xk , Y = y` )
k ` ` k
X X
= xk P(X = xk ) + y` P(Y = y` ) [from Proposition 11.2]
k `
In fact, using the properties of expectations (Proposition 9.7) one can show a useful
general result.
11.3 Independence for Random Variables 98
for all y` and zm . [In this sense, the definition of independence for three random variables
is simpler than the definition of (mutual) independence for three events.]
Solution:
We clearly have P(U = 2) = 1/6 and P(V = 3) = 1/3 but P(U = 2, V = 3) = 0 so
6 P(U = 2)P(V = 3) and hence U and V are not independent.
P(U = 2, V = 3) =
which can be easily checked for sums over small numbers of values. Using this,
we have
XX
E(XY ) = xk y` P(X = xk , Y = y` )
k `
XX
= xk y` P(X = xk )P(Y = y` ) [by independence]
k `
! !
X X
= xk P(X = xk ) y` P(Y = y` )
k `
= E(X)E(Y ).
(b) Starting from Proposition 9.6, and employing Theorem 11.4/Corollary 11.5,
we have
Theorem 11.7 generalizes to three and more random variables. For instance, if
X, Y and Z are independent random variables, then,
Indeed, using Theorem 11.7 repeatedly together with properties of the variance
(Proposition 9.8), one arrives at the following corollary.
Remarks:
• Note that while Corollary 11.5 applies for all random variables, Corollary 11.8
applies for independent random variables.
X = X1 + X2 + · · · + Xn
11.5 Further Exercises 101
E(X) = E(X1 + X2 + · · · + Xn )
= E(X1 ) + E(X2 ) + · · · + E(Xn )
= np.
This conclusion does not require the trials to be independent. However, if they
are independent, we can also employ Corollary 11.8 to obtain
Var(X) = Var(X1 + X2 + · · · + Xn )
= Var(X1 ) + Var(X2 ) + · · · + Var(Xn )
= np(1 − p).
Hence we easily recover (10.5) and (10.6) for the expectation and variance of the
number of successes in n independent Bernoulli trials, i.e., the expectation and
variance of the binomial distribution.
(a) E(X + Y ),
(d) E(X 2 + Y 2 + Z 2 ),
(e) Var(X + Y ),
[Answers (in jumbled order): cannot be determined, 151/3, 19/6, 107/36, 43/6, cannot be
determined, 139/6 ]
xk 0 1
,
P(X = xk ) 1/2 1/2
and
y` 0 1 2
.
P(Y = y` ) 1/3 1/3 1/3
Furthermore assume that
P(X = 0, Y = 0) = P(X = 1, Y = 2) = 0.
Definition 12.1. Let X and Y be discrete random variables defined on the same
sample space. The covariance of X and Y is defined by
If Var(X) > 0 and Var(Y ) > 0, then the correlation coefficient of X and Y
is defined by
Cov(X, Y )
Corr(X, Y ) = p .
Var(X)Var(Y )
We shall consider how to interpret these quantities shortly. However, we can
immediately see by comparison with Definition 9.5, that Cov(X, X) = Var(X).
Just as there is an alternative formula for the variance (Proposition 9.6), there
is an alternative formula for the covariance which, in practice, is often easier for
calculations.
Proposition 12.2. If X and Y are discrete random variables defined on the same
sample space, then
Cov(X, Y ) = E(XY ) − E(X)E(Y ).
1
As in previous chapters, we assume here that all expectations are finite, even if they involve
infinite sums.
103
12.1 Covariance and Correlation 104
Proof:
From Definition 12.1, we have
where we use Proposition 9.7 / Corollary 11.5 in going from the second to the
third line.
Remarks:
• Cov(X, Y ) > 0 if, on average, [X − E(X)] and [Y − E(Y )] have the same
sign. In other words, when [X − E(X)] is positive, [Y − E(Y )] tends to be
positive too; when [X − E(X)] is negative, [Y − E(Y )] tends to be negative
too. Loosely we can say in this case that X and Y tend to deviate together
above or below their expectation. An example of such positive correlation
might be the number of hours of revision and the final exam score.
Solution:
From Proposition 12.2, we have
Proposition 12.3. Let X and Y be discrete random variables defined on the same
sample space and with Var(X) > 0 and Var(Y ) > 0.
Corr(aX + b, cY + d) = Corr(X, Y ).
(b) If a > 0 and c > 0, use the result of part (a) to show that
Corr(aX + b, cY + d) = Corr(X, Y ).
(c) Show that if Y = aX+b with a > 0, then Corr(X, Y ) = 1 (perfect positive correlation).
What relationship between X and Y would give Corr(X, Y ) = −1 (perfect negative
correlation)?
Solution:
By Definition 4.1 for conditional probability, we have
P( (R = r) ∩ (Y = 1) ) P(R = r, Y = 1)
P(R = r|Y = 1) = = .
P(Y = 1) P(Y = 1)
Hence the values of P(R = r|Y = 1) for different r can be obtained by dividing the values
in the second row of the joint probability mass function (Exercise 11.2) by the marginal
P(Y = 1). We can calculate P(R = 1|Y = 1), for example, as
P(R = 1, Y = 1)
P(R = 1|Y = 1) =
P(Y = 1)
12/35
= [see Exercise 11.4]
4/7
3
= .
5
The resulting conditional p.m.f. is given by the following table:
r 0 1 2 3
.
P(R = r|Y = 1) 1/10 3/5 3/10 0
The above example illustrates the general idea of conditioning an event involv-
ing a random variable on another event.
Definition 12.4. Let X be a discrete random variable and A be an event with
P(A) > 0. The conditional probability
P( (X = xk ) ∩ A)
P(X = xk |A) =
P(A)
Solution:
The conditional expectation E(R|Y = 1) is easily calculated from the table of the condi-
tional p.m.f. in Example 12.3:
3
X
E(R|Y = 1) = rP(R = r|Y = 1)
r=0
1 3 3
=0× +1× +2× +3×0
10 5 10
6
= .
5
Note that E(R|Y = 1) < E(R) = 9/7. This makes sense since, if we pick one yellow ball,
we have more than the expected number of yellow balls [E(Y ) = 6/7] so the expected
number of red balls is reduced. [R and Y are not independent, see Exercise 11.10.]
Proof:
Using the law of total probability (Theorem 6.2) for the event “X = xk ” with
partition formed by E1 , E2 , . . . , En , we have
n
X
P(X = xk ) = P(X = xk |Ei )P(Ei ).
i=1
12.3 Law of Total Probability for Expectations 108
Solution:
Let H be the event “the coin shows Heads”. Clearly {H, H c } is a partition of S and, since
the coin is fair, P(H) = P(H c ) = 1/2.
Now, the random variable X|H is the number shown by the die given that the coin
shows Heads, i.e., the number shown given we roll the six-sided die. Clearly its expectation
is
1+2+3+4+5+6 7
E(X|H) = = .
6 2
Similarly, X|H c is the number shown by the die given that the coin shows Tails, i.e., the
number shown given we roll the four-sided die, and so
1+2+3+4 5
E(X|H c ) = = .
4 2
Finally, Theorem 12.5 with the partition {H, H c } gives
7 1 5 1
E(X) = E(X|H)P(H) + E(X|H c )P(H c ) = × + × = 3.
2 2 2 2
(c) Using the law of total probability, or otherwise, compute the expectation of Y .
(d) Using the law of total probability, or otherwise, compute the expectation of the prod-
uct XY and hence compute the covariance of the random variables X and Y .
[based on 2018 exam question (part)]
12.4 Further Exercises 109
• Read the question. Take a few moments to be clear about what is being asked
and, if necessary, work out how to convert the words to symbols. Make sure
you transcribe any relevant material correctly; if you copy something down
incorrectly you may even make a question impossible to solve!
• Show that you understand what you are doing. In a written question, you
must clearly show all your working except where you are simply asked to
“state” or “write down” a result. If you give a correct answer without jus-
tification, you may only be awarded partial marks. If you give an incorrect
answer without justification, you are likely to get zero marks! However, if
you give an incorrect answer but show signs of a correct method, you will
earn partial marks.
• Use your time wisely. Don’t spend ages struggling with a question (or ques-
tion part) worth only a few marks, when you know you can do something
else. Once the easier marks are safely “in the bag”, you can always come
back to the harder ones.
110
13.2 Probability in Perspective 111
• Check your answers. Try to avoid losing marks for “silly mistakes”! Don’t
just assume that you know how to do something and have done it correctly.
Think about whether each answer makes sense (e.g., is a calculated proba-
bility plausible) and if there is another way you could check it.
[Ros20] Sheldon Ross, A First Course in Probability, 10th ed., Pearson Education
Limited, Harlow, 2020.
[Sie88] Helmut Sies, A new parameter for sex education, Nature 332 (1988), 495.
112
Appendix A
Errata
This appendix lists the points in these notes where there are non-trivial correc-
tions/clarifications from earlier released versions.
• Exercise 6.4 corrected to read “Show that the answer to Example 6.3(b) ...”.
113