Probability Theory
Probability Theory
Definition. A simple experiment is some action that leads to the occurrence of a single
outcome s from a set of possible outcomes S. Note that the single outcome s is referred to
as a sample point, and the set of possible outcomes S is referred to as the sample space.
Example 1. Suppose that you flip a coin n ≥ 2 times and record the number of times
you observe a “heads”. The sample space is S = {0, 1, . . . , n}, where s = 0 corresponds to
observing no heads and s = n corresponds to observing only heads.
Example 2. Suppose that you pick a card at random from a standard deck of 52 playing
cards. The sample points are the individual cards in the deck (e.g., the Queen of Spades is
one possible sample point), and the sample space is the collection of all 52 cards.
Example 3. Suppose that you roll two standard (six-sided) dice and sum the obtained
numbers. The sample space is S = {2, 3, . . . , 11, 12}, where s = 2 corresponds to rolling
“snake eyes” (i.e., two 1’s) and s = 12 corresponds to rolling “boxcars” (i.e., two 6’s).
Definition. An event A refers to any possible subspace of the sample space S, i.e., A ⊆ S,
and an elementary event is an event that contains a single sample point s.
Example 4. For the coin flipping example, we could define the events
Example 5. For the playing card example, we could define the events
Example 6. For the dice rolling example, we could define the events
For each of the above examples, A is an elementary event, whereas B and C are not
elementary events. Note that this is assuming that 0 is considered an even number, which
ensures that C is a non-elementary event when there are only n = 2 coin flips.
Definition. A sure event is an event that always occurs, and an impossible event (or null
event) is an event that never occurs.
Example 8. For the playing card example, E = {e | e is a Club, Diamond, Heart, or Spade}
is a sure event and I = {Joker} is an impossible event.
Example 10. For the coin flipping example, the two events A = {0} and B = {n} are mutu-
ally exclusive events, whereas A = {a | a is an even number} and B = {b | b is an odd number}
are exhaustive events.
Example 11. For the playing card example, the two events A = {a | a is a Spade} and
B = {b | b is a Club} are mutually exclusive events, whereas A = {a | a is a Club or Spade}
and B = {b | b is a Diamond or Heart} are exhaustive events.
Example 12. For the dice rolling example, the two events A = {2} and B = {12} are mutu-
ally exclusive events, whereas A = {a | a is an even number} and B = {b | b is an odd number}
are exhaustive events.
2 What is a Probability?
Definition. A probability is a real number (between 0 and 1) that we assign to events in
a sample space to represent their likelihood of occurrence. The notation P (A) denotes the
probability of the event A ⊆ S.
There are two differing perspectives on how to interpret what a probability actually means:1
We will use the “physical” interpretation, given that this course is focused on Frequentist
statistical inference. But there is some merit to the “evidential” interpretation of probability
in a variety of real-world applications (because the long run isn’t always relevant).
1
For a discussion, see https://en.wikipedia.org/wiki/Probability_interpretations
3 Axioms of Probability
Regardless of which interpretation you prefer, a probability must satisfy the three axioms of
probability (Kolmogorov, 1933), which are the building blocks of all probability theory.
1. P (A) ≥ 0 (non-negativity)
define a probability measure that makes it possible to calculate the probability of events.
• The first axiom states that the probability of an event A ⊆ S must be non-negative.
• The second axiom states that (a) the probability of an event A ⊆ S must not exceed
one, and (b) the probability that at least one elementary event s in the sample space
S occurs must equal one. This axiom is a requirement on the sample space S, such
that some valid outcome must be observed when the simple experiment is conducted.
• The third axiom states that the probability of mutually exclusive events must be the
summation of the probabilities of the events.
Together, these three axioms are all that is needed to compute probabilities for any simple
experiment—which is pretty remarkable! For each of the three examples (i.e., coin flipping,
card drawing, and dice rolling), you can verify that
(i) the probability of observing any event is greater than or equal to zero
(ii) the probability of observing the entire sample space is equal to one
(iii) the probability of observing mutually exclusive events is the summation of probabilities
Note that it is okay if these points seem somewhat opaque given that we have yet to formally
specify the concept of a probability distribution, which we will do in the next section.
4 Probability Distributions
Definition. A probability distribution F (·) is a mathematical function that assigns proba-
bilities to outcomes of a simple experiment. Thus, a probability distribution is a function
from the sample space S to the interval [0, 1], which can be denoted as F : S → [0, 1].
Example 13. Consider the coin flipping example with n = 3 coin flips. The sample space
is S = {0, 1, 2, 3}. If we assume that the coin is “fair” (i.e., equal chance of observing Heads
and Tails) and that the n flips are “independent” (i.e., unrelated to one another), then the
probability of each elementary event is as follows:
Although there are only four elements in the sample space, i.e., |S| = 4, there are a
total of 2n = 8 possible sequences that we could observe when flipping two coins. Given our
assumptions, each of the 8 possible sequences is equally likely. As a result, to compute the
probability of each s ∈ S, we simply need to count all of the relevant sequences and divide by
the total number of possible sequences, which is displayed in the P ({s}) column of the table.
The probability distribution is specified by P ({s}), such that P ({s}) defines the probability
of observing each elementary event s ∈ S. Note that the probability distribution satisfies the
three probability axioms, given that (i) P (A) > 0 for any event A ⊆ S, (ii) 3s=0 P ({s}) = 1,
P
• P ({0} ∩ {3}) = 0
Example 14. Consider the dice rolling example. The sample space is S = {2, 3, . . . , 11, 12}.
If we assume that the dice are “fair” (i.e., equal chance of observing each outcome {1, . . . , 6}
on a single roll) and that the two rolls are “independent” (i.e., unrelated to one another),
then the probability of each elementary event is as follows:
Although there are only 11 elements in the sample space, i.e., |S| = 11, there are a
total of 62 = 36 possible sequences that we could observe when rolling two dice. Given our
assumptions, each of the 36 possible sequences is equally likely. As a result, to compute the
probability of each s ∈ S, we simply need to count all of the relevant sequences and divide by
the total number of possible sequences, which is displayed in the P ({s}) column of the table.
The probability distribution is specified by P ({s}), such that P ({s}) defines the probability
of observing each elementary event s ∈ S. Note that the probability distribution satisfies the
three probability axioms, given that (i) P (A) > 0 for any event A ⊆ S, (ii) 12
P
s=2 P ({s}) = 1,
0 0 0 0
and (iii) P ({s} ∪ {s }) = P ({s}) + P ({s }) for any s, s ∈ S (with s 6= s ).
• P ({2} ∩ {12}) = 0
5 Joint Events
Thus far, we have considered simple experiments where the outcome of interest is a singular
event (e.g., the sum of two dice). In such cases, the sample space consists of sample points
that are one-dimensional elements. We could easily extend the ideas of probability theory
to experiments where the sample points are d-dimensional elements with d ≥ 2.
Definition. A joint event refers to an outcome of a simple experiment where the sample
point is two-dimensional. In this case, the sample points have the form s = (a, b), where a
and b are the two events that combine to form the joint event.
Example 15. Suppose that you flip a coin n = 2 times and record the outcome of each
coin flip (instead of recording the number of heads). In this case, the sample space is
S = {(a, b) | a ∈ {H, T }, b ∈ {H, T }}, where a and b denote the outcomes of the first and
second coin flip, respectively. Note that the sample space has size |S| = 4 and the elementary
events are defined as S = {(T, T ), (H, T ), (T, H), (H, H)}.
Example 16. Suppose that you pick a card at random from a standard deck of 52 playing
cards and record both the value and suit of the card separately. In this case, the sample
space is S = {(a, b) | a ∈ {2, 3, . . . , 9, 10, J, Q, K, A}, b ∈ {Club, Diamond, Heart, Spade}}.
Note that the sample space has size |S| = 52, given that a could take 13 different values and
b could take 4 different values (and 13 × 4 = 52).
Example 17. Suppose that we roll two dice and record the value of each dice (instead of
summing the values). In this case, the sample space is S = {(a, b) | 1 ≤ a ≤ 6, 1 ≤ b ≤ 6},
where a and b denote the outcomes of the first and second dice roll, respectively. Note that
the sample space has size |S| = 36. See Example 14 for the 36 elementary events.
Definition. Two events are independent of one another if the probability of the joint event
is the product of the probabilities of the separate events, i.e., if P (A ∩ B) = P (A)P (B).
If A and B are independent of one another, then P (A|B) = P (A) and P (B|A) = P (B).
In other words, when A and B are independent, knowing that one of the events has occurred
tells us nothing about the likelihood of the other event occurring.
Example 18. For the coin flipping example, if we assume that the coin is fair and the
two flips are independent, then P (s) = (1/2)(1/2) = 1/4 for any s ∈ S. In other words,
if we independently flip a fair coin two times, each of the possible outcomes in the sample
space S = {(T, T ), (H, T ), (T, H), (H, H)} is equally likely to occur. Furthermore, if we
define A = {first flip is heads} and B = {second flip is heads}, then P (B|A) = P (B) = 1/2.
Thus, the events A and B are independent of one another—which we already knew because
we assumed that the two coin flips were independent. Now suppose that we define another
event as C = {both flips are heads}. Then we have the following probabilities:
• P (A ∩ C) = P (B ∩ C) = 1/4
• P (Ac ∩ C) = P (B c ∩ C) = 0
Example 19. For the card drawing example, note that P (s) = 1/52 for any s ∈ S, given that
we have equal probability of drawing any card in the deck. Suppose that we define the events
A = {the card is a King} and B = {the card is a face card}. Note that P (A) = 4/52 given
that there are four Kings in a deck, and P (B) = 12/52 given that there are 12 face cards in a
deck. The probability of the joint event is P (A ∩ B) = 4/52 given that A ⊂ B. This implies
that P (A|B) = (4/52)/(12/52) = 4/12, i.e., if we draw a face card, then the probability of it
being a King is 1/3. The opposite conditional probability is P (B|A) = (4/52)/(4/52) = 1,
i.e., if we draw a King, then it must be a face card. Thus, the events A and B are dependent.
Example 20. For the dice rolling example, if we assume that the dice are fair and the two
rolls are independent, then P (s) = (1/6)(1/6) = 1/36 for any s ∈ S. Suppose that we define
the events A = {the sum of the dice is equal to 7} and B = {the first dice is a 1 or 2}. The
probabilities of the marginal events are P (A) = 6/36 and P (B) = 2/6, and the probability
of the joint event is P (A ∩ B) = 2/36 (see Example 14). This implies that P (A|B) =
(2/36)/(2/6) = 2/12, i.e., if the first roll is 1 or 2, then the probability of the sum being 7
is equal to 1/6. The opposite conditional probability is P (B|A) = (2/36)/(6/36) = 2/6, i.e.,
if the sum of the dice is 7, then the probability of the first roll being 1 or 2 is equal to 1/3.
Thus, the events A and B are dependent.
6 Bayes’ Theorem
Bayes’ theorem (due to Reverend Thomas Bayes, 1763) states that
which is due to the fact that P (A ∩ B) = P (B|A)P (A) = P (A|B)P (B). Note that Bayes’
theorem has important consequences because it allows us to derive unknown conditional
probabilities from known quantities. This theorem is the foundation of Bayesian statistics,
where the goal is to derive the posterior distribution P (A|B) given the assumed distribution
for the data given the parameters P (B|A) and the prior distribution P (A).
1. 0 ≤ P (A) ≤ 1
2. P (Ac ) = 1 − P (A)
3. P (A ∪ Ac ) = 1
4. P (S) = 1
5. P (∅) = 1 − P (S) = 0
6. P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
7. P (A ∪ B) ≤ P (A) + P (B)
8. P (A ∩ B) ≤ P (A ∪ B)