0% found this document useful (0 votes)
6 views

0.1. Probability Review

The document provides a comprehensive overview of probability theory, including fundamental concepts such as sample space, events, probability measures, and random variables. It covers definitions of conditional probability, independence, cumulative distribution functions, and various types of probability functions (PMF, PDF). Additionally, it discusses expectations, variance, and the relationships between two random variables, including joint distributions and conditional distributions.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

0.1. Probability Review

The document provides a comprehensive overview of probability theory, including fundamental concepts such as sample space, events, probability measures, and random variables. It covers definitions of conditional probability, independence, cumulative distribution functions, and various types of probability functions (PMF, PDF). Additionally, it discusses expectations, variance, and the relationships between two random variables, including joint distributions and conditional distributions.

Uploaded by

9gt5rqjjnq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Review of Probability Theory

January 12, 2025

1 Elements of Probability
In order to define a probability on a set we need a few basic elements.
ˆ Sample space Ω: The set of all the outcomes of a random experiment. Here, each outcome ω ∈ Ω
can be thought of as a complete description of the state of the real world at the end of the
experiment.
ˆ Set of events (or event space) F: A set whose elements A ∈ F (called events) are subsets of Ω
(i.e., A ⊆ Ω is a collection of possible outcomes of an expectation)
ˆ Probability measure: A function P : F 7→ R that satisfies the following properties
– P(A) ≥ 0 for all A ∈ F
– P(Ω) = 1
P
– If A1 , A2 , . . . are disjoint events (i.e. Ai ∩ Aj = ∅ if i ̸= j), then P(∪i Ai ) = i P(Ai ).
We interpret P(A ∪ B) as the probability of A or B happening, and P(A ∩ B) as the probability of A
and B happening.
Example 1. Consider the event of tossing a six-sided die. The sample space is Ω = {1, 2, . . . , 6}. We
can define different event spaces on this sample space. For example, the simplest event space is the
trivial event space F = {∅, Ω}. Another event space is the set of all subsets of Ω. For the first event
space, the unique probability measure satisfying the requirements above is given by P(∅) = 0, P(Ω) = 1.
For the second event space, one valid probability measure is to assign the probability of each set in the
event space to be i/6, where i is the number of elements of the set; for example, P({1, 2, 3, 4}) = 4/6
and P({1, 2, 3}) = 3/6.
Definition 1 (Conditional probability). Let B be an event with non-zero probability. The conditional
probability of any event A given B is defined as

P(A|B) = P(A ∩ B)/P(B).

That is, P(A|B) is the probability measure of the event A after observing the occurrence of the event
B.
Definition 2 (Independency). We say A and B are independent if P(A ∩ B) = P(A)P(B) (or equiv-
alently, P(A|B) = P(A)). Therefore, independence is equivalent to saying that observing B does not
have any effect on the probability of A.

2 Random variables
Consider an experiment in which we flip 10 coins, and we want to know the number of coins that
come up heads. Here, the elements of the sample space Ω are 10-length sequences of heads and tails.
For example, we might have

ω = (H, H, T, H, T, H, H, T, T, T ) ∈ Ω.

1
2 RANDOM VARIABLES

However, in practice, we usually do not care about the probability of obtaining any particular sequence
of heads and tails. Instead we usually care about real-valued functions of outcomes, such as the number
of heads that appear among our 10 tosses, or the length of the longest run of tails. These functions,
under some technical conditions, are known as random variables.
More formally, a random variable X is a function X : Ω 7→ R. Typically, we will denote random
variables using upper case letters X(ω) or more simply X (where the dependence on the random
outcome ω is implicit). We will denote the value that a random variable may take on using lower case
letters x.

Example 2. In our experiment above, suppose that X(ω) is the number of heads which occur in the
sequence of tosses ω. Given that only 10 coins are tossed, X(ω) can take only a finite number of values,
so it is known as a discrete random variable. Here, the probability of the set associated with a random
variable X taking on some specific value k is

P(X = k) := P({ω : X(ω) = k}).

Example 3. Suppose that X(ω) is a random variable indicating the amount of time it takes for a
radioactive particle to decay. In this case, X(ω) takes on an infinite number of possible values, so it is
called a continuous random variable. We denote the probability that X takes on a value between two
real constants a and b (where a < b) as
 
P a ≤ X ≤ b := P {ω : a ≤ X(ω) ≤ b} .

2.1 Cumulative distribution functions


In order to specify the probability measures used when dealing with random variables, it is often
convenient to specify alternative functions (CDFs, PDFs, and PMFs) from which the probability
measure governing an experiment immediately follows. In this section and the next two sections, we
describe each of these types of functions in turn.

Definition 3. A cumulative distribution function (CDF) is a function FX : R 7→ [0, 1] which specifies


a probability measure as
FX (x) := P(X ≤ x).
By using this function one can calculate the probability of any event in F. According to the
definition, we know that
lim FX (x) = 0, lim FX (x) = 1.
x→−∞ x→∞

Furthermore, if x ≤ y we have FX (x) ≤ FX (y).

2.2 Probability mass functions


When a random variable X takes on a finite set of possible values (i.e., X is a discrete random
variable), a simpler way to represent the probability measure associated with a random variable is
to directly specify the probability of each value that the random variable can take. In particular, a
probability mass function (PMF) is a function pX : Ω 7→ R such that

PX (x) := P(X = x).

In the case of discrete random variable, we use the notation Val(X) for the set of possbile values that
the random variable X can take. For example, if X(ω) is a random variable indicating the number of
heads out of ten tosses of coin, then Val(X) = {0, 1, . . . , 10}. It is clear that
X
PX (x) = P(X ∈ A).
x∈A

2
2 RANDOM VARIABLES

2.3 Probability density functions


For some continuous random variables, the cumulative distribution function FX (x) is differentiable
everywhere. In these cases, we define the probability density function (PDF) as the derivative of the
CDF, i.e.,
dFX (x)
fX (x) := .
dx
According to the properties of differentiation, for very small ∆x

P x ≤ X ≤ x + ∆x ≈ fX (x)∆x.
Both CDFs and PDFs can be used for calculating the probabilities of different events. But it should
be emphasized that the value of PDF at any given point x is not the probability of the event, i.e.,
fX (x) ̸= P(X = x). For example, fX (x) can take on values larger than 1 (but the integral of fX (x)
over any subset of R will be at most one).

2.4 Expectation
Suppose that X is a discrete random variable with PMF pX (x) and g : R 7→ R is an arbitrary
function. In this case, g(X) can be considered as a random variable, and we define the expectation as
X
E[g(X)] := g(x)pX (x).
x∈Val(X)

If X is a continuous random variable with PDF fX (x), then the expected value of g(X) is defined as
Z ∞
E[g(X)] := g(x)fX (x)dx.
−∞

Intuitively, the expectation of g(X) can be thought of as a “weighted average” of the values that g(x)
can take on for different values of x, where the weights are given by pX (x) or fX (x).
Below we list some useful properties
ˆ E[af (X)] = aE[f (X)] for any constant a ∈ R.
ˆ (Linearity of expectation): E[f (X) + g(X)] = E[f (X)] + E[g(X)]
ˆ For a discrete random variable X, E[I[X = k]] = P(X = k), where I[·] is an indicator function,
taking value 1 if the event happens and 0 otherwise.

2.5 Variance
The variance of a random variable X is a measure of how concentrated the distribution of a random
variable X is around its mean. Formally, the variance of a random variable X is defined as
Var[X] := E[(X − E[X])2 ].
We can give an alternate expression for the variance
E[(X − E[X])2 ] = E[X 2 − 2E[X]X + (E[X])2 ] = E[X 2 ] − 2E[X]E[X] + (E[X])2
= E[X 2 ] − (E[X])2 .
For any constant a ∈ R, we know that Var[af (X)] = a2 Var[f (X)].
Example 4. Calculate the mean and the variance of the uniform random variable X with PDF
fX (x) = 1, ∀x ∈ [0, 1] and 0 otherwise. Then
Z ∞ Z 1
E[X] = xfX (x)dx = xdx = 1/2,
−∞ 0
Z ∞ Z 1
E[X 2 ] = x2 fX (x)dx = x2 dx = 1/3
−∞ 0
1 1 1
Var[X] = E[X 2 ] − (E[X])2 = − = .
3 4 12

3
3 TWO RANDOM VARIABLES

Example 5. Suppose that g(x) = I[x ∈ A] for some subset A ⊆ Ω. Then


Z ∞ Z
E[g(X)] = I[x ∈ A]fX (x)dx = fX (x)dx = P(A).
−∞ x∈A

3 Two Random Variables


Thus far, we have considered single random variables. In many situations, however, there may be
more than one quantity that we are interested in knowing during a random experiment. For instance,
in an experiment where we flip a coin ten times, we may care about both X(ω) = the number of heads
that come up as well as Y (ω) = the length of the longest of consecutive heads. In this section, we
consider the setting of two random variables.

3.1 Joint and marginal distributions


Suppose that we have two random variables X and Y . One way to work with these two random vari-
ables is to consider each of them separately. If we do that we will only need FX (x) and FY (y). But if we
want to know about the values that X and Y assume simultaneously during outcomes of a random ex-
periment, we require a more complicated structure known as the joint cumulative distribution function
of X and Y , defined by
FXY (x, y) = P(X ≤ x, Y ≤ y).
It can be shown that by knowing the joint cumulative distribution function, the probability of any
event involving X and Y can be calculated. The joint CDF and the joint distributions FX (x) and
FY (y) of each variable separately are related by

FX (x) = lim FXY (x, y)dy, FY (y) = lim FXY (x, y)dx.
y→∞ x→∞

3.2 Joint and marginal probability mass functions


If X and Y are discrete random variables, then the joint probability mass function pXY : R × R 7→
[0, 1] is defined by
pXY (x, y) = P(X = x, Y = y).
P P
We know that x∈Val(X) y∈Val(Y ) pXY (x, y) = 1. We also have the following relationship between
the joint PMF and individual PMF
X
pX (x) = PXY (x, y).
y∈Val(Y )

3.3 Joint and marginal probability density functions


Let X and Y be two continuous random variables with joint distribution function FXY . In the case
that FXY (x, y) is everywhere differentiable in both x and y, then we can define the joint probability
density function
∂ 2 FXY (x, y)
fXY (x, y) = .
∂x∂y
Like in the single-dimensional case, fXY (x, y) ̸= P(X = x, Y = y), but we have
ZZ
fXY (x, y)dxdy = P((X, Y ) ∈ A).
(x,y)∈A

Analogous to the discrete case, we define


Z ∞
fX (x) = fXY (x, y)dy
−∞

as the marginal probability density function of X.

4
3 TWO RANDOM VARIABLES

3.4 Conditional distributions


Conditional distributions seek to answer the question: what is the probability distribution over Y ,
when we know that X must take on a certain value x? In the discrete case, the conditional probability
mass function of X given Y is simply
PXY (x, y)
pY |X = , if pX (x) ̸= 0,
PX (x)
In the continuous case, the situation is technically a little more complicated because the probability
that a continuous random variable X takes on a specific value x is equal to 0. Ignoring this technical
point, we simply define the conditional probability density of Y given X = x to be
fXY (x, y)
fY |X (y|x) = , if fX (x) ̸= 0.
fX (x)
Example 6. Suppose we know that a dice throw was odd, and want to know the probability of an
“one” has been thrown. Let X be the random variable of the dice throw, and Y be an indicator variable
that takes on the value of 1 if the dice throw turns up odd, then we write our desired probability as
follows:
P(X = 1, Y = 1) 1/6 1
P(X = 1|Y = 1) = = = .
P(Y = 1) 1/2 3
The idea of conditional probability extends naturally to the case when the distribution of a random
variable is conditioned on several variables, namely
P(X = a, Y = b, Z = c)
P(X = a|Y = b, Z = c) = .
P(Y = b, Z = c)

3.5 Independence
Two random variables X and Y are independent if FXY (x, y) = FX (x)FY (y) for all values of x
and y. For discrete random variables, this is equivalent to saying that

pXY (x, y) = pX (x)pY (y) ⇐⇒ pY |X (y|x) = pY (y) ∀x, y.

For continuous random variables, this is equivalent to saying that

fXY (x, y) = fX (x)fY (y) ⇐⇒ fY |X (y|x) = fY (y) ∀x, y.

Informally, two random variables X and Y are independent if “knowing” the value of one variable will
never have any effect on the conditional probability distribution of the other variable. That is, you
know all the information about the pair (X, Y ) by just knowing f (x) and f (y). The following lemma
formalizes this observation.
Lemma 1. If X and Y are independent, then for any subsets A, B ⊂ R we have

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B).

Sometimes we also talk about conditional independence, meaning that if we know the value of a
random variable (or more generally, a set of random variables), then some other random variables will
be independent of each other. Formally, we say “X and Y are conditionally independent given Z” if

pX|Z (x|z) = pX|Y,Z (x|y, z) ⇐⇒ pX,Y |Z (x, y|z) = pX|Z (x|z)pY |Z (y|z).

3.6 Expectations and covariance


Suppose that we have two discrete random variables X, Y and g : R2 7→ R is a function of these
two random variables. Then, the expected value of g is defined by
X X
E[g(X, Y )] := g(x, y)pXY (x, y).
x∈Val(X) y∈Val(Y )

5
3 TWO RANDOM VARIABLES

For continuous random variables X, Y , the analogous expression is


Z ∞Z ∞
E[g(X, Y )] = g(x, y)fXY (x, y)dxdy.
−∞ −∞

The covariance of two random variables X and Y is defined by

Cov[X, Y ] := E[(X − E[X])(Y − E[Y ])].

Using an argument similar to that for variance, we can rewrite this as

Cov[X, Y ] : = E[(X − E[X])(Y − E[Y ])] = E[XY − XE[Y ] − Y E[X] + E[X]E[Y ]]


= E[XY ] − E[X]E[Y ] − E[Y ]E[X] + E[X]E[Y ] = E[XY ] − E[X]E[Y ].

Properties
ˆ (Linearity of expectation) E[f (X, Y ) + g(X, Y )] = E[f (X, Y )] + E[g(X, Y )]

ˆ Var[X + Y ] = Var[X] + Var[Y ] + 2Cov[X, Y ]

ˆ If X and Y are independent, then

Cov(X, Y ) = 0, E[f (X)g(Y )] = E[f (X)]E[g(Y )]

You might also like