0% found this document useful (0 votes)
6 views

Random Variables Review - unannotated

Uploaded by

Chamod
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Random Variables Review - unannotated

Uploaded by

Chamod
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Random Variables and Probability Theory Review

In the field of pattern recognition, patterns are typically represented as random vectors. These random
vectors consist of multiple random variables. These random variables exhibit joint probabilistic behavior
when combined, forming a comprehensive representation of the pattern in the form of a random
vector. This approach allows pattern recognition systems to incorporate uncertainty and statistical
characteristics into their modeling and decision-making processes, making them well-suited for tasks
such as image recognition, speech processing, and more, where patterns often exhibit complex
statistical relationships.

Random Variables and PDF:

A random variable is a fundamental concept in probability theory and statistics that serves as a
mathematical representation of uncertain or random events. It quantifies the possible outcomes of a
random experiment. Consider a classic example: a coin flip. In this scenario, the outcome, either "heads"
or "tails," is inherently uncertain. We use a random variable to represent this uncertainty. Let's call this
random variable . It takes on one of two values: . Importantly, these values are
not fixed; they depend on the outcome of the coin flip. For a fair coin, the probability associated with
each value of the random variable . This means that there's an equal chance of getting a
when you perform the coin flip. This specific example illustrates a discrete random variable because it
can only take on a finite set of distinct values (0 and 1, in this case). However, random variables can also
be continuous, where the possible values form a continuous range. For instance, if you measure the
height of individuals, the height values can take on any real number within a certain range, making it a
continuous random variable.

Probability Density Function (PDF):

A Probability Density Function, often denoted as , is a mathematical function used to specify the
likelihood of a continuous random variable taking on a specific value or falling within a particular range.
Unlike discrete random variables, which have a finite set of possible values, continuous random
variables can take on an infinite number of values within a specified interval.

Properties of PDF:

There is also the CDF function, but we are generally less concerned with it in pattern recognition.

Mean and Expectation:

Consider a scenario where we have a random variable, denoted as . Understanding the distribution of
this random variable can sometimes be challenging due to its complexity. To make sense of its behavior
more easily, it's valuable to summarize its characteristics using . The most encountered
ones are the mean, the variance, and the standard deviation.
Before we start with the summary statistics, let’s review expectation, defined as: The expectation of a
random variable , denoted as , is a measure of the central tendency or "average" value that is
likely to take on over a large number of trials or observations. For discrete random variables, the
expectation is computed as the weighted sum of all possible values that can take, where the weights
are given by the probabilities associated with each value. In mathematical notation:

where represents each possible value of , and is the probability of taking on the value .

For continuous random variables, the expectation is calculated as the integral of the random variable
multiplied by its probability density function (PDF) over its entire range:

The way we should interpret the mean (albeit with caution) is that it tells us essentially where the
random variable tends to be located. The mean is showing the most expected value of the RV.

Example:

Means are useful for understanding the average behavior of a random variable, however the mean is
insufficient to have a full intuitive understanding. Making a profit of per sale is very different from
making per sale despite having the same average value. The second one has a much larger
degree of fluctuation and thus represents a much larger risk. Thus, to understand the behavior of a
random variable, we will need at minimum one more measure: some measure of how widely a random
variable fluctuates.

Variance:
Variance is a quantitative measure of how far a random variable deviates from the mean. Consider the
expression , This is the deviation of the random variable from its mean. This value can be positive
or negative, so we need to do something to make it positive so that we are measuring the magnitude of
the deviation.
Variance formula:

Let’s consider the same example above where we calculated the mean:
Common distribution models

Gaussian:

Uniform:

Exponential:

Joint Statistics:
The above work assumes we are working with a single real valued random variable. However, what
happens when we encounter situations with two or more random variables that may be highly
interrelated? This scenario is quite common in machine learning, where we often encounter pairs or
groups of correlated random variables. Consider, for instance, random variables like which encode
the red value of the pixel at the coordinate in an image. In this example, adjacent pixels in an image
tend to exhibit similar colors. Treating these variables as independent entities and trying to build a
successful model using this assumption can be challenging. We can use multiple integrals to characterize
the relationship of correlated random variables. Let’s start with two random variables

When working with multiple variables, there are situations where we want to disregard the
interdependencies and focus solely on one variable at a time. This concept involves examining the
distribution of a single variable in isolation, irrespective of the others, and it's referred to as a "marginal
distribution." Let’s consider the random variables with joint density given by . When we discuss
the marginal distribution, we are essentially taking this joint density function, which encompasses both
and , and using it to determine the distribution of just one variable . The subscript here indicates
which random variable the density is for.
For this, treat y as a constant when finding the distribution for x.

Similar to a single variable case, we can find determine summary statistics for joint densities.

Expectations for joint probabilities:

Covariance:

When working with multiple random variables, there's an additional summary statistic that proves to be
quite useful: covariance. Covariance quantifies the extent to which two random variables tend to vary or
fluctuate together.

Let us see some properties of covariances:

Correlation:

Let's turn our attention to units of quantities. If one variable, let's call it is measured in one unit (for
example, inches), and another variable, say is measured in a different unit (like dollars), the
covariance between them is measured in the product of these two units. These units can be hard to
interpret. In many cases, what we're really interested in is a measure of relatedness between variables
that is independent of their specific units of measurement. We often don't require an exact quantitative
correlation but rather seek to understand if the variables move in the same direction and the strength of
this relationship.
To gain a clearer understanding, let’s convert our random variables, one measured in inches and the
other in dollars, into inches and cents. In this conversion, the random variable initially measured in
inches, remains unchanged. However, the random variable originally measured in dollars, is now
multiplied by 100 to represent cents. If we work through the definition, this means that will be
multiplied by 100. To arrive at a unit-invariant measure of correlation, we need to counteract this unit
change by dividing the covariance by something that also scales in the same way. The natural candidate
for this role is the standard deviation. Indeed, if we define the correlation coefficient as

we see that this is a unit-less value.

Properties of correlation:

Independence:

Two random variables x and y are independent if:

Means that knowing tells us nothing about . Example: When rolling two fair dice, the outcomes of
each die are independent random variables because the probability of each die's outcome is not
influenced by the other die's outcome.

The two RVs are uncorrelated if:

Uncorrelated random variables imply that there is no linear relationship between them. However, they
may still be dependent in a nonlinear or non-monotonic way. In other words, knowing the value of one
variable does not provide information about the linear relationship with the other, but they can still
exhibit other forms of statistical dependence. Consider representing the temperature in degrees
Fahrenheit and representing the temperature in degrees Celsius. These two variables are uncorrelated
because there is no linear relationship between them, but they are clearly dependent as changes in one
are related to changes in the other through a nonlinear conversion formula.

Conditional Statistics

Now that we know joint probabilities, let’s talk about conditional statistics which make the core of
supervised machine learning. For random variables X and Y, is the PDF for conditioned on . A
Conditional statistic like this is very important for pattern recognition, since we are assessing the
unknown (e.g., identity of pattern) conditioned on what’s known (e.g., measurement).

Baye’s Rule:

In machine learning and Bayesian statistics, we are often interested in making inferences of unobserved
(latent) random variables given that we have observed other random variables. Let us assume we have
some prior knowledge about an unobserved random variable and some relationship between
and a second random variable , which we can observe. If we observe , we can use Bayes’ theorem to
draw some conclusions about given the observed values of .

Here, represents the prior distribution, which encapsulates our subjective understanding of the
unobserved (latent) variable before we have observed any data. We have the flexibility to select any
prior distribution that aligns with our reasoning, but it's of utmost importance to guarantee that this
prior has a non-zero probability density function (pdf) for all conceivable values of even if these
values are exceptionally infrequent or rare.

The likelihood describes how and are related, and in the likelihood case of discrete probability
distributions, it is the probability of the data if we were to know the latent variable . Note that the
likelihood is not a distribution in , but only in . We call either the “likelihood of 𝑥 (given 𝑦)” or the
“probability of 𝑦 given 𝑥” but never the likelihood of 𝑦.

The posterior is the quantity of interest in Bayesian statistics and in pattern recognition because it
expresses exactly what we are interested in, i.e., what we know about after having observed .
Random Vectors:

In pattern recognition, we will not be looking at one or two variables, but a large number of random
variables. We can represent them as random vectors. The joint statistics that we developed above for
two variables would transfer to the general case of random vectors with many random variables. For
example, the mean will also be vector, and covariance will be a matrix. Note that, diagonal terms in the
covariance matrix are the variances of the random variables (hence, always positive).

Sample Statistics:

Given probability density , you can work out any expectation or correlation you want. However, in
reality, we often don’t know the probability density, or even the mean or covariance of a random
vector! Instead, we will probably be given some training samples, and we will have to infer the
probability density and other statistics from these samples. x1 , x2 , . . . , xN. Such inferred statistics are
called sample statistics.

Gaussian Distribution (Multivariate):

Since the Gaussian distribution is by far the most common statistical model used in pattern recognition,
it is important to understand how it can be used in the multivariate case. Gaussian distribution has many
convenient properties which we will discuss later.

Let’s see some interesting special cases:

Single variable:
Diagonal covariances:

This means that all components of x are uncorrelated! Much easier to deal with since the following then
holds true:

Visualization of Gaussian Distribution:

One important way of visualizing bivariate Gaussian distributions is to sketch the equiprobability
contour (all points along contour have equal probability), defined as:

After some derivation for the Gaussian distribution, what we need to sketch an equiprobability contour
is: the mean of the distribution which will be the center of the ellipsoid, the eigenvectors of the
covariance matrix which will be the axes, and the length of the axes will be set to the square root of
eigenvalues of .

Steps for getting the equiprobability contour of a Gaussian distribution:


• Given mean and covariance matrix , Compute eigenvalues

Compute eigenvectors

Sketch ellipse, centered at , with axes as and length of axes as .

Example: Suppose we are given the following data:

How do we sketch its equiprobability contour?

You might also like