Unit 2
Unit 2
Probability theory
Probability theory is a branch of mathematics that deals with the analysis of random
phenomena. It aims to assign numerical values to the likelihood of events occurring. The
main concepts and ideas in probability theory include:
Complementary Probability is the probability that an event will not occur, calculated by
subtracting the probability of the event from 1.
1. Addition Rule (Union): P(A ∪ B) = P(A) + P(B) - P(A ∩ B), where A and B are events.
2. Multiplication Rule (Intersection): P(A ∩ B) = P(A) * P(B|A), where A and B are events, and
P(B|A) is the probability of B given A has occurred.
3. Independence: Two events A and B are independent if P(B|A) = P(B).
4. Conditional Probability: P(A|B) is the probability of A occurring given that B has occurred.
Random variables are functions that assign numerical values to the outcomes of a random
experiment. Probability distributions describe the likelihood of different values that a
random variable can take. Some common probability distributions include:
Probability Density Functions (PDFs) and Cumulative Distribution Functions (CDFs) are used
to describe continuous probability distributions:
Finally, Expectation (or the mean) and Variance are essential concepts for understanding the
behavior of random variables and their distributions. Expectation is a measure of the central
tendency, while Variance is a measure of the dispersion or spread of a distribution.
Bayes rule:
Bayes' Theorem: Bayes' Theorem, named after the Reverend Thomas Bayes, is a
fundamental concept in probability theory that allows us to reverse the conditional
probability relationship between two events. It is particularly useful in situations where we
have prior knowledge or information about one event and want to update our beliefs when
new evidence becomes available.
Here,
- P(A|B) is the probability of event A occurring given that event B has occurred.
- P(B|A) is the probability of event B occurring given that event A has occurred.
- P(A) is the prior probability of event A, which is the probability of A occurring without
considering B.
- P(B) is the probability of event B occurring, also known as the marginal probability of B.
By using Bayes' Theorem, we can calculate the conditional probability P(A|B), which might
be difficult or impossible to determine directly. This is particularly useful when we want to
update our beliefs about the probability of an event based on new evidence or information.
Bayes' Theorem plays a crucial role in various fields, including statistics, machine learning,
data science, and scientific inquiry. Some common applications include:
Here are the key components and steps involved in concept learning:
Examples: Concept learning begins with a set of examples or instances that are labeled with
their corresponding classes or categories. These examples serve as the training data for the
learning algorithm.
Training Algorithm: The training algorithm is used to search the hypothesis space and find a
concept that best fits the training data. This involves evaluating and comparing different
hypotheses based on how well they explain the examples.
Generalization: Once a concept is learned from the training data, the goal is to generalize
this concept to new, unseen examples. Generalization ensures that the learned concept can
accurately classify instances that were not part of the training set.
Evaluation: The learned concept is evaluated using performance metrics such as accuracy,
precision, recall, and F1 score to assess its effectiveness in classifying new instances.
Bayes' Theorem:
Bayes' theorem states that the probability of a hypothesis (class) given the evidence
(features) is proportional to the probability of the evidence given the hypothesis, multiplied
by the prior probability of the hypothesis, and divided by the probability of the evidence.
The Naive Bayes algorithm is computationally efficient, especially for high-dimensional data,
and can perform well even with relatively small training datasets. However, its assumption of
feature independence may not hold true in all cases, leading to potential inaccuracies,
especially when features are correlated.
On the other hand, Expectation-Maximization algorithm can be used for the latent variables
(variables that are not directly observable and are actually inferred from the values of the other
observed variables) too in order to predict their values with the condition that the general form of
probability distribution governing those latent variables is known to us. This algorithm is actually at
the base of many unsupervised clustering algorithms in the field of machine learning. It was
explained, proposed and given its name in a paper published in 1977 by Arthur Dempster, Nan Laird,
and Donald Rubin. It is used to find the local maximum likelihood parameters of a statistical model in
the cases where latent variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the
dataset, estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the
expectation (E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.
The essence of Expectation-Maximization algorithm is to use the available
observed data of the dataset to estimate the missing data and then using that
data to update the values of the parameters. Let us understand the EM
algorithm in detail.
• Initially, a set of initial values of the parameters are considered. A set of
incomplete observed data is given to the system with the assumption that the
observed data comes from a specific model.
• The next step is known as “Expectation” – step or E-step. In this step, we use
the observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.
• The next step is known as “Maximization”-step or M-step. In this step, we
use the complete data generated in the preceding “Expectation” – step in
order to update the values of the parameters. It is basically used to update the
hypothesis.
• Now, in the fourth step, it is checked whether the values are converging or
not, if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” –
step and
“Maximization” – step until the convergence occurs.
Flow chart for EM algorithm
Usage of EM algorithm
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm
• It is always guaranteed that likelihood will increase with each iteration.
• The E-step and M-step are often pretty easy for many problems in terms of
implementation.
• Solutions to the M-steps often exist in the closed form.
Usage of EM algorithm
• It can be used to fill the missing data in a sample.
• It can be used as the basis of unsupervised learning of clusters.
• It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
• It can be used for discovering the values of latent variables.
Advantages of EM algorithm
• It is always guaranteed that likelihood will increase with each iteration.
• The E-step and M-step are often pretty easy for many problems in terms of
implementation.
• Solutions to the M-steps often exist in the closed form.