0% found this document useful (0 votes)
19 views

Unit-4 Ai

Uploaded by

Jagathdhathri KR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Unit-4 Ai

Uploaded by

Jagathdhathri KR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

UNIT-4

Uncertainty Measure: Probability Theory

Uncertainty in artificial intelligence (AI) refers to the inherent limitations in predictions,


decisions, or classifications due to incomplete, ambiguous, or noisy data, as well as
model limitations. AI systems, especially those employing machine learning, often
encounter uncertainty when dealing with real-world data that may be imperfect or
incomplete. Managing uncertainty is crucial to ensure robust, reliable, and accurate
performance in AI applications.

There are several ways to model uncertainty in AI. Bayesian approaches quantify
uncertainty by treating model parameters as probabilistic entities, offering confidence
intervals or probability distributions for predictions. Fuzzy logic addresses uncertainty
by allowing partial truth values between 0 and 1, making it useful for systems where
binary decisions (true/false) are inadequate. Probabilistic graphical models like Hidden
Markov Models or Bayesian Networks handle uncertainty by modelling relationships
between variables and their likelihoods.

Additionally, deep learning models handle uncertainty through techniques like dropout
as a regularization method, which can be interpreted to provide uncertainty estimates
in predictions. Uncertainty measures play a critical role in applications like autonomous
systems, healthcare, and decision-making processes, where incorrect or overconfident
predictions can have significant consequences.
Uncertainty in artificial intelligence (AI) refers to the lack of complete information or the
presence of variability in data and models. Understanding and modeling uncertainty is
crucial for making informed decisions and improving the robustness of AI systems.
There are several types of uncertainty in AI, including:
1. Aleatoric Uncertainty: This type of uncertainty arises from the inherent
randomness or variability in data. It is often referred to as “data uncertainty.” For
example, in a classification task, aleatoric uncertainty may arise from variations
in sensor measurements or noisy labels.
2. Epistemic Uncertainty: Epistemic uncertainty is related to the lack of
knowledge or information about a model. It represents uncertainty that can
potentially be reduced with more data or better modeling techniques. It is also
known as “model uncertainty” and arises from model limitations, such as
simplifications or assumptions.
3. Parameter Uncertainty: This type of uncertainty is specific to probabilistic
models, such as Bayesian neural networks. It reflects uncertainty about the
values of model parameters and is characterized by probability distributions over
those parameters.
4. Uncertainty in Decision-Making: Uncertainty in AI systems can affect the
decision-making process. For instance, in reinforcement learning, agents often
need to make decisions in environments with uncertain outcomes, leading to
decision-making uncertainty.
5. Uncertainty in Natural Language Understanding: In natural language
processing (NLP), understanding and generating human language can be
inherently uncertain due to language ambiguity, polysemy (multiple meanings),
and context-dependent interpretations.
6. Uncertainty in Probabilistic Inference: Bayesian methods and probabilistic
graphical models are commonly used in AI to model uncertainty. Uncertainty can
arise from the process of probabilistic inference itself, affecting the reliability of
model predictions.
7. Uncertainty in Reinforcement Learning: In reinforcement learning,
uncertainty may arise from the stochasticity of the environment or the
exploration-exploitation trade-off. Agents must make decisions under uncertainty
about the outcomes of their actions.
8. Uncertainty in Autonomous Systems: Autonomous systems, such as self-
driving cars or drones, must navigate uncertain and dynamic environments. This
uncertainty can pertain to the movement of other objects, sensor
measurements, and control actions.
9. Uncertainty in Safety-Critical Systems: In applications where safety is
paramount, such as healthcare or autonomous vehicles, managing uncertainty is
critical. Failure to account for uncertainty can lead to dangerous consequences.
10.Uncertainty in Transfer Learning: When transferring a pre-trained AI model to
a new domain or task, uncertainty can arise due to domain shift or differences in
data distributions. Understanding this uncertainty is vital for adapting the model
effectively.
11.Uncertainty in Human-AI Interaction: When AI systems interact with
humans, there can be uncertainty in understanding and responding to human
input, as well as uncertainty in predicting human behavior and preferences.
Addressing and quantifying these various types of uncertainty is an ongoing research
area in AI, and techniques such as probabilistic modeling, Bayesian inference, and
Monte Carlo methods are commonly used to manage and mitigate uncertainty in AI
systems.
Become a master of Data Science and AI by going through this PG Diploma in
Data Science and Artificial Intelligence!
Techniques for Addressing Uncertainty in AI
We’ve just discussed the different types of uncertainty in AI. Now, let’s switch gears
and learn techniques for addressing uncertainty in AI. It’s like going from understanding
the problem to finding solutions for it.
Probabilistic Logic Programming
Probabilistic logic programming (PLP) is a way to mix logic and probability to handle
uncertainty in computer programs. This is useful for computer programmers when they
are not completely sure about the facts and rules they are working with. PLP uses
probabilities to help them make decisions and learn from data. They can use different
techniques, like Bayesian logic programs or Markov logic networks, to put PLP into
action. PLP is handy in various areas of artificial intelligence, like making guesses when
we’re not sure, planning when there are risks involved, and creating models with
pictures and symbols.
Fuzzy Logic Programming
To deal with uncertainty in logic programming, there’s a method called fuzzy logic
programming (FLP). FLP combines regular logic with something called “fuzzy” logic.
This helps programmers express things that are a bit unclear or not black and white.
FLP also helps them make decisions and learn from this uncertain information. They can
use different ways to do FLP, like fuzzy Prolog, fuzzy answer set programming, and
fuzzy description logic. FLP is useful in various areas of artificial intelligence, like
understanding language, working with images, and making decisions when things are
not very clear.

Probability Theory
Introduction to Probabilistic Reasoning
Probabilistic reasoning provides a mathematical framework for representing and
manipulating uncertainty. Unlike deterministic systems, which operate under the
assumption of complete and exact information, probabilistic systems acknowledge that
the real world is fraught with uncertainties. By employing probabilities, AI systems can
make informed decisions even in the face of ambiguity.
Need for Probabilistic Reasoning in AI
Probabilistic reasoning with artificial intelligence is important to different tasks such as:
 Machine learning helps algorithms learn from possibly incomplete or noisy
data.
 Robotics: Provides robots the capability to act in and interact with dynamic and
uncertain environments.
 Natural Language Processing: Gives computers an understanding of human
language in all its ambiguity and sensitivity to context.
 Decision Making Systems: It empowers AI systems for well-informed decisions
and judgments by considering the likelihood of alternative outcomes.
Probabilistic reasoning can introduce uncertainty, allowing the AI system to sensibly
operate in the real world and make effective predictions.
Key Concepts in Probabilistic Reasoning
1. Bayesian Networks
 Imagine a kind of spider web cluttered with factors—one might say, a type of
detective board associating suspects, motives, and evidence. This, in a nutshell,
is your basic intuition behind a Bayesian network: a graphical model showing the
relationships between variables and their conditional probabilities.
 Advantages: Bayesian Networks are very effective to express cause and effect
and reasoning about missing information. They have found wide applications in
medical diagnosis where symptoms are considered variables which have
different grades of association with diseases considered other variables.
2. Markov Models
 Consider a weather forecast. A Markov model predicts the future state of a
system from its current state and its past history. For instance, according to a
simple Markov model of weather, the probability that a sunny day will be
followed by another sunny day is greater than the probability that a sunny day
will be followed by a rainy day.
 Advantages: Markov models are effective and easy to implement. They are
widely used, such as in speech recognition, and they can also be used for
prediction, depending on the choice of the previous words, as in the probability
of the next word.
3. Hidden Markov Models (HMMs)
 Consider, for example, a weather-predicting scenario that includes states of
some kind and yet also includes invisible states, such as humidity. HMMs are a
generalization of Markov models in which states are hidden.
 Advantages: HMMs are found to be very powerful in cases where hidden
variables are taken into account. Such tasks usually involve stock market
prediction, where the factors that govern prices are not fully transparent.
4. Probabilistic Graphical Models
 Probabilistic Graphical Models give a broader framework encompassing both
Bayesian networks and HMMs. In general, PGMs are an approach for
representation and reasoning in a framework of uncertain information, given in
graphical structure.
 Advantages: PGMs offer a powerful, flexible, and expressive language for doing
probabilistic reasoning, which is well suited for complex relationships that may
capture many different types of uncertainty.
These techniques are not mutually exclusive; rather, they can be combined and
extended to handle more and more specific problems in AI. For instance, the particular
technique that may be used will depend on the character of the uncertainty and the
type of result that may be sought. In turn, probabilistic reasoning can allow AI systems
to make not just predictions but quantifiable ones, thus leading to more robust and
reliable decision-making.
Techniques in Probabilistic Reasoning
1. Inference: The process of computing the probability distribution of certain
variables given known values of other variables. Exact inference methods include
variable elimination and the junction tree algorithm, while approximate inference
methods include Markov Chain Monte Carlo (MCMC) and belief propagation.
2. Learning: Involves updating the parameters and structure of probabilistic
models based on observed data. Techniques include maximum likelihood
estimation, Bayesian estimation, and expectation-maximization (EM).
3. Decision Making: Utilizing probabilistic models to make decisions that
maximize expected utility. Techniques involve computing expected rewards and
selecting actions accordingly, often implemented using frameworks like POMDPs.
How Probabilistic Reasoning Empowers AI Systems?
Suppose for a moment the maze in which you find yourself with nothing but an out-of-
focus map. The kind of traditional, rule-based reasoning would grind you to a halt,
unable to reason about the likelihood of a dead-end or an unclear way to go.
Probabilistic reasoning is like a powerful flashlight that can show the path ahead even
in circumstances of uncertainty.
This is the way in which probabilistic reasoning empowers AI systems:
 Quantifying Uncertainty: Probabilistic reasoning does not shrink from
uncertainty. It turns to the tools of probability theory to represent uncertainty by
attaching degrees of likelihood. For example, instead of a simple “true” or “false”
to whether it will rain tomorrow, probabilistic reasoning might assign a 60%
chance that it will.
 Reasoning with Evidence: AI systems cannot enjoy the luxury of making
decisions in isolation. They have to consider the available evidence and act
accordingly to help refine the probabilities. For example, the probability for a
rainy day can be refined to increase to 80% if dark clouds come in the afternoon.
 Based on Past Experience: AI systems can learn from past experiences.
Probabilistic reasoning factors in the prior knowledge of the nature of decisions.
For example, an AI system that was trained in the past on historical weather data
in your location might, therefore, consider seasonal trends when calculating the
probability of rain.
 Effective Decision-Making: Probabilistic reasoning will also enable AI systems
to make effective and well-informed decisions based on quantified uncertainty,
evidence, and prior knowledge. Returning to our maze analogy, the AI would be
able to actually weigh the probability of different paths, given the map at each
point in the maze and whatever it’s found its way through, making its reaching
the goal much more likely.
Probabilistic reasoning is not about achieving perfection in a world full of uncertainty
but about realizing the limits of perfect knowledge and working best with the
information available. This enables AI systems to perform soundly in the realistic world,
full of vagueness and where information is, in general, not complete.
Applications of Probabilistic Reasoning in AI
Probabilistic reasoning is widely applicable in a variety of domains:
1. Robotics: Probabilistic reasoning enables robots to navigate and interact with
uncertain environments. For instance, simultaneous localization and mapping
(SLAM) algorithms use probabilistic techniques to construct maps of unknown
environments while tracking the robot’s location.
2. Healthcare: In medical diagnosis, probabilistic models help in assessing the
likelihood of diseases given symptoms and test results. Bayesian networks, for
example, can model the relationships between various medical conditions and
diagnostic indicators.
3. Natural Language Processing (NLP): Probabilistic models, such as HMMs and
Conditional Random Fields (CRFs), are used for tasks like part-of-speech tagging,
named entity recognition, and machine translation.
4. Finance: Probabilistic reasoning is used to model market behavior, assess risks,
and make investment decisions. Techniques like Bayesian inference and Monte
Carlo simulations are commonly employed in financial modeling.
Advantages of Probabilistic Reasoning
 Flexibility: Probabilistic models can handle a wide range of uncertainties and
are adaptable to various domains.
 Robustness: These models are robust to noise and incomplete data, making
them reliable in real-world applications.
 Interpretable: Probabilistic models provide a clear framework for understanding
and quantifying uncertainty, which can aid in transparency and explainability.
Conclusion
Probabilistic reasoning is one of the most important methods to empower AI
applications and is widely used, dealing with the uncertainty of the problem to make
logical decisions. With the built-in probabilities, AI systems can navigate through
complexities in the real world, ultimately improving both reliability and performance.

Bayesian Belief Networks

Bayesian Belief Network in artificial intelligence


Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian
network as:
"A Bayesian network is a probabilistic graphical model which represents a set of
variables and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian
model.
Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and anomaly
detection.
PauseNext
Mute
Current Time 0:40
/
Duration 18:10
Loaded: 9.54%

Fullscreen
Real world applications are probabilistic in nature, and to represent the relationship
between multiple events, we need a Bayesian network. It can also be used in various
tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and
it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional
probabilities between random variables. These directed links or arrows connect
the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is
known as a directed acyclic graph or DAG.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |
Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability.
So let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination
of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed
acyclic graph:
Example: Harry installed a new burglar alarm at his home to detect burglary. The
alarm reliably responds at detecting a burglary but also responds for minor
earthquakes. Harry has two neighbors David and Sophia, who have taken a
responsibility to inform Harry at work when they hear the alarm. David always calls
Harry when he hears the alarm, but sometimes he got confused with the phone ringing
and calls at that time too. On the other hand, Sophia likes to listen to high music, so
sometimes she misses to hear the alarm. Here we would like to compute the probability
of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David and
Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional probabilities
table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A, B,
E], can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)


True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:


The Conditional probability of David that he will call depends on the probability of
Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form
of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.

Dempster Shafer Theory

Uncertainty is a pervasive aspect of AI systems, as they often deal with incomplete or


conflicting information. Dempster–Shafer Theory, named after its inventors Arthur P.
Dempster and Glenn Shafer, offers a mathematical framework to represent and reason
with uncertain information. By utilizing belief functions, Dempster–Shafer Theory in
Artificial Intelligence systems enables them to handle imprecise and conflicting
evidence, making it a powerful tool in decision-making processes.
Introduction
In recent times, the scientific and engineering community has come to realize the
significance of incorporating multiple forms of uncertainty. This expanded perspective
on uncertainty has been made feasible by notable advancements in computational
power within the field of artificial intelligence. As computational systems become more
adept at handling intricate analyses, the limitations of relying solely on traditional
probability theory to encompass the entirety of uncertainty have become apparent.
Traditional probability theory falls short in its ability to effectively address consonant,
consistent, or arbitrary evidence without the need for additional assumptions about
probability distributions within a given set. Moreover, it fails to express the extent of
conflict that may arise between different sets of evidence. To overcome these
limitations, Dempster-Shafer theory has emerged as a viable framework, blending the
concept of probability with the conventional understanding of sets. Dempster-Shafer
theory provides the means to handle diverse types of evidence, and it incorporates
various methods to account for conflicts when combining multiple sources of
information in the context of artificial intelligence.
What Is Dempster – Shafer Theory (DST)?
Dempster-Shafer Theory (DST) is a theory of evidence that has its roots in the work of
Dempster and Shafer. While traditional probability theory is limited to assigning
probabilities to mutually exclusive single events, DST extends this to sets of events in a
finite discrete space. This generalization allows DST to handle evidence associated with
multiple possible events, enabling it to represent uncertainty in a more meaningful way.
DST also provides a more flexible and precise approach to handling uncertain
information without relying on additional assumptions about events within an evidential
set.
Where sufficient evidence is present to assign probabilities to single events, the
Dempster-Shafer model can collapse to the traditional probabilistic formulation.
Additionally, one of the most significant features of DST is its ability to handle different
levels of precision regarding information without requiring further assumptions. This
characteristic enables the direct representation of uncertainty in system responses,
where an imprecise input can be characterized by a set or interval, and the resulting
output is also a set or interval.
The incorporation of Dempster Shafer theory in artificial intelligence allows for a more
comprehensive treatment of uncertainty. By leveraging the unique features of this
theory, AI systems can better navigate uncertain scenarios, leveraging the potential of
multiple evidentiary types and effectively managing conflicts. The utilization of
Dempster Shafer theory in artificial intelligence empowers decision-making processes
in the face of uncertainty and enhances the robustness of AI systems. Therefore,
Dempster-Shafer theory is a powerful tool for building AI systems that can handle
complex uncertain scenarios.
The Uncertainty in this Model
At its core, DST represents uncertainty using a mathematical object called a belief
function. This belief function assigns degrees of belief to various hypotheses or
propositions, allowing for a nuanced representation of uncertainty. Three crucial points
illustrate the nature of uncertainty within this theory:
1. Conflict: In DST, uncertainty arises from conflicting evidence or incomplete
information. The theory captures these conflicts and provides mechanisms to
manage and quantify them, enabling AI systems to reason effectively.
2. Combination Rule: DST employs a combination rule known as Dempster's rule
of combination to merge evidence from different sources. This rule handles
conflicts between sources and determines the overall belief in different
hypotheses based on the available evidence.
3. Mass Function: The mass function, denoted as m(K), quantifies the belief
assigned to a set of hypotheses, denoted as K. It provides a measure of
uncertainty by allocating probabilities to various hypotheses, reflecting the
degree of support each hypothesis has from the available evidence.
Example
Consider a scenario in artificial intelligence (AI) where an AI system is tasked with
solving a murder mystery using Dempster–Shafer Theory. The setting is a room with
four individuals: A, B, C, and D. Suddenly, the lights go out, and upon their return, B is
discovered dead, having been stabbed in the back with a knife. No one entered or
exited the room, and it is known that B did not commit suicide. The objective is to
identify the murderer.
To address this challenge using Dempster–Shafer Theory, we can explore various
possibilities:
1. Possibility 1: The murderer could be either A, C, or D.
2. Possibility 2: The murderer could be a combination of two individuals, such as
A and C, C and D, or A and D.
3. Possibility 3: All three individuals, A, C, and D, might be involved in the crime.
4. Possibility 4: None of the individuals present in the room is the murderer.
To find the murderer using Dempster–Shafer Theory, we can examine the evidence and
assign measures of plausibility to each possibility. We create a set of possible
conclusions (P)(P) with individual elements {p1,p2,...,pn}{p1,p2,...,pn}, where at least
one element (p)(p) must be true. These elements must be mutually exclusive.
By constructing the power set, which contains all possible subsets, we can analyze the
evidence. For instance, if P={a,b,c}P={a,b,c}, the power set would be {o,{a},{b},{c},
{a,b},{b,c},{a,c},{a,b,c}}{o,{a},{b},{c},{a,b},{b,c},{a,c},{a,b,c}},
comprising 23=823=8 elements.
Mass function m(K)
In Dempster–Shafer Theory, the mass function m(K) represents evidence for a
hypothesis or subset K. It denotes that evidence for {K or B} cannot be further divided
into more specific beliefs for K and B.
Belief in K
The belief in KK, denoted as Bel(K)Bel(K), is calculated by summing the masses of the
subsets that belong to KK. For example, if K={a,d,c},Bel(K)K={a,d,c},Bel(K) would be
calculated as m(a)+m(d)+m(c)+m(a,d)+m(a,c)+m(d,c)+m(a,d,c)m(a)+m(d)+m(c)
+m(a,d)+m(a,c)+m(d,c)+m(a,d,c).
Plausibility in K
Plausibility in KK, denoted as Pl(K)Pl(K), is determined by summing the masses of sets
that intersect with KK. It represents the cumulative evidence supporting the possibility
of K being true. Pl(K)Pl(K) is computed as m(a)+m(d)+m(c)+m(a,d)+m(d,c)+m(a,c)
+m(a,d,c)m(a)+m(d)+m(c)+m(a,d)+m(d,c)+m(a,c)+m(a,d,c).
By leveraging Dempster–Shafer Theory in AI, we can analyze the evidence, assign
masses to subsets of possible conclusions, and calculate beliefs and plausibilities to
infer the most likely murderer in this murder mystery scenario.
Characteristics of Dempster Shafer Theory
Dempster Shafer Theory in artificial intelligence (AI) exhibits several notable
characteristics:
1. Handling Ignorance: Dempster Shafer Theory encompasses a unique aspect
related to ignorance, where the aggregation of probabilities for all events sums
up to 1. This peculiar trait allows the theory to effectively address situations
involving incomplete or missing information.
2. Reduction of Ignorance: In this theory, ignorance is gradually diminished
through the accumulation of additional evidence. By incorporating more and
more evidence, Dempster Shafer Theory enables AI systems to make more
informed and precise decisions, thereby reducing uncertainties.
3. Combination Rule: The theory employs a combination rule to effectively merge
and integrate various types of possibilities. This rule allows for the synthesis of
different pieces of evidence, enabling AI systems to arrive at comprehensive and
robust conclusions by considering the diverse perspectives presented.
By leveraging these distinct characteristics, Dempster Shafer Theory proves to be a
valuable tool in the field of artificial intelligence, empowering systems to handle
ignorance, reduce uncertainties, and combine multiple types of evidence for more
accurate decision-making.
Advantages and Disadvantages
Dempster Shafer Theory in Artificial Intelligence (AI) Offers Numerous
Benefits:
1. Firstly, it presents a systematic and well-founded framework for effectively
managing uncertain information and making informed decisions in the face of
uncertainty.
2. Secondly, the application of Dempster–Shafer Theory allows for the integration
and fusion of diverse sources of evidence, enhancing the robustness of decision-
making processes in AI systems.
3. Moreover, this theory caters to the handling of incomplete or conflicting
information, which is a common occurrence in real-world scenarios encountered
in artificial intelligence.
Nevertheless, it is Crucial to Acknowledge Certain Limitations Associated with
the Utilization of Dempster Shafer Theory in Artificial Intelligence:
1. One drawback is that the computational complexity of DST increases significantly
when confronted with a substantial number of events or sources of evidence,
resulting in potential performance challenges.
2. Furthermore, the process of combining evidence using Dempster–Shafer Theory
necessitates careful modeling and calibration to ensure accurate and reliable
outcomes.
3. Additionally, the interpretation of belief and plausibility values in DST may
possess subjectivity, introducing the possibility of biases influencing decision-
making processes in artificial intelligence.

MACHINE LEARNING

Machine learning (ML) is a subdomain of artificial intelligence (AI) that focuses


on developing systems that learn—or improve performance—based on the data they
ingest. Artificial intelligence is a broad word that refers to systems or machines that
resemble human intelligence. Machine learning and AI are frequently discussed
together, and the terms are occasionally used interchangeably, although they do not
signify the same thing. A crucial distinction is that, while all machine learning is AI, not
all AI is machine learning.
What is Machine Learning?
Machine Learning is the field of study that gives computers the capability to learn
without being explicitly programmed. ML is one of the most exciting technologies that
one would have ever come across. As it is evident from the name, it gives the computer
that makes it more similar to humans: The ability to learn. Machine learning is actively
being used today, perhaps in many more places than one would expect.

Features of Machine Learning


 Machine learning is a data-driven technology. A large amount of data is
generated by organizations daily, enabling them to identify notable relationships
and make better decisions.
 Machines can learn from past data and automatically improve their performance.
 Given a dataset, ML can detect various patterns in the data.
 For large organizations, branding is crucial, and targeting a relatable customer
base becomes easier.
 It is similar to data mining, as both deal with substantial amounts of data.
Advantages and Disadvantages of the Machine Learning
Advantages:
1. Improved Accuracy and Precision: Machine learning (ML) excels in analyzing
vast data sets, identifying patterns, and improving accuracy in predictions, such
as diagnosing diseases or detecting anomalies that human analysis might miss.
2. Automation of Repetitive Tasks: ML automates routine tasks, such as data
entry and customer service, leading to increased productivity, efficiency, and the
allocation of human resources to more creative tasks.
3. Enhanced Decision-Making: By analyzing large datasets, ML provides valuable
insights, aiding in data-driven decision-making across industries, including
finance, healthcare, and marketing.
4. Personalization and Customer Experience: ML algorithms allow businesses
to personalize products and services based on user behavior, enhancing
customer satisfaction, such as through personalized recommendations in e-
commerce or content platforms.
5. Predictive Analytics: ML can predict future trends or events by analyzing
historical data, such as forecasting demand or identifying potential disease
outbreaks, helping industries plan more effectively.
6. Scalability: Machine learning models can efficiently handle and process large
datasets, making them essential for big data applications like social media
analysis or real-time business operations.
7. Improved Security: ML helps in detecting cybersecurity threats by identifying
abnormal patterns in data. It’s used in fraud detection by monitoring transactions
and analyzing network activity for suspicious behavior.
8. Cost Reduction: By automating tasks and optimizing processes, ML reduces
operational costs, such as predictive maintenance in manufacturing that
prevents costly machine failures.
9. Innovation and Competitive Advantage: Companies adopting ML gain a
competitive edge by innovating and responding to customer demands faster. ML-
driven products and insights can lead to new revenue streams and market
leadership.
10.Enhanced Human Capabilities: ML amplifies human potential, offering tools
that provide insights and help professionals, such as assisting doctors in
diagnosing diseases or researchers in processing complex data.
Disadvantages:
1. Data Dependency: ML models require vast amounts of data to function
effectively. The quality, quantity, and diversity of data are crucial to the model’s
performance, and biased or insufficient data can lead to poor results.
2. High Computational Costs: Training ML models can be resource-intensive,
often requiring expensive hardware like GPUs or TPUs. The energy consumption
is also significant, raising concerns about sustainability.
3. Complexity and Interpretability: Complex ML models, especially deep neural
networks, are difficult to interpret, leading to a "black-box" problem where
understanding how a model arrived at a decision becomes challenging,
especially in sensitive fields like healthcare.
4. Overfitting and Underfitting: ML models can suffer from overfitting (when
they memorize the training data) or underfitting (when they are too simplistic),
leading to poor generalization to new data.
5. Ethical Concerns: ML raises ethical issues around privacy, as models often rely
on sensitive personal data. Biases in data can also perpetuate social inequalities,
resulting in unfair treatment.
6. Lack of Generalization: ML models are often designed for specific tasks and
may struggle when applied to different datasets or domains. Generalizing across
diverse contexts remains a challenge in machine learning.
7. Dependency on Expertise: Developing ML models requires specialized
knowledge in algorithms, data preprocessing, and model evaluation. A shortage
of skilled professionals can limit the adoption of ML.
8. Security Vulnerabilities: ML models can be vulnerable to adversarial attacks,
where manipulated input data is used to trick the model into making incorrect
predictions, posing risks in applications like autonomous vehicles and
cybersecurity.
9. Maintenance and Updates: ML models require ongoing maintenance and
retraining as data changes over time. Data drift, where the underlying data
distribution shifts, can degrade model performance if not addressed.
10.Legal and Regulatory Challenges: The use of ML, especially in handling
personal data, faces legal and regulatory challenges like complying with GDPR. A
lack of clear regulations can create uncertainty for developers and businesses.
Conclusion:
Machine learning offers numerous advantages, such as automation, enhanced
accuracy, scalability, and personalization, making it highly valuable across industries.
However, it also faces challenges like data dependency, computational costs,
interpretability issues, and security vulnerabilities. Addressing these challenges is
essential for ethical and effective use of ML technologies.
Supervised Learning
In supervised learning, the machine is trained on a set of labeled data, which means
that the input data is paired with the desired output. The machine then learns to predict
the output for new input data. Supervised learning is often used for tasks such as
classification, regression, and object detection.
In unsupervised learning, the machine is trained on a set of unlabeled data, which
means that the input data is not paired with the desired output. The machine then
learns to find patterns and relationships in the data. Unsupervised learning is often
used for tasks such as clustering, dimensionality reduction, and anomaly detection.
What is Supervised learning?
Supervised learning is a type of machine learning algorithm that learns from labeled
data. Labeled data is data that has been tagged with a correct answer or classification.
Supervised learning, as the name indicates, has the presence of a supervisor as a
teacher. Supervised learning is when we teach or train the machine using data that is
well-labelled. Which means some data is already tagged with the correct answer. After
that, the machine is provided with a new set of examples(data) so that the supervised
learning algorithm analyses the training data(set of training examples) and produces a
correct outcome from labeled data.
For example, a labeled dataset of images of Elephant, Camel and Cow would have each
image tagged with either “Elephant” , “Camel”or “Cow.”

Key Points:
 Supervised learning involves training a machine from labeled data.
 Labeled data consists of examples with the correct answer or classification.
 The machine learns the relationship between inputs (fruit images) and outputs
(fruit labels).
 The trained machine can then make predictions on new, unlabeled data.
Example:
Let’s say you have a fruit basket that you want to identify. The machine would first
analyze the image to extract features such as its shape, color, and texture. Then, it
would compare these features to the features of the fruits it has already learned about.
If the new image’s features are most similar to those of an apple, the machine would
predict that the fruit is an apple.
For instance, suppose you are given a basket filled with different kinds of fruits. Now
the first step is to train the machine with all the different fruits one by one like this:
 If the shape of the object is rounded and has a depression at the top, is red in
color, then it will be labeled as –Apple.
 If the shape of the object is a long curving cylinder having Green-Yellow color,
then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana
from the basket, and asked to identify it.
Since the machine has already learned the things from previous data and this time has
to use it wisely. It will first classify the fruit with its shape and color and would confirm
the fruit name as BANANA and put it in the Banana category. Thus the machine learns
the things from training data(basket containing fruits) and then applies the knowledge
to test data(new fruit).
Types of Supervised Learning
Supervised learning is classified into two categories of algorithms:
 Regression: A regression problem is when the output variable is a real value,
such as “dollars” or “weight”.
 Classification: A classification problem is when the output variable is a
category, such as “Red” or “blue” , “disease” or “no disease”.
Supervised learning deals with or learns with “labeled” data. This implies that some
data is already tagged with the correct answer.
1- Regression
Regression is a type of supervised learning that is used to predict continuous values,
such as house prices, stock prices, or customer churn. Regression algorithms learn a
function that maps from the input features to the output value.
Some common regression algorithms include:
 Linear Regression
 Polynomial Regression
 Support Vector Machine Regression
 Decision Tree Regression
 Random Forest Regression
2- Classification
Classification is a type of supervised learning that is used to predict categorical values,
such as whether a customer will churn or not, whether an email is spam or not, or
whether a medical image shows a tumor or not. Classification algorithms learn a
function that maps from the input features to a probability distribution over the output
classes.
Some common classification algorithms include:
 Logistic Regression
 Support Vector Machines
 Decision Trees
 Random Forests
 Naive Baye
Evaluating Supervised Learning Models
Evaluating supervised learning models is an important step in ensuring that the model
is accurate and generalizable. There are a number of different metrics that can be used
to evaluate supervised learning models, but some of the most common ones include:
For Regression
 Mean Squared Error (MSE): MSE measures the average squared difference
between the predicted values and the actual values. Lower MSE values indicate
better model performance.
 Root Mean Squared Error (RMSE): RMSE is the square root of
MSE, representing the standard deviation of the prediction errors. Similar to
MSE, lower RMSE values indicate better model performance.
 Mean Absolute Error (MAE): MAE measures the average absolute difference
between the predicted values and the actual values. It is less sensitive to outliers
compared to MSE or RMSE.
 R-squared (Coefficient of Determination): R-squared measures the
proportion of the variance in the target variable that is explained by the
model. Higher R-squared values indicate better model fit.
For Classification
 Accuracy: Accuracy is the percentage of predictions that the model makes
correctly. It is calculated by dividing the number of correct predictions by the
total number of predictions.
 Precision: Precision is the percentage of positive predictions that the model
makes that are actually correct. It is calculated by dividing the number of true
positives by the total number of positive predictions.
 Recall: Recall is the percentage of all positive examples that the model correctly
identifies. It is calculated by dividing the number of true positives by the total
number of positive examples.
 F1 score: The F1 score is a weighted average of precision and recall. It is
calculated by taking the harmonic mean of precision and recall.
 Confusion matrix: A confusion matrix is a table that shows the number of
predictions for each class, along with the actual class labels. It can be used to
visualize the performance of the model and identify areas where the model is
struggling.
Applications of Supervised learning
Supervised learning can be used to solve a wide variety of problems, including:
 Spam filtering: Supervised learning algorithms can be trained to identify and
classify spam emails based on their content, helping users avoid unwanted
messages.
 Image classification: Supervised learning can automatically classify images
into different categories, such as animals, objects, or scenes, facilitating tasks
like image search, content moderation, and image-based product
recommendations.
 Medical diagnosis: Supervised learning can assist in medical diagnosis by
analyzing patient data, such as medical images, test results, and patient history,
to identify patterns that suggest specific diseases or conditions.
 Fraud detection: Supervised learning models can analyze financial transactions
and identify patterns that indicate fraudulent activity, helping financial
institutions prevent fraud and protect their customers.
 Natural language processing (NLP): Supervised learning plays a crucial role
in NLP tasks, including sentiment analysis, machine translation, and text
summarization, enabling machines to understand and process human language
effectively.
Advantages of Supervised learning
 Supervised learning allows collecting data and produces data output from
previous experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world
computation problems.
 It performs classification and regression tasks.
 It allows estimating or mapping the result to a new sample.
 We have complete control over choosing the number of classes we want in the
training data.
Disadvantages of Supervised learning
 Classifying big data can be challenging.
 Training for supervised learning needs a lot of computation time. So, it requires a
lot of time.
 Supervised learning cannot handle all complex tasks in Machine Learning.
 Computation time is vast for supervised learning.
 It requires a labelled data set.
 It requires a training process.
Unsupervised Learning.
Unsupervised learning is a type of machine learning that learns from unlabeled data.
This means that the data does not have any pre-existing labels or categories. The goal
of unsupervised learning is to discover patterns and relationships in the data without
any explicit guidance.
Unsupervised learning is the training of a machine using information that is neither
classified nor labeled and allowing the algorithm to act on that information without
guidance. Here the task of the machine is to group unsorted information according to
similarities, patterns, and differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given
to the machine. Therefore the machine is restricted to find the hidden structure in
unlabeled data by itself.
You can use unsupervised learning to examine the animal data that has been gathered
and distinguish between several groups according to the traits and actions of the
animals. These groupings might correspond to various animal species, providing you to
categorize the creatures without depending on labels that already exist.

Key Points
 Unsupervised learning allows the model to discover patterns and relationships in
unlabeled data.
 Clustering algorithms group similar data points together based on their inherent
characteristics.
 Feature extraction captures essential information from the data, enabling the
model to make meaningful distinctions.
 Label association assigns categories to the clusters based on the extracted
patterns and characteristics.
Example
Imagine you have a machine learning model trained on a large dataset of unlabeled
images, containing both dogs and cats. The model has never seen an image of a dog or
cat before, and it has no pre-existing labels or categories for these animals. Your task is
to use unsupervised learning to identify the dogs and cats in a new, unseen image.
For instance, suppose it is given an image having both dogs and cats which it has
never seen.
Thus the machine has no idea about the features of dogs and cats so we can’t
categorize it as ‘dogs and cats ‘. But it can categorize them according to their
similarities, patterns, and differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in them and the second part
may contain all pics having cats in them. Here you didn’t learn anything before, which
means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Types of Unsupervised Learning
Unsupervised learning is classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the inherent
groupings in the data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover
rules that describe large portions of your data, such as people that buy X also
tend to buy Y.
Clustering
Clustering is a type of unsupervised learning that is used to group similar data points
together. Clustering algorithms work by iteratively moving data points closer to their
cluster centers and further away from data points in other clusters.
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
6. Gaussian Mixture Models (GMMs)
7. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Association rule learning
Association rule learning is a type of unsupervised learning that is used to identify
patterns in a data. Association rule learning algorithms work by finding relationships
between different items in a dataset.
Some common association rule learning algorithms include:
 Apriori Algorithm
 Eclat Algorithm
 FP-Growth Algorithm
Evaluating Non-Supervised Learning Models
Evaluating non-supervised learning models is an important step in ensuring that the
model is effective and useful. However, it can be more challenging than evaluating
supervised learning models, as there is no ground truth data to compare the model’s
predictions to.
There are a number of different metrics that can be used to evaluate non-supervised
learning models, but some of the most common ones include:
 Silhouette score: The silhouette score measures how well each data point is
clustered with its own cluster members and separated from other clusters. It
ranges from -1 to 1, with higher scores indicating better clustering.
 Calinski-Harabasz score: The Calinski-Harabasz score measures the ratio
between the variance between clusters and the variance within clusters. It
ranges from 0 to infinity, with higher scores indicating better clustering.
 Adjusted Rand index: The adjusted Rand index measures the similarity
between two clusterings. It ranges from -1 to 1, with higher scores indicating
more similar clusterings.
 Davies-Bouldin index: The Davies-Bouldin index measures the average
similarity between clusters. It ranges from 0 to infinity, with lower scores
indicating better clustering.
 F1 score: The F1 score is a weighted average of precision and recall, which are
two metrics that are commonly used in supervised learning to evaluate
classification models. However, the F1 score can also be used to evaluate non-
supervised learning models, such as clustering models.
Application of Unsupervised learning
Non-supervised learning can be used to solve a wide variety of problems, including:
 Anomaly detection: Unsupervised learning can identify unusual patterns or
deviations from normal behavior in data, enabling the detection of fraud,
intrusion, or system failures.
 Scientific discovery: Unsupervised learning can uncover hidden relationships and
patterns in scientific data, leading to new hypotheses and insights in various
scientific fields.
 Recommendation systems: Unsupervised learning can identify patterns and
similarities in user behavior and preferences to recommend products, movies, or
music that align with their interests.
 Customer segmentation: Unsupervised learning can identify groups of customers
with similar characteristics, allowing businesses to target marketing campaigns
and improve customer service more effectively.
 Image analysis: Unsupervised learning can group images based on their content,
facilitating tasks such as image classification, object detection, and image
retrieval.
Advantages of Unsupervised learning
 It does not require training data to be labeled.
 Dimensionality reduction can be easily accomplished using unsupervised
learning.
 Capable of finding previously unknown patterns in data.
 Unsupervised learning can help you gain insights from unlabeled data that you
might not have been able to get otherwise.
 Unsupervised learning is good at finding patterns and relationships in data
without being told what to look for. This can help you learn new things about
your data.
Disadvantages of Unsupervised learning
 Difficult to measure accuracy or effectiveness due to lack of predefined answers
during training.
 The results often have lesser accuracy.
 The user needs to spend time interpreting and label the classes which follow that
classification.
 Unsupervised learning can be sensitive to data quality, including missing values,
outliers, and noisy data.
 Without labelled data, it can be difficult to evaluate the performance of
unsupervised learning models, making it challenging to assess their
effectiveness.
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning focused on making
decisions to maximize cumulative rewards in a given situation. Unlike supervised
learning, which relies on a training dataset with predefined answers, RL involves
learning through experience. In RL, an agent learns to achieve a goal in an uncertain,
potentially complex environment by performing actions and receiving feedback through
rewards or penalties.
Key Concepts of Reinforcement Learning
 Agent: The learner or decision-maker.
 Environment: Everything the agent interacts with.
 State: A specific situation in which the agent finds itself.
 Action: All possible moves the agent can make.
 Reward: Feedback from the environment based on the action taken.
How Reinforcement Learning Works
RL operates on the principle of learning optimal behavior through trial and error. The
agent takes actions within the environment, receives rewards or penalties, and adjusts
its behavior to maximize the cumulative reward. This learning process is characterized
by the following elements:
 Policy: A strategy used by the agent to determine the next action based on the
current state.
 Reward Function: A function that provides a scalar feedback signal based on
the state and action.
 Value Function: A function that estimates the expected cumulative reward from
a given state.
 Model of the Environment: A representation of the environment that helps in
planning by predicting future states and rewards.
Example: Navigating a Maze
The problem is as follows: We have an agent and a reward, with many hurdles in
between. The agent is supposed to find the best possible path to reach the reward. The
following problem explains the problem more easily.

The above image shows the robot, diamond, and fire. The goal of the robot is to get the
reward that is the diamond and avoid the hurdles that are fired. The robot learns by
trying all the possible paths and then choosing the path which gives him the reward
with the least hurdles. Each right step will give the robot a reward and each wrong step
will subtract the reward of the robot. The total reward will be calculated when it reaches
the final reward that is the diamond.

Main points in Reinforcement learning –


 Input: The input should be an initial state from which the model will start
 Output: There are many possible outputs as there are a variety of solutions to a
particular problem
 Training: The training is based upon the input, The model will return a state and
the user will decide to reward or punish the model based on its output.
 The model keeps continues to learn.
 The best solution is decided based on the maximum reward.

Difference between Reinforcement learning and Supervised learning:

Reinforcement learning Supervised learning

Reinforcement learning is all about making decisions


In Supervised learning, the
sequentially. In simple words, we can say that the
decision is made on the initial
output depends on the state of the current input and
input or the input given at the
the next input depends on the output of the previous
start
input
Reinforcement learning Supervised learning

In supervised learning the


In Reinforcement learning decision is dependent, So decisions are independent of
we give labels to sequences of dependent decisions each other so labels are given
to each decision.

Example: Object
Example: Chess game,text summarization
recognition,spam detetction

Types of Reinforcement:
1. Positive: Positive Reinforcement is defined as when an event, occurs due to a
particular behavior, increases the strength and the frequency of the behavior. In
other words, it has a positive effect on behavior.
Advantages of reinforcement learning are:
 Maximizes Performance
 Sustain Change for a long period of time
 Too much Reinforcement can lead to an overload of states which can
diminish the results
2. Negative: Negative Reinforcement is defined as strengthening of behavior
because a negative condition is stopped or avoided.
3. Advantages of reinforcement learning:
 Increases Behavior
 Provide defiance to a minimum standard of performance
 It Only provides enough to meet up the minimum behavior
Elements of Reinforcement Learning
i) Policy: Defines the agent’s behavior at a given time.
ii) Reward Function: Defines the goal of the RL problem by providing feedback.
iii) Value Function: Estimates long-term rewards from a state.
iv) Model of the Environment: Helps in predicting future states and rewards for
planning.

Support Vector Machine.


A Support Vector Machine (SVM) is a powerful machine learning
algorithm widely used for both linear and nonlinear classification, as well
as regression and outlier detection tasks. SVMs are highly adaptable, making them
suitable for various applications such as text classification, image
classification, spam detection, handwriting identification, gene expression
analysis, face detection, and anomaly detection.
SVMs are particularly effective because they focus on finding the maximum
separating hyperplane between the different classes in the target feature, making
them robust for both binary and multiclass classification. In this outline, we will
explore the Support Vector Machine (SVM) algorithm, its applications, and how it
effectively handles both linear and nonlinear classification, as well
as regression and outlier detection tasks.
Support Vector Machine
A Support Vector Machine (SVM) is a supervised machine learning algorithm used
for both classification and regression tasks. While it can be applied to regression
problems, SVM is best suited for classification tasks. The primary objective of
the SVM algorithm is to identify the optimal hyperplane in an N-dimensional space
that can effectively separate data points into different classes in the feature space. The
algorithm ensures that the margin between the closest points of different classes,
known as support vectors, is maximized.
The dimension of the hyperplane depends on the number of features. For instance, if
there are two input features, the hyperplane is simply a line, and if there are three
input features, the hyperplane becomes a 2-D plane. As the number of features
increases beyond three, the complexity of visualizing the hyperplane also increases.
Consider two independent variables, x1 and x2, and one dependent variable
represented as either a blue circle or a red circle.
 In this scenario, the hyperplane is a line because we are working with two
features (x1 and x2).
 There are multiple lines (or hyperplanes) that can separate the data points.
 The challenge is to determine the best hyperplane that maximizes the
separation margin between the red and blue circles.

Linearly Separable Data points


From the figure above it’s very clear that there are multiple lines (our hyperplane here
is a line because we are considering only two input features x1, x2) that segregate our
data points or do a classification between red and blue circles. So how do we choose
the best line or in general the best hyperplane that segregates our data
points?
How does Support Vector Machine Algorithm Work?
One reasonable choice for the best hyperplane in a Support Vector Machine
(SVM) is the one that maximizes the separation margin between the two classes.
The maximum-margin hyperplane, also referred to as the hard margin, is selected
based on maximizing the distance between the hyperplane and the nearest data point
on each side.
Multiple hyperplanes separate the data from two classes
So we choose the hyperplane whose distance from it to the nearest data point on each
side is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a
scenario like shown below

Selecting hyperplane for data with outlier


Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the best
hyperplane that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one


So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the

a soft margin to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge
margin. So the margins in these types of cases are called soft margins. When there is

loss is a commonly used penalty. If no violations no hinge loss.If violations hinge loss
proportional to the distance of violation.
Till now, we were talking about linearly separable data(the group of blue balls and red
balls are separable by a straight line/linear line). What to do if data are not linearly
separable?
Original 1D dataset for classification
Say, our data is shown in the figure above. SVM solves this by creating a new variable
using a kernel. We call a point xi on the line and we create a new variable yi as a
function of distance from origin o.so if we plot this we get something like as shown
below

Mapping 1D data to 2D to become able to separate the two classes


In this case, the new variable y is created as a function of distance from the origin. A
non-linear function that creates a new variable is referred to as a kernel.
Support Vector Machine Terminology
 Hyperplane: The hyperplane is the decision boundary used to separate data
points of different classes in a feature space. For linear classification, this is a
linear equation represented as wx+b=0.
 Support Vectors: Support vectors are the closest data points to the
hyperplane. These points are critical in determining the hyperplane and the
margin in Support Vector Machine (SVM).
 Margin: The margin refers to the distance between the support vector and
the hyperplane. The primary goal of the SVM algorithm is to maximize this
margin, as a wider margin typically results in better classification performance.
 Kernel: The kernel is a mathematical function used in SVM to map input data
into a higher-dimensional feature space. This allows the SVM to find a hyperplane
in cases where data points are not linearly separable in the original space.
Common kernel functions include linear, polynomial, radial basis function
(RBF), and sigmoid.
 Hard Margin: A hard margin refers to the maximum-margin hyperplane that
perfectly separates the data points of different classes without any
misclassifications.
 Soft Margin: When data contains outliers or is not perfectly separable, SVM
uses the soft margin technique. This method introduces a slack variable for
each data point to allow some misclassifications while balancing between
maximizing the margin and minimizing violations.
 C: The C parameter in SVM is a regularization term that balances margin
maximization and the penalty for misclassifications. A higher C value imposes a
stricter penalty for margin violations, leading to a smaller margin but fewer
misclassifications.
 Hinge Loss: The hinge loss is a common loss function in SVMs. It penalizes
misclassified points or margin violations and is often combined with a
regularization term in the objective function.
 Dual Problem: The dual problem in SVM involves solving for the Lagrange
multipliers associated with the support vectors. This formulation allows for the
use of the kernel trick and facilitates more efficient computation.
Mathematical Computation: SVM
Consider a binary classification problem with two classes, labeled as +1 and -1. We
have a training dataset consisting of input feature vectors X and their corresponding
class labels Y.
The equation for the linear hyperplane can be written as:

The vector W represents the normal vector to the hyperplane. i.e the direction
perpendicular to the hyperplane. The parameter b in the equation represents the offset
or distance of the hyperplane from the origin along the normal vector w.
The distance between a data point x_i and the decision boundary can be calculated as:

Optimization:
 For Hard margin linear SVM classifier:

Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector Machines (SVM) can be
divided into two main parts:
 Linear SVM: Linear SVMs use a linear decision boundary to separate the data
points of different classes. When the data can be precisely linearly separated,
linear SVMs are very suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data points into their
respective classes. A hyperplane that maximizes the margin between the classes
is the decision boundary.
 Non-Linear SVM: Non-Linear SVM can be used to classify data when it cannot
be separated into two classes by a straight line (in the case of 2D). By using
kernel functions, nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions into a higher-
dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified
space.
Popular kernel functions in SVM

You might also like