0% found this document useful (0 votes)
17 views

AIMLF-UNIT4

SUPERVISED LEARNING RELATED NOTES

Uploaded by

Sheela Raj
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

AIMLF-UNIT4

SUPERVISED LEARNING RELATED NOTES

Uploaded by

Sheela Raj
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT IV- SUPERVISED LEARNING

Neural Network: Introduction, Perceptron Networks – Adaline – Back propagation networks -


Decision Tree: Entropy – Information gain – Gini Impurity – classification algorithm – Rule
based Classification – Naive Bayesian classification – Support Vector Machines (SVM)
1.Neural Networks
Neural networks extract identifying features from data, lacking pre-programmed understanding.
Network components include neurons, connections, weights, biases, propagation functions, and a
learning rule. Neurons receive inputs, governed by thresholds and activation functions. Connections
involve weights and biases regulating information transfer. Learning, adjusting weights and biases,
occurs in three stages: input computation, output generation, and iterative refinement enhancing the
network’s proficiency in diverse tasks.
These include:
1. The neural network is simulated by a new environment.
2. Then the free parameters of the neural network are changed as a result of this simulation.
3. The neural network then responds in a new way to the environment because of the changes in
its free parameters
Importance of Neural Networks
 The ability of neural networks to identify
patterns, solve intricate puzzles, and adjust to
changing surroundings is essential.
 Their capacity to learn from data has far-
reaching effects, ranging from revolutionizing
technology like natural language processing and
self-driving automobiles to automating decision-
making processes and increasing efficiency in
numerous industries. The development of
artificial intelligence is largely dependent on
neural networks, which also drive innovation and
influence the direction of technology.
How does Neural Networks work?
Let’s understand with an example of how a neural network works:
Consider a neural network for email classification. The input layer takes features like email
content, sender information, and subject. These inputs, multiplied by adjusted weights, pass through
hidden layers. The network, through training, learns to recognize patterns indicating whether an email
is spam or not. The output layer, with a binary activation function, predicts whether the email is spam
(1) or not (0). As the network iteratively refines its weights through backpropagation, it becomes
adept at distinguishing between spam and legitimate emails, showcasing the practicality of neural
networks in real-world applications like email filtering.
Working of a Neural Network
Neural networks are complex systems that mimic some features of the functioning of the
human brain. It is composed of an input layer, one or more hidden layers, and an output layer made up
of layers of artificial neurons that are coupled. The two stages of the basic process are called
backpropagation and forward propagation.

Forward Propagation
 Input Layer: Each feature in the input layer is
represented by a node on the network, which receives
input data.
 Weights and Connections: The weight of each neuronal
connection indicates how strong the connection is.
Throughout training, these weights are changed.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by weights, adding
them up, and then passing them through an activation function. By doing this, non-linearity is
introduced, enabling the network to recognize intricate patterns.
Output: The final result is produced by repeating the process until the output layer is reached.
Backpropagation
 Loss Calculation: The network’s output is evaluated against the real goal values, and a loss
function is used to compute the difference. For a regression problem, the Mean Squared
Error (MSE) is commonly used as the cost function.

Loss Function:
 Gradient Descent: Gradient descent is then used by the network to reduce the loss. To lower
the inaccuracy, weights are changed based on the derivative of the loss with respect to each
weight.
 Adjusting weights: The weights are adjusted at each connection by applying this iterative
process, or backpropagation, backward across the network.
 Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the network to
adapt and learn patterns from the data.
 Actvation Functions: Model non-linearity is introduced by activation functions like
the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a neuron is
based on the whole weighted input.
Types of Neural Networks
There are seven types of neural networks that can be used.
 Feedforward Neteworks: A feedforward neural network is a simple artificial neural network
architecture in which data moves from input to output in a single direction. It has input,
hidden, and output layers; feedback loops are absent. Its straightforward architecture makes it
appropriate for a number of applications, such as regression and pattern recognition.
 Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.
 Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a
specialized artificial neural network designed for image processing. It employs convolutional
layers to automatically learn hierarchical features from input images, enabling effective image
recognition and classification. CNNs have revolutionized computer vision and are pivotal in
tasks like object detection and image analysis.
 Recurrent Neural Network (RNN): An artificial neural network type intended for sequential
data processing is called a Recurrent Neural Network (RNN). It is appropriate for applications
where contextual dependencies are critical, such as time series prediction and natural
language processing, since it makes use of feedback loops, which enable information to
survive within the network.
 Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.

2.Perceptron
Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural network
unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider
it as a single-layer neural network with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:
o Input Nodes or Input Layer:
This is the primary component
of Perceptron which accepts the initial
data into the system for further
processing. Each input node contains a
real numerical value.
o Wight and Bias:
Weight parameter represents the
strength of the connection between units.
This is another most important parameter of
Perceptron components. Weight is directly
proportional to the strength of the associated
input neuron in deciding the output. Further,
Bias can be considered as the line of
intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron will fire or
not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has vanishing
or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is
considered as a single-layer neural network
that consists of four main parameters named
input values (Input nodes), weights and Bias,
net sum, and an activation function. The
perceptron model begins with the
multiplication of all input values and their
weights, then adds these values together to
create the weighted sum. Then this weighted
sum is applied to the activation function 'f' to
obtain the desired output. This activation
function is also known as the step
function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of
the strength of a node. Similarly, an input's bias value gives the ability to shift the activation function
curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the model.
The main objective of the single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.
Multi-Layered Perceptron Model:

Like a single-layer perceptron model, a


multi-layer perceptron model also has the same
model structure but has a greater number of hidden
layers.
The multi-layer perceptron model is also
known as the Backpropagation algorithm, which
executes in two stages as follows:
o Forward Stage: Activation functions start from
the input layer in the forward stage and terminate
on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded originated
backward on the output layer and ended on the input layer.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
Mathematically, we can express it as follows:
Advertisement
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector
o 'b' represents the bias
o 'x' represents a vector of input x values.

3.Adaline
An artificial neural network inspired by the human neural system is a network used to process
the data which consist of three types of layer i.e input layer, the hidden layer, and the output layer. The
basic neural network contains only two layers which are the input and output layers. The layers are
connected with the weighted path which is used to find net input data. In this section, we will discuss
two basic types of neural networks Adaline which doesn’t have any hidden layer, and Madaline which
has one hidden layer.

1. Adaline (Adaptive Linear Neural) :


 A network with a single linear unit is called Adaline (Adaptive Linear Neural). A unit with a
linear activation function is called a linear unit. In Adaline, there is only one output unit and
output values are bipolar (+1,-1). Weights between the input unit and output unit are
adjustable. It uses the delta rule

i.e , where and are


the weight, predicted output, and true value respectively.
 The learning rule is found to minimize the mean square error between activation and target
values. Adaline consists of trainable weights, it compares actual output with calculated output,
and based on error training algorithm is applied.

Workflow:

First, calculate the net input to your Adaline network then apply the activation function to its
output then compare it with the original output if both the equal, then give the output else send an
error back to the network and update the weight according to the error which is calculated by the delta

learning rule. i.e , where


and are the weight, predicted output, and true value respectively.

Architecture:

In Adaline, all the input neuron is directly connected to the output neuron with the weighted
connected path. There is a bias b of activation function 1 is present.
Algorithm:
Step 1: Initialize weight not zero but small random values are used. Set learning rate α.
Step 2: While the stopping condition is False do steps 3 to 7.
Step 3: for each training set perform steps 4 to 6.
Step 4: Set activation of input unit xi = si for (i=1 to n).
Step 5: compute net input to output unit

Here, b is the bias and n is the total number of neurons.


Step 6: Update the weights and bias for i=1 to n

and calculate

when the predicted output and the true value are the same then the weight will not change.
Step 7: Test the stopping condition. The stopping condition may be when the weight changes at a
low rate or no change.
Implementations
Problem: Design OR gate using Adaline Network?
Solution :
 Initially, all weights are assumed to be small random values, say 0.1, and set learning rule to
0.1.
 Also, set the least squared error to 2.
 The weights will be updated until the total error is greater than the least squared error.

x1 x2 t

1 1 1

1 -1 1

-1 1 1

-1 -1 -1

 Calculate the net input


(when
x1=x2=1)
 Now compute, (t-yin)=(1-0.3)=0.7
 Now, update the weights and bias
 calculate the error
Similarly, repeat the same steps for other input vectors and you will get.

(t-
x w1 (0.1 w2 (0.1 b
x2 t yin (t-yin) ∆w1 ∆w2 ∆b yin)^
1 ) ) (0.1)
2

1 1 1 0.3 0.7 0.07 0.07 0.07 0.17 0.17 0.17 0.49

-
1 -1 1 0.17 0.83 0.083 0.083 0.253 0.087 0.253 0.69
0.083

-
- 0.091 0.091 0.344
1 1 0.087 0.913 0.091 0.1617 0.1783 0.83
1 3 3 3
3

- -
- - 0.004 0.100 0.100 0.243
-1 1.004 0.100 0.2621 0.2787 1.01
1 1 3 4 4 9
3 4

This is epoch 1 where the total error is 0.49 + 0.69 + 0.83 + 1.01 = 3.02 so more epochs will
run until the total error becomes less than equal to the least squared error i.e 2.

4.Back propagation networks

BACKPROPAGATION AND WHY IS IT IMPORTANT?


Backpropagation is that algorithm—it can discover the optimal weights relatively quickly,
even for a network with millions
of weights.
After a neural network is defined with initial weights, and a forward pass is performed to
generate the initial prediction, there is an error function which defines how far away the model is from
the true prediction. There are many possible algorithms that can minimize the error function—for
example, one could do a brute force search to find the weights that generate the smallest error.
However, for large neural networks, a training algorithm is needed that is very computationally
efficient.
HOW BACKPROPAGATION WORKS?
1. Forward pass—weights are initialized and inputs from the training set are fed into the
network. The forward pass is carried out and the model generates its initial prediction.
2. Error function—the error function is computed by checking how far away the prediction is
from the known true value.
3. Backpropagation with gradient descent—the backpropagation algorithm calculates how
much the output values are affected by each of the weights in the model. To do this, it
calculates partial derivatives, going back from the error function to a specific neuron and its
weight. This provides complete traceability from total errors, back to a specific weight which
contributed to that error. The result of backpropagation is a set of weights that minimize the
error function.
4. Weight update—weights can be updated after every sample in the training set, but this is
usually not practical. Typically, a batch of samples is run in one big forward pass, and then
backpropagation performed on the aggregate result. The batch size and number of batches
used in training, called iterations, are important hyperparameters that are tuned to get the
best results. Running the entire training set through the backpropagation process is called
an epoch.

Training algorithm of BPNN:


1. Inputs X, arrive through
the pre connected path
2. Input is modeled using
real weights W. The weights are
usually randomly selected.
3. Calculate the output for
every neuron from the input layer, to
the hidden layers, to the output layer.
4. Calculate the error in the
outputs Error
B= Actual Output – Desired Output
4. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
Keep repeating the process until the desired output is achieved
Architecture of back propagation network:
As shown in the diagram, the architecture of BPN has three interconnected layers having
weights on them.
The hidden layer as well as the output layer
also has bias, whose weight is always 1, on them. As
is clear from the diagram, the working of BPN is in
two phases. One phase sends the signal from the
input layer to the output layer, and the other phase
back propagates the error from the output layer to the
input layer.

5.Decision Tree
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
General structure of a decision tree:
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies
 Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Metrics for Splitting
 Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it
was randomly classified according to the distribution of classes in the dataset.
o Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2, where pi is the probability of an instance
being classified into a particular class.
 Entropy: Measures the amount of uncertainty or impurity in the dataset.
o Entropy=−∑i=1npilog⁡2(pi)Entropy=−∑i=1npilog2(pi), where pi is the probability of
an instance being classified into a particular class.
 Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split
on an attribute.
o InformationGain=Entropyparent–

∗Entropy(Di)), where Di is the subset of D after splitting by an attribute.


∑i=1n(∣Di∣∣D∣∗Entropy(Di))InformationGain=Entropyparent–∑i=1n(∣D∣∣Di∣

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while making
any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest algorithm.
o For more class labels, the computational complexity of the decision tree may increase.

6.Entropy
Entropy is a measure of disorder or impurity in the given dataset.
In the decision tree, messy data are split based on values of the feature vector associated with each
data point. With each split, the data becomes more homogenous which will decrease the entropy.
However, some data in some nodes will not be homogenous, where the entropy value will not be
small. The higher the entropy, the harder it is to draw any conclusion. When the tree finally reaches
the terminal or leaf node maximum purity is added.
For a dataset that has C classes and the probability of randomly choosing data from class, i is Pi. Then
entropy E(S) can be mathematically represented as

If we have a dataset of 10 observations belonging to two classes YES and NO. If 6 observations
belong to the class, YES, and 4 observations belong to class NO, then entropy can be written as
below.

Pyes is the probability of choosing Yes and Pno is the probability of choosing a No. Here Pyes is 6/10
and Pno is 4/10.

If all the 10 observations belong to 1 class then entropy will be equal to zero. Which implies the node
is a pure node.

If both classes YES and NO have an equal number of observations, then entropy will be equal to 1.

7.Information Gain
The Information Gain measures the expected reduction in entropy. Entropy measures impurity
in the data and information gain measures reduction in impurity in the data. The feature which has
minimum impurity will be considered as the root node.
Information gain is used to decide which feature to split on at each step in building the tree. The
creation of sub-nodes increases the homogeneity, that is decreases the entropy of these nodes. The
more the child node is homogeneous, the more the variance will be decreased after each split. Thus
Information Gain is the variance reduction and can calculate by how much the variance decreases
after each split.
Information gain of a parent node can be calculated as the entropy of the parent node subtracted
entropy of the weighted average of the child node.
As per the above example, the dataset has 10 observations belonging to two classes YES and NO.
Where 6 observations belong to the class, YES, and 4 observations belong to class NO.

Red color has 3 Yes outcome and 3 No outcome whereas yellow has 3 Yes outcome and 1 No
outcome.
E(S), we have already calculated and it is approximately equal to 0.971

For a dataset having many features, the information gain of each feature is calculated. The
feature having maximum information gain will be the most important feature which will be the root
node for the decision tree.

8 Gini Index or Gini Impurity


The Gini index can also be used for feature selection. The tree chooses the feature that
minimizes the Gini impurity index. The higher value of the Gini Index indicates the impurity is
higher. Both Gini Index and Gini Impurity are used interchangeably. The Gini Index or Gini Impurity
favors large partitions and is very simple to implement. It performs only binary split. For categorical
variables, it gives the results in terms of “success” or “failure”.
Gini Index can be calculated from the below mathematical formula where c is the number of classes
and pi is the probability associated with the ith class.

The Gini Index or Gini Impurity favors large


partitions and is very simple to implement. It
performs only binary split. For categorical variables,
it gives the results in terms of “success” or “failure”.

9.classification algorithm

A decision tree classification algorithm is a supervised machine learning algorithm that uses a
tree-like structure to classify data based on feature values:
 Structure
A decision tree has a hierarchical structure with a root node, branches, internal nodes, and leaf nodes.
 Function
The algorithm uses a series of questions to classify data into different classes. The root node starts
with a question, and the branches lead to more questions until the data reaches a leaf node.
 Purpose
Decision trees are used for classification and regression modeling. Classification trees are used to
classify data, while regression trees are used to predict outcomes.
 Visualization
Decision trees are often visualized as flowcharts, which can help people understand how decisions
were made.
Here are some steps for using a decision tree classification algorithm:
1. Fit the model: Fit the model to the data.
2. Predict values: Use the classifier model to predict values.
3. Evaluate the model: Calculate the accuracy score of the model using both the train and test
data.
4. Summarize performance: Use a confusion matrix to summarize the model's performance.

Classification trees (Yes/No types)


What we’ve seen above is an example of
classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable
is Categorical.

10.Rule-based classifiers
It is just another type of classifier which makes the class decision depending by using various
“if..else” rules. These rules are easily interpretable and thus these classifiers are generally used to
generate descriptive models. The condition used with “if” is called the antecedent and the predicted
class of each rule is called the consequent.
Properties of rule-based classifiers:
 Coverage: The percentage of records which satisfy the antecedent conditions of a particular
rule.
 The rules generated by the rule-based classifiers are generally not mutually exclusive, i.e.
many rules can cover the same record.
 The rules generated by the rule-based classifiers may not be exhaustive, i.e. there may be
some records which are not covered by any of the rules.
 The decision boundaries created by them is linear, but these can be much more complex than
the decision tree because the many rules are triggered for the same record.
An obvious question, which comes into the mind after knowing that the rules are not mutually
exclusive is that how would the class be decided in case different rules with different consequent
cover the record.
There are two solutions to the above problem:
 Either rules can be ordered, i.e. the class corresponding to the highest priority rule triggered
is taken as the final class.
 Otherwise, we can assign votes for each class depending on some their weights, i.e. the rules
remain unordered.
Example:
Below is the dataset to classify mushrooms as edible or poisonous:
Cap Cap
Shap Surfa Bruis Odou Stalk Populati Habita
Class e ce es r Shape on t

edible flat scaly yes anise tapering scattered grasses

poisono conve punge enlargeni


scaly yes several grasses
us x nt ng

conve smoot almon enlargeni


edible yes numerous grasses
x h d ng

conve almon meado


edible scaly yes tapering scattered
x d ws

fibrou enlargeni
edible flat yes anise several woods
s ng

fibrou enlargeni
edible flat no none several urban
s ng

poisono conic punge enlargeni


scaly yes scattered urban
us al nt ng

smoot enlargeni meado


edible flat yes anise numerous
h ng ws

poisono conve smoot yes punge enlargeni several urban


Cap Cap
Shap Surfa Bruis Odou Stalk Populati Habita
Class e ce es r Shape on t

us x h nt ng

Rules:
Odour = pungent and habitat = urban -> Class = poisonous
Bruises = yes -> Class = edible : This rules covers both negative and positive records.
The given rules are not mutually exclusive.

How to generate a rule:

Sequential Rule Generation.


Rules can be generated either using general-to-specific approach or specific-to-general approach. In
the general-to-specific approach, start with a rule with no antecedent and keep on adding conditions
to it till we see major improvements in our evaluation metrics. While for the other we keep on
removing the conditions from a rule covering a very specific case. The evaluation metric can be
accuracy, information gain, likelihood ratio etc.
11.Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis,
and classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on the bases
of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence
each feature individually contributes to identify that it is an apple without depending on each
other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the
probability of a hypothesis with prior knowledge. It depends on the conditional probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No
5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Frequency table for the Weather Conditions:


Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:
Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.

12. Support Vector Regression


Support Vector Machine Algorithm
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used for
Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in the correct
category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector Machine.
Consider the below diagram in which there are two different categories that are classified using a
decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a
straight line, we can easily separate these two
classes. But there can be multiple lines that can
separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the


best line or decision boundary; this best boundary or
region is called as a hyperplane. SVM algorithm
finds the closest point of the lines from both the
classes. These points are called support vectors. The
distance between the vectors and the hyperplane is
called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with
maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can
separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider
the below image:

So to separate these data points, we need to


add one more dimension. For linear data, we have
used two dimensions x and y, so for non-linear data,
we will add a third dimension z. It can be calculated
as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the


datasets into classes in the following way.
Consider the below image:

Since we are in 3-d Space, hence it is


looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become
as:

Hence we get a circumference of radius 1 in


case of non-linear data.

You might also like