AIMLF-UNIT4
AIMLF-UNIT4
Forward Propagation
Input Layer: Each feature in the input layer is
represented by a node on the network, which receives
input data.
Weights and Connections: The weight of each neuronal
connection indicates how strong the connection is.
Throughout training, these weights are changed.
Hidden Layers: Each hidden layer neuron processes inputs by multiplying them by weights, adding
them up, and then passing them through an activation function. By doing this, non-linearity is
introduced, enabling the network to recognize intricate patterns.
Output: The final result is produced by repeating the process until the output layer is reached.
Backpropagation
Loss Calculation: The network’s output is evaluated against the real goal values, and a loss
function is used to compute the difference. For a regression problem, the Mean Squared
Error (MSE) is commonly used as the cost function.
Loss Function:
Gradient Descent: Gradient descent is then used by the network to reduce the loss. To lower
the inaccuracy, weights are changed based on the derivative of the loss with respect to each
weight.
Adjusting weights: The weights are adjusted at each connection by applying this iterative
process, or backpropagation, backward across the network.
Training: During training with different data samples, the entire process of forward
propagation, loss calculation, and backpropagation is done iteratively, enabling the network to
adapt and learn patterns from the data.
Actvation Functions: Model non-linearity is introduced by activation functions like
the rectified linear unit (ReLU) or sigmoid. Their decision on whether to “fire” a neuron is
based on the whole weighted input.
Types of Neural Networks
There are seven types of neural networks that can be used.
Feedforward Neteworks: A feedforward neural network is a simple artificial neural network
architecture in which data moves from input to output in a single direction. It has input,
hidden, and output layers; feedback loops are absent. Its straightforward architecture makes it
appropriate for a number of applications, such as regression and pattern recognition.
Multilayer Perceptron (MLP): MLP is a type of feedforward neural network with three or
more layers, including an input layer, one or more hidden layers, and an output layer. It uses
nonlinear activation functions.
Convolutional Neural Network (CNN): A Convolutional Neural Network (CNN) is a
specialized artificial neural network designed for image processing. It employs convolutional
layers to automatically learn hierarchical features from input images, enabling effective image
recognition and classification. CNNs have revolutionized computer vision and are pivotal in
tasks like object detection and image analysis.
Recurrent Neural Network (RNN): An artificial neural network type intended for sequential
data processing is called a Recurrent Neural Network (RNN). It is appropriate for applications
where contextual dependencies are critical, such as time series prediction and natural
language processing, since it makes use of feedback loops, which enable information to
survive within the network.
Long Short-Term Memory (LSTM): LSTM is a type of RNN that is designed to overcome
the vanishing gradient problem in training RNNs. It uses memory cells and gates to
selectively read, write, and erase information.
2.Perceptron
Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural network
unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can consider
it as a single-layer neural network with four main parameters, i.e., input values, weights and Bias, net
sum, and an activation function
Basic Components of Perceptron
Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which contains
three main components. These are as follows:
o Input Nodes or Input Layer:
This is the primary component
of Perceptron which accepts the initial
data into the system for further
processing. Each input node contains a
real numerical value.
o Wight and Bias:
Weight parameter represents the
strength of the connection between units.
This is another most important parameter of
Perceptron components. Weight is directly
proportional to the strength of the associated
input neuron in deciding the output. Further,
Bias can be considered as the line of
intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron will fire or
not. Activation Function can be considered primarily as a step function.
Types of Activation functions:
o Sign function
o Step function, and
o Sigmoid function
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign, Step,
and Sigmoid) in perceptron models by checking whether the learning process is slow or has vanishing
or exploding gradients.
How does Perceptron work?
In Machine Learning, Perceptron is
considered as a single-layer neural network
that consists of four main parameters named
input values (Input nodes), weights and Bias,
net sum, and an activation function. The
perceptron model begins with the
multiplication of all input values and their
weights, then adds these values together to
create the weighted sum. Then this weighted
sum is applied to the activation function 'f' to
obtain the desired output. This activation
function is also known as the step
function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of
the strength of a node. Similarly, an input's bias value gives the ability to shift the activation function
curve up or down.
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model
Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside the model.
The main objective of the single-layer perceptron model is to analyze the linearly separable objects
with binary outcomes.
Multi-Layered Perceptron Model:
3.Adaline
An artificial neural network inspired by the human neural system is a network used to process
the data which consist of three types of layer i.e input layer, the hidden layer, and the output layer. The
basic neural network contains only two layers which are the input and output layers. The layers are
connected with the weighted path which is used to find net input data. In this section, we will discuss
two basic types of neural networks Adaline which doesn’t have any hidden layer, and Madaline which
has one hidden layer.
Workflow:
First, calculate the net input to your Adaline network then apply the activation function to its
output then compare it with the original output if both the equal, then give the output else send an
error back to the network and update the weight according to the error which is calculated by the delta
Architecture:
In Adaline, all the input neuron is directly connected to the output neuron with the weighted
connected path. There is a bias b of activation function 1 is present.
Algorithm:
Step 1: Initialize weight not zero but small random values are used. Set learning rate α.
Step 2: While the stopping condition is False do steps 3 to 7.
Step 3: for each training set perform steps 4 to 6.
Step 4: Set activation of input unit xi = si for (i=1 to n).
Step 5: compute net input to output unit
and calculate
when the predicted output and the true value are the same then the weight will not change.
Step 7: Test the stopping condition. The stopping condition may be when the weight changes at a
low rate or no change.
Implementations
Problem: Design OR gate using Adaline Network?
Solution :
Initially, all weights are assumed to be small random values, say 0.1, and set learning rule to
0.1.
Also, set the least squared error to 2.
The weights will be updated until the total error is greater than the least squared error.
x1 x2 t
1 1 1
1 -1 1
-1 1 1
-1 -1 -1
(t-
x w1 (0.1 w2 (0.1 b
x2 t yin (t-yin) ∆w1 ∆w2 ∆b yin)^
1 ) ) (0.1)
2
-
1 -1 1 0.17 0.83 0.083 0.083 0.253 0.087 0.253 0.69
0.083
-
- 0.091 0.091 0.344
1 1 0.087 0.913 0.091 0.1617 0.1783 0.83
1 3 3 3
3
- -
- - 0.004 0.100 0.100 0.243
-1 1.004 0.100 0.2621 0.2787 1.01
1 1 3 4 4 9
3 4
This is epoch 1 where the total error is 0.49 + 0.69 + 0.83 + 1.01 = 3.02 so more epochs will
run until the total error becomes less than equal to the least squared error i.e 2.
5.Decision Tree
Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-
structured classifier, where internal nodes represent the features of a dataset, branches represent
the decision rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a problem/decision
based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
General structure of a decision tree:
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset
and problem is the main point to remember while creating a machine learning model. Below are the
two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which
further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after
getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according
to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called
the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node
of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute
and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes and
move further. It continues the process until it reaches the leaf node of the tree. The complete process
can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in step -
3. Continue this process until a stage is reached where you cannot further classify the nodes
and called the final node as a leaf node.
Metrics for Splitting
Gini Impurity: Measures the likelihood of an incorrect classification of a new instance if it
was randomly classified according to the distribution of classes in the dataset.
o Gini=1–∑i=1n(pi)2Gini=1–∑i=1n(pi)2, where pi is the probability of an instance
being classified into a particular class.
Entropy: Measures the amount of uncertainty or impurity in the dataset.
o Entropy=−∑i=1npilog2(pi)Entropy=−∑i=1npilog2(pi), where pi is the probability of
an instance being classified into a particular class.
Information Gain: Measures the reduction in entropy or Gini impurity after a dataset is split
on an attribute.
o InformationGain=Entropyparent–
6.Entropy
Entropy is a measure of disorder or impurity in the given dataset.
In the decision tree, messy data are split based on values of the feature vector associated with each
data point. With each split, the data becomes more homogenous which will decrease the entropy.
However, some data in some nodes will not be homogenous, where the entropy value will not be
small. The higher the entropy, the harder it is to draw any conclusion. When the tree finally reaches
the terminal or leaf node maximum purity is added.
For a dataset that has C classes and the probability of randomly choosing data from class, i is Pi. Then
entropy E(S) can be mathematically represented as
If we have a dataset of 10 observations belonging to two classes YES and NO. If 6 observations
belong to the class, YES, and 4 observations belong to class NO, then entropy can be written as
below.
Pyes is the probability of choosing Yes and Pno is the probability of choosing a No. Here Pyes is 6/10
and Pno is 4/10.
If all the 10 observations belong to 1 class then entropy will be equal to zero. Which implies the node
is a pure node.
If both classes YES and NO have an equal number of observations, then entropy will be equal to 1.
7.Information Gain
The Information Gain measures the expected reduction in entropy. Entropy measures impurity
in the data and information gain measures reduction in impurity in the data. The feature which has
minimum impurity will be considered as the root node.
Information gain is used to decide which feature to split on at each step in building the tree. The
creation of sub-nodes increases the homogeneity, that is decreases the entropy of these nodes. The
more the child node is homogeneous, the more the variance will be decreased after each split. Thus
Information Gain is the variance reduction and can calculate by how much the variance decreases
after each split.
Information gain of a parent node can be calculated as the entropy of the parent node subtracted
entropy of the weighted average of the child node.
As per the above example, the dataset has 10 observations belonging to two classes YES and NO.
Where 6 observations belong to the class, YES, and 4 observations belong to class NO.
Red color has 3 Yes outcome and 3 No outcome whereas yellow has 3 Yes outcome and 1 No
outcome.
E(S), we have already calculated and it is approximately equal to 0.971
For a dataset having many features, the information gain of each feature is calculated. The
feature having maximum information gain will be the most important feature which will be the root
node for the decision tree.
9.classification algorithm
A decision tree classification algorithm is a supervised machine learning algorithm that uses a
tree-like structure to classify data based on feature values:
Structure
A decision tree has a hierarchical structure with a root node, branches, internal nodes, and leaf nodes.
Function
The algorithm uses a series of questions to classify data into different classes. The root node starts
with a question, and the branches lead to more questions until the data reaches a leaf node.
Purpose
Decision trees are used for classification and regression modeling. Classification trees are used to
classify data, while regression trees are used to predict outcomes.
Visualization
Decision trees are often visualized as flowcharts, which can help people understand how decisions
were made.
Here are some steps for using a decision tree classification algorithm:
1. Fit the model: Fit the model to the data.
2. Predict values: Use the classifier model to predict values.
3. Evaluate the model: Calculate the accuracy score of the model using both the train and test
data.
4. Summarize performance: Use a confusion matrix to summarize the model's performance.
10.Rule-based classifiers
It is just another type of classifier which makes the class decision depending by using various
“if..else” rules. These rules are easily interpretable and thus these classifiers are generally used to
generate descriptive models. The condition used with “if” is called the antecedent and the predicted
class of each rule is called the consequent.
Properties of rule-based classifiers:
Coverage: The percentage of records which satisfy the antecedent conditions of a particular
rule.
The rules generated by the rule-based classifiers are generally not mutually exclusive, i.e.
many rules can cover the same record.
The rules generated by the rule-based classifiers may not be exhaustive, i.e. there may be
some records which are not covered by any of the rules.
The decision boundaries created by them is linear, but these can be much more complex than
the decision tree because the many rules are triggered for the same record.
An obvious question, which comes into the mind after knowing that the rules are not mutually
exclusive is that how would the class be decided in case different rules with different consequent
cover the record.
There are two solutions to the above problem:
Either rules can be ordered, i.e. the class corresponding to the highest priority rule triggered
is taken as the final class.
Otherwise, we can assign votes for each class depending on some their weights, i.e. the rules
remain unordered.
Example:
Below is the dataset to classify mushrooms as edible or poisonous:
Cap Cap
Shap Surfa Bruis Odou Stalk Populati Habita
Class e ce es r Shape on t
fibrou enlargeni
edible flat yes anise several woods
s ng
fibrou enlargeni
edible flat no none several urban
s ng
us x h nt ng
Rules:
Odour = pungent and habitat = urban -> Class = poisonous
Bruises = yes -> Class = edible : This rules covers both negative and positive records.
The given rules are not mutually exclusive.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis
is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using
this dataset we need to decide that whether we should play or not on a particular day according to the
weather conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the
relationship between features.
Applications of Naïve Bayes Classifier:
o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
o Gaussian: The Gaussian model assumes that features follow a normal distribution. This
means if predictors take continuous values instead of discrete, then the model assumes that
these values are sampled from the Gaussian distribution.
o Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
o Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that can
accurately identify whether it is a cat or dog, so such a model can be created by using the SVM
algorithm. We will first train our model with lots of images of cats and dogs so that it can learn about
different features of cats and dogs, and then we test it with this strange creature. So as support vector
creates a decision boundary between these two data (cat and dog) and choose extreme cases (support
vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify
it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a
dataset cannot be classified by using a straight line, then such data is termed as non-linear
data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-
dimensional space, but we need to find out the best decision boundary that helps to classify the data
points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means
if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3
features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence
called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2.
We want a classifier that can classify the
pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a
straight line, we can easily separate these two
classes. But there can be multiple lines that can
separate these classes. Consider the below image:
Non-Linear SVM:
If data is linearly arranged, then we can
separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider
the below image: