Csit (r22) 3-2 Machine Learning Digital Notes
Csit (r22) 3-2 Machine Learning Digital Notes
OF
MACHINE LEARNING
[R22A6602]
PREPARED BY
P.HARIKRISHNA
UNIT – I
Introduction: Introduction to Machine learning , Supervised learning, Unsupervised learning
Reinforcement learning. Deep learning.
Feature Selection: Filter, Wrapper , Embedded methods.
Feature Normalization:- min-max normalization, z-score normalization, and constant factor
normalization
Introduction to Dimensionality Reduction : Principal Component Analysis(PCA), Linear
Discriminant Analysis(LDA)
UNIT – II
Supervised Learning – I (Regression/Classification)
Regression models: Simple Linear Regression, multiple linear Regression. Cost Function,
Gradient Descent, Performance Metrics: Mean Absolute Error(MAE),Mean Squared
Error(MSE) R-Squared error, Adjusted R Square.
Classification models: Decision Trees-ID3,CART, Naive Bayes, K-Nearest-Neighbours
(KNN), Logistic Regression, Multinomial Logistic Regression
Support Vector Machines (SVM) - Nonlinearity and Kernel Methods
UNIT – III
Supervised Learning – II (Neural Networks)
Neural Network Representation – Problems – Perceptrons , Activation Functions, Artificial
Neural Networks (ANN) , Back Propagation Algorithm.
Classification Metrics: Confusion matrix, Precision, Recall, Accuracy, F-Score, ROC curves.
UNIT - IV
Model Validation in Classification : Cross Validation - Holdout Method, K-Fold, Stratified K-
fold,Leave-One-Out Cross Validation. Bias-Variance tradeoff, Regularization Overfitting,
Under fitting. Ensemble Methods: Boosting, Bagging, Random Forest.
UNIT – V
Unsupervised Learning : Clustering-K-means, K-Modes, K-Prototypes, Gaussian Mixture
Models, Expectation-Maximization.
TEXT BOOKS:
REFERENCE BOOKS:
28 Expectation-Maximization 103
29 Exploration and exploitation trade-offs 105
30 Non-associative Learning 107
31 Markov Decision Process 108
32 Q-learning 111
UNIT-I
Introduction: Introduction to Machine learning, Supervised learning, Unsupervised
learning, Reinforcement learning. Deep learning. Feature Selection: Filter, Wrapper
Embedded methods. Feature Normalization:- min-max normalization, z-score
normalization, and constant factor normalization, Introduction to Dimensionality
Reduction : Principal Component Analysis(PCA), Linear Discriminant Analysis(LDA)
Introduction
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence, coined the term “Machine Learning” in 1959 while at IBM. He defined machine
learning as “the field of study that gives computers the ability to learn without being explicitly
programmed.” However, there is no universally accepted definition for machine learning.
Different authors define the term differently. We give below two more definitions.
In the above definitions we have used the term “model” and we will be using this term at
several contexts later. It appears that there is no universally accepted one sentence definition
of this term. Loosely, it may be understood as some mathematical expression or equation, or
some mathematical structures such as graphs and trees, or a division of sets into disjoint
subsets, or a set of logical “if . . . then . . . else . . .” rules, or some such thing. It may be noted
that this is not an exhaustive list.
Definition of learning
Examples:
A computer program which learns from experience is called a machine learning program or
simply a learning program. Such a program is sometimes also referred to as a learner.
The learning process, whether by a human or a machine, can be divided into four components,
namely, data storage, abstraction, generalization and evaluation. Figure 1.1 illustrates the
various components and the steps involved in the learning process.
• In a human being, the data is stored in the brain and data is retrieved using electrochemical
signals.
• Computers use hard disk drives, flash memory, random access memory and similar devices
to store data and use cables and other technology to retrieve data.
Abstraction
The second component of the learning process is known as abstraction. Abstraction is the
process of extracting knowledge about stored data. This involves creating general concepts
about the data as a whole. The creation of knowledge involves application of known models
and creation of new models.
The process of fitting a model to a dataset is known as training. When the model has been
trained, the data is transformed into an abstract form that summarizes the original information.
Generalization
The term generalization describes the process of turning the knowledge about stored data into
a form that can be utilized for future action. These actions are to be carried out on tasks that
are similar, but not identical, to those what have been seen before. In generalization, the goal
is to discover those properties of the data that will be most relevant to future tasks.
Evaluation
Application of machine learning methods to large databases is called data mining. In data
mining, a large volume of data is processed to construct a simple model with valuable use, for
example, having high predictive accuracy.
2. In finance, banks analyze their past data to build models to use in credit applications, fraud
detection, and the stock market.
3. In manufacturing, learning models are used for optimization, control, and troubleshooting.
6. In science, large amounts of data in physics, astronomy, and biology can only be analyzed
fast enough by computers. The World Wide Web is huge; it is constantly growing and
searching for relevant information cannot be done manually.
7. In artificial intelligence, it is used to teach a system to learn and adapt to changes so that
the system designer need not foresee and provide solutions for all possible situations.
8. It is used to find solutions to many problems in vision, speech recognition, and robotics.
10. Machine learning methods have been used to develop programmes for playing games
such as chess, backgammon and Go.
Supervised learning:
Supervised learning is the machine learning task of learning a function that maps an input to
an output based on example input-output pairs.
In supervised learning, each example in the training set is a pair consisting of an input object
(typically a vector) and an output value. A supervised learning algorithm analyzes the training
data and produces a function, which can be used for mapping new examples. In the optimal
case, the function will correctly determine the class labels for unseen instances. Both
classification and regression problems are supervised learning problems.
A wide range of supervised learning algorithms are available, each with its strengths and
weaknesses. There is no single learning algorithm that works best on all supervised learning
problems.
A “supervised learning” is so called because the process of algorithm learning from the training
dataset can be thought of as a teacher supervising the learning process. We know the correct
answers (that is, the correct outputs), the algorithm iteratively makes predictions on the training
data and is corrected by the teacher. Learning stops when the algorithm achieves an acceptable
level of performance.
Example :
Consider the following data regarding patients entering a clinic. The data consists of the gender
and age of the patients and each patient is labelled as “healthy” or “sick”.
Unsupervised learning
Unsupervised learning is a type of machine learning algorithm used to draw inferences from
datasets consisting of input data without labeled responses.
The most common unsupervised learning method is cluster analysis, which is used for
exploratory data analysis to find hidden patterns or grouping in data.
Example :
Consider the following data regarding patients entering a clinic. The data consists of the
gender and age of the patients.
Based on this data, can we infer anything regarding the patients entering the clinic?
Reinforcement learning
Reinforcement learning is the problem of getting an agent to act in the world so as to maximize
its rewards.
A learner (the program) is not told what actions to take as in most forms of machine learning,
but instead must discover which actions yield the most reward by trying them. In the most
interesting and challenging cases, actions may affect not only the immediate reward but also
the next situations and, through that, all subsequent rewards.
For example, consider teaching a dog a new trick: we cannot tell it what to do, but we can
reward/punish it if it does the right/wrong thing. It has to find out what it did that made it get
the reward/punishment. We can use a similar method to train computers to do many tasks, such
as playing backgammon or chess, scheduling jobs, and controlling robot limbs. Reinforcement
learning is different from supervised learning. Supervised learning is learning from examples
provided by a knowledgeable expert.
Deep learning is the subset of machine learning methods based on neural networks with
representation learning. The adjective "deep" refers to the use of multiple layers in the network.
Methods used can be either supervised, semi-supervised or unsupervised.
Feature Selection
Feature selection is a way of selecting the subset of the most relevant features from the original
features set by removing the redundant, irrelevant, or noisy features.”
While developing the machine learning model, only a few variables in the dataset are useful
for building the model, and the rest features are either redundant or irrelevant. If we input the
dataset with all these redundant and irrelevant features, it may negatively impact and reduce
the overall performance and accuracy of the model. Hence it is very important to identify and
select the most appropriate features from the data and remove the irrelevant or less important
features, which is done with the help of feature selection in machine learning.
Feature selection is one of the important concepts of machine learning, which highly impacts
the performance of the model. As machine learning works on the concept of "Garbage In
Garbage Out", so we always need to input the most appropriate and relevant dataset to the
model in order to get a better result.
In this topic, we will discuss different feature selection techniques for machine learning. But
before that, let's first understand some basics of feature selection.
A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection. Each machine
learning process depends on feature engineering, which mainly contains two processes;which
are Feature Selection and Feature Extraction.
Although feature selection and extraction processes may have the same objective, both are
completely different from each other. The main difference between them is that feature
selection is about selecting the subset of the original feature set, whereas feature extraction
creates new features.
So, we can define feature Selection as, "It is a process of automatically or manually selecting
the subset of most appropriate and relevant features to be used in model building." Feature
selection is performed by either including the important features or excluding the irrelevant
features in the dataset without changing them.
Before implementing any technique, it is really important to understand, need for the technique
and so for the Feature Selection. As we know, in machine learning, it is necessary to provide a
pre-processed and good input dataset in order to get better outcomes. We collecta huge amount
of data to train our model and help it to learn better. Generally, the dataset consists of noisy
data, irrelevant data, and some part of useful data. Moreover, the huge amount of data also
slows down the training process of the model, and with noise and irrelevant data, the model
may not predict and perform well. So, it is very necessary to remove such noises and less-
important data from the dataset and to do this, and Feature selection techniques are used.
Selecting the best features helps the model to perform well. For example, Suppose we want to
create a model that automatically decides which car should be crushed for a spare part, and to
do this, we have a dataset. This dataset contains a Model of the car, Year, Owner's name, Miles.
So, in this dataset, the name of the owner does not contribute to the model performance as it
does not decide if the car should be crushed or not, so we can remove this column and select
the rest of the features(column) for the model building.
It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
There are mainly two types of Feature Selection techniques, which are:
Supervised Feature selection techniques consider the target variable and can be used for the
labelled dataset.
Unsupervised Feature selection techniques ignore the target variable and can be used for
the unlabelled dataset.
Filter Methods:
In Filter Method, features are selected on the basis of statistics measures. This method does
not depend on the learning algorithm and chooses the features as a pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the model by
using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not
over fit the data.
information Gain
Chi-square Test
Fisher's Score
Information Gain: Information gain determines the reduction in entropy while transforming
the dataset. It can be used as a feature selection technique by calculating the information gain
of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
Fisher's Score:
Fisher's score is one of the popular supervised techniques of features selection. It returns the
rank of the variable on the fisher's criteria in descending order. Then we can select the variables
with a large fisher's score.
The value of the missing value ratio can be used for evaluating the feature set against the
threshold value. The formula for obtaining the missing value ratio is the number of missing
Wrapper Methods:
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with this feature
set, the model has trained again.
Forward selection - Forward selection is an iterative process, which begins with an empty set
of features. After each iteration, it keeps adding on a feature and evaluates the performance to
check whether it is improving the performance or not. The process continues until the addition
of a new variable/feature does not improve the performance of the model.
Recursive feature elimination is a recursive greedy optimization approach, where features are
selected by recursively taking a smaller and smaller subset of features. Now, an estimator is
trained with each set of features, and the importance of each feature is determined using
coef_attribute or through a feature_importances_attribute.
Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the most
important features that contribute the most to training in a particular iteration. Some techniques
of embedded methods are:
Feature normalization:
Although there are so many feature normalization techniques in Machine Learning, few of
them are most frequently used. These are as follows:
Min-max normalization
For the three example values, min = 28 and max = 46. Therefore, the min-max normalized
values are:0.00,1.00,0.33.
Z-score normalization refers to the process of normalizing every value in a dataset such
that the mean of all of the values is 0 and the standard deviation is 1.
We use the following formula to perform a z-score normalization on every value in a dataset:
New value = (x – μ) / σ
where:
x: Original value
μ: Mean of data
For the three example values, mean(μ) = (28 + 46 + 34) / 3 = 108 / 3 = 36.0. The standard
deviation of a set of values is the square root of the sum of the squared difference of each value
and the mean, divided by the number of values, and so is 7.48.
Therefore, the z-score normalized values are:-1.07, 1.34,-0.27.
A z-score normalized value that is positive corresponds to an x value that is greater than the
mean value, and a z-score that is negative corresponds to an x value that is less than the mean.
Constant Factor Normalization:
0.28.0.46,0.34.
In feature extraction, we are interested in finding a new set of k features that are the combination of
the original n features. These methods may be supervised or unsupervised depending on whether or
not they use the output information. The best known and most widely used feature extraction methods
are Principal Components Analysis (PCA) and LinearDiscriminant Analysis (LDA), which are both
linear projection methods, unsupervised and supervised respectively.
Linear Discriminant analysis is one of the most popular dimensionality reduction techniques
used for supervised classification problems in machine learning. It is also considered a pre-
processing step for modeling differences in ML and applications of pattern classification
Whenever there is a requirement to separate two or more classes having multiple features
efficiently, the Linear Discriminant Analysis model is considered the most common technique
to solve such classification problems. For e.g., if we have two classes with multiplefeatures
and need to separate them efficiently. When we classify them using a single feature, then it
may show overlapping.
Consider a situation where you have plotted the relationship between two variables where each
color represents a different class. One is shown with a red color and the other with blue.
If you are willing to reduce the number of dimensions to 1, you can just project everything
to the x-axis as shown below:
Although, LDA is specifically used to solve supervised classification problems for two or
more classes which are not possible using logistic regression in machine learning. But LDA
o FaceRecognition
Face recognition is the popular application of computer vision, where each face is
represented as the combination of a number of pixel values. In this case, LDA is used
to minimize the number of features to a manageable number before going through the
classification process. It generates a new template in which each dimension consists of
a linear combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
o Medical
In the medical field, LDA has a great application in classifying the patient disease on
the basis of various parameters of patient health and the medical treatment which is
going on. On such parameters, it classifies disease as mild, moderate, or severe. This
classification helps the doctors in either increasing or decreasing the pace of the
treatment.
o Customer Identification
In customer identification, LDA is currently being applied. It means with the help of
LDA; we can easily identify and select the features that can specify the group of
customers who are likely to purchase a specific product in a shopping mall. This can be
helpful when we want to identify a group of customers who mostly purchase a product
in a shopping mall.
o For Predictions
LDA can also be used for making predictions and so in decision making. For example,
"will you buy this product” will give a predicted result of either one or two possible
classes as a buying or not.
Linear regression:
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression.
The linear regression model provides a sloped straight line representing the relationship
between the variables. Consider the below image:
The cost function is defined as the measurement of difference or error between actual values
and expected values at the current position and present in the form of a single real number.
Gradient Descent
It is known as one of the most commonly used optimization algorithms to train machine
learning models by means of minimizing errors between actual and expected results. Further,
gradient descent is also used to train Neural Networks.
Mean Squared Error represents the average of the squared difference between the
original and predicted values in the data set. It measures the variance of the residuals.
Root Mean Squared Error is the square root of Mean Squared error. It measures the
standard deviation of residuals.
Evaluation Metrics
Mean Squared Error(MSE) and Root Mean Square Error penalizes the large prediction
errors vi-a-vis Mean Absolute Error (MAE). However, RMSE is widely used than MSE
to evaluate the performance of the regression model with other random models as it has
the same units as the dependent variable (Y-axis).
MSE is a differentiable function that makes it easy to perform mathematical operations
in comparison to a non-differentiable function like MAE. Therefore, in many models,
RMSE is used as a default metric for calculating Loss Function despite being harder to
interpret than MAE.
The lower value of MAE, MSE, and RMSE implies higher accuracy of a regression
model. However, a higher value of R square is considered desirable.
R Squared & Adjusted R Squared are used for explaining how well the independent
variables in the linear regression model explains the variability in the dependent variable.
R Squared value always increases with the addition of the independent variables which
might lead to the addition of the redundant variables in our model. However, the adjusted
R-squared solves this problem.
Adjusted R squared takes into account the number of predictor variables, and it is used
to determine the number of independent variables in our model. The value of Adjusted R
squared decreases if the increase in the R square by the additional variable isn’t significant
enough.
Decision Trees
In simple words, a decision tree is a structure that contains nodes (rectangular boxes) and
edges(arrows) and is built from a dataset (table of columns representing features/attributes and
rows corresponds to records). Each node is either used to make a decision (known as decision
node) or represent an outcome (known as leaf node).
Decision tree Example
The picture above depicts a decision tree that is used to classify whether a person is
Fit or Unfit.
The decision nodes here are questions like ‘’‘Is the person less than 30 years of age?’, ‘Does
the person eat junk?’, etc. and the leaves are one of the two possible outcomesviz.
Fit and Unfit.
Looking at the Decision Tree we can say make the following decisions:
if a person is less than 30 years of age and doesn’t eat junk food then he is Fit, if a person is
less than 30 years of age and eats junk food then he is Unfit and so on.
The initial node is called the root node (colored in blue), the final nodes are called the leaf
nodes (colored in green) and the rest of the nodes are called intermediate or internal nodes.
The root and intermediate nodes represent the decisions while the leaf nodes represent the
outcomes.
ID3
ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does that by
searching for the best homogeneity for the sub nodes, with the help of the Gini index criterion.
Gini index/Gini impurity
The Gini index is a metric for the classification tasks in CART. It stores the sum of squared
probabilities of each class. It computes the degree of probability of a specific variable that is
wrongly being classified when chosen randomly and a variation of the Gini coefficient. It works
on categorical variables, provides outcomes either “successful” or “failure” and hence conducts
binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
The term " Neural Network" is derived from Biological neural networks that develop the
structure of a human brain. Similar to the human brain that has neurons interconnected to one
another, artificial neural networks also have neurons that are interconnected to one another in
various layers of the networks. These neurons are known as nodes.
Dendrites from Biological Neural Network represent inputs in Artificial Neural Networks,
cell nucleus represents Nodes, synapse represents Weights, and Axon represents Output.
There are around 1000 billion neurons in the human brain. Each neuron has an association
point somewhere in the range of 1,000 and 100,000. In the human brain, data is stored in such
a manner as to be distributed, and we can extract more than one piece of this data when
necessary from our memory parallelly. We can say that the human brain is made up of
incredibly amazing parallel processors.
We can understand the artificial neural network with an example, consider an example of a
digital logic gate that takes an input and gives an output. "OR" gate, which takes two inputs.
If one or both the inputs are "On," then we get "On" in output. If both the inputs are "Off,"
then we get "Off" in output. Here the output depends upon input. Our brain does not perform
Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the calculations
to find hidden features and patterns.
Output Layer:
The artificial neural network takes input and computes the weighted sum of the inputs and
includes a bias. This computation is represented in the form of a transfer function.
Perceptrons
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
This is the primary component of Perceptron which accepts the initial data into the system for
further processing. Each input node contains a real numerical value.
Weight parameter represents the strength of the connection between units. This is another
most important parameter of Perceptron components. Weight is directly proportional to the
strength of the associated input neuron in deciding the output. Further, Bias can be considered
as the line of intercept in a linear equation.
Activation Function:
These are the final and important components that help to determine whether the neuron will
fire or not. Activation Function can be considered primarily as a step function.
Sign function
Step function, and
The data scientist uses the activation function to take a subjective decision based on various
problem statements and forms the desired outputs. Activation function may differ (e.g., Sign,
Step, and Sigmoid) in perceptron models by checking whether the learning process is slow or
has vanishing or exploding gradients.
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Based on the layers, Perceptron models are divided into two types. These are as follows:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered perceptron
model consists feed-forward network and also includes a threshold transfer function inside
the model. The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it begins
with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural networks
having various layers in which activation function does not remain linear, similar to a single
layer perceptron model. Instead of linear, activation function can be executed as sigmoid,
TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear and non-
linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND,
NOT, XNOR, NOR.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The output of a perceptron can only be a binary number (0 or 1) due to the hard limit
transfer function.
Perceptron can only be used to classify the linearly separable sets of input vectors. If
input vectors are non-linear, it is not easy to classify them properly.
Activation Functions
Without an activation function, a neural network will become a linear regression model. But
introducing the activation function the neural network will perform a non-linear
transformation to the input and will be suitable to solve problems like image classification,
sentence prediction, or langue translation.
The neuron is basically is a weighted average of input, then this sum is passed through an
activation function to get an output.
Y = ∑ (weights*input + bias)
Here Y can be anything for a neuron between range -infinity to +infinity. So, we have to
bound our output to get the desired prediction or generalized results.
Without activation function, weight and bias would only have a linear transformation, or neural
network is just a linear regression model, a linear equation is polynomial of one degreeonly
which is simple to solve but limited in terms of ability to solve complex problems or higher
degree polynomials.
But opposite to that, the addition of activation function to neural network executes the non-
linear transformation to input and make it capable to solve complex problems such as language
translations and image classifications.
The ultimate activation function of the last layer is nothing more than a linear function of
input from the first layer, regardless of how many levels we have if they are all linear in
nature. -inf to +inf is the range.
Uses: The output layer is the only location where the activation function's function is applied.
If we separate a linear function to add non-linearity, the outcome will no longer depend on
the input "x," the function will become fixed, and our algorithm won't exhibit any novel
behaviour.
A good example of a regression problem is determining the cost of a house. We can use
linear activation at the output layer since the price of a house may have any huge or little
value. The neural network's hidden layers must perform some sort of non-linear function
even in this circumstance.
It doesn’t help with the complexity or various parameters of usual data that is fed to the
neural networks.
Non-linear in nature. Observe that while Y values are fairly steep, X values range from -2 to
2. To put it another way, small changes in x also would cause significant shifts in the value of
Y. spans from 0 to 1.
Uses: Sigmoid function is typically employed in the output nodes of a classi?cation, where
the result may only be either 0 or 1. Since the value for the sigmoid function only ranges
from 0 to 1, the result can be easily anticipated to be 1 if the value is more than 0.5 and 0 if it
is not.
Tanh Function
The activation that consistently outperforms sigmoid function is known as tangent hyperbolic
function. It's actually a sigmoid function that has been mathematically adjusted. Both are
comparable to and derivable from one another.
The activation that works almost always better than sigmoid function is Tanh function
also known as Tangent Hyperbolic function. It’s actually mathematically shifted
version of the sigmoid function. Both are similar and can be derived from each other.
Equation :-
Value Range :- -1 to +1
Nature :- non-linear
Uses :- Usually used in hidden layers of a neural network as it’s values lies between -
1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence
helps in centering the data by bringing mean close to 0. This makes learning for the
next layer much easier.
Currently, the ReLU is the activation function that is employed the most globally. Since
practically all convolutional neural networks and deep learning systems employ it.
However, the problem is that all negative values instantly become zero, which reduces the
model's capacity to effectively fit or learn from the data. This means that any negative input
to a ReLU activation function immediately becomes zero in the graph, which has an impact
on the final graph by improperly mapping the negative values.
It Stands for Rectified linear unit. It is the most widely used activation function.
Chiefly implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are activated
making the network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function
Used frequently when managing several classes. In the output nodes of image classification
issues, the softmax was typically present. The softmax function would split by the sum of the
outputs and squeeze all outputs for each category between 0 and 1.
The output unit of the classifier, where we are actually attempting to obtain the probabilities
to determine the class of each input, is where the softmax function is best applied.
The usual rule of thumb is to utilise RELU, which is a usual perceptron in hidden layers and
is employed in the majority of cases these days, if we really are unsure of what encoder to
apply.
A very logical choice for the output layer is the sigmoid function if your input is for binary
classification. If our output involves multiple classes, Softmax can be quite helpful in
predicting the odds for each class.
The softmax function is also a type of sigmoid function but is handy when we are trying to
handle multi- class classification problems.
Nature :- non-linear
Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax
Artificial Neural Networks (ANN) are algorithms based on brain function and are used to
model complicated patterns and forecast issues. The Artificial Neural Network (ANN) is a deep
learning method that arose from the concept of the human brain Biological Neural Networks.
The development of ANN was the result of an attempt to replicate the workings of the human
brain. The workings of ANN are extremely similar to those of biological neural networks,
although they are not identical. ANN algorithm accepts only numeric and structured data.
1. There are three layers in the network architecture: the input layer, the hidden layer (more
than one), and the output layer. Because of the numerous layers are sometimes referred to as
the MLP (Multi-Layer Perceptron).
This model captures the presence of non-linear relationships between the inputs.
It contributes to the conversion of the input into a more usable output.
The core component of ANNs is artificial neurons. Each neuron receives inputs from several
other neurons, multiplies them by assigned weights, adds them and passes the sum to one or
more neurons. Some artificial neurons might apply an activation function to the output before
passing it to the next variable.
Artificial neural networks are composed of an input layer, which receives data from outside
sources (data files, images, hardware sensors, microphone…), one or more hidden layers that
process the data, and an output layer that provides one or more data points based on the function
of the network. For instance, a neural network that detects persons, cars and animals will have
an output layer with three nodes. A network that classifies bank transactions between safe and
fraudulent will have a single output.
Artificial neural networks start by assigning random values to the weights of the connections
between neurons. The key for the ANN to perform its task correctly and accurately is to adjust
these weights to the right numbers. But finding the right weights is not very easy, especially
when you’re dealing with multiple layers and thousands of neurons.
This calibration is done by “training” the network with annotated examples. For instance, if
you want to train the image classifier mentioned above, you provide it with multiple photos,
each labeled with its corresponding class (person, car or animal). As you provide it with more
and more training examples, the neural network gradually adjusts its weights to map each input
to the correct outputs.
Basically, what happens during training is the network adjust itself to glean specific patterns
from the data. Again, in the case of an image classifier network, when you train the AI model
with quality examples, each layer detects a specific class of features. For instance, the first
layer might detect horizontal and vertical edges, the next layers might detect corners and round
shapes. Further down the network, deeper layers will start to pick out more advanced features
such as faces and objects.
One of the challenges of training neural networks is to find the right amount and quality of
training examples. Also, training large AI models requires vast amounts of computing
resources. To overcome this challenge, many engineers use “transfer learning,” a training
technique where you take a pre-trained model and fine-tune it with new, domain-specific
examples. Transfer learning is especially efficient when there’s already an AI model that is
close to your use case.
Backpropagation is an algorithm that backpropagates the errors from the output nodes to the
input nodes. Therefore, it is simply referred to as the backward propagation of errors. It uses in
the vast applications of neural networks in data mining like Character recognition, Signature
verification, etc.
The Back propagation algorithm in neural network computes the gradient of the loss function
for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to understand:
Training Algorithm :
Step 4: Each input unit receives the signal unit and transmitsthe signal xi signal to all the
units.
Step 5 : Each hidden unit Zj (z=1 to a) sums its weighted input signal to calculate its net
input
Applying activation function zj = f(zinj) and sends this signals to all units in the layer
about i.e output units
For each output l=unit yk = (k=1 to m) sums its weighted input signals.
yk = f(yink)
Backpropagation Error :
Step 6: Each output unit yk (k=1 to n) receives a target pattern corresponding to an input
pattern then error is calculated as:
δk = ( tk – yk ) + yink
Step 7: Each hidden unit Zj (j=1 to a) sums its input from all units in the layer above
δinj = Σ δj wjk
δj = δinj + zinj
Δ wjk = α δk zj
for each hidden unit zj (j=1 to a) update its bias and weights (i=0 to n) the weight
connection term
Δ vij = α δj xi
Δ v0j = α δj
Step 9: Test the stopping condition. The stopping condition can be the minimization of error,
number of epochs.
Types of Backpropagation
Advantages:
Disadvantages:
It is sensitive to noisy data and irregularities. Noisy data can lead to inaccurate results.
Performance is highly dependent on input data.
Spending too much time training.
The matrix-based approach is preferred over a mini-batch.
Precision
The ratio of correctly predicted positive observations to all predicted positives is
known as precision.
It gauges how well the model forecasts the positive outcomes.
Recall
The ratio of correctly predicted positive observations to the total number of actual
positive observations is known as recall.
It gauges how well the model can capture each pertinent instance.
Accuracy: One of the more obvious metrics, it is the measure of all the correctly identified
cases. It is most used when all the classes are equally important.
F1-score: This is the harmonic mean of Precision and Recall and gives a better measure of the
incorrectly classified cases than the Accuracy Metric.
The Receiver Operating Characteristic (ROC) Curve is a graphical representation used to evaluate
the performance of a binary classification model. It plots the trade-off between sensitivity (True
Positive Rate) and 1-specificity (False Positive Rate) at different threshold settings.
CROSS VALIDATION
During the evaluation of machine learning (ML) models, the following question might arise:
Is this model the best one available from the hypothesis space of the algorithm in terms
of generalization error on an unknown/future data set?
What training and testing techniques are used for the model?
Consider training a model using an algorithm on a given dataset. Using the same training
data, you determine that the trained model has an accuracy of 95% or even 100%. What
does this mean? Can this model be used for prediction?
No. This is because your model has been trained on the given data, i.e. it knows the data
and has generalized over it very well. In contrast, when you try to predict over a new set
before and thus cannot generalize well over it. To deal with such problems, hold-out
methods can be employed.
The hold-out method involves splitting the data into multiple parts and using one part
for training the model and the rest for validating and testing it. It can be used for both
Model evaluation using the hold-out method entails splitting the dataset into training and test
datasets, evaluating model performance, and determining the most optimal model. This
There are two parts to the dataset in the diagram above. One split is held aside as a training set.
determined based on the amount of training data available. A typical split of 70–30% is used
in which 70% of the dataset is used for training and 30% is used for testing the model.
The objective of this technique is to select the best model based on its accuracy on the testing
dataset and compare it with other models. There is, however, the possibility that the model can
be well fitted to the test data using this technique. In other words, models are trained to improve
model accuracy on test datasets based on the assumption that the test dataset represents the
population. As a result, the test error becomes an optimistic estimation of the generalization
error. Obviously, this is not what we want. Since the final model is trained to fitwell (or overfit)
the test data, it won’t generalize well to unknowns or future datasets.
Follow the steps below for using the hold-out method for model evaluation:
1. Split the dataset in two (preferably 70–30%; however, the split percentage can vary
2. Now, we train the model on the training dataset by selecting some fixed set of
hyperparameters while training the model.
4. Use the entire dataset to train the final model so that it can generalize better on future
datasets.
In this process, the dataset is split into training and test sets, and a fixed set of hyperparameters
is used to evaluate the model. There is another process in which data can also be split into three
sets, and these sets can be used to select a model or to tune hyperparameters. We will discuss
that technique next.
Sometimes the model selection process is referred to as hyperparameter tuning. During the
hold-out method of selecting a model, the dataset is separated into three sets — training,
validation, and test.
Follow the steps below for using the hold-out method for model selection:
1. Divide the dataset into three parts: training dataset, validation dataset, and test dataset.
2. Now, different machine learning algorithms can be used to train different models. Youcan
train your classification model, for example, using logistic regression, random forest, and
XGBoost.
3. Tune the hyperparameters for models trained with different algorithms. Change the
hyperparameter settings for each algorithm mentioned in step 2 and come up withmultiple
models.
4. On the validation dataset, test the performance of each of these models (associating with
each of the algorithms).
5. Choose the most optimal model from those tested on the validation dataset. The most
optimal model will be set up with the most optimal hyperparameters. Using the example
above, let’s suppose the model trained with XGBoost with the most optimal
hyperparameters is selected.
6. Finally, on the test dataset, test the performance of the most optimal model.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal
sizes. These samples are called folds. For each learning set, the prediction function uses k-1
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st
iteration, the first fold is reserved for test the model , and rest are used to train the model. On
2nd iteration, the second fold is used to test the model, and rest are used to train the model.This
process will continue until each fold is not used for the test fold.
This technique is similar to k-fold cross-validation with some little changes. This approach
works on stratification concept, it is a process of rearranging the data to ensure that each fold
or group is a good representative of the complete dataset. To deal with the bias and variance, it
is one of the best approaches.
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.
It is important to understand prediction errors (bias and variance) when it comes to accuracy
in any machine learning algorithm. There is a tradeoff between a model’s ability to minimize
bias and variance which is referred to as the best solution for selecting a value of
Regularization constant. Proper understanding of these errors would help to avoid the
overfitting and underfitting of a data set while training the algorithm
Bias
The bias is known as the difference between the prediction of the values by the ML model and
the correct value. Being high in biasing gives a large error in training as well as testing data.
Its recommended that an algorithm should always be low biased to avoid the problemof
underfitting.By high bias, the data predicted is in a straight line format, thus not fitting
accurately in the data in the data set. Such fitting is known as Underfitting of Data. This
happens when the hypothesis is too simple or linear in nature. Refer to the graph given below
for an example of such a situation.
Variance
The variability of model prediction for a given data point which tells us spread of our datais
called the variance of the model. The model with high variance has a very complex fit to the
training data and thus is not able to fit accurately on the data which it hasn’t seen before. As
a result, such models perform very well on training data but has high error rateson test
data.When a model is high on variance, it is then said to as Overfitting of Data. Overfitting is
fitting the training set accurately via complex curve and high order hypothesis but is not
the solution as the error with unseen data is high. While training a data model
variance should be kept low.
The high variance data looks like follows.
This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm
can’t be more complex and less complex at the same time. For the graph, the perfect tradeoff
will be like.
This is referred to as the best point chosen for the training of the algorithm which gives low
error in training as well as testing data.
Regularization :
Regularization is one of the most important concepts of machine learning. It is a technique to
prevent the model from overfitting by adding extra information to it. Sometimes the machine
learning model performs well with the training data but does not perform well with the test
data. It means the model is not able to predict the output when deals with unseen data by
introducing noise in the output, and hence the model is called overfitted. This problem can be
deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features
in the model by reducing the magnitude of the variables. Hence, it maintains accuracy as well
as a generalization of the model.
it mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In
regularization technique, we reduce the magnitude of the features by keeping the same number
of features."
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The
equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can predict
the accurate value of Y. The loss function for the linear regression is called as RSS or Residual
sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount of
bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the complexity
of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
o In the above equation, the penalty term regularizes the coefficients of the model, and
hence ridge regression reduces the amplitudes of the coefficients that decreases the
complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the equation
becomes the cost function of the linear regression model. Hence, for the minimum
value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity between
the independent variables, so to solve such problems, Ridge regression can be used.
Lasso Regression:
o Lasso regression is another regularization technique to reduce the complexity of the
model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
Data used for training is not cleaned and contains noise (garbage values) in it
Underfitting:
When a model has not learned the patterns in the training data well and is unable to generalize
well on the new data, it is known as underfitting. An underfit model has poor performance on
the training data and will result in unreliable predictions. Underfitting occurs due to high bias
and low variance.
You would likely browser a few web portals where people have posted their reviews and
compare different car models, checking for their features and prices. You will also probably
ask your friends and colleagues for their opinion. In short, you wouldn’t directly reach a
conclusion, but will instead make a decision considering the opinions of other people as well.
Ensemble models in machine learning operate on a similar idea. They combine the decisions
from multiple models to improve the overall performance.
Statistical Problem –
Bagging:
BAGGing, or Bootstrap AGGregating. BAGGing gets its name because it combines
Bootstrapping and Aggregation to form one ensemble model. Given a sample of data,
multiple bootstrapped subsamples are pulled. A Decision Tree is formed on each of the
bootstrapped subsamples. After each subsample Decision Tree has been formed, an
algorithm is used to aggregate over the Decision Trees to form the most efficient
4. Each data point with the wrong prediction is sent into the second subset of data, and
5. Using this updated subset, we train and test the second weak learner.
6. We continue with the following subset until the total number of subsets is reached.
7. We now have the total prediction. The overall prediction has already been aggregated
Unsupervised Learning:
Unsupervised learning is a machine learning technique in which models are not supervised
using training dataset. Instead, models itself find the hidden patterns and insights from the
given data. It can be compared to learning which takes place in the human brain while learning
new things. It can be defined as:
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into gr oups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
K-Means Clustering
Step-06:
Keep repeating the procedure from Step-03 to Step-05 until any of the following stopping
criteria is met-
Problem:
Cluster the following eight points (with (x, y) representing locations) into three clusters:
A1(2, 10), A2(2, 5), A3(8, 4), A4(5, 8), A5(7, 5), A6(6, 4), A7(1, 2), A8(4, 9)
Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2).
The distance function between two points a = (x1, y1) and b = (x2, y2) is defined as-
Ρ(a, b) = |x2 – x1| + |y2 – y1|
Use K-Means Algorithm to find the three cluster centers after the second iteration.
Solution-
We follow the above discussed K-Means Clustering Algorithm-
Iteration-01:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each
of the center of the three clusters-
Calculating Distance Between A1(2, 10) and C1(2, 10)-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0
Calculating Distance Between A1(2, 10) and C2(5, 8)-
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
The given point belongs to that cluster whose center is nearest to it.
Cluster-01:
First cluster contains points-
A1(2, 10)
Cluster-02:
Second cluster contains points-
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
A8(4, 9)
Cluster-03:
Third cluster contains points-
A2(2, 5)
A7(1, 2)
For Cluster-01:
We have only one point A1(2, 10) in Cluster-01.
So, cluster center remains the same.
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6 + 4)/5, (4 + 8 + 5 + 4 + 9)/5)
= (6, 6)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
Iteration-02:
We calculate the distance of each point from each of the center of the three clusters.
The distance is calculated by using the given distance function.
The following illustration shows the calculation of distance between point A1(2, 10) and each
of the center of the three clusters-
Ρ(A1, C1)
= |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
Ρ(A1, C2)
= |x2 – x1| + |y2 – y1|
= |6 – 2| + |6 – 10|
=4+4
=8
Ρ(A1, C3)
= |x2 – x1| + |y2 – y1|
= |1.5 – 2| + |3.5 – 10|
= 0.5 + 6.5
=7
In the similar manner, we calculate the distance of other points from each of the center of the
three clusters.
Next,
We draw a table showing all the results.
Using the table, we decide which point belongs to which cluster.
The given point belongs to that cluster whose center is nearest to it.
Cluster-01:
First cluster contains points-
A1(2, 10)
A8(4, 9)
Cluster-02:
Second cluster contains points-
A3(8, 4)
A4(5, 8)
A5(7, 5)
A6(6, 4)
Cluster-03:
For Cluster-01:
Center of Cluster-01
= ((2 + 4)/2, (10 + 9)/2)
= (3, 9.5)
For Cluster-02:
Center of Cluster-02
= ((8 + 5 + 7 + 6)/4, (4 + 8 + 5 + 4)/4)
= (6.5, 5.25)
For Cluster-03:
Center of Cluster-03
= ((2 + 1)/2, (5 + 2)/2)
= (1.5, 3.5)
K MODES CLUSTERING
KModes is a clustering algorithm used to group similar data points into clusters based on their
categorical attributes. Unlike traditional clustering algorithms that use distance metrics,
KModes works by identifying the modes or most frequent values within each cluster to
In K-means clustering when we used categorical data after converting it into a numerical form.
it doesn’t give a good result for high-dimensional data. So, Some
changes are made for categorical data t.
Similarity and dissimilarity measurements are used to determine the distance between the data
objects in the dataset. In the case of K-modes, these distances are calculated using a
dissimilarity measure called the Hamming distance. The Hamming distance between two data
objects is the number of categorical attributes that differ between the two objects.
For example, consider the following dataset with three categorical attributes:
To calculate the Hamming distance between objects 1 and 3, we again compare their values
for each attribute and count the number of differences. In this case, there are two differences
(Attribute 2 is B for object 1 and C for object 3, and Attribute 3 is C for object 1 and E for
object 3), so the Hamming distance between objects 1 and 3 is 2.
To calculate the Hamming distance between objects 1 and 4, we again compare their values
for each attribute and count the number of differences. In this case, there are three differences
(Attribute 1 is A for objects 1 and B for object 4, Attribute 2 is B for object 1 and C for object
4, and Attribute 3 is C for objects 1 and E for object 4), so the Hamming distance between
objects 1 and 4 is 3.
Data objects with a smaller Hamming distance are considered more similar, while objects with
a larger Hamming distance is considered more dissimilar .
Some variations of the K-modes algorithm may use different methods for updating the
centroids (modes) of the clusters, such as taking the weighted mode or the median of the
objects within each cluster.
Overall, the goal of K-modes clustering is to minimize the dissimilarities between the data
objects and the centroids (modes) of the clusters, using a measure of categorical similarity
such as the Hamming distance.
One of the conventional clustering methods commonly used in clustering techniques and
efficiently used for large data is the K-Means algorithm. However, its method is not good and
suitable for data that contains categorical variables. This problem happens when the cost
function in K-Means is calculated using the Euclidian distance that is only suitable for
numerical data. While K-Mode is only suitable for categorical data only, not mixed data types.
Facing these problems, Huang proposed an algorithm called K-Prototype which is created in
order to handle clustering algorithms with the mixed data types (numerical and categorical
variables). K-Prototype is a clustering method based on partitioning. Its algorithm is an
improvement of the K-Means and K-Mode clustering algorithm to handle clustering with the
K-Prototype has an advantage because it’s not too complex and is able to handle large data and
is better than hierarchical based algorithms
where and are respectively the mean and variance of the distribution. For
Multivariate ( let us say d-variate) Gaussian Distribution, the probability density function is
given by
Suppose there are K clusters (For the sake of simplicity here it is assumed that the number
of clusters is known and it is K). So and are also estimated for each k. Had it been only one
distribution, they would have been estimated by the maximum-likelihood method. But
since there are K such clusters and the probability density is defined as a linear function of
densities of all these K distributions, i.e.
So it can be clearly seen that the parameters cannot be estimated in closed form. This is
where the Expectation-Maximization algorithm is beneficial.
These are the two basic steps of the EM algorithm, namely the E Step, or Expectation Step
or Estimation Step, and M Step, or Maximization Step.
Estimation step
Reinforcement Learning
o Reinforcement learning does not require any labeled data for the learning process. It
learns through the feedback of action performed by the agent. Moreover, in
reinforcement learning, agents also learn from past experiences.
o Reinforcement learning aims to get maximum positive feedback so that they can
improve their performance.
o Reinforcement learning involves various actions, which include taking action,
changing/unchanged state, and getting feedback. And based on these actions, agents
learn and explore the environment.
Before going to a brief description of exploration and exploitation in machine learning, let's
first understand these terms in simple words. In reinforcement learning, whenever agents get
a situation in which they have to make a difficult choice between whether to continue the same
work or explore something new at a specific time, then, this situation results in Exploration-
Exploitation Dilemma because the knowledge of an agent about the state, actions, rewards and
resulting states is always partial.
Exploitation is defined as a greedy approach in which agents try to get more rewards by using
estimated value but not the actual value. So, in this technique, agents make the best decision
based on current information.
Let's understand exploitation and exploration with some interesting real-world examples.
Let's suppose people A and B are digging in a coal mine in the hope of getting a diamond inside
it. Person B got success in finding the diamond before person A and walks off happily. After
seeing him, person A gets a bit greedy and thinks he too might get success in finding diamond
at the same place where person B was digging coal. This action performed by person A is
called greedy action, and this policy is known as a greedy policy. But person A was unknown
because a bigger diamond was buried in that place where he was initially digging the coal, and
this greedy policy would fail in this situation.
In this example, person A only got knowledge of the place where person B was digging but
had no knowledge of what lies beyond that depth. But in the actual scenario, the diamond can
also be buried in the same place where he was digging initially or some completely another
place. Hence, with this partial knowledge about getting more rewards, our reinforcement
learning agent will be in a dilemma on whether to exploit the partial knowledge to receive some
rewards or it should explore unknown actions which could result in many rewards.
However, both these techniques are not feasible simultaneously, but this issue can be resolved
by using Epsilon Greedy Policy (Explained below).
here are a few other examples of Exploitation and Exploration in Machine Learning as follows:
Example 1: Let's say we have a scenario of online restaurant selection for food orders, where
you have two options to select the restaurant. In the first option, you can choose your favorite
restaurant from where you ordered food in the past; this is called exploitation because here,
you only know information about a specific restaurant. And for other options, you can try a
new restaurant to explore new varieties and tastes of food, and it is called exploration. However,
food quality might be better in the first option, but it is also possible that it is more delicious in
another restaurant.
Example 2: Suppose there is a game-playing platform where you can play chess with robots.
To win this game, you have two choices either play the move that you believe is best, and for
the other choice, you can play an experimental move. However, you are playing the best
possible move, but who knows new move might be more strategic to win this game. Here, the
Non-Associative Learning
In reinforcement learning, non-associative learning refers to a type of learning that does not
involve forming associations or relationships between different stimuli or actions. It isa
simpler form of learning compared to associative learning, which involves linking different
stimuli or actions together.
Examples
Markov-Decision Process
Reinforcement Learning is a type of Machine Learning. It allows machines and software agents
to automatically determine the ideal behavior within a specific context, in order to maximize
its performance. Simple reward feedback is required for the agent to learn its behavior; this is
known as the reinforcement signal.
There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as
Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known as
a Markov Decision Process.
A Markov Decision Process (MDP) model contains:
A State is a set of tokens that represent every state that the agent can be in.
Model:
A Model (sometimes called Transition Model) gives an action’s effect in a state. In
particular, T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’
takes us to state S’ (S and S’ may be the same). For stochastic actions (noisy, non-
deterministic) we also define a probability P(S’|S,a) which represents the probability of
reaching a state S’ if action ‘a’ is taken in state S. Note Markov property states that the
effects of an action taken in a state depend only on that state and not on the prior history.
Actions
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken
being in state S.
Reward
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
Policy
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a.
It indicates the action ‘a’ to be taken while in state S.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT
Walls block the agent path, i.e., if there is a wall in the direction the agent would have taken,
the agent stays in the same place. So for example, if the agent says LEFT in the START grid
he would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
RIGHT RIGHT UP UPRIGHT
UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent discussion.
The move is now noisy. 80% of the time the intended action works correctly. 20% of the time
the action agent takes causes it to move at right angles. For example, if the agent says UP the
probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and the probability
of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
Small reward each step (can be negative when can also be term as punishment, in
the above example entering the Fire can have a reward of -1).
Big rewards come at the end (good or bad).
The goal is to Maximize the sum of rewards.
The model-based algorithms use transition and reward functions to estimate the optimal policy
and create the model. In contrast, model-free algorithms learn the consequences of their actions
through the experience without transition and reward function.
The value-based method trains the value function to learn which state is more valuable and
take action. On the other hand, policy-based methods train the policy directly to learn which
action to take in a given state.
In the off-policy, the algorithm evaluates and updates a policy that differs from the policy used
to take an action. Conversely, the on-policy algorithm evaluates and improves the same policy
used to take an action
Before we jump into how Q-learning works, we need to learn a few useful terminologies to
understand Q-learning's fundamentals.
Rewards: for every action, the agent receives a reward and penalty.
Episodes: the end of the stage, where agents can’t take new action. It happens when
the agent has achieved the goal or failed.
Q(St+1, a): expected optimal Q-value of doing the action in a particular state.
Q-Table: the agent maintains the Q-table of sets of states and actions.
Q-Table
The agent will use a Q-table to take the best possible action based on the expected reward for
each state in the environment. In simple words, a Q-table is a data structure of sets of actions
and states, and we use the Q-learning algorithm to update the values in the table.
Q-Function
The Q-function uses the Bellman equation and takes state(s) and action(a) as input. The
equation simplifies the state values and state-action value calculation.
Initialize Q-Table
We will first initialize the Q-table. We will build the table with columns based on the number
of actions and rows based on the number of states.
In our example, the character can move up, down, left, and right. We have four possible actions
and four states(start, Idle, wrong path, and end). You can also consider the wrongpath for
falling into the hole. We will initialize the Q-Table with values at 0.
The second step is quite simple. At the start, the agent will choose to take the random
action(down or right), and on the second run, it will use an updated Q-Table to select the action.
Perform an Action
Choosing an action and performing the action will repeat multiple times until the training loop
stops. The first action and state are selected using the Q-Table. In our case, all values of the Q-
Table are zero.
Then, the agent will move down and update the Q-Table using the Bellman equation. With
every move, we will be updating values in the Q-Table and also using it for determining the
best course of action.
Initially, the agent is in exploration mode and chooses a random action to explore the
environment. The Epsilon Greedy Strategy is a simple method to balance exploration and
exploitation. The epsilon stands for the probability of choosing to explore and exploits when
there are smaller chances of exploring.
At the start, the epsilon rate is higher, meaning the agent is in exploration mode. While
exploring the environment, the epsilon decreases, and agents start to exploit the environment.
During exploration, with every iteration, the agent becomes more confident in estimating Q -
values
After taking the action, we will measure the outcome and the reward.
The reward for taking the wrong path (falling into the hole) is 0
Update Q-Table
We will update the function Q(St, At) using the equation. It uses the previous episode’s
estimated Q-values, learning rate, and Temporal Differences error. Temporal Differences error
is calculated using Immediate reward, the discounted maximum expected future reward, and
the former estimation Q-value.
The process is repeated multiple times until the Q-Table is updated and the Q-value function is
maximized.
In the case of a frozen lake, the agent will learn to take the shortest path to reach the goal and
avoid jumping into the holes.