ML
ML
ML Notes
UNIT – I
Introduction and mathematical foundations
The life of Machine Learning programs is straightforward and can be summarized in the
following points:
1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying
• Association mining: Identifying sets of items in a data set that frequently occur
together.
• Fraud detection: Identifying cases of fraud when you only have a few positive
examples.
• Labelling data: Algorithms trained on small data sets can learn to apply data
labels to larger sets automatically.
• Robotics: Robots can learn to perform tasks the physical world using this
technique.
• Video gameplay: Reinforcement learning has been used to teach bots to play a
number of video games.
Some important applications in which machine learning is widely used are given below:
• Healthcare
• Automation
• Banking and Finance
• Transportation and Traffic prediction
• Image recognition
• Speech recognition
• Product recommendation
• Virtual personal assistance
• Email spam and Malware detection and Filtering
• Self-driving cars
• Credit card fraud detection
• Stock marketing and Trading
• Language translation
There are math foundations that are important for Machine Learning. The math subject
is:
Six math subjects become the foundation for machine learning. Each subject is
intertwined to develop our machine learning model and reach the “best” model for
generalizing the dataset.
Linear Algebra
What is Linear Algebra? This is a branch of mathematic that concerns the study of the
vectors and certain rules to manipulate the vector. When we are formalizing intuitive concepts,
the common approach is to construct a set of objects (symbols) and a set of rules to manipulate
these objects. This is what we knew as algebra.
When talking about vectors, people might flashback to their high school study
regarding the vector with direction, just like the image below.
Geometric Vector
This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine
Learning. Instead, it would be this image below we would talk about.
What we had above is also a Vector, but another kind of vector. You might be familiar
with matrix form (the image below). The vector is a matrix with only 1 column, which is known
as a column vector. In other words, we can think of a matrix as a group of column vectors or
row vectors. In summary, vectors are special objects that can be added together and multiplied
by scalars to produce another object of the same kind. We could have various objects called
vectors.
Matrix
An example of how linear algebra is used is in the linear equation. Linear algebra is a
tool used in the Linear Equation because so many problems could be presented systematically
in a Linear way. The typical Linear equation is presented in the form below.
Linear Equation
To solve the linear equation problem above, we use Linear Algebra to present the linear
equation in a systematical representation. This way, we could use the matrix characterization
to look for the most optimal solution.
To summary the Linear Algebra subject, there are three terms you might want to learn
more as a starting point within this subject:
• Vector
• Matrix
• Linear Equation
Analytic geometry is a study in which we learn the data (point) position using an
ordered pair of coordinates. This study is concerned with defining and representing
geometrical shapes numerically and extracting numerical information from the shapes
numerical definitions and representations. We project the data into the plane in a simpler
term, and we receive numerical information from there.
Cartesian Coordinate
Above is an example of how we acquired information from the data point by projecting
the dataset into the plane. How we acquire the information from this representation is the
heart of Analytical Geometry. To help you start learning this subject, here are some important
terms you might need.
Distance Function
A distance function is a function that provides numerical information for the distance
between the elements of a set. If the distance is zero, then elements are equivalent. Else, they
are different from each other.
An example of the distance function is Euclidean Distance which calculates the linear
distance between two data points.
Inner Product
The inner product is a concept that introduces intuitive geometrical concepts, such as
the length of a vector and the angle or distance between two vectors. It is often denoted as
⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).
Probability is the bedrock of ML, which tells how likely is the event to occur. The value
of Probability always lies between 0 to 1. It is the core concept as well as a primary
prerequisite to understanding the ML models and their applications.
Probability can be calculated by the number of times the event occurs divided by the
total number of possible outcomes. Let's suppose we tossed a coin, then the probability of
getting head as a possible outcome can be calculated as below formula:
P (H) = ½
P (H) = 0.5
Where;
Types of Probability
Empirical Probability: Empirical Probability can be calculated as the number of times the
event occurs divided by the total number of incidents observed.
Joint Probability:It tells the Probability of simultaneously occurring two random events.
Where;
Conditional Probability: It is given by the Probability of event A given that event B occurred.
P(A|B) = P(A∩B)/P(B)
Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint Probability of as A and B as
P(A ∩ B)= p(A).P(B|A), which means: "The chance of both things happening is the chance that
the first one happens, and then the second one is given when the first thing happened."
Statistics is also considered as the base foundation of machine learning which deals
with finding answers to the questions that we have about data. In general, we can define
statistics as:
o Descriptive Statistics
o Inferential Statistics
Use of Statistics in ML
Statistics methods are used to understand the training data as well as interpret the
results of testing different machine learning models. Further, Statistics can be used to make
better-informed business and investing decisions.
consideration, we can safely say that the probability of event B is dependent of the occurrence
of event A.
• Drawing a second ace from a deck given we got the first ace
• Finding the probability of having a disease given you were tested positive
• Finding the probability of liking Harry Potter given we know the person likes fiction
And so on….
We can write the conditional probability as, the probability of the occurrence of
event A given that B has already happened.
Bayes Theorem
It is basically a classification technique that involves the use of the Bayes Theorem
which is used to find the conditional probabilities.
The Bayes theorem describes the probability of an event based on the prior knowledge
of the conditions that might be related to the event. The conditional probability of A given B,
represented by P(A | B) is the chance of occurrence of A given that B has occurred.
P(A | B) = P(A,B)/P(B) or
P(A,B) = P(A|B)P(B)=P(B|A)P(A)
Where,
Our aim is to explore each of the components included in this theorem. Let’s explore
step by step:
• It represents the probability of how likely a feature x occurs given that it belongs to
the particular class. It is denoted by, P(X|A) where x is a particular feature
• It is the probability of how likely the feature x occurs given that it belongs to the class
wi.
• Sometimes, it is also known as the Likelihood.
• It is the quantity that we have to evaluate while training the data. During the training
process, we have input(features) X labeled to corresponding class w and we figure out
the likelihood of occurrence of that set of features given the class label.
(c) Evidence:
You need to know how to do differentiation on a vector function and how to present
it as a vector of a matrix. This is the tool behind backpropagation algorithms in neural
network training.
What is a Vector?
They are also relied upon heavily to make up the basis for some machine learning
techniques as well. One example in particular is support vector machines. A support vector
machine analyzes vectors across an n-dimensional space to find the optimal hyperplane for a
given data set. In essence, a support vector machine will attempt to find a line that have the
maximum distance between data sets of both classes. This allows for future data points to be
classified with ore confidence, due to increased reinforcement.
A vector is a data structure with at least two components, as opposed to a scalar, which
has just one. For example, a vector can represent velocity, an idea that combines speed and
direction: wind velocity = (50mph, 35 degrees North East). A scalar, on the other hand, can
represent something with one value like temperature or height: 50 degrees Celsius, 180
centimeters. Therefore, we can represent two-dimensional vectors as arrows on an x-y graph,
with the coordinates x and y each representing one of the vector’s values.
would not be distorted in the process. First, he divided it into four smaller shapes by tracing
two straight lines as follows:
Then, into 8 smaller pieces and so on until reaching an object of similar shape to a
parallelogram:
Now, the properties of a parallelogram were well-studied at the time and so he knew
how to compute the area of such an object. If you remember from your high school geometry
classes, the area of the circle is π⋅r², where r stands for the radius. Well, after dividing the circle
into approximately 64 smaller pieces, Archimedes derived that formula by noting the similarity
between the shape of the resulting object and that of a parallelogram, whose area is the base
times the height. Incredible! This summarizes the whole idea behind Calculus. Can’t solve a
specific problem? Just split it into infinitesimally smaller pieces until the solution arises
intuitively. Riemann used the same process of thinking when trying to figure out how to
compute the area below a curve, for example. There is a reason why the great Richard
Feynman referred to Calculus as “the language in which God speaks”.
The derivative is nothing more than a special limit of a continuous function, used for
calculating the slope of a tangent line, the instantaneous rate of change of a function, and the
instantaneous velocity of an object at a specific point. The formal formula to calculate said
limit is the following:
Multivariate Calculus
As you may have inferred from its name, multivariate calculus is the extension of
calculus to multiple dimensions (aka multiple variables). The larger the number of dimensions
the less intuitive the concepts and the results may get. First, note that in normal calculus we
use a common 2-dimensional cartesian plane, we want to compute y with respect to some
input variable x. If you picture such a plane in your head you will observe that there are only
two directions in which we could go: left or right. But, how would this look like in 3 dimensions?
We can now go left, right, up or down. So, to find the derivative of such a function we need to
abstract ourselves for a minute, split the problem into smaller pieces and solve the problem in
steps by putting the pieces back together. To do this, we need a slight modification on how we
go about computing derivatives. Enter partial derivatives. This new tool allows us to compute
the derivative of our shape at different points to then get an idea of the bigger picture. To do
this, pick a variable and hold the other one constant. Holding one of the variables as constant
equates to slicing a plane through the 3D object. Take the derivative of the function with
respect to the variable you chose by following the normal derivative rules we just reviewed.
Then, do the same thing but for the other variable. Here is the plot of a function f(x,y)=x²-y²:
How on earth can we calculate the derivative at a given point for this function? Well,
let’s go by steps. First, fix the y variable and compute the partial derivative for f(x,y) = x²-
y² with respect to x to get ∂f(x,y)/ ∂x = 2x-0, since y is a constant its derivative is 0 and,
following the power rule, we compute the derivative for x² which is 2x. Next, do the same
thing but fixing x instead to get ∂f(x,y)/ ∂y = 0-2y. We are left with two equations for planes
that slice our original equation which brings the problem down from 3 to 2 dimensions, and
we know how to find the derivative at any given point of a curve in 2D. That we have an
intuition of how computing derivatives in higher dimensions looks like we can introduce some
notation. Enter vectors and matrices. vectors are mathematical objects that possess size and
direction. This means that we can represent the size and direction of an object with vectors,
for example, the distance and direction in which a car moved or the direction in which a
derivative moves. We can use them to denote the various variables of a multivariate function.
For example, we can denote our previous function f(x,y)=x²-y² as [x² -y²]. Matrices, on the
other hand, are simply a way of representing a collection of vectors. We can use matrices to
simplify the notation for expressing systems of linear equations where each column is a vector
containing the information about a certain variable. Moreover, in computer science, we refer
to high dimensional matrices (larger than 2 × 2) as Tensors. Technically, we can expand this
terminology to refer to a scalar as a one-dimensional Tensor and a vector as a 2-dimensional
Tensor. Here is an illustration of these concepts:
The gradient vector can be interpreted as the “direction and rate of fastest
increase”. Note that naturally, a gradient is maximizing a function because it tells us the rate
of fastest increase, in practice, we are often concerned with finding local or global minima so
we compute the negative gradient to make sure we are minimizing instead.
1.7 Optimization:
Optimization is the process where we train the model iteratively that results in a
maximum and minimum function evaluation. It is one of the most important phenomena in
Machine Learning to get better results. Two important Optimization algorithms: Gradient
Descent and Stochastic Gradient Descent Algorithms;
Maxima is the largest and Minima is the smallest value of a function within a given
range. We represent them as below:
Global Maxima and Minima: It is the maximum value and minimum value respectively on the
entire domain of the function
Local Maxima and Minima: It is the maximum value and minimum value respectively of the
function within a given range.
There can be only one global minima and maxima but there can be more than one local
minima and maxima.
GRADIENT DESCENT
Gradient Descent is an optimization algorithm and it finds out the local minima of a
differentiable function. It is a minimization algorithm that minimizes a given function.
Let’s see the geometric intuition of Gradient Descent:
Here, the minima is the origin(0, 0). The slope here is Tanθ. So the slope on the right
side is positive as 0<θ<90 and its Tanθ is a positive value. The slope on the left side is negative
as 90<θ<180 and its Tanθ is a negative value.
One important observation in the graph is that the slope changes its sign from positive
to negative at minima. As we move closer to the minima, the slope reduces.
So, how does the Gradient Descent Algorithm work?
• Calculate for all the points: X₁, X₂, X₃, ……., Xᵢ-₁, Xᵢ
• When (Xᵢ — Xᵢ-₁) is small, i.e., when Xᵢ-₁, Xᵢ converge, we stop the iteration and
declare X* = Xᵢ
LEARNING RATE
Learning Rate is a hyperparameter or tuning parameter that determines the step size
at each iteration while moving towards minima in the function. For example, if r = 0.1 in the
initial step, it can be taken as r=0.01 in the next step. Likewise it can be reduced exponentially
as we iterate further. It is used more effectively in deep learning.
In the above example, we took r=1. As we calculate the points Xᵢ, Xᵢ+₁, Xᵢ+₂,….to find the
local minima, X*, we can see that it is oscillating between X = -0.5 and X = 0.5.
The intuition behind quantifying information is the idea of measuring how much
surprise there is in an event. Those events that are rare (low probability) are more surprising
and therefore have more information than those events that are common (high probability).
1. Data Compression (source coding): More frequent events should have shorter
encodings
2. Error Correction (channel coding): Should be able to infer encoded event even if
message is corrupted by noise
UNIT – II
Supervised Learning
2.1 Introduction – Discriminative and Generative models
In today’s world, Machine learning becomes one of the popular and exciting fields of
study that gives machines the ability to learn and become more accurate at predicting
outcomes for the unseen data i.e, not seen the data in prior. The ideas in Machine learning
overlaps and receives from Artificial Intelligence and many other related technologies.
Today, machine learning is evolved from Pattern Recognition and the concept that
computers can learn without being explicitly programmed to performing specific tasks. We
can use the Machine Learning algorithms(e.g, Logistic Regression, Naive Bayes, etc) to
while learning an artificial language. These two models have not previously been explored in
human learning. However, it is related to known effects of causal direction, classification vs.
inference learning, and observational vs. feedback learning.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an
email is a spam or not spam based on the words present in a particular email. To solve this
problem, we have a joint model over
p(Y,X) = P(y,x1,x2…xn)
Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both
generative and discriminative models can solve this problem but in different ways.
Let’s see why and how they are different!
• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest
• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
Some Examples of Generative Models
• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model
Computational Cost
Discriminative models are computationally cheap as compared to generative models.
Comparison between Discriminative and Generative Models
Let’s see some of the comparisons based on the following criteria between Discriminative and
Generative Models:
• Performance
• Missing Data
• Accuracy Score
• Applications
Based on Performance
Generative models need fewer data to train compared with discriminative models
since generative models are more biased as they make stronger assumptions i.e, assumption
of conditional independence.
Based on Missing Data
In general, if we have missing data in our dataset, then Generative models can work
with these missing data, while on the contrary discriminative models can’t. This is because, in
generative models, still we can estimate the posterior by marginalizing over the unseen
variables. However, for discriminative models, we usually require all the features X to be
observed.
Based on Accuracy Score
If the assumption of conditional independence violates, then at that time generative
models are less accurate than discriminative models.
Based on Applications
Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label i.e, target outcome, so they can only solve classification problems
while Generative models have more applications besides classification such as,
• Samplings,
• Bayes learning,
• MAP inference, etc.
Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the dependent
variable is changing according to the value of the independent variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:
y= a0+a1x+ ε
Here,
The values for x and y variables are training datasets for Linear Regression model
representation.
Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.
A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.
When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized. The best
fit line will have the least error.
The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so
to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It can
be written as:
Where,
Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so
cost function will high. If the scatter points are close to the regression line, then the residual
will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.
Model Performance:
The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It can be
achieved by below method:
1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:
Notice that the line is as close as possible to all the scattered data points. This is what
an ideal best fit line looks like. To better understand the whole process let’s see how to
calculate the line using the Least Squares Regression.
Surely, you’ve come across this equation before. It is a simple equation that represents
a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better understand this,
let’s break down the equation:
• y: dependent variable
• m: the slope of the line
• x: independent variable
• c: y-intercept
So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the dependent
variable.
Step 2: Compute the y-intercept (the value of y at the point where the line crosses the yaxis):
Step 3: Substitute the values in the final equation:
Now let’s look at an example and see how you can use the least-squares regression
method to compute the line of best fit.
Let us use the concept of least squares regression to find the line of best fit for the
above data.
Step 1: Calculate the slope ‘m’ by using the following formula:
Once you substitute the values, it should look something like this:
Let’s construct a graph that represents the y=mx + c line of best fit:
Now Tom can use the above equation to estimate how many T-shirts of price $8 can
he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T-shirts
This comes down to 13 T-shirts! That’s how simple it is to make predictions using
Linear Regression.
Overfitting and Underfitting are the two main problems that occur in machine learning
and degrade the performance of the machine learning models.
Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:
o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model has low bias and high
variance.
Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:
As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of
the regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.
Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we can
reduce the occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling
Underfitting
Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training data
can be stopped at an early stage, due to which the model may not learn enough from the
training data. As a result, it may fail to find the best fit of the dominant trend in the data.
In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions. An underfitted
model has high bias and low variance.
Example: We can understand the underfitting using below output of the linear regression
model:
As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.
Goodness of Fit
The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how closely
the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then the
performance of the model may decrease due to the overfitting, as the model also learn the
noise present in the dataset. The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here for achieving a good
model. There are two other methods by which we can get a good point for our model, which
are the resampling method to estimate model accuracy and validation dataset.
Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n datapoints in the original input dataset, then n-p data points will be used as the training
dataset and the p data points as the validation set. This complete process is repeated for all
the samples, and the average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult
for the large p.
K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction function uses
k-1 folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used to train
the model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.
Holdout Method
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by training it
on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of high
variance, and it also produces misleading results sometimes.
Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-
validation, as there is no certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5 years may drastically
different, so it is difficult to expect the correct output for such situations.
These two topics are quite famous and are the basic introduction topics in Machine
Learning. There are other types of regression, like
• Lasso regression,
• Ridge regression,
• Polynomial regression,
• Stepwise regression,
• ElasticNet regression
The above mentioned techniques are majorly used in regression kind of analytical problems.
Two popular methods for that is lasso and ridge regression. In our ridge regression
article we explained the theory behind the ridge regression also we learned the
implementation part in python.
What Is Regression?
For example,
What determines a person's salary?
Many factors,like educational qualification, experience, skills, job role, company, etc.,
play a role in salary.
You can use regression analysis to predict the dependent variable – salary using the
mentioned factors.
Y = mx+c
It is nothing but a linear regression equation. In the above equation, the dependent
variable estimates the independent variable.
In mathematical terms,
The same equation terms are called slighted differently in machine learning or the
statistical world.
The line in the above graph represents the linear regression model. You can see how
well the model fits the data. It looks like a good model, but sometimes the model fits the data
too much, resulting in overfitting.
To create the line (red) using the actual value, the regression model will iterate and
recalculate the m(coefficient) and c (bias) values while trying to reduce the loss values with
the proper loss function.
The model will have low bias and high variance due to overfitting. The model fit is
good in the training data, but it will not give good test data predictions. Regularization comes
into play to tackle this issue.
What Is Regularization?
Regularization solves the problem of overfitting. Overfitting causes low model
accuracy. It happens when the model learns the data as well as the noises in the training set.
Noises are random datum in the training set which don't represent the actual
properties of the data.
or regularize these coefficients to avoid overfitting and make them work better on different
datasets.
This type of regression is used when the dataset shows high multicollinearity or when
you want to automate variable elimination and feature selection.
Choosing a model depends on the dataset and the problem statement you are dealing
with. It is essential to understand the dataset and how features interact with each other.
Lasso regression penalizes less important features of your dataset and makes their
respective coefficients zero, thereby eliminating them. Thus it provides you with the benefit
of feature selection and simple model creation. So, if the dataset has high dimensionality and
high correlation, lasso regression can be used.
Lasso regression penalty consists of all the estimated parameters. Lambda can be any
value between zero to infinity. This value decides how aggressive regularization is performed.
It is usually chosen using cross-validation.
Lasso penalizes the sum of absolute values of coefficients. As the lambda value
increases, coefficients decrease and eventually become zero. This way, lasso regression
eliminates insignificant variables from our model. Our regularized model may have a slightly
high bias than linear regression but less variance for future predictions.
2.7 – Classification
As the name suggests, Classification is the task of “classifying things” into sub-
categories. But, by a machine! If that doesn’t sound like much, imagine your computer being
able to differentiate between you and a stranger. Between a potato and a tomato. Between
an A grade and an F. Now, it sounds interesting now. In Machine Learning and Statistics,
Classification is the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs, on the basis of a training set of data containing observations
and whose categories membership is known.
Types of Classification
Classification is of two types:
1. Binary Classification: When we have to categorize given data into 2 distinct
classes. Example – On the basis of given health conditions of a person, we have
to determine whether the person has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2. For Example –
On the basis of data about different species of flowers, we have to determine
which specie our observation belongs.
Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the
class is predicted.
6. Quality Metric: Metric used for measuring the performance of the model.
7. ML Algorithm: The algorithm that is used to update weights w’, which updates
the model and “learns” iteratively.
2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:
3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the
below image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.
Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
simple decision rules inferred from the training data. Generally, tree-based ML methods are
simple and intuitive; to predict a class label or value, we start from the top of the tree or the
root and, using branches, go to the nodes by comparing features on the basis of which will
provide the best split.
Tree-based methods also use the mean for continuous variables or mode for
categorical variables when making predictions on training observations in the regions they
belong to. Since the set of rules used to segment the predictor space can be summarized in a
visual representation with branches that show all the possible outcomes, these approaches
are commonly referred to as decision tree methods. The methods are flexible and can be
applied to either classification or regression problems. Classification and Regression
Trees (CART) is a commonly used term by Leo Breiman, referring to the flexibility of the
methods in solving both linear and non-linear predictive modeling problems.
Decision trees can be classified based on the type of target or response variable.
i. Classification Trees
The default type of decision trees, used when the response variable is categorical—i.e.
predicting whether a team will win or lose a game.
1. Interpretability: Decision tree methods are easy to understand even for non-
technical people.
2. The data type isn’t a constraint, as the methods can handle both categorical and
numerical variables.
3. Data exploration — Decision trees help us easily identify the most significant
variables and their correlation.
1. Large decision trees are complex, time-consuming and less accurate in predicting
outcomes.
2. Decision trees don’t fit well for continuous variables, as they lose important
information when segmenting the data into different regions.
i) Root node — this represents the entire population or the sample, which gets divided into
two or more homogenous subsets.
ii) Splitting — subdividing a node into two or more sub-nodes.
iii) Decision node — this is when a sub-node is divided into further sub-nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It
cannot be split further.
v) Pruning — removing unnecessary sub-nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub-section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub-node is a parent, while the
sub-node is the child node.
CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the
target variable’s values can be predicted based on other matters. It is a decision tree where
each fork is split into a predictor variable and each node has a prediction for the target
variable at the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value
of an attribute. The root node is taken as the training set and is split into two by considering
the best attribute and threshold value. Further, the subsets are also split using the same
logic. This continues till the last pure sub-set is found in the tree or the maximum number
of leaves possible in that growing tree.
CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does
that by searching for the best homogeneity for the sub nodes, with the help of the Gini
index criterion.
Classification tree
A classification tree is an algorithm where the target variable is categorical. The
algorithm is then used to identify the “Class” within which the target variable is most likely
to fall. Classification trees are used when the dataset needs to be split into classes that
belong to the response variable(like yes or no)
Regression tree
A Regression tree is an algorithm where the target variable is continuous and the
tree is used to predict its value. Regression trees are used when the response variable is
continuous. For example, if the response variable is the temperature of the day.
1. Bagging
2. Boosting
3. Stacking
Bagging
Consider a scenario where you are looking at the users’ ratings for a product. Instead
of approving one user’s good/bad rating, we consider average rating given to the product.
With average rating, we can be considerably sure of quality of the product. Bagging makes
use of this principle. Instead of depending on one model, it runs the data through multiple
models in parallel, and average them out as model’s final output.
a. Consider that there are 8 samples in the training dataset. Out of these 8 samples,
every weak learner gets 5 samples as training data for the model. These 5 samples
need not be unique, or non-repetitive.
b. The model (weak learner) is allowed to get a sample multiple times. For example,
as shown in the figure, Rec5 is selected 2 times by the model. Therefore, weak
learner1 gets Rec2, Rec5, Rec8, Rec5, Rec4 as training data.
c. All the samples are available for selection to next weak learners. Thus all 8 samples
will be available for next weak learner and any sample can be selected multiple
times by next weak learners.
• Bagging is a parallel method, which means several weak learners learn the data
pattern independently and simultaneously. This can be best shown in the below
diagram:
1. The output of each weak learner is averaged to generate final output of the model.
2. Since the weak learner’s outputs are averaged, this mechanism helps to reduce
variance or variability in the predictions. However, it does not help to reduce bias
of the model.
3. Since final prediction is an average of output of each weak learner, it means that
each weak learner has equal say or weight in the final output.
Boosting
We saw that in bagging every model is given equal preference, but if one model
predicts data more correctly than the other, then higher weightage should be given to this
model over the other. Also, the model should attempt to reduce bias. These concepts are
applied in the second ensemble method that we are going to learn, that is Boosting.
What is Boosting?
1. To start with, boosting assigns equal weights to all data points as all points are
equally important in the beginning. For example, if a training dataset has N
samples, it assigns weight = 1/N to each sample.
2. The weak learner classifies the data. The weak classifier classifies some samples
correctly, while making mistake in classifying others.
3. After classification, sample weights are changed. Weight of correctly classified
sample is reduced, and weight of incorrectly classified sample is increased. Then
the next weak classifier is run.
4. This process continues until model as a whole gives strong predictions.
Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble
is a decision tree classifier and is generated using a random selection of attributes
at each node to determine the split. During classification, each tree votes and the
most popular class is returned.
Implementation steps of Random Forest –
0. Multiple subsets are created from the original data set, selecting
observations with replacement.
This combination of multiple models is called Ensemble. Ensemble uses two methods:
1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy is called Boosting.
Example: ADA BOOST, XG BOOST.
Bagging: From the principle mentioned above, we can understand Random forest uses the
Bagging code. Now, let us understand this concept in detail. Bagging is also known as
Bootstrap Aggregation used by random forest. The process begins with any original random
data. After arranging, it is organised into samples known as Bootstrap Sample. This process is
known as Bootstrapping.Further, the models are trained individually, yielding different results
known as Aggregation. In the last step, all the results are combined, and the generated output
is based on majority voting. This step is known as Bagging and is done using an Ensemble
Classifier.
With supervised training, the training data contains the input and target values.
The algorithm picks up a pattern that maps the input values to the output and uses this
pattern to predict values in the future. Unsupervised learning, on the other hand, uses
training data that does not contain the output values. The algorithm figures out the desired
output over multiple iterations of training. Finally, we have reinforcement learning. Here, the
algorithm is rewarded for every right decision made, and using this as feedback, and the
algorithm can build stronger strategies.
In ID3, entropy is calculated for each remaining attribute. The attribute with the
smallest entropy is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category. Information Gain
IG(A) tells us how much uncertainty in S was reduced after splitting set S on attribute A.
Mathematical representation of Information gain is shown here -
I G(A,S)=H(S)−∑t∈Tp(t)H(t)
Where,
• H(S) - Entropy of set S.
UNIT – III
Unsupervised learning and Reinforcement learning
3.1 Introduction – Clustering Algorithm
Clustering is an unsupervised machine learning task. You might also hear this referred
to as cluster analysis because of the way this method works. Using a clustering algorithm
means you're going to give the algorithm a lot of input data with no labels and let it find any
groupings in the data it can. Those groupings are called clusters. A cluster is a group of data
points that are similar to each other based on their relation to surrounding data points.
Clustering is used for things like feature engineering or pattern discovery. When you're
starting with data you know nothing about, clustering might be a good place to get some
insight.
Types of clustering algorithms
There are different types of clustering algorithms that handle all kinds of unique data.
Density-based
In density-based clustering, data is grouped by areas of high concentrations of data
points surrounded by areas of low concentrations of data points. Basically the algorithm finds
the places that are dense with data points and calls those clusters.
The great thing about this is that the clusters can be any shape. You aren't constrained to
expected conditions.
The clustering algorithms under this type don't try to assign outliers to clusters, so they get
ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are considered
parts of a cluster based on the probability that they belong to a given cluster.
It works like this: there is a center-point, and as the distance of a data point from the center
increases, the probability of it being a part of that cluster decreases.
If you aren't sure of how the distribution in your data might be, you should consider a different
type of algorithm.
Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data. Each
data point is assigned to a cluster based on its squared distance from the centroid. This is the
most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get
from a company database or taxonomies. It builds a tree of clusters so everything is organized
from the top-down.
This is more restrictive than the other clustering types, but it's perfect for specific kinds of
data sets.
2. Initializing centroids
Centroid is the center of a cluster but initially, the exact center of data points will be
unknown so, we select random data points and define them as centroids for each cluster. We
will initialize 3 centroids in the dataset.
In this step, we will first calculate the distance between data point X and centroid C using
Euclidean Distance metric.
And then choose the cluster for data points where the distance between the data
point and the centroid is minimum.
K-means clustering
4. Re-initialize centroids
Next, we will re-initialize the centroids by calculating the average of all data points of
that cluster.
K-means clustering
Does this iterative process sound familiar? Well, K-means follows the same approach
as Expectation-Maximization(EM). EM is an iterative method to find the maximum likelihood
of parameters where the machine learning model depends on unobserved features. This
approach consists of two steps Expectation(E) and Maximization(M) and iterates between
these two.
For K-means, The Expectation(E) step is where each data point is assigned to the most
likely cluster and the Maximization(M) step is where the centroids are recomputed using the
least square optimization technique.
• Naive Sharding
The sharding centroid initialization algorithm primarily depends on the composite
summation value of all the attributes for a particular instance or row in a dataset. The idea is
to calculate the composite value and then use it to sort the instances of the data. Once the
data set is sorted, it is then divided horizontally into k shards.
Finally, all the attributes from each shard will be summed and their mean will be
calculated. The shard attributes mean value collection will be identified as the set of centroids
that can be used for initialization.
Centroid initialization using sharding happens in linear time and the resultant
execution time is much better than random centroid initialization.
Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into
different sets based on their degree of similarity.
The various types of clustering are:
• Hierarchical clustering
• Partitioning clustering
Hierarchical clustering is further subdivided into:
• Agglomerative clustering
• Divisive clustering
Partitioning clustering is further subdivided into:
• K-Means clustering
• Fuzzy C-Means clustering
Distance Measure
Distance measure determines the similarity between two elements and influences the
shape of clusters.
K-Means clustering supports various kinds of distance measures, such as:
1. Agglomerative clustering:
2. Divisive clustering:
Generally, cluster validity measures are categorized into 3 classes, they are –
1. Internal cluster validation : The clustering result is evaluated based on the data
clustered itself (internal information) without reference to external information.
2. External cluster validation : Clustering results are evaluated based on some
externally known result, such as externally provided class labels.
3. Relative cluster validation : The clustering results are evaluated by varying
different parameters for the same algorithm (e.g. changing the number of
clusters).
Besides the term cluster validity index, we need to know about inter-cluster
distance d(a, b) between two cluster a, b and intra-cluster index D(a) of cluster a.
Now, let’s discuss 2 internal cluster validity indices namely Dunn index and DB index.
Dunn index :
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating
clustering algorithms, is an internal evaluation scheme, where the result is based on the
clustered data itself. Like all other such indices, the aim of this Dunn index to identify sets
of clusters that are compact, with a small variance between members of the cluster, and
well separated, where the means of different clusters are sufficiently far apart, as compared
to the within cluster variance.
Higher the Dunn index value, better is the clustering. The number of clusters that
maximizes Dunn index is taken as the optimal number of clusters k. It also has some
drawbacks. As the number of clusters and dimensionality of the data increase, the
computational cost also increases.
The Dunn index for c number of clusters is defined as :
DB index :
The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W.
Bouldin in 1979), a metric for evaluating clustering algorithms, is an internal evaluation
scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset.
Lower the DB index value, better is the clustering. It also has a drawback. A good value
reported by this method does not imply the best information retrieval.
The DB index for k number of clusters is defined as :
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Some benefits of applying dimensionality reduction technique to the given dataset are given
below:
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we want to keep
the whole information but use fewer resources while processing the information.
Hence, we are left with a lesser number of eigenvectors, and there might have been
some data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.
PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data. PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables and drops the least important variable.
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.
Collaborative filtering algorithms recommend items (this is the filtering part) based on
preference information from many users (this is the collaborative part). This approach uses
similarity of user preference behavior, given previous interactions between users and items,
recommender algorithms learn to predict future interaction. These recommender systems
build a model from a user’s past behavior, such as items purchased previously or ratings given
to those items and similar decisions by other users. The idea is that if some people have made
similar decisions and purchases in the past, like a movie choice, then there is a high probability
they will agree on additional future selections. For example, if a collaborative filtering
recommender knows you and another user share similar tastes in movies, it might
recommend a movie to you that it knows this other user already likes.
Hybrid recommender systems combine the advantages of the types above to create a more
comprehensive recommending system.
▪ Improving retention.
▪ Increasing sales.
▪ Helping to form customer habits and trends.
▪ Speeding up the pace of work.
▪ Boosting cart value.
3.8 EM – Algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent variable model.
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.
Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.
EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two modes. In
the first mode, we estimate the missing or latent variables. Hence it is referred to as
the Expectation/estimation step (E-step). Further, the other mode is used to optimize the
parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.
o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to update
the values of the parameters in the M-step.
Steps in EM Algorithm
o 1st Step: The very first step is to initialize the parameter values. Further, the system
is provided with incomplete observed data with the assumption that data is obtained
from a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further,
E-step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or
not. If it gets "yes", then stop the process; else, repeat the process from step 2 until
the convergence occurs.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable model has a
broad range of real-life applications in machine learning. These are as follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as the Gaussian
Mixture Modeland quantitative genetics.
o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.
There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-
term return at any state(s) under policy π.
2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a
policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model representation
is different for each environment.
Let's take an example of a maze environment that the agent needs to explore. Consider
the below image:
In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.
The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can
take four actions: move up, move down, move left, and move right.
The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get
the +1-reward point.
The agent will try to remember the preceding steps that it has taken to reach the final
step. To memorize the steps, it assigns 1 value to each previous step. Consider the below
step:
Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has 1
value block on both sides? Consider the below diagram:
It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to reach
the destination. Hence to solve the problem, we will use the Bellman equation, which is
the main concept behind reinforcement learning.
Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.
Positive:
It is defined as an event, that occurs because of specific behavior. It increases the
strength and the frequency of the behavior and impacts positively on the action taken by the
agent.
This type of Reinforcement helps you to maximize performance and sustain change
for a more extended period. However, too much Reinforcement may lead to over-
optimization of state, which can affect the results.
Negative:
Negative Reinforcement is defined as strengthening of behavior that occurs because
of a negative condition which should have stopped or avoided. It helps you to define the
minimum stand of performance. However, the drawback of this method is that it provides
enough to meet up the minimum behavior.
3.9 Elements
Where the reward at time t is the combination of discounted rewards in the future. It
implies that future rewards are valued less. The TD Error is the difference between the
ultimate correct reward (V*_t) and our current prediction (V_t).
And similar to other optimization methods, the current value will be updated by its
value + learning_rate * error:
Parameters
Alpha (α): learning rate. This parameter shows how much we should adjust our
estimates based on the error. The learning rate is between 0 and 1. A large learning rate adjusts
aggressively and might lead to fluctuating training results — not converging. A small learning
rate adjusts slowly, which will take more time to converge.
Gamma (γ): the discount rate. How much we are valuing future rewards. The discount
rate is between 0 and 1. The bigger the discount rate, we more we valuing the future rewards.
UNIT – IV
Probabilistic method for learning
4.1 Introduction – Naïve Bayes Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5
Likelihood table weather condition:
Weather No Yes
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
MLE is carried out by writing an expression known as the Likelihood function for a set of
observations. This expression contains an unknown parameter, say, θ of he model. We obtain
the value of this parameter that maximizes the likelihood of the observations. This value is
called maximum likelihood estimate.
Think of MLE as opposite of probability. While probability function tries to determine
the probability of the parameters for a given sample, likelihood tries to determine the
probability of the samples given the parameter.
2. Properties of Maximum Likelihood Estimates. MLE has the very desirable properties
especially for very large sample sizes some of which are:
likelihood function are very efficient in testing hypothesis about models and parameters
they become unbiased minimum variance estimator with increasing sample size
they have approximate normal distributions
3. Deriving the Likelihood Function. Assuming a random sample x1, x2, x3, … ,xn which have
joint probability density and denoted by:
L(θ) = f(x1, x2, x3, … ,xn|θ)
We need to find the most likely value of the parameter θ given the set observations.
To do this, we use a likelihood function.
The likelihood function is defined as:
L(θ) = f(x1, x2, x3, … ,xn|θ)
which is considered as a function of θ
If we assume that the sample is normally distributed, then we can define the likelihood
estimate for θ as the value of θ that maximizes the L(θ), that is the value of θ that makes the
data set most likely.
We can split the function f(x1, x2, x3, … ,xn|θ) as a product of univariates such that:
L(θ) = f(x1, x2, x3, … ,xn|θ) = f(x1|θ) +f(x2|θ), + f(x3|θ) +… + f(xn|θ)
which would give us the same results.
4. Log Likelihood. Maximizing the likelihood function derived above can be a complex
operation. So to work around this, we can use the fact that the logarithm of a function is also
an increasing function. So maximizing the logarithm of the likelihood function, would also be
equivalent to maximizing the likelihood function.
This is given as:
So at this point, the result we have from maximizing this function is known as
‘maximum likelihood estimate‘ for the given function
5. Applications of Maximum Likelihood Estimation. MLE can be applied in different statistical
models including linear and generalized linear models, exploratory and confirmatory analysis,
communication system, econometrics and signal detection.
Solution:
Step-1: Calculating C1 and L1:
o In the first step, we will create a table that contains support count (The frequency of
each itemset individually in the dataset) of each itemset in the given dataset. This table
is called the Candidate set or C1.
o Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.
o Again, we need to compare the C2 Support count with the minimum support count,
and after comparing, the itemset with less support count will be eliminated from the
table C2. It will give us the below table for L2.
o Now we will create the L3 table. As we can see from the above C3 table, there is only
one combination of itemset that has support count equal to the minimum support
count. So, the L3 will have only one combination, i.e., {A, B, C}.
Step-4: Finding the association rules for the subsets:
To generate the association rules, first, we will create a new table with the possible
rules from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all rules,
we will exclude the rules that have less confidence than the minimum threshold(50%).
As the given threshold or minimum confidence is 50%, so the first three rules A ^B →
C, B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.
The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer before
calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:
If the model is Non-Probabilistic (Deterministic), it will usually output only the most likely class
that the input data instance belongs to. Vanilla “Support Vector Machines” is a popular non-
probabilistic classifier.
Let’s discuss an example to better understand probabilistic classifiers. Take the
task of classifying an image of an animal into five classes — {Dog, Cat, Deer, Lion, Rabbit} as
the problem. As input, we have an image (of a dog). For this example, let’s consider that the
classifier works well and provides correct/ acceptable results for the particular input we are
discussing. When the image is provided as the input to the probabilistic classifier, it will provide
an output such as (Dog (0.6), Cat (0.2), Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier
is non-probabilistic, it will only output “Dog”.
Objective functions:
In order to identify whether a particular model is probabilistic or not, we can look
at its Objective Function. In machine learning, we aim to optimize a model to excel at a
particular task. The aim of having an objective function is to provide a value based on the
model’s outputs, so optimization can be done by either maximizing or minimizing the
particular value.
In Machine Learning, usually, the goal is to minimize prediction error. So, we
define what is called a loss function as the objective function and tries to minimize the loss
function in the training phase of an ML model.
If we take a basic machine learning model such as Linear Regression, the objective
function is based on the squared error. The objective of the training is to minimize the Mean
Squared Error / Root Mean Squared Error (RMSE). The intuition behind calculating Mean
Squared Error is, the loss/ error created by a prediction given to a particular data point is based
on the difference between the actual value and the predicted value.
The loss created by a particular data point will be higher if the prediction gives by the
model is significantly higher or lower than the actual value. The loss will be less when the
predicted value is very close to the actual value. As you can see, the objective function here is
not based on probabilities, but on the difference (absolute difference) between the actual
value and the predicted value.
Here, n indicates the number of data instances in the data set, y_true is the correct/ true value
and y_predict is the predicted value (by the linear regression model).
The intuition behind Cross-Entropy Loss is ; if the probabilistic model is able to predict
the correct class of a data point with high confidence, the loss will be less. In the example we
discussed about image classification, if the model provides a probability of 1.0 to the class ‘Dog’
(which is the correct class), the loss due to that prediction = -log(P(‘Dog’)) = -log(1.0)=0.
Instead, if the predicted probability for ‘Dog’ class is 0.8, the loss = -log(0.8)= 0.097. However,
if the model provides a low probability for the correct class, like 0.3, the loss = -log(0.3)
= 0.523, which can be considered as a significant loss.
In a binary classification model based on Logistic Regression, the loss function is usually
defined using the Binary Cross Entropy loss (BCE loss).
Here y_i is the class label (1 if similar, 0 otherwise) and p(s_i) is the predicted
probability of a point being class 1 for each point ‘i’ in the dataset. N is the number of data
points. Note that as this is a binary classification problem, there are only two classes, class 1
and class 0.
where,
a -> lower limit
b -> upper limit
X -> continuous random variable
f(x) -> probability density function
Steps Involved:
Step 1 - Create a histogram for the random set of observations to understand the
density of the random sample.
Step 2 - Create the probability density function and fit it on the random sample.
Observe how it fits the histogram plot.
Density Estimation: It is the process of finding out the density of the whole population by
examining a random sample of data from that population. One of the best ways to achieve
a density estimate is by using a histogram plot.
using distribution parameters like mean and standard deviation, a particular algorithm is
used to estimate the probability distribution. Thus, it is known as a ‘nonparametric density
estimation’.
One of the most common nonparametric approach is known as Kernel Density
Estimation. In this, the objective is to calculate the unknown density fh(x) using the equation
given below:
where,
K -> kernel (non-negative function)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
fh(x) -> density (to calculate)
n -> no. of samples in random sample.
reflecting an observation at a certain point in time, such as a stock price or sensor data.
Sequences, DNA sequences, and meteorological data are examples of sequential data.
In other words sequential we can term video data, audio data, and images up to some
extent as sequential data. Below are a few basic examples of sequential data.
Below the listed items are some popular machine learning applications that are based on
sequential data,
• Time Series: a challenge of predicting time series, such as stock market projections.
• Text mining and sentiment analysis are two examples of natural language
processing (e.g., Learning word vectors for sentiment analysis)
• Machine Translation: Given a single language input, sequence models are used to
translate the input into several languages. Here’s a recent poll.
• Image captioning is assessing the current action and creating a caption for the
image.
• Deep Recurrent Neural Network for Speech Recognition Deep Recurrent Neural
Network for Speech Recognition
• Recurrent neural networks are being used to create classical music.
• Recurrent Neural Network for Predicting Transcription Factor Binding Sites based
on DNA Sequence Analysis
A different task that can be achieved using RNN areas,
One-to-one
With one input and one output, this is the classic feed-forward neural network
architecture.
One-to-many
This is referred to as image captioning. We have one fixed-size image as input, and
the output can be words or phrases of varying lengths.
Many-to-one
This is used to categorize emotions. A succession of words or even paragraphs of
words is anticipated as input. The result can be a continuous-valued regression output that
represents the likelihood of having a favourable attitude.
Many-to-many
This paradigm is suitable for machine translation, such as that seen on Google
Translate. The input could be a variable-length English sentence, and the output could be
a variable-length English sentence in a different language. On a frame-by-frame basis, the
last many to many models can be utilized for video classification.
Another two commonly applied types of Markov model are used when the system
being represented is controlled -- that is, when the system is influenced by a decision-making
agent. These are as follows:
1. Markov decision processes. These are used to model decision-making in discrete,
stochastic, sequential environments. In these processes, an agent makes decisions
based on reliable information. These models are applied to problems in artificial
intelligence (AI), economics and behavioral sciences.
2. Partially observable Markov decision processes. These are used in cases like
Markov decision processes but with the assumption that the agent doesn't always
have reliable information. Applications of these models include robotics, where it
empirical observations that show, for example, that the most probable transitions from first
gear are to second or neutral.
The image below represents the toss of a coin. Two states are possible: heads
and tails. The transition from heads to heads or heads to tails is equally probable (.5) and is
independent of all preceding coin tosses.
Ashok believes that the weather operates as a discrete Markov chain, wherein the
chain there are only two states whether the weather is Rainy or it is sunny. The condition of
the weather cannot be observed by Ashok, here the conditions of the weather are hidden from
Ashok. On each day, there is a certain chance that Bob will perform one activity from the set
of the following activities {“jog”, “work”,” clean”}, which are depending on the weather. Since
Rahul tells Ashok that what he has done, those are the observations. The entire system is that
of a hidden Markov model (HMM).
Here we can say that the parameter of HMM is known to Ashok because he has general
information about the weather and he also knows what Rahul likes to do on average.
So let’s consider a day where Rahul called Ashok and told him that he has cleaned his
residence. In that scenario, Ashok will have a belief that there are more chances of a rainy day
and we can say that belief Ashok has is the start probability of HMM let’s say which is like the
following.
Now the distribution of the probability has the weightage more on the rainy day
stateside so we can say there will be more chances for a day to being rainy again and the
probabilities for next day weather states are as following
transition_probability = {
From the above we can say the changes in the probability for a day is transition
probabilities and according to the transition probability the emitted results for the probability
of work that Rahul will perform is
emission_probability = {
This probability can be considered as the emission probability. Using the emission
probability Ashok can predict the states of the weather or using the transition probabilities
Ashok can predict the work which Rahul is going to perform the next day.
So here from the above intuition and the example we can understand how we can use
this probabilistic model to make a prediction. Now let’s just discuss the applications where it
can be used. Depending on the situation, we usually ask three different types of questions
regarding an HMM problem.
• Likelihood: How likely are the observations based on the current model or the
probability of being at a state at a specific time step.
• Decoding: Find the internal state sequence based on the current model and
observations.
• Learning. Learn the HMM model.
UNIT – V
Neural Networks and Deep Learning
5.1 Neural networks
Neural Networks is one of the most significant discoveries in history. Neural Networks
can solve problems that can't be solved by algorithms:
• Medical Diagnosis
• Face Detection
• Voice Recognition
Neural Networks is the essence of Deep Learning.
Neurons
Scientists agree that our brain has around 100 billion neurons. These neurons have
hundreds of billions connections between them.
Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system.
The neurons are responsible for receiving input from the external world, for sending output
(commands to our muscles), and for transforming the electrical signals in between.
Neural Networks
Artificial Neural Networks are normally called Neural Networks (NN). Neural
networks are in fact multi-layer Perceptrons. The perceptron defines the first step into multi-
layered neural networks.
"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."
E: Experience (the number of times).
T: The Task (driving a car).
P: The Performance (good or bad).
The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.
The typical Artificial Neural Network looks something like the given figure.
Dendrites Inputs
Synapse Weights
Axon Output
Signals are transmitted between neurons by electrical pulses (spikes) travelling along
the long thin stand called axon. These pulses are received by the receiving neuron at terminals
called synapses. (They are found on a set of branches emerging from the cell body (soma) and
known as dendrites). These pulses lead to certain chemical activity in the dendrites which may
inhibit or excite the generation of pulses in the receiving neuron – this depends on the
geometry of the synapse and type of chemical activity. The neuron sums up or integrates the
effects of thousands of impulses over its dendritic tree and over time. If the integrated
potential exceeds a threshold, the cell ‘fires’ and generates a spike which starts to travel along
its axon. This then initiates the whole sequence of events further in the connected neurons.
Motivation behind neural network is human brain. Human brain is called as the best
processor even though it works slower than other computers. Many researchers thought to
make a machine that would work in the prospective of the human brain.
Human brain contains billion of neurons which are connected to many other neurons to form
a network so that if it sees any image, it recognizes the image and processes the output.
Artificial Neuron
Artificial Neuron are also called as perceptrons. This consist of the following basic
terms:
• Input
• Weight
• Bias
• Activation Function
• Output
A. All the inputs X1, X2, X3,…., Xn multiplies with their respective weights.
Input Layer - Input layer contains inputs and weights. Example: X1, W1, etc.
Hidden Layer - In a neural network, there can be more than one hidden layer. Hidden layer
contains the summation and activation function.
Output Layer - Output layer consists the set of results generated by the previous layer. It also
contains the desired value, i.e. values that are already present in the output layer to check
with the values generated by the previous layer. It may be also used to improve the end
results.
5.3 Perceptron
Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.
o Input Nodes or Input Layer: - This is the primary component of Perceptron which
accepts the initial data into the system for further processing. Each input node
contains a real numerical value.
o Wight and Bias: - Weight parameter represents the strength of the connection
between units. This is another most important parameter of Perceptron components.
Weight is directly proportional to the strength of the associated input neuron in
deciding the output. Further, Bias can be considered as the line of intercept in a linear
equation.
o Activation Function: - These are the final and important components that help to
determine whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.
The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may differ
(e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process
is slow or has vanishing or exploding gradients.
This activation function is also known as the step function and is represented by 'f'.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight of
input is indicative of the strength of a node. Similarly, an input's bias value gives the ability to
shift the activation function curve up or down.
Add a special term called bias ‘b’ to this weighted sum to improve the model’s performance.
∑wi*xi + b
Step-2 - In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector
back-propagation. In such cases, each hidden layer within the network is adjusted according
to the output values produced by the final layer.
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward. The main difference
between both of these methods is: that the mapping is rapid in static back-propagation while
it is nonstatic in recurrent backpropagation.
Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we would
update the weights and biases of the neurons on the basis of the error at the output. This
process is known as back-propagation. Activation functions make the back-propagation
possible since the gradients are supplied along with the error to update the weights and
biases.
For example : Calculation of price of a house is a regression problem. House price may have
any big/small value, so we can apply linear activation at output layer. Even in this case
neural net must have any non-linear function at hidden layers.
1) Sigmoid Function:
Description: Takes a real-valued number and scales it between 0 and 1. Large
negative numbers become 0 and large positive numbers become 1
Formula: 1 /(1 + e^-x)
Range: (0,1)
Pros: As it’s range is between 0 and 1, it is ideal for situations where we need to
predict the probability of an event as an output.
Cons: The gradient values are significant for range -3 and 3 but become much closer
to zero beyond this range which almost kills the impact of the neuron on the final
output. Also, sigmoid outputs are not zero-centered (it is centred around 0.5) which
leads to undesirable zig-zagging dynamics in the gradient updates for the weights
Plot:
2) Tanh Function:
Description: Similar to sigmoid but takes a real-valued number and scales it between -
1 and 1.It is better than sigmoid as it is centred around 0 which leads to better
convergence
Range: (-1,1)
Pros: The derivatives of the tanh are larger than the derivatives of the sigmoid which
Cons: Similar to sigmoid, the gradient values become close to zero for wide range of
values (this is known as vanishing gradient problem). Thus, the network refuses to
Plot:
3. Softmax Function:
which can returns the probability for a datapoint belonging to each individual class in
Formula:
Pros: Can handle multiple classes and give the probability of belonging to each class
Cons: Should not be used in hidden layers as we want the neurons to be independent.
4. ReLU Function:
Description: The rectified linear activation function or ReLU for short is a piecewise
linear function that will output the input directly if it is positive, otherwise, it will output
zero. This is the default function but modifying default parameters allows us to use non-
zero thresholds and to use a non-zero multiple of the input for values below the
threshold (called Leaky ReLU).
Formula: max(0,x)
Range: (0,inf)
Pros: Although RELU looks and acts like a linear function, it is a nonlinear function
allowing complex relationships to be learned and is able to allow learning through all
the hidden layers in a deep network by having large derivatives.
Cons: It should not be used as the final output layer for either classification/regression
tasks
Plot:
Loss Functions
The other key aspect in setting up the neural network infrastructure is selecting the
right loss functions. With neural networks, we seek to minimize the error (difference between
actual and predicted value) which is calculated by the loss function. We will be discussing 3
popular loss functions:
Description: MSE loss is used for regression tasks. As the name suggests, this loss is
calculated by taking the mean of squared differences between actual(target) and predicted
values.
Formula:
Range: (0,inf)
Pros: Preferred loss function if the distribution of the target variable is Gaussian as it has good
derivatives and helps the model converge quickly
Cons: Is not robust to outliers in the data (unlike loss functions like Mean Absolute Error) and
penalizes high and low predictions exponentially (unlike loss functions like Mean Squared
Logarithmic Error Loss)
Description: BCE loss is the default loss function used for the binary classification tasks.
It requires one output layer to classify the data into two classes and the range of output is (0–
Formula:
where y is the actual label, ŷ is the classifier’s predicted probability distributions for predicting
Range: (0,inf)
Pros: The continuous nature of the loss function helps the training process converged well
Cons: Can only be used with sigmoid activation function. Other loss functions like Hinge or
task. It requires the same number of output nodes as the classes with the final layer going
through a softmax activation so that each output node has a probability value between (0–1).
Formula:
where y is the actual label and p is the classifier’s predicted probability distributions for
predicting the class j
Range: (0,inf)
Pros: Similar to Binary Cross Entropy, the continuous nature of the loss function helps the
training process converged well.
Cons: May require a one hot encoded vector with many zero values if there many classes,
requiring significant memory (should use Sparse Categorical Crossentropy in this case)
There are two main reasons it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car
development requires millions of images and thousands of hours of video.
CNNs eliminate the need for manual feature extraction, so you do not need to identify
features used to classify images. The CNN works by extracting features directly from images.
The relevant features are not pretrained; they are learned while the network trains on a
collection of images. This automated feature extraction makes deep learning models highly
accurate for computer vision tasks such as object classification.
CNNs learn to detect different features of an image using tens or hundreds of hidden
layers. Every hidden layer increases the complexity of the learned image features. For
example, the first hidden layer could learn how to detect edges, and the last learns how to
detect more complex shapes specifically catered to the shape of the object we are trying to
recognize.
A key advantage of deep learning networks is that they often continue to improve as
the size of your data increases.
Comparing a machine learning approach to categorizing vehicles (left) with deep learning
(right).
In CNN, each input image will pass through a sequence of convolution layers along
with pooling, fully connected layers, filters (Also known as kernels). After that, we will apply
the Soft-max function to classify an object with probabilistic values 0 and 1.
Convolution Layer
Convolution layer is the first layer to extract features from an input image. By learning
image features using a small square of input data, the convolutional layer preserves the
relationship between pixels. It is a mathematical operation which takes two inputs such as
image matrix and a kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.
Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter matrix 3*3
as:
The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called "Features Map"
and show as an output.
Convolution of an image with different filters can perform an operation such as blur, sharpen,
and edge detection by applying filters.
Strides
Stride is the number of pixels which are shift over the input matrix. When the stride is
equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equaled
to 2, then we move the filters to 2 pixels at a time. The following figure shows that the
convolution would work with a stride of 2.
Padding
Padding plays a crucial role in building the convolutional neural network. If the image
will get shrink and if we will take a neural network with 100's of layers on it, it will give us a
small image after filtered in the end.
If we take a three by three filter on top of a grayscale image and do the convolving
then what will happen?
It is clear from the above picture that the pixel in the corner will only get covers one
time, but the middle pixel will get covered more than once. It means that we have more
information on that middle pixel, so there are two downsides:
o Shrinking outputs
o Losing information on the corner of the image.
To overcome this, we have introduced padding to an image. "Padding is an additional
layer which can add to the border of an image."
Pooling Layer
Pooling layer plays an important role in pre-processing of an image. Pooling layer
reduces the number of parameters when the images are too large. Pooling is "downscaling"
of the image obtained from the previous layers. It can be compared to shrinking an image to
reduce its pixel density. Spatial pooling is also called downsampling or subsampling, which
reduces the dimensionality of each map but retains the important information.
There are the following types of spatial pooling:
Max Pooling
Max pooling is a sample-based discretization process. Its main objective is to
downscale an input representation, reducing its dimensionality and allowing for the
assumption to be made about features contained in the sub-region binned. Max pooling is
done by applying a max filter to non-overlapping sub-regions of the initial representation.
Average Pooling
Down-scaling will perform through average pooling by dividing the input into
rectangular pooling regions and computing the average values of each region.
Syntax
layer = averagePooling2dLayer(poolSize)
layer = averagePooling2dLayer(poolSize,Name,Value)
Sum Pooling
The sub-region for sum pooling or mean pooling are set exactly the same as for max-
pooling but instead of using the max function we use sum or mean.
In the above diagram, the feature map matrix will be converted into the vector such
as x1, x2, x3... xn with the help of fully connected layers. We will combine features to create
a model and apply the activation function such as softmax or sigmoid to classify the outputs
as a car, dog, truck, etc.
on all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three
hidden layers and one output layer. Then like other neural networks,
each hidden layer will have its own set of weights and biases, let’s
say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2)
for second hidden layer and (w3, b3) for third hidden layer. This
means that each of these layers are independent of each other, i.e.
they do not memorize the previous outputs.
where:
ht -> current state
ht-1 -> previous state
xt -> input state
where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron
Yt -> output
Why -> weight at output layer
Where
ht= current state
Ht-1= previous state
Xt= input state
To apply the activation function tanh, we have-
ht = tanh (Whhht-1+ WxhXt)
Where:
Whh = weight of recurrent neuron and,
Wxh = weight of the input neuron
The formula for calculating output:
Yt = Whyht
Prediction problems
RNNs are generally useful in working with sequence prediction problems. Sequence
prediction problems come in many forms and are best described by the types of inputs and
outputs it supports.