0% found this document useful (0 votes)
1 views

ML

The document provides an overview of machine learning, including its definitions, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications across various industries. It also covers the mathematical foundations necessary for understanding machine learning, such as linear algebra, probability, and statistics, emphasizing their importance in model development and data analysis. Additionally, it introduces concepts like conditional probability and Bayes' theorem, which are crucial for making informed predictions in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

ML

The document provides an overview of machine learning, including its definitions, types (supervised, unsupervised, semi-supervised, and reinforcement learning), and applications across various industries. It also covers the mathematical foundations necessary for understanding machine learning, such as linear algebra, probability, and statistics, emphasizing their importance in model development and data analysis. Additionally, it introduces concepts like conditional probability and Bayes' theorem, which are crucial for making informed predictions in machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 147

lOMoARcPSD|46256176

ML Notes

Machine learning (RVS Institute of Management Studies)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by 024_CSE_ DHARSHINI.A ([email protected])
lOMoARcPSD|46256176

UNIT – I
Introduction and mathematical foundations

1.1 What is machine learning? Need – History and Definitions - Applications


Machine Learning is a branch of Artificial Intelligence that allows machines to learn
and improve from experience automatically. It is defined as the field of study that gives
computers the capability to learn without being explicitly programmed. It is quite different
than traditional programming.

AI (Artificial Intelligence) is a machine’s ability to perform cognitive functions as


humans do, such as perceiving, learning, reasoning, and solving problems. The benchmark for
AI is the human level concerning in teams of reasoning, speech, and vision.

Machine learning is important because it gives enterprises a view of trends in


customer behavior and business operational patterns, as well as supports the development
of new products. Many of today's leading companies, such as Facebook, Google and Uber,
make machine learning a central part of their operations.

The life of Machine Learning programs is straightforward and can be summarized in the
following points:

1. Define a question
2. Collect data
3. Visualize data
4. Train algorithm
5. Test the Algorithm
6. Collect feedback
7. Refine the algorithm
8. Loop 4-7 until the results are satisfying

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Types of machine learning?


Classical machine learning is often categorized by how an algorithm learns to become
more accurate in its predictions. There are four basic approaches:
supervised learning, unsupervised learning, semi-supervised learning and reinforcement
learning. The type of algorithm data scientists choose to use depends on what type of data
they want to predict.

• Supervised learning: In this type of machine learning, data scientists supply


algorithms with labeled training data and define the variables they want the
algorithm to assess for correlations. Both the input and the output of the
algorithm is specified.

• Unsupervised learning: This type of machine learning involves algorithms that


train on unlabeled data. The algorithm scans through data sets looking for any
meaningful connection. The data that algorithms train on as well as the
predictions or recommendations they output are predetermined.

• Semi-supervised learning: This approach to machine learning involves a mix of the


two preceding types. Data scientists may feed an algorithm mostly labeled training
data, but the model is free to explore the data on its own and develop its own
understanding of the data set.

• Reinforcement learning: Data scientists typically use reinforcement learning to


teach a machine to complete a multi-step process for which there are clearly
defined rules. Data scientists program an algorithm to complete a task and give it
positive or negative cues as it works out how to complete a task. But for the most
part, the algorithm decides on its own what steps to take along the way.

How does supervised machine learning work?


Supervised machine learning requires the data scientist to train the algorithm with
both labeled inputs and desired outputs. Supervised learning algorithms are good for the
following tasks:

• Binary classification: Dividing data into two categories.

• Multi-class classification: Choosing between more than two types of answers.

• Regression modeling: Predicting continuous values.

• Ensembling: Combining the predictions of multiple machine learning models to


produce an accurate prediction.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

How does unsupervised machine learning work?


Unsupervised machine learning algorithms do not require data to be labeled. They sift
through unlabeled data to look for patterns that can be used to group data points into
subsets. Most types of deep learning, including neural networks, are unsupervised
algorithms. Unsupervised learning algorithms are good for the following tasks:

• Clustering: Splitting the dataset into groups based on similarity.

• Anomaly detection: Identifying unusual data points in a data set.

• Association mining: Identifying sets of items in a data set that frequently occur
together.

• Dimensionality reduction: Reducing the number of variables in a data set.

How does semi-supervised learning work?


Semi-supervised learning works by data scientists feeding a small amount of labeled
training data to an algorithm. From this, the algorithm learns the dimensions of the data set,
which it can then apply to new, unlabeled data. The performance of algorithms typically
improves when they train on labeled data sets. But labeling data can be time consuming and
expensive. Semi-supervised learning strikes a middle ground between the performance of
supervised learning and the efficiency of unsupervised learning. Some areas where semi-
supervised learning is used include:

• Machine translation: Teaching algorithms to translate language based on less


than a full dictionary of words.

• Fraud detection: Identifying cases of fraud when you only have a few positive
examples.

• Labelling data: Algorithms trained on small data sets can learn to apply data
labels to larger sets automatically.

How does reinforcement learning work?


Reinforcement learning works by programming an algorithm with a distinct goal and
a prescribed set of rules for accomplishing that goal. Data scientists also program the
algorithm to seek positive rewards -- which it receives when it performs an action that is
beneficial toward the ultimate goal -- and avoid punishments -- which it receives when it
performs an action that gets it farther away from its ultimate goal. Reinforcement learning is
often used in areas such as:

• Robotics: Robots can learn to perform tasks the physical world using this
technique.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

• Video gameplay: Reinforcement learning has been used to teach bots to play a
number of video games.

• Resource management: Given finite resources and a defined goal, reinforcement


learning can help enterprises plan out how to allocate resources.

Some important applications in which machine learning is widely used are given below:

• Healthcare
• Automation
• Banking and Finance
• Transportation and Traffic prediction
• Image recognition
• Speech recognition
• Product recommendation
• Virtual personal assistance
• Email spam and Malware detection and Filtering
• Self-driving cars
• Credit card fraud detection
• Stock marketing and Trading
• Language translation

1.2 Mathematical foundations for Machine Learning:

There are math foundations that are important for Machine Learning. The math subject
is:

Six math subjects become the foundation for machine learning. Each subject is
intertwined to develop our machine learning model and reach the “best” model for
generalizing the dataset.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Linear Algebra
What is Linear Algebra? This is a branch of mathematic that concerns the study of the
vectors and certain rules to manipulate the vector. When we are formalizing intuitive concepts,
the common approach is to construct a set of objects (symbols) and a set of rules to manipulate
these objects. This is what we knew as algebra.

If we talk about Linear Algebra in machine learning, it is defined as the part of


mathematics that uses vector space and matrices to represent linear equations.

When talking about vectors, people might flashback to their high school study
regarding the vector with direction, just like the image below.

Geometric Vector

This is a vector, but not the kind of vector discussed in the Linear Algebra for Machine
Learning. Instead, it would be this image below we would talk about.

Vector 4x1 Matrix

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

What we had above is also a Vector, but another kind of vector. You might be familiar
with matrix form (the image below). The vector is a matrix with only 1 column, which is known
as a column vector. In other words, we can think of a matrix as a group of column vectors or
row vectors. In summary, vectors are special objects that can be added together and multiplied
by scalars to produce another object of the same kind. We could have various objects called
vectors.

Matrix

Linear algebra itself s a systematic representation of data that computers can


understand, and all the operations in linear algebra are systematic rules. That is why in modern
time machine learning, Linear algebra is an important study.

An example of how linear algebra is used is in the linear equation. Linear algebra is a
tool used in the Linear Equation because so many problems could be presented systematically
in a Linear way. The typical Linear equation is presented in the form below.

Linear Equation

To solve the linear equation problem above, we use Linear Algebra to present the linear
equation in a systematical representation. This way, we could use the matrix characterization
to look for the most optimal solution.

Linear Equation in Matrix Representation

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

To summary the Linear Algebra subject, there are three terms you might want to learn
more as a starting point within this subject:
• Vector
• Matrix
• Linear Equation

Analytic Geometry (Coordinate Geometry)

Analytic geometry is a study in which we learn the data (point) position using an
ordered pair of coordinates. This study is concerned with defining and representing
geometrical shapes numerically and extracting numerical information from the shapes
numerical definitions and representations. We project the data into the plane in a simpler
term, and we receive numerical information from there.

Cartesian Coordinate

Above is an example of how we acquired information from the data point by projecting
the dataset into the plane. How we acquire the information from this representation is the
heart of Analytical Geometry. To help you start learning this subject, here are some important
terms you might need.

Distance Function
A distance function is a function that provides numerical information for the distance
between the elements of a set. If the distance is zero, then elements are equivalent. Else, they
are different from each other.
An example of the distance function is Euclidean Distance which calculates the linear
distance between two data points.

Euclidean Distance Equation

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Inner Product

The inner product is a concept that introduces intuitive geometrical concepts, such as
the length of a vector and the angle or distance between two vectors. It is often denoted as
⟨x,y⟩ (or occasionally (x,y) or ⟨x|y⟩).

1.3 Probability in machine learning:

Probability is the bedrock of ML, which tells how likely is the event to occur. The value
of Probability always lies between 0 to 1. It is the core concept as well as a primary
prerequisite to understanding the ML models and their applications.

Probability can be calculated by the number of times the event occurs divided by the
total number of possible outcomes. Let's suppose we tossed a coin, then the probability of
getting head as a possible outcome can be calculated as below formula:

P (H) = Number of ways to head occur/ total number of possible outcomes

P (H) = ½

P (H) = 0.5

Where;

P (H) = Probability of occurring Head as outcome while tossing a coin.

Types of Probability

For better understanding the Probability, it can be categorized further in different


types as follows:

Empirical Probability: Empirical Probability can be calculated as the number of times the
event occurs divided by the total number of incidents observed.

Theoretical Probability:Theoretical Probability can be calculated as the number of ways the


particular event can occur divided by the total number of possible outcomes.

Joint Probability:It tells the Probability of simultaneously occurring two random events.

P(A ∩ B) = P(A). P(B)

Where;

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

P(A ∩ B) = Probability of occurring events A and B both.

P (A) = Probability of event A

P (B) = Probability of event B

Conditional Probability: It is given by the Probability of event A given that event B occurred.

The Probability of an event A conditioned on an event B is denoted and defined as;

P(A|B) = P(A∩B)/P(B)

Similarly, P(B|A) = P(A ∩ B)/ P(A) . We can write the joint Probability of as A and B as
P(A ∩ B)= p(A).P(B|A), which means: "The chance of both things happening is the chance that
the first one happens, and then the second one is given when the first thing happened."

We have a basic understanding of Probability required to learn Machine Learning.


Now, we will discuss the basic introduction of Statistics for ML.

Statistics in Machine Learning

Statistics is also considered as the base foundation of machine learning which deals
with finding answers to the questions that we have about data. In general, we can define
statistics as:

Statistics can be categorized into 2 major parts. These are as follows:

o Descriptive Statistics
o Inferential Statistics

Use of Statistics in ML

Statistics methods are used to understand the training data as well as interpret the
results of testing different machine learning models. Further, Statistics can be used to make
better-informed business and investing decisions.

1.4 Conditional probability and Bayesian theorem:

Conditional probabilities arise naturally in the investigation of experiments where an


outcome of a trial may affect the outcomes of the subsequent trials. We try to calculate the
probability of the second event (event B) given that the first event (event A) has already
happened. If the probability of the event changes when we take the first event into

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

consideration, we can safely say that the probability of event B is dependent of the occurrence
of event A.

Let’s think of cases where this happens:

• Drawing a second ace from a deck given we got the first ace
• Finding the probability of having a disease given you were tested positive
• Finding the probability of liking Harry Potter given we know the person likes fiction

And so on….

Here we can define, 2 events:

• Event A is the probability of the event we’re trying to calculate.


• Event B is the condition that we know or the event that has happened.

We can write the conditional probability as, the probability of the occurrence of
event A given that B has already happened.

Bayes Theorem

Bayesian decision theory refers to the statistical approach based on tradeoff


quantification among various classification decisions based on the concept of
Probability(Bayes Theorem) and the costs associated with the decision.

It is basically a classification technique that involves the use of the Bayes Theorem
which is used to find the conditional probabilities.

The Bayes theorem describes the probability of an event based on the prior knowledge
of the conditions that might be related to the event. The conditional probability of A given B,
represented by P(A | B) is the chance of occurrence of A given that B has occurred.

P(A | B) = P(A,B)/P(B) or

By Using the Chain rule, this can also be written as:

P(A,B) = P(A|B)P(B)=P(B|A)P(A)

P(A | B) = P(B|A)P(A)/P(B) ——- (1)

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Where,

P(B) = P(B,A) + P(B,A’) = P(B|A)P(A) + P(B|A’)P(A’)

Here, equation (1) is known as the Bayes Theorem of probability

Our aim is to explore each of the components included in this theorem. Let’s explore
step by step:

(a) Prior or State of Nature:

• Prior probabilities represent how likely is each Class is going to occur.


• Priors are known before the training process.
• The state of nature is a random variable P(wi).
• If there are only two classes, then the sum of the priors is P(w1) + P(w2)=1, if the
classes are exhaustive.

(b) Class Conditional Probabilities:

• It represents the probability of how likely a feature x occurs given that it belongs to
the particular class. It is denoted by, P(X|A) where x is a particular feature
• It is the probability of how likely the feature x occurs given that it belongs to the class
wi.
• Sometimes, it is also known as the Likelihood.
• It is the quantity that we have to evaluate while training the data. During the training
process, we have input(features) X labeled to corresponding class w and we figure out
the likelihood of occurrence of that set of features given the class label.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

(c) Evidence:

• It is the probability of occurrence of a particular feature i.e. P(X).


• It can be calculated using the chain rule as, P(X) = Σin P(X | wi) P(wi)
• As we need the likelihood of class conditional probability is also figure out evidence
values during training.

(d) Posterior Probabilities:

• It is the probability of occurrence of Class A when certain Features are given


• It is what we aim at computing in the test phase in which we have testing input or
features (the given entity) and have to find how likely trained model can predict
features belonging to the particular class wi.

1.6 Vector calculus and optimization:

This brings differentiation to a higher dimension. Usually, machine learning algorithms


involve more than one parameter. Sometimes, there are multiple outputs from a single
model. We typically describe such machine learning algorithms with vector functions and use
multivariate calculus to describe their behavior.

You need to know how to do differentiation on a vector function and how to present
it as a vector of a matrix. This is the tool behind backpropagation algorithms in neural
network training.

In addition to Linear Algebra, Vector calculus is a key component of any Machine


Learning project. At the core, Calculus is just a very special way of thinking about large
problems by splitting them into several, smaller, problems.

What is a Vector?

A vector is a mathematical object that encodes a length and direction. Conceptually


they can be thought of as representing a position or even a change in some mathematical
framework or space. More formally they are elements of a vector space: a collection of objects
that is closed under an addition rule and a rule for multiplication by scalars.

A vector is often represented as a 1-dimensional array of numbers, referred to as


components and is displayed either in column form or row form. Represented geometrically,
vectors typically represent coordinates within a n-dimensional space, where n is the number
of dimensions. A simplistic representation of a vector might be a arrow in a vector space, with
an origin, direction, and magnitude (length). Vectors are commonly used in machine learning
as they lend a convenient way to organize data. Often one of the very first steps in making a
machine learning model is vectorizing the data.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

They are also relied upon heavily to make up the basis for some machine learning
techniques as well. One example in particular is support vector machines. A support vector
machine analyzes vectors across an n-dimensional space to find the optimal hyperplane for a
given data set. In essence, a support vector machine will attempt to find a line that have the
maximum distance between data sets of both classes. This allows for future data points to be
classified with ore confidence, due to increased reinforcement.

A vector is a data structure with at least two components, as opposed to a scalar, which
has just one. For example, a vector can represent velocity, an idea that combines speed and
direction: wind velocity = (50mph, 35 degrees North East). A scalar, on the other hand, can
represent something with one value like temperature or height: 50 degrees Celsius, 180
centimeters. Therefore, we can represent two-dimensional vectors as arrows on an x-y graph,
with the coordinates x and y each representing one of the vector’s values.

Today’s Calculus is commonly divided into two main branches: Differential


calculus and Integral calculus. Differential Calculus focuses on the concept of rate of change.
By how much (and in which direction as we will soon see) does a certain variable change in a
determined interval of time? This is the main question being analyzed in this subfield of
calculus. Archimedes, a Greek mathematician from the year 280 BC (yes, it was a while ago),
is the one credited with introducing this way of thinking to the world. While trying to figure
out how to compute the area of curved shapes (like circles or cylinders) he encountered an
interesting challenge. At the time, mathematicians already figured out a straightforward
approach to computing the area of proper shapes (like squares or triangles) but nothing about
curved shapes. So, armed with his exceptional problem-solving skills, Archimedes theorized
that he could “split” a circle into several smaller shapes and that the area of the original shape

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

would not be distorted in the process. First, he divided it into four smaller shapes by tracing
two straight lines as follows:

Then, into 8 smaller pieces and so on until reaching an object of similar shape to a
parallelogram:

Now, the properties of a parallelogram were well-studied at the time and so he knew
how to compute the area of such an object. If you remember from your high school geometry
classes, the area of the circle is π⋅r², where r stands for the radius. Well, after dividing the circle
into approximately 64 smaller pieces, Archimedes derived that formula by noting the similarity
between the shape of the resulting object and that of a parallelogram, whose area is the base
times the height. Incredible! This summarizes the whole idea behind Calculus. Can’t solve a
specific problem? Just split it into infinitesimally smaller pieces until the solution arises
intuitively. Riemann used the same process of thinking when trying to figure out how to
compute the area below a curve, for example. There is a reason why the great Richard
Feynman referred to Calculus as “the language in which God speaks”.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The derivative is nothing more than a special limit of a continuous function, used for
calculating the slope of a tangent line, the instantaneous rate of change of a function, and the
instantaneous velocity of an object at a specific point. The formal formula to calculate said
limit is the following:

Multivariate Calculus
As you may have inferred from its name, multivariate calculus is the extension of
calculus to multiple dimensions (aka multiple variables). The larger the number of dimensions
the less intuitive the concepts and the results may get. First, note that in normal calculus we
use a common 2-dimensional cartesian plane, we want to compute y with respect to some
input variable x. If you picture such a plane in your head you will observe that there are only
two directions in which we could go: left or right. But, how would this look like in 3 dimensions?
We can now go left, right, up or down. So, to find the derivative of such a function we need to
abstract ourselves for a minute, split the problem into smaller pieces and solve the problem in
steps by putting the pieces back together. To do this, we need a slight modification on how we
go about computing derivatives. Enter partial derivatives. This new tool allows us to compute
the derivative of our shape at different points to then get an idea of the bigger picture. To do
this, pick a variable and hold the other one constant. Holding one of the variables as constant
equates to slicing a plane through the 3D object. Take the derivative of the function with
respect to the variable you chose by following the normal derivative rules we just reviewed.
Then, do the same thing but for the other variable. Here is the plot of a function f(x,y)=x²-y²:

How on earth can we calculate the derivative at a given point for this function? Well,
let’s go by steps. First, fix the y variable and compute the partial derivative for f(x,y) = x²-
y² with respect to x to get ∂f(x,y)/ ∂x = 2x-0, since y is a constant its derivative is 0 and,
following the power rule, we compute the derivative for x² which is 2x. Next, do the same
thing but fixing x instead to get ∂f(x,y)/ ∂y = 0-2y. We are left with two equations for planes
that slice our original equation which brings the problem down from 3 to 2 dimensions, and
we know how to find the derivative at any given point of a curve in 2D. That we have an
intuition of how computing derivatives in higher dimensions looks like we can introduce some
notation. Enter vectors and matrices. vectors are mathematical objects that possess size and
direction. This means that we can represent the size and direction of an object with vectors,
for example, the distance and direction in which a car moved or the direction in which a
derivative moves. We can use them to denote the various variables of a multivariate function.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

For example, we can denote our previous function f(x,y)=x²-y² as [x² -y²]. Matrices, on the
other hand, are simply a way of representing a collection of vectors. We can use matrices to
simplify the notation for expressing systems of linear equations where each column is a vector
containing the information about a certain variable. Moreover, in computer science, we refer
to high dimensional matrices (larger than 2 × 2) as Tensors. Technically, we can expand this
terminology to refer to a scalar as a one-dimensional Tensor and a vector as a 2-dimensional
Tensor. Here is an illustration of these concepts:

The gradient vector can be interpreted as the “direction and rate of fastest
increase”. Note that naturally, a gradient is maximizing a function because it tells us the rate
of fastest increase, in practice, we are often concerned with finding local or global minima so
we compute the negative gradient to make sure we are minimizing instead.

1.7 Optimization:

Optimization is the process where we train the model iteratively that results in a
maximum and minimum function evaluation. It is one of the most important phenomena in
Machine Learning to get better results. Two important Optimization algorithms: Gradient
Descent and Stochastic Gradient Descent Algorithms;

MAXIMA AND MINIMA

Maxima is the largest and Minima is the smallest value of a function within a given
range. We represent them as below:

Global Maxima and Minima: It is the maximum value and minimum value respectively on the
entire domain of the function

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Local Maxima and Minima: It is the maximum value and minimum value respectively of the
function within a given range.
There can be only one global minima and maxima but there can be more than one local
minima and maxima.

GRADIENT DESCENT
Gradient Descent is an optimization algorithm and it finds out the local minima of a
differentiable function. It is a minimization algorithm that minimizes a given function.
Let’s see the geometric intuition of Gradient Descent:

Here, the minima is the origin(0, 0). The slope here is Tanθ. So the slope on the right
side is positive as 0<θ<90 and its Tanθ is a positive value. The slope on the left side is negative
as 90<θ<180 and its Tanθ is a negative value.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

One important observation in the graph is that the slope changes its sign from positive
to negative at minima. As we move closer to the minima, the slope reduces.
So, how does the Gradient Descent Algorithm work?

Objective: Calculate X*- local minimum of the function Y=X².

• Pick an initial point X₀ at random

• Calculate X₁ = X₀-r[df/dx] at X₀. r is Learning Rate (we’ll discuss r in Learning


Rate Section). Let us take r=1. Here, df/dx is nothing but the gradient.

• Calculate X₂ = X₁-r[df/dx] at X₁.

• Calculate for all the points: X₁, X₂, X₃, ……., Xᵢ-₁, Xᵢ

• General formula for calculating local minima: Xᵢ = (Xᵢ-₁)-r[df/dx] at Xᵢ-₁

• When (Xᵢ — Xᵢ-₁) is small, i.e., when Xᵢ-₁, Xᵢ converge, we stop the iteration and
declare X* = Xᵢ

LEARNING RATE
Learning Rate is a hyperparameter or tuning parameter that determines the step size
at each iteration while moving towards minima in the function. For example, if r = 0.1 in the
initial step, it can be taken as r=0.01 in the next step. Likewise it can be reduced exponentially
as we iterate further. It is used more effectively in deep learning.

What happens if we keep r value as constant:

In the above example, we took r=1. As we calculate the points Xᵢ, Xᵢ+₁, Xᵢ+₂,….to find the
local minima, X*, we can see that it is oscillating between X = -0.5 and X = 0.5.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

When we keep r as constant, we end up with an oscillation problem. So, we have to


reduce the “r” value with each iteration. Reduce the r value as the iteration step increases.

The disadvantage of Gradient Descent:


• When n(number of data points) is large, the time it takes for k iterations to calculate
the optimum vector becomes very large.
• Time Complexity: O(kn²)

STOCHASTIC GRADIENT DESCENT(SGD)


In SGD, we do not use all the data points but a sample of it to calculate the local
minimum of the function. Stochastic basically means Probabilistic. So we select points
randomly from the population.

• SGD in Logistic Regression

• Here, m is the sample of data selected randomly from the population, n


• Time Complexity: O(km²). m is significantly lesser than n. So, it takes lesser time to
compute when compared to Gradient Descent.

1.8 Information Theory:

Information theory is concerned with representing data in a compact fashion (a task


known as data compression or source coding), as well as with transmitting and storing it in a
way that is robust to errors (a task known as error correction or channel coding). Quantifying
information is the foundation of the field of information theory.

The intuition behind quantifying information is the idea of measuring how much
surprise there is in an event. Those events that are rare (low probability) are more surprising
and therefore have more information than those events that are common (high probability).

• Low Probability Event: High Information (surprising).


• High Probability Event: Low Information (unsurprising).

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Essentially, information theory entails two broad techniques:

1. Data Compression (source coding): More frequent events should have shorter
encodings
2. Error Correction (channel coding): Should be able to infer encoded event even if
message is corrupted by noise

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

UNIT – II
Supervised Learning
2.1 Introduction – Discriminative and Generative models
In today’s world, Machine learning becomes one of the popular and exciting fields of
study that gives machines the ability to learn and become more accurate at predicting
outcomes for the unseen data i.e, not seen the data in prior. The ideas in Machine learning
overlaps and receives from Artificial Intelligence and many other related technologies.
Today, machine learning is evolved from Pattern Recognition and the concept that
computers can learn without being explicitly programmed to performing specific tasks. We
can use the Machine Learning algorithms(e.g, Logistic Regression, Naive Bayes, etc) to

• Recognize spoken words,


• Data Mining, and
• Build applications that learn from data, etc.
And the improvement of these algorithms in terms of accuracy increases over time.
Machine learning models can be classified into two types of models
– Discriminative and Generative models. In simple words, a discriminative model makes
predictions on the unseen data based on conditional probability and can be used either for
classification or regression problem statements. On the contrary, a generative model focuses
on the distribution of a dataset to return a probability for a given example.
We as a human can adopt any of the two different approaches to machine learning models

while learning an artificial language. These two models have not previously been explored in
human learning. However, it is related to known effects of causal direction, classification vs.
inference learning, and observational vs. feedback learning.
Problem Formulation
Suppose we are working on a classification problem where our task is to decide if an
email is a spam or not spam based on the words present in a particular email. To solve this
problem, we have a joint model over

• Labels: Y=y, and


• Features: X={x1, x2, …xn}

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Therefore, the joint distribution of the model can be represented as

p(Y,X) = P(y,x1,x2…xn)

Now, our goal is to estimate the probability of spam email i.e, P(Y=1|X). Both
generative and discriminative models can solve this problem but in different ways.
Let’s see why and how they are different!

The approach of Generative Models


In the case of generative models, to find the conditional probability P(Y|X), they
estimate the prior probability P(Y) and likelihood probability P(X|Y) with the help of
the training data and uses the Bayes Theorem to calculate the posterior probability P(Y |X):

The approach of Discriminative Models


In the case of discriminative models, to find the probability, they directly assume some
functional form for P(Y|X) and then estimate the parameters of P(Y|X) with the help of the
training data.

What are Discriminative Models?


The discriminative model refers to a class of models used in Statistical Classification,
mainly used for supervised machine learning. These types of models are also known
as conditional models since they learn the boundaries between classes or labels in a dataset.
Discriminative models (just as in the literal meaning) separate classes instead of modeling the
conditional probability and don’t make any assumptions about the data points. But these
models are not capable of generating new data points. Therefore, the ultimate objective of
discriminative models is to separate one class from another.
If we have some outliers present in the dataset, then discriminative models work
better compared to generative models i.e, discriminative models are more robust to outliers.
However, there is one major drawback of these models is the misclassification problem, i.e.,
wrongly classifying a data point.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Training discriminative classifiers involve estimating a function f: X -> Y, or probability P(Y|X)

• Assume some functional form for the probability such as P(Y|X)


• With the help of training data, we estimate the parameters of P(Y|X)
Some Examples of Discriminative Models

• Logistic regression
• Scalar Vector Machine (SVMs)
• Traditional neural networks
• Nearest neighbor
• Conditional Random Fields (CRFs)
• Decision Trees and Random Forest

What are Generative Models?


Generative models are considered as a class of statistical models that can generate
new data instances. These models are used in unsupervised machine learning as a means to
perform tasks such as

• Probability and Likelihood estimation,


• Modeling data points,
• To describe the phenomenon in data,
• To distinguish between classes based on these probabilities.
Since these types of models often rely on the Bayes theorem to find the joint
probability, so generative models can tackle a more complex task than analogous
discriminative models.
So, Generative models focus on the distribution of individual classes in a dataset and
the learning algorithms tend to model the underlying patterns or distribution of the data
points. These models use the concept of joint probability and create the instances where a
given feature (x) or input and the desired output or label (y) exist at the same time.
These models use probability estimates and likelihood to model data points and
differentiate between different class labels present in a dataset. Unlike discriminative models,
these models are also capable of generating new data points.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Mathematical things involved in Generative Models


Training generative classifiers involve estimating a function f: X -> Y, or probability P(Y|X):

• Assume some functional form for the probabilities such as P(Y), P(X|Y)
• With the help of training data, we estimate the parameters of P(X|Y), P(Y)
• Use the Bayes theorem to calculate the posterior probability P(Y |X)
Some Examples of Generative Models

• Naïve Bayes
• Bayesian networks
• Markov random fields
• Hidden Markov Models (HMMs)
• Latent Dirichlet Allocation (LDA)
• Generative Adversarial Networks (GANs)
• Autoregressive Model

Difference between Discriminative and Generative Models


Let’s see some of the differences between Discriminative and Generative Models.
Core Idea
Discriminative models draw boundaries in the data space, while generative models try
to model how data is placed throughout the space. A generative model focuses on explaining
how the data was generated, while a discriminative model focuses on predicting the labels of
the data.
Mathematical Intuition
In mathematical terms, a discriminative machine learning trains a model which is done
by learning parameters that maximize the conditional probability P(Y|X), while on the other
hand, a generative model learns parameters by maximizing the joint probability of P(X, Y).
Applications
Discriminative models recognize existing data i.e, discriminative modeling identifies
tags and sorts data and can be used to classify data while Generative modeling produces
something.
Since these models use different approaches to machine learning, so both are suited
for specific tasks i.e, Generative models are useful for unsupervised learning tasks while
discriminative models are useful for supervised learning tasks.
Outliers
Generative models have more impact on outliers than discriminative models.

Computational Cost
Discriminative models are computationally cheap as compared to generative models.
Comparison between Discriminative and Generative Models

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Let’s see some of the comparisons based on the following criteria between Discriminative and
Generative Models:

• Performance
• Missing Data
• Accuracy Score
• Applications
Based on Performance
Generative models need fewer data to train compared with discriminative models
since generative models are more biased as they make stronger assumptions i.e, assumption
of conditional independence.
Based on Missing Data
In general, if we have missing data in our dataset, then Generative models can work
with these missing data, while on the contrary discriminative models can’t. This is because, in
generative models, still we can estimate the posterior by marginalizing over the unseen
variables. However, for discriminative models, we usually require all the features X to be
observed.
Based on Accuracy Score
If the assumption of conditional independence violates, then at that time generative
models are less accurate than discriminative models.
Based on Applications
Discriminative models are called “discriminative” since they are useful for
discriminating Y’s label i.e, target outcome, so they can only solve classification problems
while Generative models have more applications besides classification such as,

• Samplings,
• Bayes learning,
• MAP inference, etc.

2.2 Linear Regression

Linear regression is one of the easiest and most popular Machine Learning algorithms.
It is a statistical method that is used for predictive analysis. Linear regression makes
predictions for continuous/real or numeric variables such as sales, salary, age, product
price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (y) variables, hence called as linear regression. Since linear
regression shows the linear relationship, which means it finds how the value of the dependent
variable is changing according to the value of the independent variable.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε

Here,

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:
o Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called Simple Linear
Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is called
Multiple Linear Regression.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Linear Regression Line

A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of relationship:
o Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear relationship.

o Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized. The best
fit line will have the least error.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The different values for weights or the coefficient of lines (a0, a1) gives a different line
of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so
to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the different line
of regression, and the cost function is used to estimate the values of the coefficient
for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a
linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function, which
maps the input variable to the output variable. This mapping function is also known
as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is
the average of squared error occurred between the predicted values and actual values. It can
be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation


Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called residual. If
the observed points are far from the regression line, then the residual will be high, and so
cost function will high. If the scatter points are close to the regression line, then the residual
will be small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update
the values to reach the minimum cost function.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Model Performance:

The Goodness of fit determines how the line of regression fits the set of observations.
The process of finding the best model out of various models is called optimization. It can be
achieved by below method:

1. R-squared method:
o R-squared is a statistical method that determines the goodness of fit.
o It measures the strength of the relationship between the dependent and independent
variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the predicted
values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

2.3 Least Squared Method


The least-squares regression method is a technique commonly used in Regression
Analysis. It is a mathematical method used to find the best fit line that represents
the relationship between an independent and dependent variable.
To understand the least-squares regression method lets get familiar with the concepts
involved in formulating the line of best fit.

What is the Line Of Best Fit?


Line of best fit is drawn to represent the relationship between 2 or more variables. To
be more specific, the best fit line is drawn across a scatter plot of data points in order to
represent a relationship between those data points.
Regression analysis makes use of mathematical methods such as least squares to
obtain a definite relationship between the predictor variable (s) and the target variable. The
least-squares method is one of the most effective ways used to draw the line of best fit. It is
based on the idea that the square of the errors obtained must be minimized to the most
possible extent and hence the name least squares method.
If we were to plot the best fit line that shows the depicts the sales of a company over
a period of time, it would look something like this:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Notice that the line is as close as possible to all the scattered data points. This is what
an ideal best fit line looks like. To better understand the whole process let’s see how to
calculate the line using the Least Squares Regression.

Steps to calculate the Line of Best Fit


To start constructing the line that best depicts the relationship between variables in
the data, we first need to get our basics right. Take a look at the equation below:

Surely, you’ve come across this equation before. It is a simple equation that represents
a straight line along 2 Dimensional data, i.e. x-axis and y-axis. To better understand this,
let’s break down the equation:

• y: dependent variable
• m: the slope of the line
• x: independent variable
• c: y-intercept
So the aim is to calculate the values of slope, y-intercept and substitute the
corresponding ‘x’ values in the equation in order to derive the value of the dependent
variable.

Let’s see how this can be done.


As an assumption, let’s consider that there are ‘n’ data points.
Step 1: Calculate the slope ‘m’ by using the following formula:

Step 2: Compute the y-intercept (the value of y at the point where the line crosses the yaxis):
Step 3: Substitute the values in the final equation:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Now let’s look at an example and see how you can use the least-squares regression
method to compute the line of best fit.

Least Squares Regression Example


Consider an example. Tom who is the owner of a retail shop, found the price of
different T-shirts vs the number of T-shirts sold at his shop over a period of one week.
He tabulated this like shown below:

Let us use the concept of least squares regression to find the line of best fit for the
above data.
Step 1: Calculate the slope ‘m’ by using the following formula:

After you substitute the respective values, m = 1.518 approximately.


Step 2: Compute the y-intercept value

After you substitute the respective values, c = 0.305 approximately.


Step 3: Substitute the values in the final equation

Once you substitute the values, it should look something like this:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Let’s construct a graph that represents the y=mx + c line of best fit:

Now Tom can use the above equation to estimate how many T-shirts of price $8 can
he sell at the retail shop.
y = 1.518 x 8 + 0.305 = 12.45 T-shirts
This comes down to 13 T-shirts! That’s how simple it is to make predictions using
Linear Regression.

2.4 Underfitting and Overfitting


Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting are the two main problems that occur in machine learning
and degrade the performance of the machine learning models.

The main goal of each machine learning model is to generalize well.


Here generalization defines the ability of an ML model to provide a suitable output by
adapting the given set of unknown input. It means after providing training on the dataset, it
can produce reliable and accurate output. Hence, the underfitting and overfitting are the two
terms that need to be checked for the performance of the model and whether the model is
generalizing well or not.

Before understanding the overfitting and underfitting, let's understand some basic
term that will help to understand this topic well:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o Signal: It refers to the true underlying pattern of the data that helps the machine
learning model to learn from the data.
o Noise: Noise is unnecessary and irrelevant data that reduces the performance of the
model.
o Bias: Bias is a prediction error that is introduced in the model due to oversimplifying
the machine learning algorithms. Or it is the difference between the predicted values
and the actual values.
o Variance: If the machine learning model performs well with the training dataset, but
does not perform well with the test dataset, then variance occurs.

Overfitting

Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this, the model
starts caching noise and inaccurate values present in the dataset, and all these factors reduce
the efficiency and accuracy of the model. The overfitted model has low bias and high
variance.

The chances of occurrence of overfitting increase as much we provide training to our


model. It means the more we train our model, the more chances of occurring the overfitted
model.

Overfitting is the main problem that occurs in supervised learning.

Example: The concept of the overfitting can be understood by the below graph of the linear
regression output:

As we can see from the above graph, the model tries to cover all the data points
present in the scatter plot. It may look efficient, but in reality, it is not so. Because the goal of
the regression model to find the best fit line, but here we have not got any best fit, so, it will
generate the prediction errors.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

How to avoid the Overfitting in Model

Both overfitting and underfitting cause the degraded performance of the machine
learning model. But the main cause is overfitting, so there are some ways by which we can
reduce the occurrence of overfitting in our model.
o Cross-Validation
o Training with more data
o Removing features
o Early stopping the training
o Regularization
o Ensembling

Underfitting

Underfitting occurs when our machine learning model is not able to capture the
underlying trend of the data. To avoid the overfitting in the model, the fed of training data
can be stopped at an early stage, due to which the model may not learn enough from the
training data. As a result, it may fail to find the best fit of the dominant trend in the data.

In the case of underfitting, the model is not able to learn enough from the training
data, and hence it reduces the accuracy and produces unreliable predictions. An underfitted
model has high bias and low variance.

Example: We can understand the underfitting using below output of the linear regression
model:

As we can see from the above diagram, the model is unable to capture the data points present
in the plot.
How to avoid underfitting:
o By increasing the training time of the model.
o By increasing the number of features.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Goodness of Fit

The "Goodness of fit" term is taken from the statistics, and the goal of the machine
learning models to achieve the goodness of fit. In statistics modeling, it defines how closely
the result or predicted values match the true values of the dataset.
The model with a good fit is between the underfitted and overfitted model, and
ideally, it makes predictions with 0 errors, but in practice, it is difficult to achieve it.
As when we train our model for a time, the errors in the training data go down, and
the same happens with test data. But if we train the model for a long duration, then the
performance of the model may decrease due to the overfitting, as the model also learn the
noise present in the dataset. The errors in the test dataset start increasing, so the point, just
before the raising of errors, is the good point, and we can stop here for achieving a good
model. There are two other methods by which we can get a good point for our model, which
are the resampling method to estimate model accuracy and validation dataset.

2.5 Cross validation:


Cross-validation is a technique for validating the model efficiency by training it on the
subset of input data and testing on previously unseen subset of the input data. We can also
say that it is a technique to check how a statistical model generalizes to an independent
dataset.
In machine learning, there is always the need to test the stability of the model. It
means based only on the training dataset; we can't fit our model on the training dataset. For
this purpose, we reserve a particular sample of the dataset, which was not part of the training
dataset. After that, we test our model on that sample before deployment, and this complete
process comes under cross-validation. This is something different from the general train-test
split.
Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well
with the validation set, perform the further step, else check for the issues.

Methods used for Cross-Validation


There are some common methods that are used for cross-validation. These methods
are given below:
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Validation Set Approach


We divide our input dataset into a training set and test or validation set in the
validation set approach. Both the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train
our model, so the model may miss out to capture important information of the dataset. It
also tends to give the underfitted model.

Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are
total n datapoints in the original input dataset, then n-p data points will be used as the training
dataset and the p data points as the validation set. This complete process is repeated for all
the samples, and the average error is calculated to know the effectiveness of the model.
There is a disadvantage of this technique; that is, it can be computationally difficult
for the large p.

Leave one out cross-validation


This method is similar to the leave-p-out cross-validation, but instead of p, we need to
take 1 dataset out of training. It means, in this approach, for each learning set, only one
datapoint is reserved, and the remaining dataset is used to train the model. This process
repeats for each datapoint. Hence for n samples, we get n different training set and n test set.
It has the following features:
o In this approach, the bias is minimum as all the data points are used.
o The process is executed for n times; hence execution time is high.
o This approach leads to high variation in testing the effectiveness of the model as we
iteratively check against one data point.

K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of
equal sizes. These samples are called folds. For each learning set, the prediction function uses
k-1 folds, and the rest of the folds are used for the test set. This approach is a very popular CV
approach because it is easy to understand, and the output is less biased than other methods.

The steps for k-fold cross-validation are:


o Split the input dataset into K groups
o For each group:
o Take one group as the reserve or test data set.
o Use remaining groups as the training dataset
o Fit the model on the training set and evaluate the performance of the model
using the test set.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5
folds. On 1st iteration, the first fold is reserved for test the model, and rest are used to train
the model. On 2nd iteration, the second fold is used to test the model, and rest are used to
train the model. This process will continue until each fold is not used for the test fold.

Consider the below diagram:

Stratified k-fold cross-validation


This technique is similar to k-fold cross-validation with some little changes. This
approach works on stratification concept, it is a process of rearranging the data to ensure that
each fold or group is a good representative of the complete dataset. To deal with the bias and
variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some
houses can be much high than other houses. To tackle such situations, a stratified k-fold cross-
validation technique is useful.

Holdout Method
This method is the simplest cross-validation technique among all. In this method, we
need to remove a subset of the training data and use it to get prediction results by training it
on the rest part of the dataset.

The error that occurs in this process tells how well our model will perform with the
unknown dataset. Although this approach is simple to perform, it still faces the issue of high
variance, and it also produces misleading results sometimes.

Comparison of Cross-validation to train/test split in Machine Learning


o Train/test split: The input data is divided into two parts, that are training set and test
set on a ratio of 70:30, 80:20, etc. It provides a high variance, which is one of the
biggest disadvantages.
o Training Data: The training data is used to train the model, and the dependent
variable is known.
o Test Data: The test data is used to make the predictions from the model that
is already trained on the training data. This has the same features as training
data but not the part of that.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o Cross-Validation dataset: It is used to overcome the disadvantage of train/test split


by splitting the dataset into groups of train/test splits, and averaging the result. It can
be used if we want to optimize our model that has been trained on the training dataset
for the best performance. It is more efficient as compared to train/test split as every
observation is used for the training and testing both.

Limitations of Cross-Validation

There are some limitations of the cross-validation technique, which are given below:
o For the ideal conditions, it provides the optimum output. But for the inconsistent data,
it may produce a drastic result. So, it is one of the big disadvantages of cross-
validation, as there is no certainty of the type of data in machine learning.
o In predictive modeling, the data evolves over a period, due to which, it may face the
differences between the training set and validation sets. Such as if we create a model
for the prediction of stock market values, and the data is trained on the previous 5
years stock values, but the realistic future values for the next 5 years may drastically
different, so it is difficult to expect the correct output for such situations.

2.6 Lasso regression


Whenever we hear the term "regression," two things that come to mind are linear
regression and logistic regression. Even though the logistic regression falls under the
classification algorithms category still it buzzes in our mind.

These two topics are quite famous and are the basic introduction topics in Machine
Learning. There are other types of regression, like

• Lasso regression,
• Ridge regression,
• Polynomial regression,
• Stepwise regression,
• ElasticNet regression

The above mentioned techniques are majorly used in regression kind of analytical problems.

When we increase the degree of freedom (increasing polynomials in the equation)


for regression models, they tend to overfit. Using the regularization techniques we can
overcome the overfitting issue.

Two popular methods for that is lasso and ridge regression. In our ridge regression
article we explained the theory behind the ridge regression also we learned the
implementation part in python.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

What Is Regression?

Regression is a statistical technique used to determine the relationship between one


dependent variable and one or many independent variables. In simple words, a regression
analysis will tell you how your result varies for different factors.

For example,
What determines a person's salary?
Many factors,like educational qualification, experience, skills, job role, company, etc.,
play a role in salary.

You can use regression analysis to predict the dependent variable – salary using the
mentioned factors.

Y = mx+c

Do you remember this equation from our school days?

It is nothing but a linear regression equation. In the above equation, the dependent
variable estimates the independent variable.

In mathematical terms,

• Y is the dependent value,


• X is the independent value,
• m is the slope of the line,
• c is the constant value.

The same equation terms are called slighted differently in machine learning or the
statistical world.

• Y is the predicted value,


• X is feature value,
• m is coefficients or weights,
• c is the bias value.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The line in the above graph represents the linear regression model. You can see how
well the model fits the data. It looks like a good model, but sometimes the model fits the data
too much, resulting in overfitting.

To create the line (red) using the actual value, the regression model will iterate and
recalculate the m(coefficient) and c (bias) values while trying to reduce the loss values with
the proper loss function.

The model will have low bias and high variance due to overfitting. The model fit is
good in the training data, but it will not give good test data predictions. Regularization comes
into play to tackle this issue.

What Is Regularization?
Regularization solves the problem of overfitting. Overfitting causes low model
accuracy. It happens when the model learns the data as well as the noises in the training set.
Noises are random datum in the training set which don't represent the actual
properties of the data.

Y ≈ C0 + C1X1 + C2X2 + …+ CpXp

Y represents the dependent variable, X represents the independent variables and C


represents the coefficient estimates for different variables in the above linear regression
equation.
The model fitting involves a loss function known as the sum of squares. The
coefficients in the equation are chosen in a way to reduce the loss function to a minimum
value. Wrong coefficients get selected if there is a lot of irrelevant data in the training set.

Definition Of Lasso Regression


Lasso regression is like linear regression, but it uses a technique "shrinkage" where
the coefficients of determination are shrunk towards zero. Linear regression gives you
regression coefficients as observed in the dataset. The lasso regression allows you to shrink

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

or regularize these coefficients to avoid overfitting and make them work better on different
datasets.

This type of regression is used when the dataset shows high multicollinearity or when
you want to automate variable elimination and feature selection.

When To Use Lasso Regression?

Choosing a model depends on the dataset and the problem statement you are dealing
with. It is essential to understand the dataset and how features interact with each other.

Lasso regression penalizes less important features of your dataset and makes their
respective coefficients zero, thereby eliminating them. Thus it provides you with the benefit
of feature selection and simple model creation. So, if the dataset has high dimensionality and
high correlation, lasso regression can be used.

The Statistics Of Lasso Regression

Statistics of lasso regression


d1, d2, d3, etc., represents the distance between the actual data points and the model
line in the above graph. Least-squares is the sum of squares of the distance between the
points from the plotted curve.
In linear regression, the best model is chosen in a way to minimize the least-squares.
While performing lasso regression, we add a penalizing factor to the least-squares.
That is, the model is chosen in a way to reduce the below loss function to a minimal value.

D = least-squares + lambda * summation (absolute values of the magnitude of the


coefficients)

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Lasso regression penalty consists of all the estimated parameters. Lambda can be any
value between zero to infinity. This value decides how aggressive regularization is performed.
It is usually chosen using cross-validation.
Lasso penalizes the sum of absolute values of coefficients. As the lambda value
increases, coefficients decrease and eventually become zero. This way, lasso regression
eliminates insignificant variables from our model. Our regularized model may have a slightly
high bias than linear regression but less variance for future predictions.

2.7 – Classification
As the name suggests, Classification is the task of “classifying things” into sub-
categories. But, by a machine! If that doesn’t sound like much, imagine your computer being
able to differentiate between you and a stranger. Between a potato and a tomato. Between
an A grade and an F. Now, it sounds interesting now. In Machine Learning and Statistics,
Classification is the problem of identifying to which of a set of categories (subpopulations),
a new observation belongs, on the basis of a training set of data containing observations
and whose categories membership is known.

Types of Classification
Classification is of two types:
1. Binary Classification: When we have to categorize given data into 2 distinct
classes. Example – On the basis of given health conditions of a person, we have
to determine whether the person has a certain disease or not.
2. Multiclass Classification: The number of classes is more than 2. For Example –
On the basis of data about different species of flowers, we have to determine
which specie our observation belongs.

Fig: Binary and Multiclass Classification. Here x1 and x2 are the variables upon which the
class is predicted.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

How does classification works?


Suppose we have to predict whether a given patient has a certain disease or not, on
the basis of 3 variables, called features.
This means there are two possible outcomes:
1. The patient has the said disease. Basically, a result labeled “Yes” or “True”.
2. The patient is disease-free. A result labeled “No” or “False”.

This is a binary classification problem. We have a set of observations called the


training data set, which comprises sample data with actual classification results. We train a
model, called Classifier on this data set, and use that model to predict whether a certain
patient will have the disease or not.
The outcome, thus now depends upon :
1. How well these features are able to “map” to the outcome.
2. The quality of our data set. By quality, I refer to statistical and Mathematical
qualities.
3. How well our Classifier generalizes this relationship between the features and
the outcome.
4. The values of the x1 and x2.
Following is the generalized block diagram of the classification task.

Generalized Classification Block Diagram.


1. X: pre-classified data, in the form of an N*M matrix. N is the no. of observations
and M is the number of features
2. y: An N-d vector corresponding to predicted classes for each of the N
observations.
3. Feature Extraction: Extracting valuable information from input X using a series of
transforms.
4. ML Model: The “Classifier” we’ll train.
5. y’: Labels predicted by the Classifier.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

6. Quality Metric: Metric used for measuring the performance of the model.
7. ML Algorithm: The algorithm that is used to update weights w’, which updates
the model and “learns” iteratively.

Types of Classifiers (algorithms)


There are various types of classifiers. Some of them are :
• Linear Classifiers: Logistic Regression
• Tree-Based Classifiers: Decision Tree Classifier
• Support Vector Machines
• Artificial Neural Networks
• Bayesian Regression
• Gaussian Naive Bayes Classifiers
• Stochastic Gradient Descent (SGD) Classifier
• Ensemble Methods: Random Forests, AdaBoost, Bagging Classifier, Voting
Classifier, ExtraTrees Classifier

Learners in Classification Problems:

In the classification problems, there are two types of learners:


1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives
the test dataset. In Lazy learner case, classification is done on the basis of the most
related data stored in the training dataset. It takes less time in training but more time
for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes
more time in learning, and less time in prediction. Example: Decision Trees, Naïve
Bayes, ANN.

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:


o It is used for evaluating the performance of a classifier, whose output is a probability
value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o For Binary classification, cross-entropy can be calculated as:


1. ?(ylog(p)+(1?y)log(1?p))

Where y= Actual output, p= predicted output.

2. Confusion Matrix:
o The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as below
table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms


Classification algorithms can be used in different places. Below are some popular use
cases of Classification Algorithms:
o Email Spam Detection
o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o Biometric Identification, etc.

2.8 Support vector machine


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

So as support vector creates a decision boundary between these two data (cat and
dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.
On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text
categorization, etc.

Types of SVM

SVM can be of two types:


o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes
in n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset,
which means if there are 2 features (as shown in image), then hyperplane will be a straight
line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:
The working of the SVM algorithm can be understood by using an example.
Suppose we have a dataset that has two tags (green and blue), and the dataset has two
features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in
either green or blue. Consider the below image:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the below
image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the
lines from both the classes. These points are called support vectors. The distance between
the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this
margin. The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third dimension z.
It can be calculated as:
z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the
below image:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.

2.9 Kernel Methods in SVM


Kernel Function is a method used to take data as input and transform it into the
required form of processing data. “Kernel” is used due to a set of mathematical functions
used in Support Vector Machine providing the window to manipulate the data. So, Kernel
Function generally transforms the training set of data so that a non-linear decision surface
is able to transform to a linear equation in a higher number of dimension spaces. Basically,
It returns the inner product between two points in a standard feature dimension.

Major Kernel Functions :-


For Implementing Kernel Functions, first of all, we have to install the “scikit-learn”
library using the command prompt terminal:

pip install scikit-learn


• Gaussian Kernel: It is used to perform transformation when there is no prior
knowledge about data.
• Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function,
adding radial basis method to improve the transformation.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

2.10 Instance based methods


The Machine Learning systems which are categorized as instance-based
learning are the systems that learn the training examples by heart and then generalizes to
new instances based on some similarity measure. It is called instance-based because it
builds the hypotheses from the training instances. It is also known as memory-based
learning or lazy-learning. The time complexity of this algorithm depends upon the size of
training data. The worst-case time complexity of this algorithm is O (n), where n is the
number of training instances.
For example, If we were to create a spam filter with an instance-based learning
algorithm, instead of just flagging emails that are already marked as spam emails, our spam
filter would be programmed to also flag emails that are very similar to them. This requires
a measure of resemblance between two emails. A similarity measure between two emails
could be the same sender or the repetitive use of the same keywords or something else.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be
made to the target function.
2. This algorithm can adapt to new data easily, one which is collected as we go .
Some of the instance-based learning algorithms are :
1. K Nearest Neighbor (KNN)
2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)

2.11 K Nearest Neighbors


o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training set
immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o KNN algorithm at the training phase just stores the dataset and when it gets new data,
then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and dog,
but we want to know either it is a cat or dog. So for this identification, we can use the
KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar
features of the new data set to the cats and dogs images and based on the most similar
features it will put it in either cat or dog category.

Why do we need a K-NN Algorithm?


Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.

Suppose we have a new data point and we need to put it in the required category.
Consider the below image:

o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in
geometry. It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o As we can see the 3 nearest neighbors are from category A, hence this new data point
must belong to category A.

How to select the value of K in the K-NN Algorithm?

Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try some
values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Advantages of KNN Algorithm:


o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:


o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.

2.12 Tree based methods – Decision Trees


Tree-based machine learning methods are among the most commonly used supervised
learning methods. They are constructed by two entities; branches and nodes. Tree-based ML
methods are built by recursively splitting a training sample, using different features from a
dataset at each node that splits the data most effectively. The splitting is based on learning

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

simple decision rules inferred from the training data. Generally, tree-based ML methods are
simple and intuitive; to predict a class label or value, we start from the top of the tree or the
root and, using branches, go to the nodes by comparing features on the basis of which will
provide the best split.
Tree-based methods also use the mean for continuous variables or mode for
categorical variables when making predictions on training observations in the regions they
belong to. Since the set of rules used to segment the predictor space can be summarized in a
visual representation with branches that show all the possible outcomes, these approaches
are commonly referred to as decision tree methods. The methods are flexible and can be
applied to either classification or regression problems. Classification and Regression
Trees (CART) is a commonly used term by Leo Breiman, referring to the flexibility of the
methods in solving both linear and non-linear predictive modeling problems.

Types of Decision Trees

Decision trees can be classified based on the type of target or response variable.
i. Classification Trees
The default type of decision trees, used when the response variable is categorical—i.e.
predicting whether a team will win or lose a game.

ii. Regression Trees


Used when the target variable is continuous or numerical in nature—i.e. predicting
house prices based on year of construction, number of rooms, etc.

Advantages of Tree-based Machine Learning Methods

1. Interpretability: Decision tree methods are easy to understand even for non-
technical people.
2. The data type isn’t a constraint, as the methods can handle both categorical and
numerical variables.
3. Data exploration — Decision trees help us easily identify the most significant
variables and their correlation.

Disadvantages of Tree-based Machine Learning Methods

1. Large decision trees are complex, time-consuming and less accurate in predicting
outcomes.
2. Decision trees don’t fit well for continuous variables, as they lose important
information when segmenting the data into different regions.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

i) Root node — this represents the entire population or the sample, which gets divided into
two or more homogenous subsets.
ii) Splitting — subdividing a node into two or more sub-nodes.
iii) Decision node — this is when a sub-node is divided into further sub-nodes.
iv) Leaf/Terminal node — this is the final/last node that we consider for our model output. It
cannot be split further.
v) Pruning — removing unnecessary sub-nodes of a decision node to combat overfitting.
vi) Branch/Sub-tree — the sub-section of the entire tree.
vii) Parent and Child node — a node that’s subdivided into a sub-node is a parent, while the
sub-node is the child node.

2.13 Classification and Regression Tree (CART)


CART( Classification And Regression Tree) is a variation of the decision tree
algorithm. It can handle both classification and regression tasks. Scikit-Learn uses the
Classification And Regression Tree (CART) algorithm to train Decision Trees (also called
“growing” trees). CART was first produced by Leo Breiman, Jerome Friedman, Richard
Olshen, and Charles Stone in 1984.

CART Algorithm
CART is a predictive algorithm used in Machine learning and it explains how the
target variable’s values can be predicted based on other matters. It is a decision tree where
each fork is split into a predictor variable and each node has a prediction for the target
variable at the end.
In the decision tree, nodes are split into sub-nodes on the basis of a threshold value
of an attribute. The root node is taken as the training set and is split into two by considering
the best attribute and threshold value. Further, the subsets are also split using the same
logic. This continues till the last pure sub-set is found in the tree or the maximum number
of leaves possible in that growing tree.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The CART algorithm works via the following process:


• The best split point of each input is obtained.
• Based on the best split points of each input in Step 1, the new “best” split point
is identified.
• Split the chosen input according to the “best” split point.
• Continue splitting until a stopping rule is satisfied or no further desirable splitting
is available.

CART algorithm uses Gini Impurity to split the dataset into a decision tree .It does
that by searching for the best homogeneity for the sub nodes, with the help of the Gini
index criterion.

Gini index/Gini impurity


The Gini index is a metric for the classification tasks in CART. It stores the sum of
squared probabilities of each class. It computes the degree of probability of a specific
variable that is wrongly being classified when chosen randomly and a variation of the Gini
coefficient. It works on categorical variables, provides outcomes either “successful” or
“failure” and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
• Where 0 depicts that all the elements are allied to a certain class, or only one
class exists there.
• The Gini index of value 1 signifies that all the elements are randomly distributed
across various classes, and
• A value of 0.5 denotes the elements are uniformly distributed into some classes.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Mathematically, we can write Gini Impurity as follows:

where pi is the probability of an object being classified to a particular class.

Classification tree
A classification tree is an algorithm where the target variable is categorical. The
algorithm is then used to identify the “Class” within which the target variable is most likely
to fall. Classification trees are used when the dataset needs to be split into classes that
belong to the response variable(like yes or no)

Regression tree
A Regression tree is an algorithm where the target variable is continuous and the
tree is used to predict its value. Regression trees are used when the response variable is
continuous. For example, if the response variable is the temperature of the day.

CART model representation


CART models are formed by picking input variables and evaluating split points on
those variables until an appropriate tree is produced.
Steps to create a Decision Tree using the CART algorithm:
• Greedy algorithm: In this The input space is divided using the Greedy method
which is known as a recursive binary spitting. This is a numerical method within
which all of the values are aligned and several other split points are tried and
assessed using a cost function.
• Stopping Criterion: As it works its way down the tree with the training data, the
recursive binary splitting method described above must know when to stop
splitting. The most frequent halting method is to utilize a minimum amount of
training data allocated to every leaf node. If the count is smaller than the
specified threshold, the split is rejected and also the node is considered the last
leaf node.
• Tree pruning: Decision tree’s complexity is defined as the number of splits in the
tree. Trees with fewer branches are recommended as they are simple to grasp
and less prone to cluster the data. Working through each leaf node in the tree
and evaluating the effect of deleting it using a hold-out test set is the quickest
and simplest pruning approach.
• Data preparation for the CART: No special data preparation is required for the
CART algorithm.
Advantages of CART
• Results are simplistic.
• Classification and regression trees are Nonparametric and Nonlinear.
• Classification and regression trees implicitly perform feature selection.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

• Outliers have no meaningful effect on CART.


• It requires minimal supervision and produces easy-to-understand models.
Limitations of CART
• Overfitting.
• High Variance.
• low bias.
• the tree structure may be unstable.
Applications of the CART algorithm
• For quick Data insights.
• In Blood Donors Classification.
• For environmental and ecological data.
• In the financial sectors.

2.14 Ensemble Methods


Ensemble learning helps improve machine learning results by combining several
models. This approach allows the production of better predictive performance compared
to a single model. Basic idea is to learn a set of classifiers (experts) and to allow them to
vote.
Advantage : Improvement in predictive accuracy.
Disadvantage : It is difficult to understand an ensemble of classifiers.

Why do ensembles work?

Dietterich(2002) showed that ensembles overcome three problems –


• Statistical Problem – The Statistical Problem arises when the hypothesis space is
too large for the amount of available data. Hence, there are many hypotheses
with the same accuracy on the data and the learning algorithm chooses only one
of them! There is a risk that the accuracy of the chosen hypothesis is low on
unseen data!

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

• Computational Problem – The Computational Problem arises when the learning


algorithm cannot guarantees finding the best hypothesis.
• Representational Problem – The Representational Problem arises when the
hypothesis space does not contain any good approximation of the target
class(es).

Main Challenge for Developing Ensemble Models?


The main challenge is not to obtain highly accurate base models, but rather to obtain
base models which make different kinds of errors. For example, if ensembles are used for
classification, high accuracies can be accomplished if different base models misclassify
different training examples, even if the base classifier accuracy is low.

Methods for Independently Constructing Ensembles –


• Majority Vote
• Bagging and Random Forest
• Randomness Injection
• Feature-Selection Ensembles
• Error-Correcting Output Coding
Methods for Coordinated Construction of Ensembles –
• Boosting
• Stacking

Types of Ensemble Classifier –


Ensemble techniques are classified into three types:

1. Bagging
2. Boosting
3. Stacking

Bagging
Consider a scenario where you are looking at the users’ ratings for a product. Instead
of approving one user’s good/bad rating, we consider average rating given to the product.
With average rating, we can be considerably sure of quality of the product. Bagging makes
use of this principle. Instead of depending on one model, it runs the data through multiple
models in parallel, and average them out as model’s final output.

What is Bagging? How it works?

• Bagging is an acronym for Bootstrapped Aggregation. Bootstrapping means


random selection of records with replacement from the training dataset. ‘Random
selection with replacement’ can be explained as follows:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

a. Consider that there are 8 samples in the training dataset. Out of these 8 samples,
every weak learner gets 5 samples as training data for the model. These 5 samples
need not be unique, or non-repetitive.
b. The model (weak learner) is allowed to get a sample multiple times. For example,
as shown in the figure, Rec5 is selected 2 times by the model. Therefore, weak
learner1 gets Rec2, Rec5, Rec8, Rec5, Rec4 as training data.
c. All the samples are available for selection to next weak learners. Thus all 8 samples
will be available for next weak learner and any sample can be selected multiple
times by next weak learners.

• Bagging is a parallel method, which means several weak learners learn the data
pattern independently and simultaneously. This can be best shown in the below
diagram:

1. The output of each weak learner is averaged to generate final output of the model.
2. Since the weak learner’s outputs are averaged, this mechanism helps to reduce
variance or variability in the predictions. However, it does not help to reduce bias
of the model.
3. Since final prediction is an average of output of each weak learner, it means that
each weak learner has equal say or weight in the final output.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Boosting
We saw that in bagging every model is given equal preference, but if one model
predicts data more correctly than the other, then higher weightage should be given to this
model over the other. Also, the model should attempt to reduce bias. These concepts are
applied in the second ensemble method that we are going to learn, that is Boosting.

What is Boosting?

1. To start with, boosting assigns equal weights to all data points as all points are
equally important in the beginning. For example, if a training dataset has N
samples, it assigns weight = 1/N to each sample.
2. The weak learner classifies the data. The weak classifier classifies some samples
correctly, while making mistake in classifying others.
3. After classification, sample weights are changed. Weight of correctly classified
sample is reduced, and weight of incorrectly classified sample is increased. Then
the next weak classifier is run.
4. This process continues until model as a whole gives strong predictions.

Random Forest:
Random Forest is an extension over bagging. Each classifier in the ensemble
is a decision tree classifier and is generated using a random selection of attributes
at each node to determine the split. During classification, each tree votes and the
most popular class is returned.
Implementation steps of Random Forest –
0. Multiple subsets are created from the original data set, selecting
observations with replacement.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

1. A subset of features is selected randomly and whichever feature gives the


best split is used to split the node iteratively.
2. The tree is grown to the largest.
3. Repeat the above steps and prediction is given based on the aggregation
of predictions from n number of trees.

2.14 Random forest


A Random Forest Algorithm is a supervised machine learning algorithm which is
extremely popular and is used for Classification and Regression problems in Machine
Learning. We know that a forest comprises numerous trees, and the more trees more it will
be robust. Similarly, the greater the number of trees in a Random Forest Algorithm, the higher
its accuracy and problem-solving ability. Random Forest is a classifier that contains several
decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset. It is based on the concept of ensemble learning which is
a process of combining multiple classifiers to solve a complex problem and improve the
performance of the model.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Working of Random Forest Algorithm


The following steps explain the working Random Forest Algorithm:
Step 1: Select random samples from a given data or training set.
Step 2: This algorithm will construct a decision tree for every training data.
Step 3: Voting will take place by averaging the decision tree.
Step 4: Finally, select the most voted prediction result as the final prediction result.

This combination of multiple models is called Ensemble. Ensemble uses two methods:

1. Bagging: Creating a different training subset from sample training data with
replacement is called Bagging. The final output is based on majority voting.
2. Boosting: Combing weak learners into strong learners by creating sequential
models such that the final model has the highest accuracy is called Boosting.
Example: ADA BOOST, XG BOOST.

Bagging: From the principle mentioned above, we can understand Random forest uses the
Bagging code. Now, let us understand this concept in detail. Bagging is also known as
Bootstrap Aggregation used by random forest. The process begins with any original random
data. After arranging, it is organised into samples known as Bootstrap Sample. This process is
known as Bootstrapping.Further, the models are trained individually, yielding different results
known as Aggregation. In the last step, all the results are combined, and the generated output
is based on majority voting. This step is known as Bagging and is done using an Ensemble
Classifier.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

A Random Forest Algorithm is a supervised machine learning algorithm which is


extremely popular and is used for Classification and Regression problems in Machine
Learning. We know that a forest comprises numerous trees, and the more trees more it will
be robust. Similarly, the greater the number of trees in a Random Forest Algorithm, the higher
its accuracy and problem-solving ability. Random Forest is a classifier that contains several
decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset. It is based on the concept of ensemble learning which is
a process of combining multiple classifiers to solve a complex problem and improve the
performance of the model.

Types of Machine Learning


To better understand Random Forest algorithm and how it works, it's helpful to review
the three main types of machine learning -

• Reinforced Learning - The process of teaching a machine to make specific decisions


using trial and error.
• Unsupervised Learning - Users have to look at the data and then divide it based on
its own algorithms without having any training. There is no target or outcome
variable to predict nor estimate.
• Supervised Learning - Users have a lot of data and can train your models. Supervised
learning further falls into two groups: classification and regression.

With supervised training, the training data contains the input and target values.
The algorithm picks up a pattern that maps the input values to the output and uses this
pattern to predict values in the future. Unsupervised learning, on the other hand, uses
training data that does not contain the output values. The algorithm figures out the desired
output over multiple iterations of training. Finally, we have reinforcement learning. Here, the
algorithm is rewarded for every right decision made, and using this as feedback, and the
algorithm can build stronger strategies.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Essential Features of Random Forest


• Miscellany: Each tree has a unique attribute, variety and features concerning other
trees. Not all trees are the same.
• Immune to the curse of dimensionality: Since a tree is a conceptual idea, it requires
no features to be considered. Hence, the feature space is reduced.
• Parallelization: We can fully use the CPU to build random forests since each tree is
created autonomously from different data and features.
• Train-Test split: In a Random Forest, we don’t have to differentiate the data for
train and test because the decision tree never sees 30% of the data.
• Stability: The final result is based on Bagging, meaning the result is based on
majority voting or average.
Difference between Decision Tree and Random Forest

Decision Trees Random Forest

• Since they are created from subsets


• They usually suffer from
of data and the final output is based
the problem of overfitting
on average or majority ranking, the
if it’s allowed to grow
problem of overfitting doesn’t
without any control.
happen here.

• A single decision tree is


comparatively faster in • It is slower.
computation.

• Random Forest randomly selects


• They use a particular set of
observations, builds a decision tree
rules when a data set with
and then the result is obtained based
features are taken as
on majority voting. No formulas are
input.
required here.

Why to use a Random Forest Algorithm?


There are a lot of benefits to using Random Forest Algorithm, but one of the main
advantages is that it reduces the risk of overfitting and the required training time.
Additionally, it offers a high level of accuracy. Random Forest algorithm runs efficiently in
large databases and produces highly accurate predictions by estimating missing data.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

2.15 Iterative Dichotomiser 3 (ID3)


ID3 algorithm, stands for Iterative Dichotomiser 3, is a classification algorithm that
follows a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H).
Entropy is a measure of the amount of uncertainty in the dataset S. Mathematical
Representation of Entropy is shown here -
H(S)=∑c∈C−p(c)log2p(c)
Where,
• S - The current dataset for which entropy is being calculated(changes every iteration
of the ID3 algorithm).
• C - Set of classes in S {example - C ={yes, no}}
• p(c) - The proportion of the number of elements in class c to the number of elements
in set S.

In ID3, entropy is calculated for each remaining attribute. The attribute with the
smallest entropy is used to split the set S on that particular iteration.
Entropy = 0 implies it is of pure class, that means all are of same category. Information Gain
IG(A) tells us how much uncertainty in S was reduced after splitting set S on attribute A.
Mathematical representation of Information gain is shown here -
I G(A,S)=H(S)−∑t∈Tp(t)H(t)
Where,
• H(S) - Entropy of set S.

• T - The subsets created from splitting set S by attribute A such that


S=⋃tϵTt
• p(t) - The proportion of the number of elements in t to the number of elements in set
S.
• H(t) - Entropy of subset t.
In ID3, information gain can be calculated (instead of entropy) for each remaining
attribute. The attribute with the largest information gain is used to split the set S on that
particular iteration.
What are the steps in ID3 algorithm?
The steps in ID3 algorithm are as follows:
1. Calculate entropy for dataset.
2. For each attribute/feature.
2.1. Calculate entropy for all its categorical values.
2.2. Calculate information gain for the feature.
3. Find the feature with maximum information gain.
4. Repeat it until we get the desired tree.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

UNIT – III
Unsupervised learning and Reinforcement learning
3.1 Introduction – Clustering Algorithm
Clustering is an unsupervised machine learning task. You might also hear this referred
to as cluster analysis because of the way this method works. Using a clustering algorithm
means you're going to give the algorithm a lot of input data with no labels and let it find any
groupings in the data it can. Those groupings are called clusters. A cluster is a group of data
points that are similar to each other based on their relation to surrounding data points.
Clustering is used for things like feature engineering or pattern discovery. When you're
starting with data you know nothing about, clustering might be a good place to get some
insight.
Types of clustering algorithms
There are different types of clustering algorithms that handle all kinds of unique data.
Density-based
In density-based clustering, data is grouped by areas of high concentrations of data
points surrounded by areas of low concentrations of data points. Basically the algorithm finds
the places that are dense with data points and calls those clusters.
The great thing about this is that the clusters can be any shape. You aren't constrained to
expected conditions.
The clustering algorithms under this type don't try to assign outliers to clusters, so they get
ignored.
Distribution-based
With a distribution-based clustering approach, all of the data points are considered
parts of a cluster based on the probability that they belong to a given cluster.
It works like this: there is a center-point, and as the distance of a data point from the center
increases, the probability of it being a part of that cluster decreases.
If you aren't sure of how the distribution in your data might be, you should consider a different
type of algorithm.
Centroid-based
Centroid-based clustering is the one you probably hear about the most. It's a little
sensitive to the initial parameters you give it, but it's fast and efficient.
These types of algorithms separate data points based on multiple centroids in the data. Each
data point is assigned to a cluster based on its squared distance from the centroid. This is the
most commonly used type of clustering.
Hierarchical-based
Hierarchical-based clustering is typically used on hierarchical data, like you would get
from a company database or taxonomies. It builds a tree of clusters so everything is organized
from the top-down.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

This is more restrictive than the other clustering types, but it's perfect for specific kinds of
data sets.

When to use clustering


When you have a set of unlabeled data, it's very likely that you'll be using some kind
of unsupervised learning algorithm. There are a lot of different unsupervised learning
techniques, like neural networks, reinforcement learning, and clustering. The specific type of
algorithm you want to use is going to depend on what your data looks like. You might want
to use clustering when you're trying to do anomaly detection to try and find outliers in your
data. It helps by finding those groups of clusters and showing the boundaries that would
determine whether a data point is an outlier or not.
If you aren't sure of what features to use for your machine learning model, clustering
discovers patterns you can use to figure out what stands out in the data.

3.2 K – Means Algorithm


Every Machine Learning engineer wants to achieve accurate predictions with their
algorithms. Such learning algorithms are generally broken down into two types - supervised
and unsupervised. K-Means clustering is one of the unsupervised algorithms where the
available input data does not have a labeled response.
K-means is a centroid-based clustering algorithm, where we calculate the distance
between each data point and a centroid to assign it to a cluster. The goal is to identify the K
number of groups in the dataset.
“K-means clustering is a method of vector quantization, originally from signal
processing, that aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster.”
It is an iterative process of assigning each data point to the groups and slowly data
points get clustered based on similar features. The objective is to minimize the sum of
distances between the data points and the cluster centroid, to identify the correct group each
data point should belong to.
Here, we divide a data space into K clusters and assign a mean value to each. The data
points are placed in the clusters closest to the mean value of that cluster. There are several
distance metrics available that can be used to calculate the distance.

How does K-means work?


Let’s take an example to understand how K-means work step by step. The algorithm
can be broken down into 4-5 steps.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

1. Choosing the number of clusters


The first step is to define the K number of clusters in which we will group the data.
Let’s select K=3.

2. Initializing centroids
Centroid is the center of a cluster but initially, the exact center of data points will be
unknown so, we select random data points and define them as centroids for each cluster. We
will initialize 3 centroids in the dataset.

K-means clustering – centroid

3. Assign data points to the nearest cluster


Now that centroids are initialized, the next step is to assign data points Xn to their
closest cluster centroid Ck

K-means clustering – assign data points

In this step, we will first calculate the distance between data point X and centroid C using
Euclidean Distance metric.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

And then choose the cluster for data points where the distance between the data
point and the centroid is minimum.

K-means clustering

4. Re-initialize centroids
Next, we will re-initialize the centroids by calculating the average of all data points of
that cluster.

K-means clustering

5. Repeat steps 3 and 4


We will keep repeating steps 3 and 4 until we have optimal centroids and the
assignments of data points to correct clusters are not changing anymore. Figure - K-means
clustering.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Does this iterative process sound familiar? Well, K-means follows the same approach
as Expectation-Maximization(EM). EM is an iterative method to find the maximum likelihood
of parameters where the machine learning model depends on unobserved features. This
approach consists of two steps Expectation(E) and Maximization(M) and iterates between
these two.
For K-means, The Expectation(E) step is where each data point is assigned to the most
likely cluster and the Maximization(M) step is where the centroids are recomputed using the
least square optimization technique.

Centroid initialization methods


Positioning the initial centroids can be challenging and the aim is to initialize centroids
as close as possible to optimal values of actual centroids. It is recommended to use some
strategies for defining initial centroids as it directly impacts the overall runtime. The
traditional way is to select the centroids randomly but there are other methods as well which
we will cover in the section.

• Random Data Points


This is the traditional approach of initializing centroids where K random data points
are selected and defined as centroids. As we saw in the above example, in this method each
data instance in the dataset will have to be enumerated and will have to keep a record of the
minimum/maximum value of each attribute. This is a time-consuming process; with increased
dataset complexity the number of steps to achieve the correct centroid or correct cluster will
also increase.

• Naive Sharding
The sharding centroid initialization algorithm primarily depends on the composite
summation value of all the attributes for a particular instance or row in a dataset. The idea is
to calculate the composite value and then use it to sort the instances of the data. Once the
data set is sorted, it is then divided horizontally into k shards.
Finally, all the attributes from each shard will be summed and their mean will be
calculated. The shard attributes mean value collection will be identified as the set of centroids
that can be used for initialization.
Centroid initialization using sharding happens in linear time and the resultant
execution time is much better than random centroid initialization.

Applications of K-Means Clustering


K-Means clustering is used in a variety of examples or business cases in real life, like:
o Academic performance
o Diagnostic systems
o Search engines
o Wireless sensor networks

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Types of Clustering
Clustering is a type of unsupervised learning wherein data points are grouped into
different sets based on their degree of similarity.
The various types of clustering are:

• Hierarchical clustering
• Partitioning clustering
Hierarchical clustering is further subdivided into:

• Agglomerative clustering
• Divisive clustering
Partitioning clustering is further subdivided into:

• K-Means clustering
• Fuzzy C-Means clustering

Distance Measure
Distance measure determines the similarity between two elements and influences the
shape of clusters.
K-Means clustering supports various kinds of distance measures, such as:

• Euclidean distance measure


• Manhattan distance measure
• A squared euclidean distance measure
• Cosine distance measure

Euclidean Distance Measure


The most common case is determining the distance between two points. If we have a
point P and point Q, the euclidean distance is an ordinary straight line. It is the distance
between the two points in Euclidean space.
The formula for distance between two points is shown below:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Squared Euclidean Distance Measure


This is identical to the Euclidean distance measurement but does not take the square
root at the end. The formula is shown below:

Manhattan Distance Measure


The Manhattan distance is the simple sum of the horizontal and vertical components
or the distance between two points measured along axes at right angles. Note that we are
taking the absolute value so that the negative values don't come into play.
The formula is shown below:

3.3 Hierarchical Clustering


Hierarchical clustering, also known as hierarchical cluster analysis or HCA, is another
unsupervised machine learning approach for grouping unlabeled datasets into clusters. The
hierarchy of clusters is developed in the form of a tree in this technique, and this tree-shaped
structure is known as the dendrogram.

Each observation is treated as a separate cluster in hierarchical clustering. After that,


it repeats the next two steps:

1. Finds the two clusters that are the closest together


2. Combines the two clusters that are the most similar. This iterative process is repeated
until all of the clusters have been integrated.

Hierarchical clustering method functions in two approaches-


• Agglomerative
• Divisive

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

1. Agglomerative clustering:

Agglomerative Clustering is a bottom-up strategy in which each data point is


originally a cluster of its own, and as one travels up the hierarchy, more pairs of clusters
are combined. In it, two nearest clusters are taken and joined to form one single cluster.

The algorithm for Agglomerative Hierarchical Clustering is:


• Calculate the similarity of one cluster with all the other clusters (calculate
proximity matrix)
• Consider every data point as an individual cluster
• Merge the clusters which are highly similar or close to each other.
• Recalculate the proximity matrix for each cluster
• Repeat Steps 3 and 4 until only a single cluster remains.
Note: This is just a demonstration of how the actual algorithm works no calculation has
been performed below all the proximity among the clusters is assumed.

Let’s say we have six data points A, B, C, D, E, and F.

2. Divisive clustering:

The divisive clustering algorithm is a top-down clustering strategy in which all


points in the dataset are initially assigned to one cluster and then divided iteratively as
one progresses down the hierarchy. It partitions data points that are clustered together
into one cluster based on the slightest difference. This process continues till the desired
number of clusters is obtained.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

3.4 Cluster validity


Different performance metrics are used to evaluate different Machine Learning
Algorithms. In case of classification problem, we have a variety of performance measure to
evaluate how good our model is. For cluster analysis, the analogous question is how to
evaluate the “goodness” of the resulting clusters?

Why do we need cluster validity indices ?

• To compare clustering algorithms.


• To compare two sets of clusters.
• To compare two clusters i.e which one is better in terms of compactness and
connectedness.
• To determine whether random structure exists in the data due to noise.

Generally, cluster validity measures are categorized into 3 classes, they are –

1. Internal cluster validation : The clustering result is evaluated based on the data
clustered itself (internal information) without reference to external information.
2. External cluster validation : Clustering results are evaluated based on some
externally known result, such as externally provided class labels.
3. Relative cluster validation : The clustering results are evaluated by varying
different parameters for the same algorithm (e.g. changing the number of
clusters).

Besides the term cluster validity index, we need to know about inter-cluster
distance d(a, b) between two cluster a, b and intra-cluster index D(a) of cluster a.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Inter-cluster distance d(a, b) between two clusters a and b can be –

• Single linkage distance: Closest distance between two objects belonging


to a and b respectively.
• Complete linkage distance: Distance between two most remote objects
belonging to a and b respectively.
• Average linkage distance: Average distance between all the objects belonging
to a and b respectively.
• Centroid linkage distance: Distance between the centroid of the two
clusters a and b respectively.

Intra-cluster distance D(a) of a cluster a can be –


• Complete diameter linkage distance: Distance between two farthest objects
belonging to cluster a.
• Average diameter linkage distance: Average distance between all the objects
belonging to cluster a.
• Centroid diameter linkage distance: Twice the average distance between all the
objects and the centroid of the cluster a.

Now, let’s discuss 2 internal cluster validity indices namely Dunn index and DB index.

Dunn index :
The Dunn index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating
clustering algorithms, is an internal evaluation scheme, where the result is based on the
clustered data itself. Like all other such indices, the aim of this Dunn index to identify sets
of clusters that are compact, with a small variance between members of the cluster, and
well separated, where the means of different clusters are sufficiently far apart, as compared
to the within cluster variance.

Higher the Dunn index value, better is the clustering. The number of clusters that
maximizes Dunn index is taken as the optimal number of clusters k. It also has some
drawbacks. As the number of clusters and dimensionality of the data increase, the
computational cost also increases.
The Dunn index for c number of clusters is defined as :

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

DB index :
The Davies–Bouldin index (DBI) (introduced by David L. Davies and Donald W.
Bouldin in 1979), a metric for evaluating clustering algorithms, is an internal evaluation
scheme, where the validation of how well the clustering has been done is made using
quantities and features inherent to the dataset.
Lower the DB index value, better is the clustering. It also has a drawback. A good value
reported by this method does not imply the best information retrieval.
The DB index for k number of clusters is defined as :

3.5 Dimensionality Reduction


The number of input features, variables, or columns present in a given dataset is
known as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The Curse of Dimensionality


Handling the high-dimensional data is very difficult in practice, commonly known as
the curse of dimensionality. If the dimensionality of the input dataset increases, any machine
learning algorithm and model becomes more complex. As the number of features increases,
the number of samples also gets increased proportionally, and the chance of overfitting also
increases. If the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are given
below:
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Approaches of Dimension Reduction


1. Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy. In
other words, it is a way of selecting the optimal features from the input dataset.

Three methods are used for the feature selection:

1.1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

1.2. Wrappers Methods


The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML model, and
evaluate the performance. The performance decides whether to add those features or
remove to increase the accuracy of the model. This method is more accurate than the filtering
method but complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination

1.3. Embedded Methods:


Embedded methods check the different training iterations of the machine learning
model and evaluate the importance of each feature. Some common techniques of Embedded
methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.

2. Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we want to keep
the whole information but use fewer resources while processing the information.

Some common feature extraction techniques are:


a. Principal Component Analysis
b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Dimension reduction involves the following steps:


• Construct the covariance matrix of the data.
• Compute the eigenvectors of this matrix.
• Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a
large fraction of variance of the original data.

Hence, we are left with a lesser number of eigenvectors, and there might have been
some data loss in the process. But, the most important variances should be retained by the
remaining eigenvectors.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

3.6 Principal Component Analysis


Principal Component Analysis is an unsupervised learning algorithm that is used for
the dimensionality reduction in machine learning . It is a statistical process that converts
the observations of correlated features into a set of linearly uncorrelated features with the
help of orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by
reducing the variances.

PCA generally tries to find the lower-dimensional surface to project the high-
dimensional data. PCA works by considering the variance of each attribute because the high
attribute shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation system,
optimizing the power allocation in various communication channels. It is a feature extraction
technique, so it contains the important variables and drops the least important variable.

The PCA algorithm is based on some mathematical concepts such as:


o Variance and Covariance
o Eigenvalues and Eigen factors

Some common terms used in PCA algorithm:


o Dimensionality: It is the number of features or variables present in the given dataset.
More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely proportional to each other, and
+1 indicates that variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v
will be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables
is called the Covariance Matrix.

Principal Components in PCA


As described above, the transformed new features or the output of PCA are the
Principal Components. The number of these PCs are either equal to or less than the original
features present in the dataset. Some properties of these principal components are given
below:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is
zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC
has the most importance, and n PC will have the least importance.

Steps for PCA algorithm


1. Getting the dataset - Firstly, we need to take the input dataset and divide it into two
subparts X and Y, where X is the training set, and Y is the validation set.
2. Representing data into a structure - Now we will represent our dataset into a
structure. Such as we will represent the two-dimensional matrix of independent
variable X. Here each row corresponds to the data items, and the column corresponds
to the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data - In this step, we will standardize our dataset. Such as in a
particular column, the features with high variance are more important compared to
the features with lower variance.
If the importance of features is independent of the variance of the feature,
then we will divide each data item in a column with the standard deviation of the
column. Here we will name the matrix as Z.
4. Calculating the Covariance of Z - To calculate the covariance of Z, we will take the
matrix Z, and will transpose it. After transpose, we will multiply it by Z. The output
matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors - Now we need to calculate the
eigenvalues and eigenvectors for the resultant covariance matrix Z. Eigenvectors or
the covariance matrix are the directions of the axes with high information. And the
coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors - In this step, we will take all the eigenvalues and will sort
them in decreasing order, which means from largest to smallest. And simultaneously
sort the eigenvectors accordingly in matrix P of eigenvalues. The resultant matrix will
be named as P*.
7. Calculating the new features Or Principal Components - Here we will calculate the
new features. To do this, we will multiply the P* matrix to the Z. In the resultant matrix
Z*, each observation is the linear combination of original features. Each column of the
Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset - The new feature set
has occurred, so we will decide here what to keep and what to remove. It means, we
will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Applications of Principal Component Analysis


o PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.

3.7 Recommendation System


A recommendation system is an artificial intelligence or AI algorithm, usually
associated with machine learning, that uses Big Data to suggest or recommend additional
products to consumers. These can be based on various criteria, including past purchases,
search history, demographic information, and other factors. Recommender systems are
highly useful as they help users discover products and services they might otherwise have not
found on their own.
Recommender systems are trained to understand the preferences, previous decisions,
and characteristics of people and products using data gathered about their interactions.
These include impressions, clicks, likes, and purchases. Because of their capability to predict
consumer interests and desires on a highly personalized level, recommender systems are a
favorite with content and product providers. They can drive consumers to just about any
product or service that interests them, from books to videos to health classes to clothing.

Types of Recommendation Systems


While there are a vast number of recommender algorithms and techniques, most fall
into these broad categories: collaborative filtering, content filtering and context filtering.

Collaborative filtering algorithms recommend items (this is the filtering part) based on
preference information from many users (this is the collaborative part). This approach uses
similarity of user preference behavior, given previous interactions between users and items,
recommender algorithms learn to predict future interaction. These recommender systems
build a model from a user’s past behavior, such as items purchased previously or ratings given

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

to those items and similar decisions by other users. The idea is that if some people have made
similar decisions and purchases in the past, like a movie choice, then there is a high probability
they will agree on additional future selections. For example, if a collaborative filtering
recommender knows you and another user share similar tastes in movies, it might
recommend a movie to you that it knows this other user already likes.

Content filtering, by contrast, uses the


attributes or features of an item (this is the
content part) to recommend other items
similar to the user’s preferences. This
approach is based on similarity of item and
user features, given information about a user
and items they have interacted with (e.g. a
user’s age, the category of a restaurant’s
cuisine, the average review for a
movie), model the likelihood of a new
interaction. For example, if a content
filtering recommender sees you liked the
movies You’ve Got Mail and Sleepless in
Seattle, it might recommend another movie
to you with the same genres and/or cast such
as Joe Versus the Volcano.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Hybrid recommender systems combine the advantages of the types above to create a more
comprehensive recommending system.

Context filtering includes users’ contextual information in the recommendation process.


Netflix spoke at NVIDIA GTC about making better recommendations by framing a
recommendation as a contextual sequence prediction. This approach uses a sequence of
contextual user actions, plus the current context, to predict the probability of the next action.
In the Netflix example, given one sequence for each user—the country, device, date, and time
when they watched a movie—they trained a model to predict what to watch next.

Use Cases and Applications


• E-Commerce & Retail: Personalized Merchandising
• Media & Entertainment: Personalized Content
• Personalized Banking

Benefits of Recommendation Systems

▪ Improving retention.
▪ Increasing sales.
▪ Helping to form customer habits and trends.
▪ Speeding up the pace of work.
▪ Boosting cart value.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

How Recommenders Work


How a recommender model makes recommendations will depend on the type of data
you have. If you only have data about which interactions have occurred in the past, you’ll
probably be interested in collaborative filtering. If you have data describing the user and items
they have interacted with (e.g. a user’s age, the category of a restaurant’s cuisine, the average
review for a movie), you can model the likelihood of a new interaction given these properties
at the current moment by adding content and context filtering.

Matrix Factorization for Recommendation


Matrix factorization (MF) techniques are the core of many popular algorithms,
including word embedding and topic modeling, and have become a dominant methodology
within collaborative-filtering-based recommendation. MF can be used to calculate the
similarity in user’s ratings or interactions to provide recommendations. In the simple user
item matrix below, Ted and Carol like movies B and C. Bob likes movie B. To recommend a
movie to Bob, matrix factorization calculates that users who liked B also liked C, so C is a
possible recommendation for Bob.

Matrix factorization using the alternating least squares


(ALS) algorithm approximates the sparse user item rating matrix u-by-i as the product of two
dense matrices, user and item factor matrices of size u × f and f × i (where u is the number of
users, i the number of items and f the number of latent features) . The factor matrices
represent latent or hidden features which the algorithm tries to discover. One matrix tries to
describe the latent or hidden features of each user, and one tries to describe latent properties
of each movie. For each user and for each item, the ALS algorithm iteratively learns (f) numeric
“factors” that represent the user or item. In each iteration, the algorithm alternatively fixes
one factor matrix and optimizes for the other, and this process continues until it converges.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

CuMF is an NVIDIA® CUDA®-based matrix factorization library that optimizes the


alternate least square (ALS) method to solve very large-scale MF. CuMF uses a set of
techniques to maximize the performance on single and multiple GPUs. These techniques
include smart access of sparse data leveraging GPU memory hierarchy, using data parallelism
in conjunction with model parallelism, to minimize the communication overhead among
GPUs, and a novel topology-aware parallel reduction scheme.

3.8 EM – Algorithm
The Expectation-Maximization (EM) algorithm is defined as the combination of various
unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent variable model.

A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.

Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.

EM Algorithm
The EM algorithm is the combination of various unsupervised ML algorithms, such as
the k-means clustering algorithm. Being an iterative approach, it consists of two modes. In
the first mode, we estimate the missing or latent variables. Hence it is referred to as
the Expectation/estimation step (E-step). Further, the other mode is used to optimize the
parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the
dataset to estimate the missing data of the latent variables and then use that data to update
the values of the parameters in the M-step.

What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on intuition, e.g.,


if there are two random variables that have very less difference in their probability, then they
are known as converged. In other words, whenever the values of given variables are matched
with each other, it is called convergence.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step,


Expectation Step, Maximization Step, and convergence Step. These steps are explained as
follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system
is provided with incomplete observed data with the assumption that data is obtained
from a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further,
E-step primarily updates the variables.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or
not. If it gets "yes", then stop the process; else, repeat the process from step 2 until
the convergence occurs.

Gaussian Mixture Model (GMM)


The Gaussian Mixture Model or GMM is defined as a mixture model that has a
combination of the unspecified probability distribution function. Further, GMM also requires
estimated statistics values such as mean and standard deviation or parameters. It is used to
estimate the parameters of the probability distributions to best fit the density of a given
training dataset. Although there are plenty of techniques available to estimate the parameter
of the Gaussian Mixture Model (GMM), the Maximum Likelihood Estimation is one of the
most popular techniques among them.

Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent
variables through observed data in datasets. The EM algorithm or latent variable model has a
broad range of real-life applications in machine learning. These are as follows:
o The EM algorithm is applicable in data clustering in machine learning.
o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as the Gaussian
Mixture Modeland quantitative genetics.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
o It is also applicable in the medical and healthcare industry, such as in image
reconstruction and structural engineering.
o It is used to determine the Gaussian density of a function.

3.9 Reinforcement Algorithm


Reinforcement Learning is defined as a Machine Learning method that is concerned
with how software agents should take actions in an environment. Reinforcement Learning is
a part of the deep learning method that helps you to maximize some portion of the
cumulative reward. This neural network learning method helps you to learn how to attain a
complex objective or maximize a specific dimension over many steps.

Here are some important terms used in Reinforcement AI:


• Agent: It is an assumed entity which performs actions in an environment to gain some
reward.
• Environment (e): A scenario that an agent has to face.
• Reward (R): An immediate return given to an agent when he or she performs specific
action or task.
• State (s): State refers to the current situation returned by the environment.
• Policy (π): It is a strategy which applies by the agent to decide the next action based
on the current state.
• Value (V): It is expected long-term return with discount, as compared to the short-
term reward.
• Value Function: It specifies the value of a state that is the total amount of reward. It
is an agent which should be expected beginning from that state.
• Model of the environment: This mimics the behavior of the environment. It helps you
to make inferences to be made and also determine how the environment will behave.
• Model based methods: It is a method for solving reinforcement learning problems
which use model-based methods.
• Q value or action value (Q): Q value is quite similar to value. The only difference
between the two is that it takes an additional parameter as a current action.

Approaches to implement Reinforcement Learning

There are mainly three ways to implement reinforcement-learning in ML, which are:
1. Value-based:
The value-based approach is about to find the optimal value function, which is the
maximum value at a state under any policy. Therefore, the agent expects the long-
term return at any state(s) under policy π.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

2. Policy-based:
Policy-based approach is to find the optimal policy for the maximum future rewards
without using the value function. In this approach, the agent tries to apply such a
policy that the action performed in each step helps to maximize the future reward.
The policy-based approach has mainly two types of policy:
o Deterministic: The same action is produced by the policy (π) at any state.
o Stochastic: In this policy, probability determines the produced action.
3. Model-based: In the model-based approach, a virtual model is created for the
environment, and the agent explores that environment to learn it. There is no
particular solution or algorithm for this approach because the model representation
is different for each environment.

Elements of Reinforcement Learning


There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
4. Model of the environment

How does Reinforcement Learning Work?


To understand the working process of the RL, we need to consider two main things:
o Environment: It can be anything such as a room, maze, football ground, etc.
o Agent: An intelligent agent such as AI robot.

Let's take an example of a maze environment that the agent needs to explore. Consider
the below image:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

In the above image, the agent is at the very first block of the maze. The maze is
consisting of an S6 block, which is a wall, S8 a fire pit, and S4 a diamond block.

The agent cannot cross the S6 block, as it is a solid wall. If the agent reaches the
S4 block, then get the +1 reward; if it reaches the fire pit, then gets -1 reward point. It can
take four actions: move up, move down, move left, and move right.

The agent can take any path to reach to the final point, but he needs to make it in
possible fewer steps. Suppose the agent considers the path S9-S5-S1-S2-S3, so he will get
the +1-reward point.

The agent will try to remember the preceding steps that it has taken to reach the final
step. To memorize the steps, it assigns 1 value to each previous step. Consider the below
step:

Now, the agent has successfully stored the previous steps assigning the 1 value to each
previous block. But what will the agent do if he starts moving from the block, which has 1
value block on both sides? Consider the below diagram:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

It will be a difficult condition for the agent whether he should go up or down as each
block has the same value. So, the above approach is not suitable for the agent to reach
the destination. Hence to solve the problem, we will use the Bellman equation, which is
the main concept behind reinforcement learning.

The Bellman Equation

It is a way of calculating the value functions in dynamic programming or environment


that leads to modern reinforcement learning.

The key-elements used in Bellman equations are:


o Action performed by the agent is referred to as "a"
o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ."

The Bellman equation can be written as:


V(s) = max [R(s,a) + γV(s`)]

Where,
V(s)= value calculated at a particular point.
R(s,a) = Reward at a particular state s by performing an action.
γ = Discount factor
V(s`) = The value at the previous state.

Characteristics of Reinforcement Learning


Here are important characteristics of reinforcement learning

• There is no supervisor, only a real number or reward signal


• Sequential decision making
• Time plays a crucial role in Reinforcement problems
• Feedback is always delayed, not instantaneous
• Agent’s actions determine the subsequent data it receives

Types of Reinforcement Learning


Two types of reinforcement learning methods are:

Positive:
It is defined as an event, that occurs because of specific behavior. It increases the
strength and the frequency of the behavior and impacts positively on the action taken by the
agent.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

This type of Reinforcement helps you to maximize performance and sustain change
for a more extended period. However, too much Reinforcement may lead to over-
optimization of state, which can affect the results.

Negative:
Negative Reinforcement is defined as strengthening of behavior that occurs because
of a negative condition which should have stopped or avoided. It helps you to define the
minimum stand of performance. However, the drawback of this method is that it provides
enough to meet up the minimum behavior.

3.9 Elements

The 6 elements of the the Machine Learning are:


1. Data
2. Task
3. Model
4. Loss Function
5. Learning Algorithm
6. Evaluation

3.10 Temporal difference algorithm


One of the problems with the environment is that rewards usually are not immediately
observable. For example, in tic-tac-toe or others, we only know the reward(s) on the final move
(terminal state). All other moves will have “0(Zero)” immediate rewards.
TD learning is an unsupervised technique to predict a variable's expected value in a
sequence of states. TD uses a mathematical trick to replace complex reasoning about the
future with a simple learning procedure that can produce the same results. Instead of
calculating the total future reward, TD tries to predict the combination of immediate reward
and its own reward prediction at the next moment in time. (more info can be found here)
Mathematically, the key concept of TD learning is the discounted return:

Where the reward at time t is the combination of discounted rewards in the future. It
implies that future rewards are valued less. The TD Error is the difference between the
ultimate correct reward (V*_t) and our current prediction (V_t).

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

And similar to other optimization methods, the current value will be updated by its
value + learning_rate * error:

Parameters

Alpha (α): learning rate. This parameter shows how much we should adjust our
estimates based on the error. The learning rate is between 0 and 1. A large learning rate adjusts
aggressively and might lead to fluctuating training results — not converging. A small learning
rate adjusts slowly, which will take more time to converge.

Gamma (γ): the discount rate. How much we are valuing future rewards. The discount
rate is between 0 and 1. The bigger the discount rate, we more we valuing the future rewards.

e (coming up in the next section on “e-greedy” policy): the ratio reflective of


exploration vs. exploitation. We explore new options with probability e and stay at the current
max with probability 1-e. The larger e implies more exploration while training.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

UNIT – IV
Probabilistic method for learning
4.1 Introduction – Naïve Bayes Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
o It is mainly used in text classification that includes a high-dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it predicts on the basis of the probability
of an object.
o Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental
analysis, and classifying articles.

Why is it called Naïve Bayes?


The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
o Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on
the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
o Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem
.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine
the probability of a hypothesis with prior knowledge. It depends on the conditional
probability.
o The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability
of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Working of Naïve Bayes' Classifier:


Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not on a
particular day according to the weather conditions. So to solve this problem, we need to
follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:

Outlook Play

0 Rainy Yes

1 Sunny Yes

2 Overcast Yes

3 Overcast Yes

4 Sunny No

5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No

9 Sunny No

10 Sunny Yes

11 Rainy No

12 Overcast Yes

13 Overcast Yes

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Frequency table for the Weather Conditions:


Weather Yes No

Overcast 5 0

Rainy 2 2

Sunny 3 2

Total 10 5
Likelihood table weather condition:
Weather No Yes

Overcast 0 5 5/14= 0.35

Rainy 2 2 4/14=0.29

Sunny 2 3 5/14=0.35

All 4/14=0.29 10/14=0.71

Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60

P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)

Hence on a Sunny day, Player can play the game.

4.2 Maximum likelihood


1. What is Maximum Likelihood Estimation? The likelihood of a given set of observations is
the probability of obtaining that particular set of data, given chosen probability distribution
model.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

MLE is carried out by writing an expression known as the Likelihood function for a set of
observations. This expression contains an unknown parameter, say, θ of he model. We obtain
the value of this parameter that maximizes the likelihood of the observations. This value is
called maximum likelihood estimate.
Think of MLE as opposite of probability. While probability function tries to determine
the probability of the parameters for a given sample, likelihood tries to determine the
probability of the samples given the parameter.
2. Properties of Maximum Likelihood Estimates. MLE has the very desirable properties
especially for very large sample sizes some of which are:
likelihood function are very efficient in testing hypothesis about models and parameters
they become unbiased minimum variance estimator with increasing sample size
they have approximate normal distributions
3. Deriving the Likelihood Function. Assuming a random sample x1, x2, x3, … ,xn which have
joint probability density and denoted by:
L(θ) = f(x1, x2, x3, … ,xn|θ)

where θ is a parameter of the distribution with unknown value.

We need to find the most likely value of the parameter θ given the set observations.
To do this, we use a likelihood function.
The likelihood function is defined as:
L(θ) = f(x1, x2, x3, … ,xn|θ)
which is considered as a function of θ

If we assume that the sample is normally distributed, then we can define the likelihood
estimate for θ as the value of θ that maximizes the L(θ), that is the value of θ that makes the
data set most likely.
We can split the function f(x1, x2, x3, … ,xn|θ) as a product of univariates such that:
L(θ) = f(x1, x2, x3, … ,xn|θ) = f(x1|θ) +f(x2|θ), + f(x3|θ) +… + f(xn|θ)
which would give us the same results.

4. Log Likelihood. Maximizing the likelihood function derived above can be a complex
operation. So to work around this, we can use the fact that the logarithm of a function is also
an increasing function. So maximizing the logarithm of the likelihood function, would also be
equivalent to maximizing the likelihood function.
This is given as:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

So at this point, the result we have from maximizing this function is known as
‘maximum likelihood estimate‘ for the given function
5. Applications of Maximum Likelihood Estimation. MLE can be applied in different statistical
models including linear and generalized linear models, exploratory and confirmatory analysis,
communication system, econometrics and signal detection.

4.3 Maximum apriori algorithm


Apriori algorithm refers to the algorithm which is used to calculate the association
rules between objects. It means how two or more objects are related to one another. In other
words, we can say that the apriori algorithm is an association rule leaning that analyzes that
people who bought product A also bought product B.
The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to one
another. Apriori algorithm is also called frequent pattern mining.
The Apriori algorithm uses frequent itemsets to generate association rules, and
it is designed to work on the databases that contain transactions. With the help of these
association rule, it determines how strongly or how weakly two objects are connected. This
algorithm uses a breadth-first search and Hash Tree to calculate the itemset associations
efficiently. It is the iterative process for finding the frequent itemsets from the large dataset.
It is mainly used for market basket analysis and helps to find those products that can be
bought together. It can also be used in the healthcare field to find drug reactions for patients.

What is Frequent Itemset?


Frequent itemsets are those items whose support is greater than the threshold value
or user-specified minimum support. It means if A & B are the frequent itemsets together, then
individually A and B should also be the frequent itemset.
Suppose there are the two transactions: A= {1,2,3,4,5}, and B= {2,3,7}, in these two
transactions, 2 and 3 are the frequent itemsets.
Steps for Apriori Algorithm
Below are the steps for the apriori algorithm:
Step-1: Determine the support of itemsets in the transactional database, and select
the minimum support and confidence.
Step-2: Take all supports in the transaction with higher support value than the
minimum or selected support value.
Step-3: Find all the rules of these subsets that have higher confidence value than the
threshold or minimum confidence.
Step-4: Sort the rules as the decreasing order of lift.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Apriori Algorithm Working


We will understand the apriori algorithm using an example and mathematical
calculation:
Example: Suppose we have the following dataset that has various transactions, and from this
dataset, we need to find the frequent itemsets and generate the association rules using the
Apriori algorithm:

Solution:
Step-1: Calculating C1 and L1:
o In the first step, we will create a table that contains support count (The frequency of
each itemset individually in the dataset) of each itemset in the given dataset. This table
is called the Candidate set or C1.

o Now, we will take out all the itemsets that have the greater support count that the
Minimum Support (2). It will give us the table for the frequent itemset L1.
Since all the itemsets have greater or equal support count than the minimum support,
except the E, so E itemset will be removed.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Step-2: Candidate Generation C2, and L2:


o In this step, we will generate C2 with the help of L1. In C2, we will create the pair of
the itemsets of L1 in the form of subsets.
o After creating the subsets, we will again find the support count from the main
transaction table of datasets, i.e., how many times these pairs have occurred together
in the given dataset. So, we will get the below table for C2:

o Again, we need to compare the C2 Support count with the minimum support count,
and after comparing, the itemset with less support count will be eliminated from the
table C2. It will give us the below table for L2.

Step-3: Candidate generation C3, and L3:


o For C3, we will repeat the same two processes, but now we will form the C3 table with
subsets of three itemsets together, and will calculate the support count from the
dataset. It will give the below table:

o Now we will create the L3 table. As we can see from the above C3 table, there is only
one combination of itemset that has support count equal to the minimum support
count. So, the L3 will have only one combination, i.e., {A, B, C}.
Step-4: Finding the association rules for the subsets:
To generate the association rules, first, we will create a new table with the possible
rules from the occurred combination {A, B.C}. For all the rules, we will calculate the
Confidence using formula sup( A ^B)/A. After calculating the confidence value for all rules,
we will exclude the rules that have less confidence than the minimum threshold(50%).

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Consider the below table:

Rules Support Confidence

A ^B → C 2 Sup{(A ^B) ^C}/sup(A ^B)= 2/4=0.5=50%

B^C → A 2 Sup{(B^C) ^A}/sup(B ^C)= 2/4=0.5=50%

A^C → B 2 Sup{(A ^C) ^B}/sup(A ^C)= 2/4=0.5=50%

C→ A ^B 2 Sup{(C^( A ^B)}/sup(C)= 2/5=0.4=40%

A→ B^C 2 Sup{(A^( B ^C)}/sup(A)= 2/6=0.33=33.33%

B→ B^C 2 Sup{(B^( B ^C)}/sup(B)= 2/7=0.28=28%

As the given threshold or minimum confidence is 50%, so the first three rules A ^B →
C, B^C → A, and A^C → B can be considered as the strong association rules for the given
problem.

4.3 Bayesian belief networks


Bayesian belief network is key computer technology for dealing with probabilistic
events and to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables and
their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian
model. Bayesian networks are probabilistic, because these networks are built from
a probability distribution, and also use probability theory for prediction and anomaly
detection. Real world applications are probabilistic in nature, and to represent the
relationship between multiple events, we need a Bayesian network. It can also be used in
various tasks including prediction, anomaly detection, diagnostics, automated insight,
reasoning, time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions,
and it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision
problems under uncertain knowledge is known as an Influence diagram.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can


be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes
in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

The Bayesian network has mainly two components:


o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability
distribution P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So

let's first understand the joint probability distribution:


Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination
of x1, x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.

Solution:
o The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer before
calling.
o The conditional distributions for each node are given as conditional probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:


o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B, E], can
rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]

Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

P(B= False)= 0.998, which is the probability of no burglary.


P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:


The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999

Conditional probability table for David Calls:


The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the
form of probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).


= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using Joint distribution.

The semantics of Bayesian Network:


There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional independence
statements.

4.4 Probabilistic models of problems


Mathematics is the foundation of Machine Learning, and its branches such as
Linear Algebra, Probability, and Statistics can be considered as integral parts of ML. In machine
learning, there are probabilistic models as well as non-probabilistic models. In order to have a
better understanding of probabilistic models, the knowledge about basic concepts of
probability such as random variables and probability distributions will be beneficial.

What are Probabilistic Machine Learning Models?


In order to understand what is a probabilistic machine learning model, let’s consider a
classification problem with N classes. If the classification model (classifier) is probabilistic, for
a given input, it will provide probabilities for each class (of the N classes) as the output. In other
words, a probabilistic classifier will provide a probability distribution over the N classes.
Usually, the class with the highest probability is then selected as the Class for which the input
data instance belongs.
Some examples for probabilistic models are Logistic Regression, Bayesian
Classifiers, Hidden Markov Models, and Neural Networks (with a Softmax output layer).

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

If the model is Non-Probabilistic (Deterministic), it will usually output only the most likely class
that the input data instance belongs to. Vanilla “Support Vector Machines” is a popular non-
probabilistic classifier.
Let’s discuss an example to better understand probabilistic classifiers. Take the
task of classifying an image of an animal into five classes — {Dog, Cat, Deer, Lion, Rabbit} as
the problem. As input, we have an image (of a dog). For this example, let’s consider that the
classifier works well and provides correct/ acceptable results for the particular input we are
discussing. When the image is provided as the input to the probabilistic classifier, it will provide
an output such as (Dog (0.6), Cat (0.2), Deer(0.1), Lion(0.04), Rabbit(0.06)). But, if the classifier
is non-probabilistic, it will only output “Dog”.

Objective functions:
In order to identify whether a particular model is probabilistic or not, we can look
at its Objective Function. In machine learning, we aim to optimize a model to excel at a
particular task. The aim of having an objective function is to provide a value based on the
model’s outputs, so optimization can be done by either maximizing or minimizing the
particular value.
In Machine Learning, usually, the goal is to minimize prediction error. So, we
define what is called a loss function as the objective function and tries to minimize the loss
function in the training phase of an ML model.
If we take a basic machine learning model such as Linear Regression, the objective
function is based on the squared error. The objective of the training is to minimize the Mean
Squared Error / Root Mean Squared Error (RMSE). The intuition behind calculating Mean
Squared Error is, the loss/ error created by a prediction given to a particular data point is based
on the difference between the actual value and the predicted value.
The loss created by a particular data point will be higher if the prediction gives by the
model is significantly higher or lower than the actual value. The loss will be less when the
predicted value is very close to the actual value. As you can see, the objective function here is
not based on probabilities, but on the difference (absolute difference) between the actual
value and the predicted value.

Here, n indicates the number of data instances in the data set, y_true is the correct/ true value
and y_predict is the predicted value (by the linear regression model).

The intuition behind Cross-Entropy Loss is ; if the probabilistic model is able to predict
the correct class of a data point with high confidence, the loss will be less. In the example we
discussed about image classification, if the model provides a probability of 1.0 to the class ‘Dog’
(which is the correct class), the loss due to that prediction = -log(P(‘Dog’)) = -log(1.0)=0.
Instead, if the predicted probability for ‘Dog’ class is 0.8, the loss = -log(0.8)= 0.097. However,

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

if the model provides a low probability for the correct class, like 0.3, the loss = -log(0.3)
= 0.523, which can be considered as a significant loss.

In a binary classification model based on Logistic Regression, the loss function is usually
defined using the Binary Cross Entropy loss (BCE loss).

Here y_i is the class label (1 if similar, 0 otherwise) and p(s_i) is the predicted
probability of a point being class 1 for each point ‘i’ in the dataset. N is the number of data
points. Note that as this is a binary classification problem, there are only two classes, class 1
and class 0.

4.5 Probability density estimation


Probability Density: Assume a random variable x that has a probability distribution
p(x). The relationship between the outcomes of a random variable and its probability is
referred to as the probability density.
The problem is that we don’t always know the full probability distribution for a
random variable. This is because we only use a small subset of observations to derive the
outcome. This problem is referred to as Probability Density Estimation as we use only a
random sample of observations to find the general density of the whole sample space.

Probability Density Function (PDF)


A PDF is a function that tells the probability of the random variable from a sub-
sample space falling within a particular range of values and not just one value. It tells the
likelihood of the range of values in the random variable sub-space being the same as that
of the whole sample.
By definition, if X is any continuous random variable, then the function f(x) is called
a probability density function if:

where,
a -> lower limit
b -> upper limit
X -> continuous random variable
f(x) -> probability density function

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Steps Involved:
Step 1 - Create a histogram for the random set of observations to understand the
density of the random sample.

Step 2 - Create the probability density function and fit it on the random sample.
Observe how it fits the histogram plot.

Step 3 - Now iterate steps 1 and 2 in the following manner:


3.1 - Calculate the distribution parameters.
3.2 - Calculate the PDF for the random sample distribution.
3.3 - Observe the resulting PDF against the data.
3.4 - Transform the data to until it best fits the distribution.
Most of the histogram of the different random sample after fitting should match the
histogram plot of the whole population.

Density Estimation: It is the process of finding out the density of the whole population by
examining a random sample of data from that population. One of the best ways to achieve
a density estimate is by using a histogram plot.

Parametric Density Estimation


A normal distribution has two given parameters, mean and standard deviation. We
calculate the sample mean and standard deviation of the random sample taken from this
population to estimate the density of the random sample. The reason it is termed
as ‘parametric’ is due to the fact that the relation between the observations and its
probability can be different based on the values of the two parameters.
Now, it is important to understand that the mean and standard deviation of this
random sample is not going to be the same as that of the whole population due to its small
size. A sample plot for parametric density estimation is shown below.

Nonparametric Density Estimation


In some cases, the PDF may not fit the random sample as it doesn’t follow a normal
distribution (i.e instead of one peak there are multiple peaks in the graph). Here, instead of

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

using distribution parameters like mean and standard deviation, a particular algorithm is
used to estimate the probability distribution. Thus, it is known as a ‘nonparametric density
estimation’.
One of the most common nonparametric approach is known as Kernel Density
Estimation. In this, the objective is to calculate the unknown density fh(x) using the equation
given below:

where,
K -> kernel (non-negative function)
h -> bandwidth (smoothing parameter, h > 0)
Kh -> scaled kernel
fh(x) -> density (to calculate)
n -> no. of samples in random sample.

4.6 Sequence model


Traditional machine learning assumes that data points are dispersed
independently and identically, however in many cases, such as with language, voice, and
time-series data, one data item is dependent on those that come before or after it.
Sequence data is another name for this type of information. In machine learning as well, a
similar concept of sequencing is followed to learn for a sequence of data.

What is The Sequential Learning?


Machine learning models that input or output data sequences are known as
sequence models. Text streams, audio clips, video clips, time-series data, and other types
of sequential data are examples of sequential data. Recurrent Neural Networks (RNNs) are
a well-known method in sequence models.
The analysis of sequential data such as text sentences, time-series, and other
discrete sequence data prompted the development of Sequence Models. These models are
better suited to handle sequential data, whereas Convolutional Neural Networks are better
suited to treat spatial data.
The crucial element to remember about sequence models is that the data we’re
working with are no longer independently and identically distributed (i.i.d.) samples, and
the data are reliant on one another due to their sequential order. For speech recognition,
voice recognition, time series prediction, and natural language processing, sequence
models are particularly popular.

What is Sequential Data?


When the points in the dataset are dependent on the other points in the dataset,
the data is termed sequential. A Timeseries is a common example of this, with each point

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

reflecting an observation at a certain point in time, such as a stock price or sensor data.
Sequences, DNA sequences, and meteorological data are examples of sequential data.
In other words sequential we can term video data, audio data, and images up to some
extent as sequential data. Below are a few basic examples of sequential data.

Below the listed items are some popular machine learning applications that are based on
sequential data,
• Time Series: a challenge of predicting time series, such as stock market projections.
• Text mining and sentiment analysis are two examples of natural language
processing (e.g., Learning word vectors for sentiment analysis)
• Machine Translation: Given a single language input, sequence models are used to
translate the input into several languages. Here’s a recent poll.
• Image captioning is assessing the current action and creating a caption for the
image.
• Deep Recurrent Neural Network for Speech Recognition Deep Recurrent Neural
Network for Speech Recognition
• Recurrent neural networks are being used to create classical music.
• Recurrent Neural Network for Predicting Transcription Factor Binding Sites based
on DNA Sequence Analysis
A different task that can be achieved using RNN areas,

One-to-one
With one input and one output, this is the classic feed-forward neural network
architecture.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

One-to-many
This is referred to as image captioning. We have one fixed-size image as input, and
the output can be words or phrases of varying lengths.
Many-to-one
This is used to categorize emotions. A succession of words or even paragraphs of
words is anticipated as input. The result can be a continuous-valued regression output that
represents the likelihood of having a favourable attitude.
Many-to-many
This paradigm is suitable for machine translation, such as that seen on Google
Translate. The input could be a variable-length English sentence, and the output could be
a variable-length English sentence in a different language. On a frame-by-frame basis, the
last many to many models can be utilized for video classification.

4.6 Markov model


A Markov model is a stochastic method for randomly changing systems that possess
the Markov property. This means that, at any given time, the next state is only dependent on
the current state and is independent of anything in the past. Two commonly applied types of
Markov model are used when the system being represented is autonomous -- that is, when
the system isn't influenced by an external agent. These are as follows:
1. Markov chains. These are the simplest type of Markov model and are used to
represent systems where all states are observable. Markov chains show all
possible states, and between states, they show the transition rate, which is
the probability of moving from one state to another per unit of time. Applications
of this type of model include prediction of market crashes, speech recognition and
search engine algorithms.
2. Hidden Markov models. These are used to represent systems with some
unobservable states. In addition to showing states and transition rates, hidden
Markov models also represent observations and observation likelihoods for each
state. Hidden Markov models are used for a range of applications, including
thermodynamics, finance and pattern recognition.

Another two commonly applied types of Markov model are used when the system
being represented is controlled -- that is, when the system is influenced by a decision-making
agent. These are as follows:
1. Markov decision processes. These are used to model decision-making in discrete,
stochastic, sequential environments. In these processes, an agent makes decisions
based on reliable information. These models are applied to problems in artificial
intelligence (AI), economics and behavioral sciences.
2. Partially observable Markov decision processes. These are used in cases like
Markov decision processes but with the assumption that the agent doesn't always
have reliable information. Applications of these models include robotics, where it

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

isn't always possible to know the location. Another application is machine


maintenance, where reliable information on machine parts can't be obtained
because it's too costly to shut down the machine to get the information.

How is Markov analysis applied?


Markov analysis is a probabilistic technique that uses Markov models to predict the
future behavior of some variable based on the current state. Markov analysis is used in many
domains, including the following:
• Markov chains are used for several business applications, including predicting
customer brand switching for marketing, predicting how long people will remain
in their jobs for human resources, predicting time to failure of a machine in
manufacturing, and forecasting the future price of a stock in finance.
• Markov analysis is also used in natural language processing (NLP) and in machine
learning. For NLP, a Markov chain can be used to generate a sequence of words
that form a complete sentence, or a hidden Markov model can be used for named-
entity recognition and tagging parts of speech. For machine learning, Markov
decision processes are used to represent reward in reinforcement learning.
• A recent example of the use of Markov analysis in healthcare was in Kuwait.
A continuous-time Markov chain model was used to determine the optimal timing
and duration of a full COVID-19 lockdown in the country, minimizing both new
infections and hospitalizations. The model suggested that a 90-day lockdown
beginning 10 days before the epidemic peak was optimal.

How are Markov models represented?


The simplest Markov model is a Markov chain, which can be expressed in equations,
as a transition matrix or as a graph. A transition matrix is used to indicate the probability of
moving from each state to each other state. Generally, the current states are listed in rows,
and the next states are represented as columns. Each cell then contains the probability of
moving from the current state to the next state. For any given row, all the cell values must
then add up to one.
A graph consists of circles, each of which represents a state, and directional arrows to
indicate possible transitions between states. The directional arrows are labeled with the
transition probability. The transition probabilities on the directional arrows coming out of any
given circle must add up to one.
Other Markov models are based on the chain representations but with added
information, such as observations and observation likelihoods.
The transition matrix below represents shifting gears in a car with a manual
transmission. Six states are possible, and a transition from any given state to any other state
depends only on the current state -- that is, where the car goes from second gear isn't
influenced by where it was before second gear. Such a transition matrix might be built from

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

empirical observations that show, for example, that the most probable transitions from first
gear are to second or neutral.
The image below represents the toss of a coin. Two states are possible: heads
and tails. The transition from heads to heads or heads to tails is equally probable (.5) and is
independent of all preceding coin tosses.

4.7 Hidden Markov model


A Hidden Markov Model (HMM) is a statistical model which is also
used in machine learning. It can be used to describe the evolution of observable events
that depend on internal factors, which are not directly observable. These are a class of
probabilistic graphical models that allow us to predict a sequence of unknown variables
from a set of observed variables.
The Hidden Markov model is a probabilistic model which is used to explain or derive
the probabilistic characteristic of any random process. It basically says that an observed
event will not be corresponding to its step-by-step status but related to a set of probability
distributions. Let’s assume a system that is being modelled is assumed to be a Markov
chain and in the process, there are some hidden states. In that case, we can say that hidden
states are a process that depends on the main Markov process/chain.
The main goal of HMM is to learn about a Markov chain by observing its hidden
states. Considering a Markov process X with hidden states Y here the HMM solidifies that
for each time stamp the probability distribution of Y must not depend on the history of X
according to that time.
In many ML problems, we assume the sampled data is i.i.d. This simplifies the maximum
likelihood estimation (MLE) and makes the math much simpler to solve. But for the time
sequence model, states are not completely independent. If I am happy now, I will be more
likely to stay happy tomorrow.
In many ML problems, the states of a system may not be observable or fully observable.
But we can get insights about this internal state through the observables. For example, if I am
happy, there is a 40% chance that I will go to a party. But there is a 10% chance that I will be
found at a party when I am sad too. With HMM, we determine the internal state (happy or
sad) by making observations — where I was found.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

HMM models a process with a Markov process.


• It includes the initial state distribution π (the probability distribution of the initial
state)
• The transition probabilities A from one state (xt) to another.
• HMM also contains the likelihood B of the observation (yt) given a hidden state.
Matrix B is called the emission probabilities. It demonstrates the probability of
our observation given a specific internal state.
To explain it more we can take the example of two friends, Rahul and Ashok. Now Rahul
completes his daily life works according to the weather conditions. Major three activities
completed by Rahul are- go jogging, go to the office, and cleaning his residence. What Rahul is
doing today depends on whether and whatever Rahul does he tells Ashok and Ashok has no
proper information about the weather But Ashok can assume the weather condition according
to Rahul work.

Ashok believes that the weather operates as a discrete Markov chain, wherein the
chain there are only two states whether the weather is Rainy or it is sunny. The condition of
the weather cannot be observed by Ashok, here the conditions of the weather are hidden from
Ashok. On each day, there is a certain chance that Bob will perform one activity from the set
of the following activities {“jog”, “work”,” clean”}, which are depending on the weather. Since
Rahul tells Ashok that what he has done, those are the observations. The entire system is that
of a hidden Markov model (HMM).

Here we can say that the parameter of HMM is known to Ashok because he has general
information about the weather and he also knows what Rahul likes to do on average.

So let’s consider a day where Rahul called Ashok and told him that he has cleaned his
residence. In that scenario, Ashok will have a belief that there are more chances of a rainy day
and we can say that belief Ashok has is the start probability of HMM let’s say which is like the
following.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

The states and observation are:

states = ('Rainy', 'Sunny')

observations = ('walk', 'shop', 'clean')

And the start probability is:

start_probability = {'Rainy': 0.6, 'Sunny': 0.4}

Now the distribution of the probability has the weightage more on the rainy day
stateside so we can say there will be more chances for a day to being rainy again and the
probabilities for next day weather states are as following

transition_probability = {

'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3},


'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6},
}

From the above we can say the changes in the probability for a day is transition
probabilities and according to the transition probability the emitted results for the probability
of work that Rahul will perform is

emission_probability = {

'Rainy' : {'jog': 0.1, 'work': 0.4, 'clean': 0.5},


'Sunny' : {'jog': 0.6, 'work: 0.3, 'clean': 0.1},
}

This probability can be considered as the emission probability. Using the emission
probability Ashok can predict the states of the weather or using the transition probabilities
Ashok can predict the work which Rahul is going to perform the next day.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Below image shown the HMM process for making probabilities

So here from the above intuition and the example we can understand how we can use
this probabilistic model to make a prediction. Now let’s just discuss the applications where it
can be used. Depending on the situation, we usually ask three different types of questions
regarding an HMM problem.
• Likelihood: How likely are the observations based on the current model or the
probability of being at a state at a specific time step.
• Decoding: Find the internal state sequence based on the current model and
observations.
• Learning. Learn the HMM model.

Application of Hidden Markov Model


An application, where HMM is used, aims to recover the data sequence where the
next sequence of the data can not be observed immediately but the next data depends on
the old sequences. Taking the above intuition into account the HMM can be used in the
following applications:
• Computational finance
• speed analysis
• Speech recognition
• Speech synthesis
• Document separation in scanning solutions
• Machine translation
• Handwriting recognition
• Time series analysis
• Activity recognition
• Sequence classification
• Transportation forecasting

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

UNIT – V
Neural Networks and Deep Learning
5.1 Neural networks
Neural Networks is one of the most significant discoveries in history. Neural Networks
can solve problems that can't be solved by algorithms:
• Medical Diagnosis
• Face Detection
• Voice Recognition
Neural Networks is the essence of Deep Learning.

The Deep Learning Revolution


The deep learning revolution started around 2010. Since then, Deep Learning has
solved many "unsolvable" problems. The deep learning revolution was not started by a single
discovery. It more or less happened when several needed factors were ready:
• Computers were fast enough
• Computer storage was big enough
• Better training methods were invented
• Better tuning methods were invented

Neurons
Scientists agree that our brain has around 100 billion neurons. These neurons have
hundreds of billions connections between them.

Neurons (aka Nerve Cells) are the fundamental units of our brain and nervous system.
The neurons are responsible for receiving input from the external world, for sending output
(commands to our muscles), and for transforming the electrical signals in between.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Neural Networks
Artificial Neural Networks are normally called Neural Networks (NN). Neural
networks are in fact multi-layer Perceptrons. The perceptron defines the first step into multi-
layered neural networks.

The Neural Network Model


Input data (Yellow) are processed against a hidden layer (Blue) and modified against
another hidden layer (Green) to produce the final output (Red).

"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E."
E: Experience (the number of times).
T: The Task (driving a car).
P: The Performance (good or bad).

The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another, artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known
as nodes.

The typical Artificial Neural Network looks something like the given figure.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and Axon represents
Output.

Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

5.2 Biological motivations


In many ways the study of ANNs has been inspired by the observation that biological
learning systems (for example, a human brain) are built of very complex networks of
interconnected neuron.

Signals are transmitted between neurons by electrical pulses (spikes) travelling along
the long thin stand called axon. These pulses are received by the receiving neuron at terminals
called synapses. (They are found on a set of branches emerging from the cell body (soma) and
known as dendrites). These pulses lead to certain chemical activity in the dendrites which may
inhibit or excite the generation of pulses in the receiving neuron – this depends on the
geometry of the synapse and type of chemical activity. The neuron sums up or integrates the
effects of thousands of impulses over its dendritic tree and over time. If the integrated

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

potential exceeds a threshold, the cell ‘fires’ and generates a spike which starts to travel along
its axon. This then initiates the whole sequence of events further in the connected neurons.

Learning is a complex process of changing the effectiveness of the synapses so that


the influence of one neuron on another changes. Research in ANN was inspired by
neuroscience but did not attempt to be biologically realistic in detail – it simply appeared to
be too difficult to achieve. As a result most ANNs are connectionists models combining
simple processing elements (called also neurons, units, or nodes). Learning in ANN is typically
in changing the strength of connections (weights) between neurons. In some types of ANNs
neurons may have local memory.

Motivation behind neural network is human brain. Human brain is called as the best
processor even though it works slower than other computers. Many researchers thought to
make a machine that would work in the prospective of the human brain.
Human brain contains billion of neurons which are connected to many other neurons to form
a network so that if it sees any image, it recognizes the image and processes the output.

• Dendrite receives signals from other neurons.


• Cell body sums the incoming signals to generate input.
• When the sum reaches a threshold value, neuron fires and the signal travels down
the axon to the other neurons.
• The amount of signal transmitted depend upon the strength of the connections.
• Connections can be inhibitory, i.e. decreasing strength or excitatory, i.e.
increasing strength in nature.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Structure of Neural Network

Artificial Neuron
Artificial Neuron are also called as perceptrons. This consist of the following basic
terms:
• Input
• Weight
• Bias
• Activation Function
• Output

How perceptron works?

A. All the inputs X1, X2, X3,…., Xn multiplies with their respective weights.

B. All the multiplied values are added.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

C. Sum of the values are applied to the activation function.


Input layer, Hidden layer and Output layer

Input Layer - Input layer contains inputs and weights. Example: X1, W1, etc.
Hidden Layer - In a neural network, there can be more than one hidden layer. Hidden layer
contains the summation and activation function.
Output Layer - Output layer consists the set of results generated by the previous layer. It also
contains the desired value, i.e. values that are already present in the output layer to check
with the values generated by the previous layer. It may be also used to improve the end
results.

5.3 Perceptron
Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or neural
network unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we can
consider it as a single-layer neural network with four main parameters, i.e., input values,
weights and Bias, net sum, and an activation function.

What is Binary classifier in Machine Learning?


In Machine Learning, binary classifiers are defined as the function that helps in
deciding whether input data can be represented as vectors of numbers and belongs to some
specific class.
Binary classifiers can be considered as linear classifiers. In simple words, we can
understand it as a classification algorithm that can predict linear predictor function in terms
of weight and feature vectors.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Basic Components of Perceptron


Mr. Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

o Input Nodes or Input Layer: - This is the primary component of Perceptron which
accepts the initial data into the system for further processing. Each input node
contains a real numerical value.
o Wight and Bias: - Weight parameter represents the strength of the connection
between units. This is another most important parameter of Perceptron components.
Weight is directly proportional to the strength of the associated input neuron in
deciding the output. Further, Bias can be considered as the line of intercept in a linear
equation.
o Activation Function: - These are the final and important components that help to
determine whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.

Types of Activation functions:


o Sign function
o Step function, and
o Sigmoid function

The data scientist uses the activation function to take a subjective decision based on
various problem statements and forms the desired outputs. Activation function may differ
(e.g., Sign, Step, and Sigmoid) in perceptron models by checking whether the learning process
is slow or has vanishing or exploding gradients.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

How does Perceptron work?


In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias, net
sum, and an activation function. The perceptron model begins with the multiplication of all
input values and their weights, then adds these values together to create the weighted sum.
Then this weighted sum is applied to the activation function 'f' to obtain the desired output.

This activation function is also known as the step function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the weight of
input is indicative of the strength of a node. Similarly, an input's bias value gives the ability to
shift the activation function curve up or down.

Perceptron model works in two important steps as follows:


Step-1 - In the first step first, multiply all input values with corresponding weight values and
then add them to determine the weighted sum. Mathematically, we can calculate the
weighted sum as follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias ‘b’ to this weighted sum to improve the model’s performance.
∑wi*xi + b

Step-2 - In the second step, an activation function is applied with the above-mentioned
weighted sum, which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as
follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer
function inside the model. The main objective of the single-layer perceptron model is to
analyze the linearly separable objects with binary outcomes.
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up all inputs
(weight). After adding all inputs, if the total sum of all inputs is more than a pre-determined
value, the model gets activated and shows the output value as +1.
If the outcome is same as pre-determined or threshold value, then the performance
of this model is stated as satisfied, and weight demand does not change. However, this model
consists of a few discrepancies triggered when multiple weight inputs values are fed into the
model. Hence, to find desired output and minimize errors, some changes should be necessary
for the weights input.
"Single-layer perceptron can learn only linearly separable patterns."

Multi-Layered Perceptron Model:


Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers. The multi-layer
perceptron model is also known as the Backpropagation algorithm, which executes in two
stages as follows:
o Forward Stage: Activation functions start from the input layer in the forward stage
and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified as per
the model's requirement. In this stage, the error between actual output and
demanded originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural
networks having various layers in which activation function does not remain linear, similar to
a single layer perceptron model. Instead of linear, activation function can be executed as
sigmoid, TanH, ReLU, etc., for deployment.
A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR, XOR,
NAND, NOT, XNOR, NOR.

Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.
Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0
o 'w' represents real-valued weights vector

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

o 'b' represents the bias


o 'x' represents a vector of input x values.

5.4 Feed forward network


A feedforward neural network is a biologically inspired classification algorithm. It
consist of a (possibly large) number of simple neuron-like processing units, organized
in layers. Every unit in a layer is connected with all the units in the previous layer. These
connections are not all equal: each connection may have a different strength or weight. The
weights on these connections encode the knowledge of a network. Often the units in a neural
network are also called nodes.
Data enters at the inputs and passes through the network, layer by layer, until it
arrives at the outputs. During normal operation, that is when it acts as a classifier, there is no
feedback between layers. This is why they are called feedforward neural networks.
A Feed Forward Neural Network is an artificial neural network in which the
connections between nodes does not form a cycle. The opposite of a feed forward neural
network is a recurrent neural network, in which certain pathways are cycled. The feed
forward model is the simplest form of neural network as information is only processed in one
direction. While the data may pass through multiple hidden nodes, it always moves in one
direction and never backwards.

How does a Feed Forward Neural Network work?


A Feed Forward Neural Network is commonly seen in its simplest form as a single
layer perceptron. In this model, a series of inputs enter the layer and are multiplied by the
weights. Each value is then added together to get a sum of the weighted input values. If the
sum of the values is above a specific threshold, usually set at zero, the value produced is often
1, whereas if the sum falls below the threshold, the output value is -1. The single layer
perceptron is an important model of feed forward neural networks and is often used in
classification tasks. Furthermore, single layer perceptrons can incorporate aspects of machine
learning. Using a property known as the delta rule, the neural network can compare the
outputs of its nodes with the intended values, thus allowing the network to adjust its weights
through training in order to produce more accurate output values. This process of training
and learning produces a form of a gradient descent. In multi-layered perceptrons, the process
of updating weights is nearly analogous, however the process is defined more specifically as

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

back-propagation. In such cases, each hidden layer within the network is adjusted according
to the output values produced by the final layer.

Applications of Feed Forward Neural Networks


While Feed Forward Neural Networks are fairly straightforward, their simplified
architecture can be used as an advantage in particular machine learning applications. For
example, one may set up a series of feed forward neural networks with the intention of
running them independently from each other, but with a mild intermediary for moderation.
Like the human brain, this process relies on many individual neurons in order to handle and
process larger tasks. As the individual networks perform their tasks independently, the results
can be combined at the end to produce a synthesized, and cohesive output.

5.5 Back Propagation:


Backpropagation is the essence of neural network training. It is the method of fine-
tuning the weights of a neural network based on the error rate obtained in the previous epoch
(i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the
model reliable by increasing its generalization.
Backpropagation in neural network is a short form for “backward propagation of
errors.” It is a standard method of training artificial neural networks. This method helps
calculate the gradient of a loss function with respect to all the weights in the network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike
a native direct computation. It computes the gradient, but it does not define how the gradient
is used. It generalizes the computation in the delta rule.
Consider the following Back propagation neural network example diagram to
understand:
1. Inputs X, arrive through the preconnected path
2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to
the output layer.
4. Calculate the error in the outputs

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

ErrorB= Actual Output – Desired Output


5. Travel back from the output layer to the hidden layer to adjust the weights such that
the error is decreased.
Keep repeating the process until the desired output is achieved

Why We Need Backpropagation?


Most prominent advantages of Backpropagation are:
• Backpropagation is fast, simple and easy to program
• It has no parameters to tune apart from the numbers of input
• It is a flexible method as it does not require prior knowledge about the network
• It is a standard method that generally works well
• It does not need any special mention of the features of the function to be learned.

Types of Backpropagation Networks


Two Types of Backpropagation Networks are:
• Static Back-propagation
• Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is
achieved. After that, the error is computed and propagated backward. The main difference
between both of these methods is: that the mapping is rapid in static back-propagation while
it is nonstatic in recurrent backpropagation.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

5.6 Activation and Loss function


The activation function decides whether a neuron should be activated or not by
calculating the weighted sum and further adding bias to it. The purpose of the activation
function is to introduce non-linearity into the output of a neuron.

Explanation: We know, the neural network has neurons that work in correspondence
with weight, bias, and their respective activation function. In a neural network, we would
update the weights and biases of the neurons on the basis of the error at the output. This
process is known as back-propagation. Activation functions make the back-propagation
possible since the gradients are supplied along with the error to update the weights and
biases.

Why do we need Non-linear activation function?


A neural network without an activation function is essentially just a linear regression
model. The activation function does the non-linear transformation to the input making it
capable to learn and perform more complex tasks.

Variants of Activation Function


Linear Function
• Equation : Linear function has the equation similar to as of a straight line
i.e. y = x
• No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
• Range : -inf to +inf
• Uses : Linear activation function is used at just one place i.e. output layer.
• Issues : If we will differentiate linear function to bring non-linearity, result will
no more depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.

For example : Calculation of price of a house is a regression problem. House price may have
any big/small value, so we can apply linear activation at output layer. Even in this case
neural net must have any non-linear function at hidden layers.

1) Sigmoid Function:
Description: Takes a real-valued number and scales it between 0 and 1. Large
negative numbers become 0 and large positive numbers become 1
Formula: 1 /(1 + e^-x)
Range: (0,1)
Pros: As it’s range is between 0 and 1, it is ideal for situations where we need to
predict the probability of an event as an output.
Cons: The gradient values are significant for range -3 and 3 but become much closer

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

to zero beyond this range which almost kills the impact of the neuron on the final
output. Also, sigmoid outputs are not zero-centered (it is centred around 0.5) which
leads to undesirable zig-zagging dynamics in the gradient updates for the weights

Plot:

2) Tanh Function:

Description: Similar to sigmoid but takes a real-valued number and scales it between -

1 and 1.It is better than sigmoid as it is centred around 0 which leads to better

convergence

Formula: (e^x — e^-x) / (e^x + e^-x)

Range: (-1,1)

Pros: The derivatives of the tanh are larger than the derivatives of the sigmoid which

help us minimize the cost function faster

Cons: Similar to sigmoid, the gradient values become close to zero for wide range of
values (this is known as vanishing gradient problem). Thus, the network refuses to

learn or keeps learning at a very small rate.

Plot:

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

3. Softmax Function:

Description: Softmax function can be imagined as a combination of multiple sigmoids

which can returns the probability for a datapoint belonging to each individual class in

a multiclass classification problem

Formula:

Range: (0,1), sum of output = 1

Pros: Can handle multiple classes and give the probability of belonging to each class

Cons: Should not be used in hidden layers as we want the neurons to be independent.

If we apply it then they will be linearly dependent.

Plot: Not Applicable

4. ReLU Function:
Description: The rectified linear activation function or ReLU for short is a piecewise
linear function that will output the input directly if it is positive, otherwise, it will output
zero. This is the default function but modifying default parameters allows us to use non-
zero thresholds and to use a non-zero multiple of the input for values below the
threshold (called Leaky ReLU).
Formula: max(0,x)
Range: (0,inf)
Pros: Although RELU looks and acts like a linear function, it is a nonlinear function
allowing complex relationships to be learned and is able to allow learning through all
the hidden layers in a deep network by having large derivatives.
Cons: It should not be used as the final output layer for either classification/regression
tasks

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Plot:

Loss Functions
The other key aspect in setting up the neural network infrastructure is selecting the
right loss functions. With neural networks, we seek to minimize the error (difference between
actual and predicted value) which is calculated by the loss function. We will be discussing 3
popular loss functions:

1. Mean Squared Error, L2 Loss

Description: MSE loss is used for regression tasks. As the name suggests, this loss is

calculated by taking the mean of squared differences between actual(target) and predicted

values.

Formula:

Range: (0,inf)
Pros: Preferred loss function if the distribution of the target variable is Gaussian as it has good
derivatives and helps the model converge quickly
Cons: Is not robust to outliers in the data (unlike loss functions like Mean Absolute Error) and
penalizes high and low predictions exponentially (unlike loss functions like Mean Squared
Logarithmic Error Loss)

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

2. Binary Cross Entropy

Description: BCE loss is the default loss function used for the binary classification tasks.

It requires one output layer to classify the data into two classes and the range of output is (0–

1) i.e. should use the sigmoid function.

Formula:

where y is the actual label, ŷ is the classifier’s predicted probability distributions for predicting

one class and m is the number of records.

Range: (0,inf)

Pros: The continuous nature of the loss function helps the training process converged well

Cons: Can only be used with sigmoid activation function. Other loss functions like Hinge or

Squared Hinge Loss can work with tanh activation function

3. Categorical Cross Entropy

Description: It is the default loss function when we have a multi-class classification

task. It requires the same number of output nodes as the classes with the final layer going

through a softmax activation so that each output node has a probability value between (0–1).

Formula:

where y is the actual label and p is the classifier’s predicted probability distributions for
predicting the class j
Range: (0,inf)
Pros: Similar to Binary Cross Entropy, the continuous nature of the loss function helps the
training process converged well.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Cons: May require a one hot encoded vector with many zero values if there many classes,
requiring significant memory (should use Sparse Categorical Crossentropy in this case)

5.7 Limitations of Machine learning


• Each narrow application needs to be specially trained
• Require large amounts of hand-crafted, structured training data
• Learning must generally be supervised: Training data must be tagged
• Require lengthy offline/ batch training
• Do not learn incrementally or interactively, in real-time
• Poor transfer learning ability, reusability of modules, and integration
• Systems are opaque, making them very hard to debug
• Performance cannot be audited or guaranteed at the ‘long tail’
• They encode correlation, not causation or ontological relationships
• Do not encode entities or spatial relationships between entities
• Only handle very narrow aspects of natural language
• Not well suited for high-level, symbolic reasoning or planning
• Understanding which process need automation
• Lack of quality of data
• Inadequate infrastructure
• Implementation
• Lack of skilled resources
• Ethics
• Deterministic problems
• Misapplication
• Interpretability

5.8 Deep Learning


Deep learning is a machine learning technique that teaches computers to do what
comes naturally to humans: learn by example. Deep learning is a key technology behind
driverless cars, enabling them to recognize a stop sign, or to distinguish a pedestrian from a
lamppost. It is the key to voice control in consumer devices like phones, tablets, TVs, and
hands-free speakers. Deep learning is getting lots of attention lately and for good reason. It’s
achieving results that were not possible before.
In deep learning, a computer model learns to perform classification tasks directly from
images, text, or sound. Deep learning models can achieve state-of-the-art accuracy,
sometimes exceeding human-level performance. Models are trained by using a large set of
labeled data and neural network architectures that contain many layers.

There are two main reasons it has only recently become useful:
1. Deep learning requires large amounts of labeled data. For example, driverless car
development requires millions of images and thousands of hours of video.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

2. Deep learning requires substantial computing power. High-performance GPUs


have a parallel architecture that is efficient for deep learning. When combined
with clusters or cloud computing, this enables development teams to reduce
training time for a deep learning network from weeks to hours or less.

Examples of Deep Learning at Work


• Automated Driving
• Aerospace and Defense
• Medical Research
• Industrial Automation
• Electronics

How Deep Learning Works


Most deep learning methods use neural network architectures, which is why deep
learning models are often referred to as deep neural networks. The term “deep” usually
refers to the number of hidden layers in the neural network. Traditional neural networks only
contain 2-3 hidden layers, while deep networks can have as many as 150. Deep learning
models are trained by using large sets of labeled data and neural network architectures that
learn features directly from the data without the need for manual feature extraction.
One of the most popular types of deep neural networks is known as convolutional
neural networks (CNN or ConvNet). A CNN convolves learned features with input data, and
uses 2D convolutional layers, making this architecture well suited to processing 2D data, such
as images.

CNNs eliminate the need for manual feature extraction, so you do not need to identify
features used to classify images. The CNN works by extracting features directly from images.
The relevant features are not pretrained; they are learned while the network trains on a

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

collection of images. This automated feature extraction makes deep learning models highly
accurate for computer vision tasks such as object classification.

CNNs learn to detect different features of an image using tens or hundreds of hidden
layers. Every hidden layer increases the complexity of the learned image features. For
example, the first hidden layer could learn how to detect edges, and the last learns how to
detect more complex shapes specifically catered to the shape of the object we are trying to
recognize.

What’s the Difference Between Machine Learning and Deep Learning?


Deep learning is a specialized form of machine learning. A machine learning workflow
starts with relevant features being manually extracted from images. The features are then
used to create a model that categorizes the objects in the image. With a deep learning
workflow, relevant features are automatically extracted from images. In addition, deep
learning performs “end-to-end learning” – where a network is given raw data and a task to
perform, such as classification, and it learns how to do this automatically.
Another key difference is deep learning algorithms scale with data, whereas shallow
learning converges. Shallow learning refers to machine learning methods that plateau at a
certain level of performance when you add more examples and training data to the network.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

A key advantage of deep learning networks is that they often continue to improve as
the size of your data increases.

Comparing a machine learning approach to categorizing vehicles (left) with deep learning
(right).

How to Create and Train Deep Learning Models


• Training from Scratch
• Transfer Learning
• Feature Extraction

5.8 Convolutional neural networks


A convolutional neural network is a feed-forward neural network that is generally
used to analyze visual images by processing data with grid-like topology. It’s also known as
a ConvNet. A convolutional neural network is used to detect and classify objects in an image.
Convolutional Neural Network is one of the main categories to do image classification and
image recognition in neural networks. Scene labeling, objects detections, and face
recognition, etc., are some of the areas where convolutional neural networks are widely used.
CNN takes an image as input, which is classified and process under a certain category such as
dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and depends on the
resolution of the image. Based on image resolution, it will see as h * w * d, where h= height
w= width and d= dimension. For example, An RGB image is 6 * 6 * 3 array of the matrix, and
the grayscale image is 4 * 4 * 1 array of the matrix.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

In CNN, each input image will pass through a sequence of convolution layers along
with pooling, fully connected layers, filters (Also known as kernels). After that, we will apply
the Soft-max function to classify an object with probabilistic values 0 and 1.

Convolution Layer
Convolution layer is the first layer to extract features from an input image. By learning
image features using a small square of input data, the convolutional layer preserves the
relationship between pixels. It is a mathematical operation which takes two inputs such as
image matrix and a kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.

Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter matrix 3*3
as:

The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called "Features Map"
and show as an output.

Convolution of an image with different filters can perform an operation such as blur, sharpen,
and edge detection by applying filters.

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Strides
Stride is the number of pixels which are shift over the input matrix. When the stride is
equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride is equaled
to 2, then we move the filters to 2 pixels at a time. The following figure shows that the
convolution would work with a stride of 2.

Padding
Padding plays a crucial role in building the convolutional neural network. If the image
will get shrink and if we will take a neural network with 100's of layers on it, it will give us a
small image after filtered in the end.
If we take a three by three filter on top of a grayscale image and do the convolving
then what will happen?

It is clear from the above picture that the pixel in the corner will only get covers one
time, but the middle pixel will get covered more than once. It means that we have more
information on that middle pixel, so there are two downsides:
o Shrinking outputs
o Losing information on the corner of the image.
To overcome this, we have introduced padding to an image. "Padding is an additional
layer which can add to the border of an image."

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Pooling Layer
Pooling layer plays an important role in pre-processing of an image. Pooling layer
reduces the number of parameters when the images are too large. Pooling is "downscaling"
of the image obtained from the previous layers. It can be compared to shrinking an image to
reduce its pixel density. Spatial pooling is also called downsampling or subsampling, which
reduces the dimensionality of each map but retains the important information.
There are the following types of spatial pooling:
Max Pooling
Max pooling is a sample-based discretization process. Its main objective is to
downscale an input representation, reducing its dimensionality and allowing for the
assumption to be made about features contained in the sub-region binned. Max pooling is
done by applying a max filter to non-overlapping sub-regions of the initial representation.

Average Pooling
Down-scaling will perform through average pooling by dividing the input into
rectangular pooling regions and computing the average values of each region.
Syntax
layer = averagePooling2dLayer(poolSize)
layer = averagePooling2dLayer(poolSize,Name,Value)

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Sum Pooling
The sub-region for sum pooling or mean pooling are set exactly the same as for max-
pooling but instead of using the max function we use sum or mean.

Fully Connected Layer


The fully connected layer is a layer in which the input from the other layers will be
flattened into a vector and sent. It will transform the output into the desired number of
classes by the network.

In the above diagram, the feature map matrix will be converted into the vector such
as x1, x2, x3... xn with the help of fully connected layers. We will combine features to create
a model and apply the activation function such as softmax or sigmoid to classify the outputs
as a car, dog, truck, etc.

5.9 Recurrent neural networks:


Recurrent Neural Network(RNN) are a type of Neural Network where the output
from previous step are fed as input to the current step. In traditional neural networks, all
the inputs and outputs are independent of each other, but in cases like when it is required
to predict the next word of a sentence, the previous words are required and hence there is
a need to remember the previous words.
Thus RNN came into existence, which solved
this issue with the help of a Hidden Layer.
The main and most important feature of RNN
is Hidden state, which remembers some
information about a sequence.

RNN have a “memory” which


remembers all information about what has
been calculated. It uses the same parameters
for each input as it performs the same task

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

on all the inputs or hidden layers to produce the output. This reduces the complexity of
parameters, unlike other neural networks.
How RNN works
The working of a RNN can be understood with the help of below example:
Example:
Suppose there is a deeper network with one input layer, three
hidden layers and one output layer. Then like other neural networks,
each hidden layer will have its own set of weights and biases, let’s
say, for hidden layer 1 the weights and biases are (w1, b1), (w2, b2)
for second hidden layer and (w3, b3) for third hidden layer. This
means that each of these layers are independent of each other, i.e.
they do not memorize the previous outputs.

Now the RNN will do the following:


• RNN converts the independent activations into
dependent activations by providing the same weights and
biases to all the layers, thus reducing the complexity of
increasing parameters and memorizing each previous
outputs by giving each output as input to the next hidden
layer.
• Hence these three layers can be joined together such that
the weights and bias of all the hidden layers is the same,
into a single recurrent layer.

Formula for calculating current state:

where:
ht -> current state
ht-1 -> previous state
xt -> input state

Formula for applying Activation function(tanh):

where:
whh -> weight at recurrent neuron
wxh -> weight at input neuron

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Formula for calculating output:

Yt -> output
Why -> weight at output layer

Training through RNN


1. A single time step of the input is provided to the network.
2. Then calculate its current state using set of current input and the previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the information
from all the previous states.
5. Once all the time steps are completed the final current state is used to calculate
the output.
6. The output is then compared to the actual output i.e the target output and the
error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained.

The recurrent neural will perform the following.


The recurrent network first performs the conversion of independent activations into
dependent ones. It also assigns the same weight and bias to all the layers, which reduces the
complexity of RNN of parameters. And it provides a standard platform for memorization of
the previous outputs by providing previous output as an input to the next layer.
These three layers having the same weights and bias, combine into a single recurrent unit.

For calculating the current state-


ht =f(ht-1, Xt)

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])


lOMoARcPSD|46256176

Where
ht= current state
Ht-1= previous state
Xt= input state
To apply the activation function tanh, we have-
ht = tanh (Whhht-1+ WxhXt)
Where:
Whh = weight of recurrent neuron and,
Wxh = weight of the input neuron
The formula for calculating output:
Yt = Whyht

Prediction problems
RNNs are generally useful in working with sequence prediction problems. Sequence
prediction problems come in many forms and are best described by the types of inputs and
outputs it supports.

Sequence prediction problems include:


One-to-Many:In this type of problem, an observation is mapped as input to a sequence with
multiple steps as an output.
Many-to-One: Here a sequence of multiple steps as input are mapped to a class or quantity
prediction.
Many-to-Many: A sequence of multiple steps as input are mapped to a sequence with
multiple steps as output.The Many-to-Many problem is often referred to as sequence-to-
sequence, or seq2seq .

5.10 Usecases (Or) Applications of RNN


• Prediction problems
• Machine Translation
• Speech Recognition
• Language Modelling and Generating Text
• Video Tagging
• Generating Image Descriptions
• Text Summarization
• Call Center Analysis
• Face detection,
• OCR Applications as Image Recognition
• Other applications also

Downloaded by 024_CSE_ DHARSHINI.A ([email protected])

You might also like