0% found this document useful (0 votes)
386 views

AD3501-DL-Unit 1 Notes

deep learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
386 views

AD3501-DL-Unit 1 Notes

deep learning
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3501 DEEP LEARNING - NOTES


UNIT I DEEP NETWORKS BASICS
Linear Algebra: Scalars -- Vectors -- Matrices and tensors; Probability Distributions -- Gradient
based Optimization – Machine Learning Basics: Capacity -- Overfitting and underfitting --
Hyperparameters and validation sets -- Estimators -- Bias and variance -- Stochastic gradient
descent -- Challenges motivating deep learning; Deep Networks: Deep feed forward networks;
Regularization -- Optimization.

1. Introduction:
Today, artificial intelligence (AI) is a thriving field with many practical applications and active research
topics. We look to intelligent software to automate routine labor, understand speech or images, make
diagnoses in medicine and support basic scientific research. In the early days of artificial intelligence, the field
rapidly tackled and solved problems that are intellectually difficult for human beings but relatively
straightforward for computers—problems that can be described by a list of formal, mathematical rules. The
true challenge of AI lies in solving more intuitive problems. The solution is to allow computers to learn
from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in
terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the
need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy
of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If one
draws a graph showing how these concepts are built on top of each other, the graph is deep, with many layers.
For this reason, this approach is called as deep learning.
A computer can reason about statements in these formal languages automatically using logical inference rules.
This is known as the knowledge base approach to artificial intelligence. The difficulties faced by systems
relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge,
by extracting patterns from raw data. This capability is known as machine learning. The introduction of
machine learning allowed computers to tackle problems involving knowledge of the real world and make
decisions that appear subjective. A simple machine learning algorithm called logistic regression can
determine whether to recommend cesarean delivery. A simple machine learning algorithm called naive Bayes
can separate legitimate e-mail from spam e-mail.
The performance of these simple machine learning algorithms depends heavily on the representation of the
data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI
system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant
information, such as the presence or absence of a uterine scar. Each piece of information included in the
representation of the patient is known as a feature. Logistic regression learns how each of these features of

1
the patient correlates with various outcomes. However, it cannot influence the way that the features are defined
in any way. If logistic regression was given an MRI scan of the patient, rather than the doctor‘s formalized
report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible
correlation with any complications that might occur during delivery.
This dependence on representations is a general phenomenon that appears throughout computer science and
even daily life. In computer science, operations such as searching a collection of data can proceed
exponentially faster if the collection is structured and indexed intelligently. People can easily perform
arithmetic on Arabic numerals, but find arithmetic on Roman numerals much more time-consuming. It is not
surprising that the choice of representation has an enormous effect on the performance of machine learning
algorithms. Many artificial intelligence tasks can be solved by designing the right set of features to extract for
that task, then providing these features to a simple machine learning algorithm. However, for many tasks, it
is difficult to know what features should be extracted. For example, suppose that we would like to write a
program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence
of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of
pixel values.
One solution to this problem is to use machine learning to discover not only the mapping from representation
to output but also the representation itself. This approach is known as representation learning. Learned
representations often result in much better performance than can be obtained with hand- designed
representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention.
A representation learning algorithm can discover a good set of features for a simple task in minutes, or a
complex task in hours to months.
The quintessential example of a representation learning algorithm is the autoencoder. An autoencoder is the
combination of an encoder function that converts the input data into a different representation, and adecoder
function that converts the new representation back into the original format. Autoencoders are trained to
preserve as much information as possible when an input is run through the encoder and then the decoder, but
are also trained to make the new representation have various nice properties. Different kinds of autoencoders
aim to achieve different kinds of properties. When designing features or algorithms forlearning features,
our goal is usually to separate the factors of variation that explain the observed data. A major source of
difficulty in many real-world artificial intelligence applications is that many of the factors of variation
influence every single piece of data we are able to observe. The individual pixels in an image of a red car
might be very close to black at night. The shape of the car‘s silhouette depends on the viewing angle. It can
be very difficult to extract such high-level, abstract features from raw data. Deep learning solves this central
problem in representation learning by introducing representations that are expressed in terms of other, simpler
representations.
Deep learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.1 shows how a
deep learning system can represent the concept of an image of a person by combining simpler concepts, such
as corners and contours, which are in turn defined in terms of edges. The quintessential example of a deep
learning model is the feedforward deep network or multilayer perceptron (MLP). A multilayer perceptron is
just a mathematical function mapping some set of input values to output values. The functionis formed by
composing many simpler functions. The idea of learning the right representation for the data provides one
perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn
a multi-step computer program. Each layer of the representation can be thought of as the state of the
computer‘s memory after executing another set of instructions in parallel. Networks with greater
2
depth can execute more instructions in sequence. Sequential instructions offer great power because later
instructions can refer back to the results of earlier instructions.
The input is presented at the, so named because it contains visible layer the variables that we are able to
observe. Then a series of hidden layers extracts increasingly abstract features from the image. These layers
are called ―hidden‖ because their values are not given in the data; instead the model must determine which
concepts are useful for explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer can easily identify
edges, by comparing the brightness of neighboring pixels. Given the first hidden layer‘s description of the
edges, the second hidden layer can easily search for corners and extended contours, which are recognizable
as collections of edges. Given the second hidden layer‘s description of the image in terms of corners and
contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts it contains can be used
to recognize the objects present in the image.

Fig. 1.1 Illustration of deep learning model

There are two main ways of measuring the depth of a model. The first view is based on the number of
sequential instructions that must be executed to evaluate the architecture. Another approach, used by deep
probabilistic models, regards the depth of a model as being not the depth of the computational graph but the
depth of the graph describing how concepts are related to each other. Machine learning is the only viable
approach to building AI systems that can operate in complicated, real-world environments. Deep learning is
a particular kind of machine learning that achieves great power and flexibility by learning to represent the
world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more
abstract representations computed in terms of less abstract ones. Fig. 1.2 illustrates the relationship between
these different AI disciplines. Fig. 1.3 gives a high-level schematic of how each works.

3
Fig. 1.2 Venn diagram representing relationship between AI disciplines

Fig. 1.3 High level schematics representing relationship between AI disciplines

4
AI is basically the study of training your machine (computers) to mimic a human brain and its thinking
capabilities. AI focuses on three major aspects (skills): learning, reasoning, and self- correction to obtain
the maximum efficiency possible. Machine Learning (ML) is an application or subset of AI. The major
aim of ML is to allow the systems to learn by themselves through experience without any kind of human
intervention or assistance. Deep Learning (DL) is basically a sub-part of the broader family of Machine
Learning which makes use of Neural Networks (similar to the neurons working in our brain) to mimic human
brain-like behavior. DL algorithms focus on information processing patterns mechanism to possibly
identify the patterns just like our human brain does and classifies the information accordingly. DL works on
larger sets of data when compared to ML and the prediction mechanism is self-administered by machines.
The differences between AI, ML and DL are presented as Table 1 as below.

Table 1. Difference between Artificial Intelligence, Machine Learning & Deep Learning

Artificial Intelligence Machine Learning Deep Learning


AI stands for Artificial Intelligence, ML stands for Machine Learning, DL stands for Deep Learning,
and is basically the study/process and is the study that uses statistical and is the study that makes use
which enables machines to mimic methods enabling machines to of Neural Networks (similar to
human behaviour through particular improve with experience. neurons present in human brain)
algorithm. to imitate functionality just like
a human brain.
AI is the broader family consisting of ML is the subset of AI. DL is the subset of ML.
ML and DL as it‘s components.
AI is a computer algorithm which ML is an AI algorithm which DL is a ML algorithm that uses
exhibits intelligence through decision allows system to learn from data. deep (more than one layer)
making. neural networks to analyze data
and provide output accordingly.
Search Trees and much complex math Having a clear idea about the logic With clear about the math
are involved in AI. (math) involved in behind and can involved in it but don‘t have idea
visualize the complex about the features, so one break
functionalities like K-Mean, the complex
Support Vector Machines, etc.,then functionalities into linear/lower
it defines the ML aspect. dimension features by adding
more layers, then it defines the
DL aspect.
The aim is to basically increase The aim is to increase accuracy not It attains the highest rank in
chances of success and not accuracy. caring much about the success terms of accuracy when it is
ratio. trained with large amount of
data.
The efficiency of AI is basically the Less efficient than DL as it can‘t More powerful than ML as it
efficiency provided by ML and DL work for longer dimensions or can easily work for larger sets
respectively. higher amount of data. of data.

Artificial Intelligence Machine Learning Deep Learning


Three broad categories/types Of AI Three broad categories/types ofMLDL can be considered as neural
are: Artificial Narrow Intelligence are: Supervised Learning,
networks with a large numberof
(ANI), Artificial General Intelligence Unsupervised Learning and
parameters layers lying in one
(AGI) and Artificial Super Intelligence Reinforcement Learning of the four fundamental network
(ASI) architectures:
Unsupervised Pre-trained
Networks, Convolutional
Neural Networks, Recurrent
Neural Networks and Recursive
Neural Networks
Examples of AI applications include: Examples of ML applications Examples of DL applications
Google‘s AI-Powered Predictions, include: Virtual Personal include: Sentiment based news
Ridesharing Apps Like Uber and Lyft, Assistants: Siri, Alexa, Google,etc., aggregation, Image analysis and
Commercial Flights Use an AI Email Spam and Malware caption generation, etc.
Autopilot, etc. Filtering.

2. Linear Algebra:
A good understanding of linear algebra is essential for understanding and working with many machine
learning algorithms, especially deep learning algorithms.
2.1 Scalars, Vectors, Matrices and Tensors
The study of linear algebra involves several types of mathematical objects:
● Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which
are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower- case variable
names. When we introduce them, we specify what kind of number they are. For example, we might say ―Let
s ∈ R be the slope of the line,‖ while defining a real-valued scalar, or ―Let n ∈ N be the number of units,‖
while defining a natural number scalar.

● Vectors: A vector is an array of numbers. The numbers are arranged in order. We can identify each individual
number by its index in that ordering. Typically we give vectors lower case names written in bold typeface,
such as x. The elements of the vector are identified by writing its name in italic typeface, with a subscript.
The first element of x is x1, the second element is x2 and so on. We also need to say what kinds of numbers
are stored in the vector. If each element is in R, and the vector has n elements, then the vector lies in the set
formed by taking the Cartesian product of R n times, denoted as Rn. When we need to explicitly identify the
elements of a vector, we write them as a column enclosed in square brackets:

6
We can think of vectors as identifying points in space, with each element giving the coordinate 7along a
different axis. Sometimes we need to index a set of elements of a vector. In this case, we define a set containing
the indices and write the set as a subscript. For example, to access x1, x3 and x6 , we define the set S = {1, 3,
6} and write xS . We use the − sign to index the complement of a set. For example x−1 is the vector containing
all elements of x except for x1, and x−S is the vector containing all of the elements of x except for x1, x3 and
x6.

● Matrices: A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one.
We usually give matrices upper-case variable names with bold typeface, such as A. If a real-valued matrix
A has a height of m and a width of n, then we say that A ∈ Rm×n. We usually identify the elements of a matrix
using its name in italic but not bold font, and the indices are listed with separating commas. For example, A1,1
is the upper left entry of A and Am,n is the bottom right entry. We can identify all of the numbers with vertical
coordinate i by writing a ―:‖ for the horizontal coordinate. For example, Ai,: denotes the horizontal cross
section of A with vertical coordinate i. This is known as the i-th row of A. Likewise, A:,iis the i-th column of
A. When we need to explicitly identify the elements of a matrix, we write them as an array enclosed in square
brackets:

Sometimes we may need to index matrix-valued expressions that are not just a single letter. In this case, we
use subscripts after the expression, but do not convert anything to lower case. For example, f (A) i,j gives
element (i, j) of the matrix computed by applying the function f to A.

● Tensors: In some cases we will need an array with more than two axes. In the general case, an array of numbers
arranged on a regular grid with a variable number of axes is known as a tensor. We denote a tensornamed
―A‖ with this typeface: A. We identify the element of A at coordinates (i, j, k) by writing Ai,j,k. One important
operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a
diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner. See
Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a matrix A as AT,and it is
defined such that

Vectors can be thought of as matrices that contain only one column. The transpose of a vector is therefore a
matrix with only one row. Sometimes we define a vector by writing out its elements in the text inline as a

row matrix, then using the transpose operator to turn it into a standard column vector, e.g., x = [x1, x2, x3
]T.

A scalar can be thought of as a matrix with only a single entry. From this, we can see that a scalar is its own
transpose: a = aT. We can add matrices to each other, as long as they have the same shape, just by adding their
corresponding elements: C = A +B where Ci,j = Ai,j + Bi,j.We can also add a scalar to a matrix or multiply a
matrix by a scalar, just by performing that operation on each element of a matrix: D = a · B + c where Di,j = a
· Bi,j + c.

In the context of deep learning, we also use some less conventional notation. We allow the addition of
matrix and a vector, yielding another matrix: C = A + b, where Ci,j = Ai,j + bj. In other words, the vector b is
added to each row of the matrix. This shorthand eliminates the need to define a matrix with b copied into each
row before doing the addition. This implicit copying of b to many locations is called broadcasting.

3.3 Probability Distributions


Probability can be seen as the extension of logic to deal with uncertainty. Logic provides a set of formal rules
for determining what propositions are implied to true or false given the assumption that some other set of
propositions is true or false. Probability theory provides a set of formal rules for determining the likelihood
of a proposition being true given the likelihood of other propositions.
A random variable is a variable that can take on different values randomly. Random variables may be discrete
or continuous. A discrete random variable is one that has a finite or countably infinite number of states. Note
that these states are not necessarily the integers; they can also just be named states that are not considered to
have any numerical value. A continuous random variable is associated with a real value.
A probability distribution is a description of how likely a random variable or set of random variables is to
take on each of its possible states. The way we describe probability distributions depends on whether the
variables are discrete or continuous.
3.3.1 Discrete Variables and Probability Mass Functions
A probability distribution over discrete variables may be described using a probability mass function (PMF).
We typically denote probability mass functions with a capital P. Often we associate each random variable
with a different probability mass function and the reader must infer which probability mass function to use
based on the identity of the random variable, rather than the name of the function; P(x) is usually not the same
as P(y).
The probability mass function maps from a state of a random variable to the probability of that random
variable taking on that state. The probability that x = x is denoted as P (x), with a probability of 1 indicating
that x = x is certain and a probability of 0 indicating that x = x is impossible. Sometimes to disambiguate
which PMF to use, we write the name of the random variable explicitly: P (x = x). Sometimes we define a
variable first, then use ~notation to specify which distribution it follows later: x ~ P(x).
Probability mass functions can act on many variables at the same time. Such a probability distribution over
many variables is known as a joint probability distribution. P (x = x, y = y) denotes the probability that x =
x and y = y simultaneously. We may also write P(x, y) for brevity. To be a probability mass function on a
random variable x, a function P must satisfy the following properties:

• The domain of P must be the set of all possible states of x.

8
• x  X , 0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can be less probable
than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can
have a greater chance of occurring.

•  xX
P(x) = 1
.We refers to this property as being normalized. Without this property, we
could obtain probabilities greater than one by computing the probability of one of many events
occurring.
For example, consider a single discrete random variable x with k different states. We can place a uniform
distribution on x—that is, make each of its states equally likely—by setting its probability mass function to

1
for all i. We can see that this fits the requirements for a probability mass function. The value k is positive
because k is a positive integer. We also see that
so the distribution is properly normalized. Let‘s discuss few discrete probability distributions as follows:

3.3.1.1 Binomial Distribution


The binomial distribution is a discrete distribution with a finite number of possibilities. When observing a
series of what are known as Bernoulli trials, the binomial distribution emerges. A Bernoulli trial is a scientific
experiment with only two outcomes: success or failure.
Consider a random experiment in which you toss a biased coin six times with a 0.4 chance of getting head.
If 'getting a head' is considered a ‗success‘, the binomial distribution will show the probability of r successes
for each value of r.
The binomial random variable represents the number of successes (r) in n consecutive independent Bernoulli
trials.

3.3.1.2 Bernoulli's Distribution 9


The Bernoulli distribution is a variant of the Binomial distribution in which only one experiment isconducted,
resulting in a single observation. As a result, the Bernoulli distribution describes events that have exactly two
outcomes.
Here‘s a Python Code to show Bernoulli distribution:
The Bernoulli random variable's expected value is p, which is also known as the Bernoulli distribution's
parameter.
The experiment's outcome can be a value of 0 or 1. Bernoulli random variables can have values of 0 or 1.
The pmf function is used to calculate the probability of various random variable values.

3.3.1.3 Poisson Distribution


A Poisson distribution is a probability distribution used in statistics to show how many times an event is likely
to happen over a given period of time. To put it another way, it's a count distribution. Poisson distributions
are frequently used to comprehend independent events at a constant rate over a given time interval. Siméon
Denis Poisson, a French mathematician, was the inspiration for the name.

The Python code below shows a simple example of Poisson distribution. It has two parameters: 10
1. Lam: Known number of occurrences
2. Size: The shape of the returned array
The below-given Python code generates the 1x100 distribution for occurrence 5.
3.3.2 Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distributions using a probability
density function (PDF) rather than a probability mass function. To be a probability density function, a function
p must satisfy the following properties:
• The domain of must be the set of p all possible states of x.
• x  X , P(x) ≥ 0. Note that we do not require p(x) ≤ 1.
•  p(x)dx = 1 .
A probability density function p(x) does not give the probability of a specific state directly, instead the
probability of landing inside an infinitesimal region with volume δx is given by p(x)δx.
We can integrate the density function to find the actual probability mass of a set of points. Specifically, the
probability that x lies in some set S is given by the integral of p(x) over that set. In the univariate example,
the probability that x lies in the interval [a, b] is given by  p(x)dx .
a,b 
For an example of a probability density function corresponding to a specific probability density over a
continuous random variable, consider a uniform distribution on an interval of the real numbers. We can do this
with a function u(x; a, b), where a and b are the endpoints of the interval, with b > a. The ―;‖ notation means
―parametrized by‖; we consider x to be the argument of the function, while a and b are parameters

11b) = 0
that define the function. To ensure that there is no probability mass outside the interval, we say u(x;a,
1
for all x  [a, b]. Within [a, b], u(x; a, b) = . We can see that this is nonnegative everywhere.
b−a
Additionally, it integrates to 1. We often denote that x follows the uniform distribution on [a, b] by writing x
~ U(a, b).
.
3.3.2.1 Normal Distribution
Normal Distribution is one of the most basic continuous distribution types. Gaussian distribution is another
name for it. Around its mean value, this probability distribution is symmetrical. It also demonstrates that data
close to the mean occurs more frequently than data far from it. Here, the mean is 0, and the variance isa finite
value.
In the example, you generated 100 random variables ranging from 1 to 50. After that, you created a function
to define the normal distribution formula to calculate the probability density function. Then, you have plotted
the data points and probability density function against X-axis and Y-axis, respectively.

3.3.2.2 Continuous Uniform Distribution


In continuous uniform distribution, all outcomes are equally possible. Each variable has the same chance of
being hit as a result. Random variables are spaced evenly in this symmetric probabilistic distribution, with a
1/ (b-a) probability.
The below Python code is a simple example of continuous distribution taking 1000 samples of random
variables.

12
3.3.2.3 Log-Normal Distribution
The random variables whose logarithm values follow a normal distribution are plotted using this
distribution. Take a look at the random variables X and Y. The variable represented in this distribution is Y
= ln(X), where ln denotes the natural logarithm of X values.
The size distribution of rain droplets can be plotted using log normal distribution.

13
3.3.2.4 Exponential Distribution
In a Poisson process, an exponential distribution is a continuous probability distribution that describes the
time between events (success, failure, arrival, etc.).
You can see in the below example how to get random samples of exponential distribution and return Numpy
array samples by using the numpy.random.exponential() method.

4. Gradient based Optimization


Optimization means minimizing or maximizing any mathematical expression. Optimizers are algorithms or
methods used to update the parameters of the network such as weights, biases, etc. to minimize the losses.
Therefore, Optimizers are used to solve optimization problems by minimizing the function i.e, loss function
in the case of neural networks.
Here, we‘re going to explore and deep dive into the world of optimizers for deep learning models. We will
also discuss the foundational mathematics behind these optimizers and discuss their advantages, and
disadvantages.

14
4.1 Role of an Optimization
As discussed above, optimizers update the parameters of neural networks such as weights and learning rate
to minimize the loss function. Here, the loss function acts as a guide to the terrain telling optimizer if it is
moving in the right direction to reach the bottom of the valley, the global minimum.
4.2 The Intuition behind Optimization
Let us imagine a climber hiking down the hill with no sense of direction. He doesn‘t know the right way to
reach the valley in the hills, but, he can understand whether he is moving closer (going downhill) or further
away (uphill) from his final destination. If he keeps taking steps in the correct direction, he will reach to his
aim i.,e the valley.
Exactly, this is the intuition behind optimization- to reach a global minimum concerning the loss function.
4.3 Instances of Gradient-Based Optimizers
Different instances of Gradient descent based Optimizers are as follows:
• Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)
• Mini batch Gradient Descent (MB-GD)
4.3.1 Batch Gradient Descent
Gradient descent algorithm is an optimization algorithm which is used to minimize the function. The
function which is set to be minimized is called as an objective function. For machine learning,the
objective function is also termed as the cost function or loss function. It is the loss function which is
optimized (minimized) and gradient descent is used to find the most optimal value of parameters / weights
which minimizes the loss function. Loss function, simply speaking, is the measure of the squared
difference between actual values and predictions. In order to minimize the objective function, the most
optimal value of the parameters of the function from large or infinite parameter space are found.
Gradient of a function at any point is the direction of steepest increase or ascent of the function at
that point.
Based on above, the gradient descent of a function at any point, thus, represent the direction of steepest
decrease or descent of function at that point.
In order to find the gradient of the function with respect to x dimension, take the derivative ofthe
function with respect to x , then substitute the x-coordinate of the point of interest in for the x values in the
derivative. Once gradient of the function at any point is calculated, the gradient descent can becalculated by
multiplying the gradient with -1. Here are the steps of finding minimum of the function using gradient descent:
• Calculate the gradient by taking the derivative of the function with respect to the specific
parameter. In case, there are multiple parameters, take the partial derivatives with respect to
different parameters.
• Calculate the descent value for different parameters by multiplying the value of derivatives with
learning or descent rate (step size) and -1.
• Update the value of parameter by adding up the existing value of parameter and the descent value.
The diagram below represents the updation of parameter [latex]\theta[/latex] with the value of
gradient in the opposite direction while taking small steps.
Gradient descent is an optimization algorithm that‘s used when training deep learning models. It‘s based on
a convex function and updates its parameters iteratively to minimize a given function to its local minimum.

15
The notation used in the above Formula is given below,
In the above formula,
• α is the learning rate,
• J is the cost function, and
• ϴ is the parameter to be updated.
As we see, the gradient represents the partial derivative of J(cost function) with respect to ϴj
Note that, as we reach closer to the global minima, the slope(gradient) of the curve becomes less and less
steep, which results in a smaller value of derivative, which in turn reduces the step size(learning rate)
automatically.
It is the most basic but most used optimizer that directly uses the derivative of the loss function and learning
rate to reduce the loss function and tries to reach the global minimum.
Thus, the Gradient Descent Optimization algorithm has many applications including-
• Linear Regression,
• Classification Algorithms,
• Back-propagation in Neural Networks, etc.
The above-described equation calculates the gradient of the cost function J(θ) with respect to the network
parameters θ for the entire training dataset:

Our aim is to reach at the bottom of the graph (Cost vs weight), or to a point where we can no longer move
downhill–a local minimum.
➢ Role of Gradient
In general, Gradient represents the slope of the equation while gradients are partial derivatives and they
describe the change reflected in the loss function with respect to the small change in parameters of the
function. Now, this slight change in loss functions can tell us about the next step to reduce the output of the
loss function.

16
➢ Role of Learning Rate
Learning rate represents the size of the steps our optimization algorithm takes to reach the global minima.
To ensure that the gradient descent algorithm reaches the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high.
Taking very large steps i.e, a large value of the learning rate may skip the global minima, and the model will
never reach the optimal value for the loss function. On the contrary, taking very small steps i.e, a small value
of learning rate will take forever to converge.
Thus, the size of the step is also dependent on the gradient value.

As we discussed, the gradient represents the direction of increase. But our aim is to find the minimum point
in the valley so we have to go in the opposite direction of the gradient. Therefore, we update parameters in
the negative gradient direction to minimize the loss.

Algorithm: θ=θ−α⋅∇J(θ)
In code, Batch Gradient Descent looks something like this:
for x in range(epochs):
params_gradient = find_gradient(loss_function, data, parameters)
parameters = parameters - learning_rate * params_gradient
➢ Advantages of Batch Gradient Descent
• Easy computation
• Easy to implement
• Easy to understand
➢ Disadvantages of Batch Gradient Descent
• May trap at local minima
• Weights are changed after calculating the gradient on the whole dataset. So, if the dataset is too
large then this may take years to converge to the minima
• Requires large memory to calculate gradient on the whole dataset
4.3.2 Stochastic Gradient Descent
To overcome some of the disadvantages of the GD algorithm, the SGD algorithm comes into the picture as
an extension of the Gradient Descent. One of the disadvantages of the Gradient Descent algorithm is that it
requires a lot of memory to load the entire dataset at a time to compute the derivative of the loss function.

17
So, In the SGD algorithm, we compute the derivative by taking one data point at a time i.e, tries to update the
model‘s parameters more frequently. Therefore, the model parameters are updated after the computation of
loss on each training example.
So, let‘s have a dataset that contains 1000 rows, and when we apply SGD it will update the model parameters
1000 times in one complete cycle of a dataset instead of one time as in Gradient Descent.
Algorithm: θ=θ−α⋅∇J (θ;x(i);y(i)) , where {x(i),y(i)} are the training examples
We want the training, even more, faster, so we take a Gradient Descent step for each training example. Let‘s
see the implications in the image below:

Let‘s try to find some insights from the above diagram:


• In the left diagram of the above picture, we have SGD (where 1 per step time) we take a Gradient
Descent step for each example and on the right diagram is GD(1 step per entire training set).
• SGD seems to be quite noisy, but at the same time it is much faster than others and also it might be
possible that it not converges to a minimum.
It is observed that in SGD the updates take more iteration compared to GD to reach minima. On the contrary,
the GD takes fewer steps to reach minima but the SGD algorithm is noisier and takes more iterations as the
model parameters are frequently updated parameters having high variance and fluctuations in loss functions
at different values of intensities.
Its code snippet simply adds a loop over the training examples and finds the gradient with respect to each of
the training examples.
for x in range(epochs):
np.random.shuffle(data)
for example in data:
params_gradient = find_gradient(loss_function, example, parameters)
parameters = parameters - learning_rate * params_gradient
➢ Advantages of Stochastic Gradient Descent
• Convergence takes less time as compared to others since there are frequent updates in model
parameters
• Requires less memory as no need to store values of loss functions
• May get new minima‘s
➢ Disadvantages of Stochastic Gradient Descent
• High variance in model parameters
• Even after achieving global minima, it may overshoots 18
• To reach the same convergence as that of gradient descent, we need to slowly reduce the value
of the learning rate
4.3.3 Mini-Batch Gradient Descent
To overcome the problem of large time complexity in the case of the SGD algorithm. MB-GD algorithm
comes into the picture as an extension of the SGD algorithm. It‘s not all but it also overcomes the problem
of Gradient descent. Therefore, It‘s considered the best among all the variations of gradient descentalgorithms.
MB-GD algorithm takes a batch of points or subset of points from the dataset to compute derivate.

It is observed that the derivative of the loss function for MB-GD is almost the same as a derivate of the loss
function for GD after some number of iterations. But the number of iterations to achieve minima is large for
MB-GD compared to GD and the cost of computation is also large.
Therefore, the weight updation is dependent on the derivate of loss for a batch of points. The updates in the
case of MB-GD are much noisy because the derivative is not always towards minima.
It updates the model parameters after every batch. So, this algorithm divides the dataset into various batches
and after every batch, it updates the parameters.
Algorithm: θ=θ−α⋅∇J (θ; B(i)), where {B(i)} are the batches of training examples
n the code snippet, instead of iterating over examples, we now iterate over mini-batches of size 30:
for x in range(epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=30):
params_gradient = find_gradient(loss_function, batch, parameters)
parameters = parameters - learning_rate * params_gradient
➢ Advantages of Mini Batch Gradient Descent
• Updates the model parameters frequently and also has less variance
• Requires not less or high amount of memory i.e requires a medium amount of memory
➢ Disadvantages of Mini Batch Gradient Descent
• The parameter updating in MB-SGD is much noisy compared to the weight updating in the GD
algorithm
• Compared to the GD algorithm, it takes a longer time to converge
• May get stuck at local minima

4.3.5 Challenges with all types of Gradient-based Optimizers 19


Optimum Learning Rate: If we choose the learning rate as a too-small value, then gradient descent may take
a very long time to converge. For more about this challenge, refer to the above section of Learning Rate which
we discussed in the Gradient Descent Algorithm.
Constant Learning Rate: For all the parameters, they have a constant learning rate but there may be some
parameters that we may not want to change at the same rate.
Local minimum: May get stuck at local minima i.e., not reach up to the local minimum.
5. Basics in Machine Learning
5.1 Need for machine learning:
Machine learning is important because it allows computers to learn from data and improve thei r performance
on specific tasks without being explicitly programmed. This ability to learn from data and adapt to new
situations makes machine learning particularly useful for tasks that involve large amounts of data, complex
decision-making, and dynamic environments.
Here are some specific areas where machine learning is being used:
• Predictive modeling: Machine learning can be used to build predictive models that can help businesses
make better decisions. For example, machine learning can be used to predict which customers are
most likely to buy a particular product, or which patients are most likely to develop a certain disease.
• Natural language processing: Machine learning is used to build systems that can understand and
interpret human language. This is important for applications such as voice recognition, chatbots,
and language translation.
• Computer vision: Machine learning is used to build systems that can recognize and interpret images
and videos. This is important for applications such as self-driving cars, surveillance systems, and
medical imaging.
• Fraud detection: Machine learning can be used to detect fraudulent behavior in financial transactions,
online advertising, and other areas.
• Recommendation systems: Machine learning can be used to build recommendation systems that
suggest products, services, or content to users based on their past behavior and preferences.
Overall, machine learning has become an essential tool for many businesses and industries, as it enables them
to make better use of data, improve their decision-making processes, and deliver more personalized
experiences to their customers.
5.2 Definition and Workflow:
Machine Learning is a branch of artificial intelligence that develops algorithms by learning the hidden patterns
of the datasets used it to make predictions on new similar type data, without being explicitly programmed for
each task.
Machine Learning works in the following manner.
• Forward Pass: In the Forward Pass, the machine learning algorithm takes in input data and produces
an output. Depending on the model algorithm it computes the predictions.
• Loss Function: The loss function, also known as the error or cost function, is used to evaluate the
accuracy of the predictions made by the model. The function compares the predicted output of the
model to the actual output and calculates the difference between them. This difference i s known

as error or loss. The goal of the model is to minimize the error or loss function by adjusting its 20
internal
parameters.
• Model Optimization Process: The model optimization process is the iterative process of adjusting the
internal parameters of the model to minimize the error or loss function. This is done using an
optimization algorithm, such as gradient descent. The optimization algorithm calculates the gradient
of the error function with respect to the model‘s parameters and uses this information to adjust the
parameters to reduce the error. The algorithm repeats this process until the error is minimized to a
satisfactory level.
Once the model has been trained and optimized on the training data, it can be used to make predictions
on new, unseen data. The accuracy of the model‘s predictions can be evaluated using various performance
metrics, such as accuracy, precision, recall, and F1 -score.
5.3 Machine Learning lifecycle:
The lifecycle of a machine learning project involves a series of steps that include:
1. Study the Problems: The first step is to study the problem. This step involves understanding the
business problem and defining the objectives of the model.
2. Data Collection: When the problem is well-defined, we can collect the relevant data required for
the model. The data could come from various sources such as databases, APIs, or web scraping.
3. Data Preparation: When our problem-related data is collected. then it is a good idea to check the
data properly and make it in the desired format so that it can be used by the model to find the
hidden patterns. This can be done in the following steps:
• Data cleaning
• Data Transformation
• Explanatory Data Analysis and Feature Engineering
• Split the dataset for training and testing.
4. Model Selection: The next step is to select the appropriate machine learning algorithm that is suitable
for our problem. This step requires knowledge of the strengths and weaknesses of different
algorithms. Sometimes we use multiple models and compare their results and select thebest model
as per our requirements.
5. Model building and Training: After selecting the algorithm, we have to build the model.
a. In the case of traditional machine learning building mode is easy it is just a few
hyperparameter tunings.
b. In the case of deep learning, we have to define layer-wise architecture along with input and
output size, number of nodes in each layer, loss function, gradient descent optimizer, etc.
c. After that model is trained using the preprocessed dataset.
6. Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to determineits
accuracy and performance using different techniques like classification report, F1 score, precision,
recall, ROC Curve, Mean Square error, absolute error, etc.
7. Model Tuning: Based on the evaluation results, the model may need to be tuned or optimized to
improve its performance. This involves tweaking the hyperparameters of the model.
8. Deployment: Once the model is trained and tuned, it can be deployed in a production environment
to make predictions on new data. This step requires integrating the model into an existing software
system or creating a new system for the model.

9. Monitoring and Maintenance: Finally, it is essential to monitor the model‘s performance21in the
production environment and perform maintenance tasks as required. This involves monitoring for data
drift, retraining the model as needed, and updating the model as new data becomes available.
5.4 Types of Machine Learning
The types are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Machine Learning

5.4.1 Supervised Machine Learning:


Supervised learning is a type of machine learning in which the algorithm is trained on the labeled dataset.
It learns to map input features to targets based on labeled training data. In supervised learning, the algorithm
is provided with input features and corresponding output labels, and it learns to generalize from this data to
make predictions on new, unseen data.
There are two main types of supervised learning:
• Regression: Regression is a type of supervised learning where the algorithm learns to predict
continuous values based on input features. The output labels in regression are continuous values, such
as stock prices, and housing prices. The different regression algorithms in machine learningare:
Linear Regression, Polynomial Regression, Ridge Regression, Decision Tree Regression, Random
Forest Regression, Support Vector Regression, etc
• Classification: Classification is a type of supervised learning where the algorithm learns to assign input
data to a specific category or class based on input features. The output labels in classification are
discrete values. Classification algorithms can be binary, where the output is one of two possible
classes, or multiclass, where the output can be one of several classes. The different Classification
algorithms in machine learning are: Logistic Regression, Naive Bayes, Decision Tree, Support Vector
Machine (SVM), K-Nearest Neighbors (KNN), etc

5.4.2 Unsupervised Machine Learning: 22


Unsupervised learning is a type of machine learning where the algorithm learns to recognize patterns in
data without being explicitly trained using labeled examples. The goal of unsupervised learning is to discover
the underlying structure or distribution in the data.
There are two main types of unsupervised learning:
• Clustering: Clustering algorithms group similar data points together based on their characteristics. The
goal is to identify groups, or clusters, of data points that are similar to each other, while being
distinct from other groups. Some popular clustering algorithms include K-means, Hierarchical
clustering, and DBSCAN.
• Dimensionality reduction: Dimensionality reduction algorithms reduce the number of input variables
in a dataset while preserving as much of the original information as possible. This isuseful for
reducing the complexity of a dataset and making it easier to visualize and analyze. Some popular
dimensionality reduction algorithms include Principal Component Analysis (PCA), t-SNE, and
Autoencoders.
5.4.3 Reinforcement Machine Learning
Reinforcement learning is a type of machine learning where an agent learns to interact with an environment
by performing actions and receiving rewards or penalties based on its actions. The goal of reinforcement
learning is to learn a policy, which is a mapping from states to actions, that maximizes the expected
cumulative reward over time.
There are two main types of reinforcement learning:
• Model-based reinforcement learning: In model-based reinforcement learning, the agent learns a
model of the environment, including the transition probabilities between states and the rewards
associated with each state-action pair. The agent then uses this model to plan its actions in order
to maximize its expected reward. Some popular model-based reinforcement learning algorithms
include Value Iteration and Policy Iteration.
• Model-free reinforcement learning: In model-free reinforcement learning, the agent learns a policy
directly from experience without explicitly building a model of the environment. The agent
interacts with the environment and updates its policy based on the rewards it receives. Some
popular model-free reinforcement learning algorithms include Q-Learning, SARSA, and Deep
Reinforcement Learning.
5.5 Capacity
The capacity of a network refers to the range of the types of functions that the model can approximate.
Informally, a model‘s capacity is its ability to fit a wide variety of functions. A model with less capacity
may not be able to sufficiently learn the training dataset.
A model with more capacity can model more different types of functions and may be able to learn a function
to sufficiently map inputs to outputs in the training dataset. Whereas a model with too much capacity may
memorize the training dataset and fail to generalize or get lost or stuck in the search for a suitable mapping
function. Generally, we can think of model capacity as a control over whether the model islikely to underfit
or overfit a training dataset.
The capacity of a neural network can be controlled by two aspects of the model:
• Number of Nodes
• Number of Layers
A model with more nodes or more layers has a greater capacity and, in turn, is potentially capable of learning
a larger set of mapping functions. A model with more layers and more hidden units per layer has higher
representational capacity; it is capable of representing more complicated functions.

23
The number of nodes in a layer is referred to as the width and the number of layers in a model is referred to
as its depth. Increasing the depth increases the capacity of the model. Training deep models, e.g. those with
many hidden layers, can be computationally more efficient than training a single layer network with a vast
number of nodes.
5.6 Over-fitting and under-fitting
Over-fitting and under-fitting are two crucial concepts in machine learning and are the prevalent causes for
the poor performance of a machine learning model. In this topic we will explore over-fitting and under- fitting
in machine learning.
➢ Over-fitting
When a model performs very well for training data but has poor performance with test data (new data), it is
known as over-fitting. In this case, the machine learning model learns the details and noise in the training data
such that it negatively affects the performance of the model on test data. Over-fitting can happen due tolow
bias and high variance.

➢ Reasons for over-fitting


• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high variance
• The size of the training dataset used is not enough
• The model is too complex

➢ Methods to tackle over-fitting


• Using K-fold cross-validation
• Using Regularization techniques such as Lasso and Ridge
• Training model with sufficient data
• Adopting ensembling techniques

➢ Under-fitting
When a model has not learned the patterns in the training data well and is unable to generalize well on the
new data, it is known as under-fitting. An under-fit model has poor performance on the training data and
will result in unreliable predictions. Under-fitting occurs due to high bias and low variance.

24
➢ Reasons for under-fitting
• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high bias
• The size of the training dataset used is not enough
• The model is too simple

➢ Methods to tackle under-fitting


• Increase the number of features in the dataset
• Increase model complexity
• Reduce noise in the data
• Increase the duration of training the data
Now that we have understood what over-fitting and under-fitting are, let‘s see what a good fit model is in
this tutorial on over-fitting and under-fitting in machine learning.

➢ Good fit in machine learning


To find the good fit model, we need to look at the performance of a machine learning model over time with
the training data. As the algorithm learns over time, the error for the model on the training data reduces, as
well as the error on the test dataset. If we train the model for too long, the model may learn the unnecessary
details and the noise in the training set and hence lead to over-fitting. In order to achieve a good fit, we need
to stop training at a point where the error starts to increase.

5.7 Hyper-parameter 25
Hyper-parameters are defined as the parameters that are explicitly defined by the user to control the
learning process. The value of the Hyper-parameter is selected and set by the machine learning engineer
before the learning algorithm begins training the model. These parameters are tunable and can directly
affect how well a model trains. Hence, these are external to the model, and their values cannot be changed
during the training process. Some examples of hyper-parameters in machine learning:
• Learning Rate
• Number of Epochs
• Momentum
• Regularization constant
• Number of branches in a decision tree
• Number of clusters in a clustering algorithm (like k-means)
5.7.1 Model Parameters:
Model parameters are configuration variables that are internal to the model, and a model learns them on its
own. For example, Weights or Coefficients of dependent variables in the linear regression model.
Weights or Coefficients of independent variables in SVM, weight, and biases of a neural network,
cluster centroid in clustering. Some key points for model parameters are as follows:
• They are used by the model for making predictions
• They are learned by the model from the data itself
• These are usually not set manually
• These are the part of the model and key to a machine learning algorithm
5.7.2 Model Hyper-parameters:
• Hyper-parameters are those parameters that are explicitly defined by the user to control the learning
process. Some key points for model parameters are as follows:
• These are usually defined manually by the machine learning engineer.
• One cannot know the exact best value for hyper-parameters for the given problem. The best value can
be determined either by the rule of thumb or by trial and error.
• Some examples of Hyper-parameters are the learning rate for training a neural network, K in the
KNN algorithm
5.7.3 Difference between Model and Hyper parameters
The difference is as tabulated below.
MODEL PARAMETERS HYPER-PARAMETERS
They are required for estimating the model
They are required for making predictions
parameters
They are estimated by optimization
They are estimated by hyperparameter tuning
algorithms(Gradient Descent, Adam, Adagrad)
They are not set manually They are set manually
The choice of hyperparameters decide how
The final parameters found after training will efficient the training is. In gradient descent the
decide how the model will perform on unseen learning rate decide how efficient and accurate
data the optimization process is in estimating the
parameters

5.7.4 Categories of Hyper-parameters 26


Broadly hyper-parameters can be divided into two categories, which are given below:
• Hyper-parameter for Optimization
• Hyper-parameter for Specific Models
5.7.4.1 Hyper-parameter for optimization
The process of selecting the best hyper-parameters to use is known as hyper-parameter tuning, and the tuning
process is also known as hyper-parameter optimization. Optimization parameters are used for optimizing the
model.

Some of the popular optimization parameters are given below:


• Learning Rate: The learning rate is the hyper-parameter in optimization algorithms that controls how
much the model needs to change in response to the estimated error for each time when the model's
weights are updated. It is one of the crucial parameters while building a neural network, and also it
determines the frequency of cross-checking with model parameters. Selecting the optimized learning
rate is a challenging task because if the learning rate is very less, then it may slow down the training
process. On the other hand, if the learning rate is too large, then it may not optimize the model properly.
• Batch Size: To enhance the speed of the learning process, the training set is divided into different
subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs varies from model to
model, and various models are created with more than one epoch. To determine the right number of
epochs, a validation error is taken into account. The number of epochs is increased until there is a
reduction in a validation error. If there is no improvement in reduction error for the consecutive
epochs, then it indicates to stop increasing the number of epochs.
5.7.1.2 Hyper-parameter for Specific Models
Hyper-parameters that are involved in the structure of the model are known as hyper-parameters for specific
models. These are given below:
• A number of Hidden Units: Hidden units are part of neural networks, which refer to the components
comprising the layers of processors between input and output units in a neural network.
• Number of Layers: A neural network is made up of vertically arranged components, which are called
layers. There are mainly input layers, hidden layers, and output layers. A 3-layered neuralnetwork
gives a better performance than a 2-layered network. For a Convolutional Neural network, agreater
number of layers make a better model.
5.8 Validation Sets 27
A validation set is a set of data used to train artificial intelligence (AI) with the goal of finding and optimizing
the best model to solve a given problem. Validation sets are also known as dev sets. Asupervised AI is trained
on a corpus of training data.
Training, tuning, model selection and testing are performed with three different datasets: the training set, the
validation set and the testing set. Validation sets are used to select and tune the final AI model.
Training sets make up the majority of the total data, averaging 60%. Most of the training data sets are
collected from several resources and then pre-processed and organized to provide proper performance of
the model. Type of training data sets determines the ability of the model to generalize .i.e. the better the quality
and diversity of training data sets, the better will be the performance of the model.
Validation set makes up about 20% of the bulk of data used. The validation set contrasts with training sets
and test sets is an intermediate phase used for choosing the best model and optimizing it. Validation is
sometimes considered a part of the training phase. In this phase that parameter tuning occurs for optimizing
the selected model. Over-fitting is checked and avoided in the validation set to eliminate errors that can be
caused for future predictions and observations to a specific dataset.
Testing sets make up 20% of the bulk of the data. These sets are ideal data and results with which to verify
correct operation of an AI. The test set is ensured to be the input data grouped together with verified correct
outputs, generally by human verification. This ideal set is used to test results and assess the performance of
the final model.
5.8.1 Cross Validation
Cross-validation is a technique for validating the model efficiency by training it on the subset of input data
and testing on previously unseen subset of the input data. Hence the basic steps of cross-validations are:
o Reserve a subset of the dataset as a validation set.
o Provide the training to the model using the training dataset.
o Now, evaluate model performance using the validation set. If the model performs well with the
validation set, perform the further step, else check for the issues.
5.8.2 Methods used for Cross-Validation
There are some common methods that are used for cross-validation. These methods are given below:
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validation
5.8.2.1 Validation Set Approach
We divide our input dataset into a training set and test or validation set in the validation set approach. Both
the subsets are given 50% of the dataset.
But it has one of the big disadvantages that we are just using a 50% dataset to train our model, so the model
may miss out to capture important information of the dataset. It also tends to give the underfitted model.
5.8.2.2 Leave-P-out cross-validation
In this approach, the p datasets are left out of the training data. It means, if there are total n datapoints in the
original input dataset, then n-p data points will be used as the training dataset and the p data points as the
validation set. This complete process is repeated for all the samples, and the average error is calculated to
know the effectiveness of the model.

There is a disadvantage of this technique; that is, it can be computationally difficult for the large p. 28
5.8.2.3 Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1 dataset out of
training. It means, in this approach, for each learning set, only one datapoint is reserved, and the remaining
dataset is used to train the model. This process repeats for each datapoint. Hence for n samples, we get n
different training set and n test set. It has the following features:
• In this approach, the bias is minimum as all the data points are used.
• The process is executed for n times; hence execution time is high.
• This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.
5.8.2.4 K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the folds
are used for the test set. This approach is a very popular CV approach because it is easy to understand, and
the output is less biased than other methods.
The steps for k-fold cross-validation are:
• Split the input dataset into K groups
• For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model using the test set.

Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st iteration, the
first fold is reserved for test the model, and rest are used to train the model. On 2nd iteration, the second fold
is used to test the model, and rest are used to train the model. This process will continue until each fold is
not used for the test fold.
Consider the below diagram:

5.8.2.5 Stratified k-fold cross-validation


This technique is similar to k-fold cross-validation with some little changes. This approach works on
stratification concept, it is a process of rearranging the data to ensure that each fold or group is a good
representative of the complete dataset. To deal with the bias and variance, it is one of the best approaches.
It can be understood with an example of housing prices, such that the price of some houses can be much high
than other houses. To tackle such situations, a stratified k-fold cross-validation technique is useful.

5.8.2.6 Holdout Method 29


This method is the simplest cross-validation technique among all. In this method, we need to remove a subset
of the training data and use it to get prediction results by training it on the rest part of the dataset.
The error that occurs in this process tells how well our model will perform with the unknown dataset. Although
this approach is simple to perform, it still faces the issue of high variance, and it also produces misleading
results sometimes.
5.8.3 Comparison of Cross-validation to train/test split in Machine Learning
• Train/test split: The input data is divided into two parts, that are training set and test set on a ratio
of 70:30, 80:20, etc. It provides a high variance, which is one of the biggest disadvantages.
• Training Data: The training data is used to train the model, and the dependent variable is known.
• Test Data: The test data is used to make the predictions from the model that is already trained
on the training data. This has the same features as training data but not the part of that.
• Cross-Validation dataset: It is used to overcome the disadvantage of train/test split by splitting
the dataset into groups of train/test splits, and averaging the result. It can be used if we want to
optimize our model that has been trained on the training dataset for the best performance. It is
more efficient as compared to train/test split as every observation is used for the training and
testing both.
5.8.4 Limitations of Cross-Validation
There are some limitations of the cross-validation technique, which are given below:
• For the ideal conditions, it provides the optimum output. But for the inconsistent data, it may
produce a drastic result. So, it is one of the big disadvantages of cross-validation, as there is no
certainty of the type of data in machine learning.
• In predictive modeling, the data evolves over a period, due to which, it may face the differences
between the training set and validation sets. Such as if we create a model for the prediction of
stock market values, and the data is trained on the previous 5 years stock values, but the realistic
future values for the next 5 years may drastically different, so it is difficult to expect the correct
output for such situations.
5.8.5 Applications of Cross-Validation
• This technique can be used to compare the performance of different predictive modeling methods.
• It has great scope in the medical research field.
• It can also be used for the meta-analysis, as it is already being used by the data scientists in the
field of medical statistics.
5.9 Estimators
In machine learning, an estimator is an equation for picking the ―best,‖ or most likely accurate, data
model based upon observations in realty. The estimator is the formula that evaluates a given quantity and
generates an estimate. This estimate is then inserted into the deep learning classifier system to determine
what action to take. Estimation is a statistical term for finding some estimate of unknown parameter, given
somedata. Point Estimation is the attempt to provide the single best prediction of some quantity ofinterest.
Quantity of interest can be:
• A single parameter
• A vector of parameters — e.g., weights in linear regression
• A whole function

➢ Point estimator 30
To distinguish estimates of parameters from their true value, a point estimate of a parameter θis represented
by θˆ. Let {x(1) , x(2) ,..x(m)} be m independent and identically distributed data points. Then a point
estimator is any function of the data:
This definition of a point estimator is very general and allows the designer of an estimator great flexibility.

While almost any function thus qualifies as an estimator, a good estimator is a function whose output is close
to the true underlying θ that generated the training data.
Point estimation can also refer to estimation of relationship between input and target variablesreferred to as
function estimation.
➢ Function Estimator
Here we are trying to predict a variable y given an input vector x. We assume that there is a function f(x)
that describes the approximate relationship between y and x. For example,
we may assume that y = f(x) + ε, where ε stands for the part of y that is not predictable from x. In function
estimation, we are interested in approximating f with a model or estimate fˆ. Function estimation is really
just the same as estimating a parameter θ; the function estimator fˆ is simply a point estimator in function
space. Ex: in polynomial regression we are either estimating a parameter w or estimating a function mapping
from x to y.
5.9.1 Uses of Estimators
By quantifying guesses, estimators are how machine learning in theory is implemented in practice. Without
the ability to estimate the parameters of a dataset (such as the layers in a neural network or the bandwidth in
a kernel), there would be no way for an AI system to ―learn.‖
A simple example of estimators and estimation in practice is the so-called ―German Tank Problem‖ from
World War Two. The Allies had no way to know for sure how many tanks the Germans were building every
month. By counting the serial numbers of captured or destroyed tanks, allied statisticians created anestimator
rule. This equation calculated the maximum possible number of tanks based upon the sequential serial
numbers, and applies minimum variance analysis to generate the most likely estimate for how many new tanks
German was building.
5.9.2 Types of Estimators
Estimators come in two broad categories, point and interval. Point equations generate single value results,
such as standard deviation, that can be plugged into a deep learning algorithm‘s classifier functions. Interval
equations generate a range of likely values, such as a confidence interval, for analysis.
In addition, each estimator rule can be tailored to generate different types of estimates:
• Biased: Either an overestimate or an underestimate.
• Efficient: Smallest variance analysis. The smallest possible variance is referred to as the ―best‖
estimate.
• Invariant: Less flexible estimates that aren‘t easily changed by data transformations.
• Shrinkage: An unprocessed estimate that‘s combined with other variables to create complex
estimates.
• Sufficient: Estimating the total population‘s parameter from a limited dataset.
• Unbiased: An exact-match estimate value that neither underestimates nor overestimates.

The difference between a classifier, model and estimator is as follows: 31


• An estimator is a predictor found from regression algorithm
• A classifier is a predictor found from a classification algorithm
• A model can be both an estimator or a classifier
5.9.3 Bias and Variance
5.9.3.1 Errors in Machine Learning
In machine learning, an error is a measure of how accurately an algorithm can make predictions for the
previously unknown dataset. On the basis of these errors, the machine learning model is selected that can
perform best on the particular dataset. There are mainly two types of errors in machine learning, which are:
Reducible errors: These errors can be reduced to improve the model accuracy. Such errors can further be
classified into bias and Variance.
Irreducible errors: These errors will always be present in the model regardless of which algorithm has been
used. The cause of these errors is unknown variables whose value can't be reduced.

5.9.3.2 Bias
In general, a machine learning model analyses the data, find patterns in it and make predictions. While
training, the model learns these patterns in the dataset and applies them to test data for prediction. While
making predictions, a difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias. It can be defined
as an inability of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias occurs from
assumptions in the model, which makes the target function simple to learn. A model has either:
• Low Bias: A low bias model will make fewer assumptions about the form of the target function.
• High Bias: A model with a high bias makes more assumptions, and the model becomes unable
to capture the important features of our dataset. A high bias model also cannot perform well
on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the higher
the bias it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours
and Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear
Discriminant Analysis and Logistic Regression.

➢ Ways to reduce High Bias: 32


High bias mainly occurs due to a much simple model. Below are some ways to reduce the high bias:
• Increase the input features as the model is under-fitted.
• Decrease the regularization term.
• Use more complex models, such as including some polynomial features.
5.9.3.3 Variance
The variance would specify the amount of variation in the prediction if the different training data was used.
In simple words, variance tells that how much a random variable is different from its expectedvalue.
Ideally, a model should not vary too much from one training dataset to another, which means the algorithm
should be good in understanding the hidden mapping between inputs and output variables.Variance
errors are either of low variance or high variance.
• Low variance means there is a small variation in the prediction of the target function with
changes in the training data set.
• High variance shows a large variation in the prediction of the target function with changes in
the training dataset.
A model that shows high variance learns a lot and performs well with the training dataset, and does not
generalize well with the unseen dataset. As a result, such a model gives good results with the training
dataset but shows high error rates on the test dataset.
Since, with high variance, the model learns too much from the dataset, it leads to over-fitting of the model.
A model with high variance has the below problems:
• A high variance model leads to over-fitting.
• Increase model complexities.
Usually, nonlinear algorithms have a lot of flexibility to fit the model, have high variance.

Some examples of machine learning algorithms with low variance are, Linear Regression, Logistic
Regression, and Linear discriminant analysis. At the same time, algorithms with high varianceare
decision tree, Support Vector Machine, and K-nearest neighbours.
➢ Ways to Reduce High Variance:
• Reduce the input features or number of parameters as a model is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.

5.9.3.4 Different Combinations of Bias-Variance 33


There are four possible combinations of bias and variances, which are represented by the below diagram:
1. Low-Bias, Low-Variance: The combination of low bias and low variance shows an ideal
machine learning model. However, it is not possible practically.
2. Low-Bias, High-Variance: With low bias and high variance, model predictions areinconsistent
and accurate on average. This case occurs when the model learns with a large number of
parameters and hence leads to an over-fitting
3. High-Bias, Low-Variance: With High bias and low variance, predictions are consistent but
inaccurate on average. This case occurs when a model does not learn well with the training
dataset or uses few numbers of the parameter. It leads to under-fitting problems in the model.
4. High-Bias, High-Variance: With high bias and high variance, predictions are inconsistent and
also inaccurate on average.
High variance can be identified if the model has:
• Low training error and high test error.

High Bias can be identified if the model has:


• High training error and the test error is almost similar to training error.
5.9.3.5 Bias-Variance Trade-Off
While building the machine learning model, it is really important to take care of bias and variance in order
to avoid over-fitting and under-fitting in the model. If the model is very simple with fewer parameters, it may
have low variance and high bias. Whereas, if the model has a large number of parameters, it will have
34 this
high variance and low bias. So, it is required to make a balance between bias and variance errors, and
balance between the bias error and variance error is known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance and low bias. But this is not possible
because bias and variance are related to each other:
• If we decrease the variance, it will increase the bias
• If we decrease the bias, it will increase the variance
Bias-Variance trade-off is a central issue in supervised learning. Ideally, we need a model that accurately
captures the regularities in training data and simultaneously generalizes well with the unseen dataset.
Unfortunately, doing this is not possible simultaneously. Because a high variance algorithm may perform well
with training data, but it may lead to over-fitting to noisy data. Whereas, high bias algorithm generatesa much
simple model that may not even capture important regularities in the data. So, we need to find a sweet spot
between bias and variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to make a balance between bias and
variance errors.

5.10. Challenges Motivating Deep Learning


The challenges are as listed below.
1. Learning without Supervision
Deep learning models are one of, if not the most data-hungry models of the Machine Learning world. They
need huge amounts of data to reach their optimal performance and serve us with the excellence we expect
from them.
However, having this much data is not always easy. Additionally, while we can have large amounts of data
on some topic, many times it is not labeled so we cannot use it to train any kind of supervised learning
algorithm.

35
One of the main challenges of Deep Learning derived from this is being able to deliver great performances
with a lot less training data. As we will see later, recent advances like transfer learning or semi-supervised
learning are already taking steps in this direction, but still it is not enough.
2. Coping with data from outside the training distribution
Data is dynamic, it changes through different drivers like time, location, and many other conditions.
However, Machine Learning models, including Deep Learning ones, are built using a defined set of data
(the training set) and perform well as long as the data that is later used to make predictions once the system
is built comes from the same distribution as the data the system was built with.
This makes them perform poorly when data that is not entirely different, but that does have some variations
from the training data is fed to them. Another challenge of Deep Learning in the future will be to overcome
this problem, and still perform reasonably well when data that does not exactly match the training data is fed
to them.
3. Incorporating Logic
Incorporating some sort of rule based knowledge, so that logical procedures can be implemented and
sequential reasoning used to formalize knowledge.
While these cases can be covered in code, Machine Learning algorithms don‘t usually incorporate sets or rules
into their knowledge. Kind of like a prior data distribution used in Bayesian learning, sets of pre- defined rules
could assist Deep Learning systems in their reasoning and live side by side with the ‗learning from data‘
based approach.
4. The Need for less data and higher efficiency
Although we kind of covered this in our first two sections, this point is really worth highlighting.
The success of Deep Learning comes from the possibility to incorporate many layers into our models, allowing
them to try an insane number of linear and non-linear parameter combinations. However, with more layers
comes more model complexity and we need more data for this model to function correctly.
When the amount of data that we have is effectively smaller than the complexity of the neural network then
we need to resort to a different approach like the aforementioned Transfer Learning.
Also, too big Deep Learning models, aside from needing crazy amounts of data to be trained on, use a lot of
computational resources and can take a very long while to train. Advances on the field should also be oriented
towards making the training process more efficient and cost effective
6. Deep Neural Network
Deep neural networks (DNN) is a class of machine learning algorithms similar to the artificial neural network
and aims to mimic the information processing of the brain. Deep neural networks, or deep learning networks,
have several hidden layers with millions of artificial neurons linked together. A number, called weight,
represents the connections between one node and another. The weight is a positive number if one node excites
another, or negative if one node suppresses the other.
6.1 Feed-Forward Neural Network
In its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs
enter the layer and are multiplied by the weights in this model. The weighted input values are then summed
together to form a total. If the sum of the values is more than a predetermined threshold, which is normally
set at zero, the output value is usually 1, and if the sum is less than the threshold, the output value is usually -
1. The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification. Single-layer perceptrons can also contain machine learning features.

36
The neural network can compare the outputs of its nodes with the desired values using a property known
as the delta rule, allowing the network to alter its weights through training to create more accurate
output values. This training and learning procedure results in gradient descent. The technique of updating
weights in multi-layered perceptrons is virtually the same, however, the process is referred to as back-
propagation. In such circumstances, the output values provided by the final layer are used to alter each hidden
layer inside the network.
6.1.1 Work Strategy
The function of each neuron in the network is similar to that of linear regression. The neuron also has
an activation function at the end, and each neuron has its weight vector.

6.1..2 Importance of the Non-Linearity


When two or more linear objects, such as a line, plane, or hyperplane, are combined, the outcome is also a
linear object: line, plane, or hyperplane. No matter how many of these linear things we add, we‘ll still end
up with a linear object.
However, this is not the case when adding non-linear objects. When two separate curves are combined, the
result is likely to be a more complex curve.
We’re introducing non-linearity at every layer using these activation functions, in addition to just adding non-
linear objects or hyper-curves like hyperplanes. In other words, we‘re applying a nonlinear function on an
already nonlinear object.
37
What if activation functions were not used in neural networks?
Suppose if neural networks didn‘t have an activation function, they‘d just be a huge linear unit that a single
linear regression model could easily replace.
a = m*x + d
Z= k*a + t => k*(m*x+d) + t => k*m*x + k*d + t => (k*m)*x + (k*c+t)
6.1.3 Applications of the Feed Forward Neural Networks
A Feed Forward Neural Network is an artificial neural network in which the nodes are connected
circularly. A feed-forward neural network, in which some routes are cycled, is the polar opposite of a
recurrent neural network. The feed-forward model is the simplest type of neural network because the input
is only processed in one direction. The data always flows in one direction and never backwards, regardless of
how many buried nodes it passes through.
6.2 Regularization in Machine Learning
Regularization is one of the most important concepts of machine learning. It is a technique to prevent the
model from overfitting by adding extra information to it.
Sometimes the machine learning model performs well with the training data but does not perform well with
the test data. It means the model is not able to predict the output when deals with unseen data by introducing
noise in the output, and hence the model is called overfitted. This problem can be deal with the help of a
regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or features in the model
by reducing the magnitude of the variables. Hence, it maintains accuracy as well as a generalization of the
model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words, "In regularization
technique, we reduce the magnitude of the features by keeping the same number of features."
How does Regularization Work?
Regularization works by adding a penalty or complexity term to the complex model. Let's consider the simple
linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
In the above equation, Y represents the value to be predicted
X1, X2, …Xn are the features for Y.
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here represents the bias of
the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The equation for the cost
function for the linear model is given below:

Now, we will add a loss function and optimize parameter to make the model that can predict the accurate
value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
6.2.1 Ridge Regression

38
Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so
that we can get better long-term predictions.
Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is
also called as L2 regularization.
In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the
model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared
weight of each individual feature.
The equation for the cost function in ridge regression will be:

In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression
reduces the amplitudes of the coefficients that decreases the complexity of the model.
As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost
function of the linear regression model. Hence, for the minimum value of λ, the model will resemble the
linear regression model.
A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
It helps to solve the problems if we have more parameters than samples.
6.2.2 Lasso Regression
Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead
of a square of weights.
Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink
it near to 0.
It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:

Some of the features in this technique are completely neglected for model evaluation.
Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
Lasso regression helps to reduce the overfitting in the model as well as feature selection.

39
6.3 Optimization in Machine Learning
In machine learning, optimization is the procedure of identifying the ideal set of model parameters that
minimize a loss function. For a particular set of inputs, the loss function calculates the discrepancy between
the predicted and actual outputs. For the model to successfully forecast the output for fresh inputs,
optimization seeks to minimize the loss function.
A method for finding a function's minimum or maximum is called an optimization algorithm, which is used
in optimization. Up until the minimum or maximum of the loss function is reached, the optimization algorithm
iteratively modifies the model parameters. Gradient descent, stochastic gradient descent, Adam, Adagrad,
and RMSProp are a few optimization methods that can be utilised in machine learning.
• Gradient Descent
In machine learning, gradient descent is a popular optimization approach. It is a first-order
optimization algorithm that works by repeatedly changing the model's parameters in the opposite
direction of the loss function's negative gradient. The loss function lowers most quickly in that
direction because the negative gradient leads in the direction of the greatest descent.
The gradient descent algorithm operates by computing the gradient of the loss function with respect
to each parameter starting with an initial set of parameters. The partial derivatives of the loss function
with respect to each parameter are contained in a vector known as the gradient. After that, the
algorithm modifies the parameters by deducting a small multiple of the gradient from their existing
values.
• Stochastic Gradient Descent
A part of the training data is randomly chosen for each iteration of the stochastic gradient descent
process, which is a variant on the gradient descent technique. This makes the algorithm's computations
simpler and speeds up its convergence. For big datasets when it is not practical to compute the gradient
of the loss function for all of the training data, stochastic gradient descent is especially helpful.
The primary distinction between stochastic gradient descent and gradient descent is that stochastic
gradient descent changes the parameters based on the gradient obtained for a single example rather
than the full dataset. Due to the stochasticity introduced by this, each iteration of the algorithm may
result in a different local minimum.
• Adam
Adam is an optimization algorithm that combines the advantages of momentum-based techniques and
stochastic gradient descent. The learning rate during training is adaptively adjusted using the first and
second moments of the gradient. Adam is frequently used in deep learning since it is known to
converge more quickly than other optimization techniques.
• Adagrad
An optimization algorithm called Adagrad adjusts the learning rate for each parameter based on
previous gradient data. It is especially beneficial for sparse datasets with sporadic occurrences of
specific attributes. Adagrad can converge more quickly than other optimization methods because it
uses separate learning rates for each parameter.
• RMSProp
An optimization method called RMSProp deals with the issue of deep neural network gradients that
vanish and explode. It employs the moving average of the squared gradient to normalize the learning

40
rate for each parameter. Popular deep learning optimization algorithm RMSProp is well known for
converging more quickly than some other optimization algorithms.
6.3.1 Importance of Optimization in Machine Learning
Machine learning depends heavily on optimization since it gives the model the ability to learn from data
and generate precise predictions. Model parameters are estimated using machine learning techniques using
the observed data. Finding the parameters' ideal values to minimize the discrepancy between the predicted
and actual results for a given set of inputs is the process of optimization. Without optimization, the model's
parameters would be chosen at random, making it impossible to correctly forecast the outcome for brand-
new inputs.
Optimization is highly valued in deep learning models, which have multiple levels of layers and millions of
parameters. Deep neural networks need a lot of data to be trained, and optimizing the parameters of the model
in which they are used requires a lot of processing power. The optimization algorithm chosen can have a big
impact on the training process's accuracy and speed.
New machine learning algorithms are also implemented solely through optimization. Researchers are
constantly looking for novel optimization techniques to boost the accuracy and speed of machine learning
systems. These techniques include normalization, optimization strategies that account for knowledge of the
underlying structure of the data, and adaptive learning rates.
6.3.2Challenges in Optimization
There are difficulties with machine learning optimization. One of the most difficult issues is overfitting, which
happens when the model learns the training data too well and is unable to generalize to new data. When the
model is overly intricate or the training set is insufficient, overfitting might happen.
When the optimization process converges to a local minimum rather than the global optimum, it poses the
problem of local minima, which is another obstacle in optimization. Deep neural networks, which contain
many parameters and may have multiple local minima, are highly prone to local minima.

41

You might also like