AD3501-DL-Unit 1 Notes
AD3501-DL-Unit 1 Notes
1. Introduction:
Today, artificial intelligence (AI) is a thriving field with many practical applications and active research
topics. We look to intelligent software to automate routine labor, understand speech or images, make
diagnoses in medicine and support basic scientific research. In the early days of artificial intelligence, the field
rapidly tackled and solved problems that are intellectually difficult for human beings but relatively
straightforward for computers—problems that can be described by a list of formal, mathematical rules. The
true challenge of AI lies in solving more intuitive problems. The solution is to allow computers to learn
from experience and understand the world in terms of a hierarchy of concepts, with each concept defined in
terms of its relation to simpler concepts. By gathering knowledge from experience, this approach avoids the
need for human operators to formally specify all of the knowledge that the computer needs. The hierarchy
of concepts allows the computer to learn complicated concepts by building them out of simpler ones. If one
draws a graph showing how these concepts are built on top of each other, the graph is deep, with many layers.
For this reason, this approach is called as deep learning.
A computer can reason about statements in these formal languages automatically using logical inference rules.
This is known as the knowledge base approach to artificial intelligence. The difficulties faced by systems
relying on hard-coded knowledge suggest that AI systems need the ability to acquire their own knowledge,
by extracting patterns from raw data. This capability is known as machine learning. The introduction of
machine learning allowed computers to tackle problems involving knowledge of the real world and make
decisions that appear subjective. A simple machine learning algorithm called logistic regression can
determine whether to recommend cesarean delivery. A simple machine learning algorithm called naive Bayes
can separate legitimate e-mail from spam e-mail.
The performance of these simple machine learning algorithms depends heavily on the representation of the
data they are given. For example, when logistic regression is used to recommend cesarean delivery, the AI
system does not examine the patient directly. Instead, the doctor tells the system several pieces of relevant
information, such as the presence or absence of a uterine scar. Each piece of information included in the
representation of the patient is known as a feature. Logistic regression learns how each of these features of
1
the patient correlates with various outcomes. However, it cannot influence the way that the features are defined
in any way. If logistic regression was given an MRI scan of the patient, rather than the doctor‘s formalized
report, it would not be able to make useful predictions. Individual pixels in an MRI scan have negligible
correlation with any complications that might occur during delivery.
This dependence on representations is a general phenomenon that appears throughout computer science and
even daily life. In computer science, operations such as searching a collection of data can proceed
exponentially faster if the collection is structured and indexed intelligently. People can easily perform
arithmetic on Arabic numerals, but find arithmetic on Roman numerals much more time-consuming. It is not
surprising that the choice of representation has an enormous effect on the performance of machine learning
algorithms. Many artificial intelligence tasks can be solved by designing the right set of features to extract for
that task, then providing these features to a simple machine learning algorithm. However, for many tasks, it
is difficult to know what features should be extracted. For example, suppose that we would like to write a
program to detect cars in photographs. We know that cars have wheels, so we might like to use the presence
of a wheel as a feature. Unfortunately, it is difficult to describe exactly what a wheel looks like in terms of
pixel values.
One solution to this problem is to use machine learning to discover not only the mapping from representation
to output but also the representation itself. This approach is known as representation learning. Learned
representations often result in much better performance than can be obtained with hand- designed
representations. They also allow AI systems to rapidly adapt to new tasks, with minimal human intervention.
A representation learning algorithm can discover a good set of features for a simple task in minutes, or a
complex task in hours to months.
The quintessential example of a representation learning algorithm is the autoencoder. An autoencoder is the
combination of an encoder function that converts the input data into a different representation, and adecoder
function that converts the new representation back into the original format. Autoencoders are trained to
preserve as much information as possible when an input is run through the encoder and then the decoder, but
are also trained to make the new representation have various nice properties. Different kinds of autoencoders
aim to achieve different kinds of properties. When designing features or algorithms forlearning features,
our goal is usually to separate the factors of variation that explain the observed data. A major source of
difficulty in many real-world artificial intelligence applications is that many of the factors of variation
influence every single piece of data we are able to observe. The individual pixels in an image of a red car
might be very close to black at night. The shape of the car‘s silhouette depends on the viewing angle. It can
be very difficult to extract such high-level, abstract features from raw data. Deep learning solves this central
problem in representation learning by introducing representations that are expressed in terms of other, simpler
representations.
Deep learning allows the computer to build complex concepts out of simpler concepts. Fig. 1.1 shows how a
deep learning system can represent the concept of an image of a person by combining simpler concepts, such
as corners and contours, which are in turn defined in terms of edges. The quintessential example of a deep
learning model is the feedforward deep network or multilayer perceptron (MLP). A multilayer perceptron is
just a mathematical function mapping some set of input values to output values. The functionis formed by
composing many simpler functions. The idea of learning the right representation for the data provides one
perspective on deep learning. Another perspective on deep learning is that depth allows the computer to learn
a multi-step computer program. Each layer of the representation can be thought of as the state of the
computer‘s memory after executing another set of instructions in parallel. Networks with greater
2
depth can execute more instructions in sequence. Sequential instructions offer great power because later
instructions can refer back to the results of earlier instructions.
The input is presented at the, so named because it contains visible layer the variables that we are able to
observe. Then a series of hidden layers extracts increasingly abstract features from the image. These layers
are called ―hidden‖ because their values are not given in the data; instead the model must determine which
concepts are useful for explaining the relationships in the observed data. The images here are visualizations
of the kind of feature represented by each hidden unit. Given the pixels, the first layer can easily identify
edges, by comparing the brightness of neighboring pixels. Given the first hidden layer‘s description of the
edges, the second hidden layer can easily search for corners and extended contours, which are recognizable
as collections of edges. Given the second hidden layer‘s description of the image in terms of corners and
contours, the third hidden layer can detect entire parts of specific objects, by finding specific collections of
contours and corners. Finally, this description of the image in terms of the object parts it contains can be used
to recognize the objects present in the image.
There are two main ways of measuring the depth of a model. The first view is based on the number of
sequential instructions that must be executed to evaluate the architecture. Another approach, used by deep
probabilistic models, regards the depth of a model as being not the depth of the computational graph but the
depth of the graph describing how concepts are related to each other. Machine learning is the only viable
approach to building AI systems that can operate in complicated, real-world environments. Deep learning is
a particular kind of machine learning that achieves great power and flexibility by learning to represent the
world as a nested hierarchy of concepts, with each concept defined in relation to simpler concepts, and more
abstract representations computed in terms of less abstract ones. Fig. 1.2 illustrates the relationship between
these different AI disciplines. Fig. 1.3 gives a high-level schematic of how each works.
3
Fig. 1.2 Venn diagram representing relationship between AI disciplines
4
AI is basically the study of training your machine (computers) to mimic a human brain and its thinking
capabilities. AI focuses on three major aspects (skills): learning, reasoning, and self- correction to obtain
the maximum efficiency possible. Machine Learning (ML) is an application or subset of AI. The major
aim of ML is to allow the systems to learn by themselves through experience without any kind of human
intervention or assistance. Deep Learning (DL) is basically a sub-part of the broader family of Machine
Learning which makes use of Neural Networks (similar to the neurons working in our brain) to mimic human
brain-like behavior. DL algorithms focus on information processing patterns mechanism to possibly
identify the patterns just like our human brain does and classifies the information accordingly. DL works on
larger sets of data when compared to ML and the prediction mechanism is self-administered by machines.
The differences between AI, ML and DL are presented as Table 1 as below.
Table 1. Difference between Artificial Intelligence, Machine Learning & Deep Learning
2. Linear Algebra:
A good understanding of linear algebra is essential for understanding and working with many machine
learning algorithms, especially deep learning algorithms.
2.1 Scalars, Vectors, Matrices and Tensors
The study of linear algebra involves several types of mathematical objects:
● Scalars: A scalar is just a single number, in contrast to most of the other objects studied in linear algebra, which
are usually arrays of multiple numbers. We write scalars in italics. We usually give scalars lower- case variable
names. When we introduce them, we specify what kind of number they are. For example, we might say ―Let
s ∈ R be the slope of the line,‖ while defining a real-valued scalar, or ―Let n ∈ N be the number of units,‖
while defining a natural number scalar.
● Vectors: A vector is an array of numbers. The numbers are arranged in order. We can identify each individual
number by its index in that ordering. Typically we give vectors lower case names written in bold typeface,
such as x. The elements of the vector are identified by writing its name in italic typeface, with a subscript.
The first element of x is x1, the second element is x2 and so on. We also need to say what kinds of numbers
are stored in the vector. If each element is in R, and the vector has n elements, then the vector lies in the set
formed by taking the Cartesian product of R n times, denoted as Rn. When we need to explicitly identify the
elements of a vector, we write them as a column enclosed in square brackets:
6
We can think of vectors as identifying points in space, with each element giving the coordinate 7along a
different axis. Sometimes we need to index a set of elements of a vector. In this case, we define a set containing
the indices and write the set as a subscript. For example, to access x1, x3 and x6 , we define the set S = {1, 3,
6} and write xS . We use the − sign to index the complement of a set. For example x−1 is the vector containing
all elements of x except for x1, and x−S is the vector containing all of the elements of x except for x1, x3 and
x6.
● Matrices: A matrix is a 2-D array of numbers, so each element is identified by two indices instead of just one.
We usually give matrices upper-case variable names with bold typeface, such as A. If a real-valued matrix
A has a height of m and a width of n, then we say that A ∈ Rm×n. We usually identify the elements of a matrix
using its name in italic but not bold font, and the indices are listed with separating commas. For example, A1,1
is the upper left entry of A and Am,n is the bottom right entry. We can identify all of the numbers with vertical
coordinate i by writing a ―:‖ for the horizontal coordinate. For example, Ai,: denotes the horizontal cross
section of A with vertical coordinate i. This is known as the i-th row of A. Likewise, A:,iis the i-th column of
A. When we need to explicitly identify the elements of a matrix, we write them as an array enclosed in square
brackets:
Sometimes we may need to index matrix-valued expressions that are not just a single letter. In this case, we
use subscripts after the expression, but do not convert anything to lower case. For example, f (A) i,j gives
element (i, j) of the matrix computed by applying the function f to A.
● Tensors: In some cases we will need an array with more than two axes. In the general case, an array of numbers
arranged on a regular grid with a variable number of axes is known as a tensor. We denote a tensornamed
―A‖ with this typeface: A. We identify the element of A at coordinates (i, j, k) by writing Ai,j,k. One important
operation on matrices is the transpose. The transpose of a matrix is the mirror image of the matrix across a
diagonal line, called the main diagonal, running down and to the right, starting from its upper left corner. See
Fig. 2.1 for a graphical depiction of this operation. We denote the transpose of a matrix A as AT,and it is
defined such that
Vectors can be thought of as matrices that contain only one column. The transpose of a vector is therefore a
matrix with only one row. Sometimes we define a vector by writing out its elements in the text inline as a
row matrix, then using the transpose operator to turn it into a standard column vector, e.g., x = [x1, x2, x3
]T.
A scalar can be thought of as a matrix with only a single entry. From this, we can see that a scalar is its own
transpose: a = aT. We can add matrices to each other, as long as they have the same shape, just by adding their
corresponding elements: C = A +B where Ci,j = Ai,j + Bi,j.We can also add a scalar to a matrix or multiply a
matrix by a scalar, just by performing that operation on each element of a matrix: D = a · B + c where Di,j = a
· Bi,j + c.
In the context of deep learning, we also use some less conventional notation. We allow the addition of
matrix and a vector, yielding another matrix: C = A + b, where Ci,j = Ai,j + bj. In other words, the vector b is
added to each row of the matrix. This shorthand eliminates the need to define a matrix with b copied into each
row before doing the addition. This implicit copying of b to many locations is called broadcasting.
8
• x X , 0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can be less probable
than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can
have a greater chance of occurring.
• xX
P(x) = 1
.We refers to this property as being normalized. Without this property, we
could obtain probabilities greater than one by computing the probability of one of many events
occurring.
For example, consider a single discrete random variable x with k different states. We can place a uniform
distribution on x—that is, make each of its states equally likely—by setting its probability mass function to
1
for all i. We can see that this fits the requirements for a probability mass function. The value k is positive
because k is a positive integer. We also see that
so the distribution is properly normalized. Let‘s discuss few discrete probability distributions as follows:
The Python code below shows a simple example of Poisson distribution. It has two parameters: 10
1. Lam: Known number of occurrences
2. Size: The shape of the returned array
The below-given Python code generates the 1x100 distribution for occurrence 5.
3.3.2 Continuous Variables and Probability Density Functions
When working with continuous random variables, we describe probability distributions using a probability
density function (PDF) rather than a probability mass function. To be a probability density function, a function
p must satisfy the following properties:
• The domain of must be the set of p all possible states of x.
• x X , P(x) ≥ 0. Note that we do not require p(x) ≤ 1.
• p(x)dx = 1 .
A probability density function p(x) does not give the probability of a specific state directly, instead the
probability of landing inside an infinitesimal region with volume δx is given by p(x)δx.
We can integrate the density function to find the actual probability mass of a set of points. Specifically, the
probability that x lies in some set S is given by the integral of p(x) over that set. In the univariate example,
the probability that x lies in the interval [a, b] is given by p(x)dx .
a,b
For an example of a probability density function corresponding to a specific probability density over a
continuous random variable, consider a uniform distribution on an interval of the real numbers. We can do this
with a function u(x; a, b), where a and b are the endpoints of the interval, with b > a. The ―;‖ notation means
―parametrized by‖; we consider x to be the argument of the function, while a and b are parameters
11b) = 0
that define the function. To ensure that there is no probability mass outside the interval, we say u(x;a,
1
for all x [a, b]. Within [a, b], u(x; a, b) = . We can see that this is nonnegative everywhere.
b−a
Additionally, it integrates to 1. We often denote that x follows the uniform distribution on [a, b] by writing x
~ U(a, b).
.
3.3.2.1 Normal Distribution
Normal Distribution is one of the most basic continuous distribution types. Gaussian distribution is another
name for it. Around its mean value, this probability distribution is symmetrical. It also demonstrates that data
close to the mean occurs more frequently than data far from it. Here, the mean is 0, and the variance isa finite
value.
In the example, you generated 100 random variables ranging from 1 to 50. After that, you created a function
to define the normal distribution formula to calculate the probability density function. Then, you have plotted
the data points and probability density function against X-axis and Y-axis, respectively.
12
3.3.2.3 Log-Normal Distribution
The random variables whose logarithm values follow a normal distribution are plotted using this
distribution. Take a look at the random variables X and Y. The variable represented in this distribution is Y
= ln(X), where ln denotes the natural logarithm of X values.
The size distribution of rain droplets can be plotted using log normal distribution.
13
3.3.2.4 Exponential Distribution
In a Poisson process, an exponential distribution is a continuous probability distribution that describes the
time between events (success, failure, arrival, etc.).
You can see in the below example how to get random samples of exponential distribution and return Numpy
array samples by using the numpy.random.exponential() method.
14
4.1 Role of an Optimization
As discussed above, optimizers update the parameters of neural networks such as weights and learning rate
to minimize the loss function. Here, the loss function acts as a guide to the terrain telling optimizer if it is
moving in the right direction to reach the bottom of the valley, the global minimum.
4.2 The Intuition behind Optimization
Let us imagine a climber hiking down the hill with no sense of direction. He doesn‘t know the right way to
reach the valley in the hills, but, he can understand whether he is moving closer (going downhill) or further
away (uphill) from his final destination. If he keeps taking steps in the correct direction, he will reach to his
aim i.,e the valley.
Exactly, this is the intuition behind optimization- to reach a global minimum concerning the loss function.
4.3 Instances of Gradient-Based Optimizers
Different instances of Gradient descent based Optimizers are as follows:
• Batch Gradient Descent or Vanilla Gradient Descent or Gradient Descent (GD)
• Stochastic Gradient Descent (SGD)
• Mini batch Gradient Descent (MB-GD)
4.3.1 Batch Gradient Descent
Gradient descent algorithm is an optimization algorithm which is used to minimize the function. The
function which is set to be minimized is called as an objective function. For machine learning,the
objective function is also termed as the cost function or loss function. It is the loss function which is
optimized (minimized) and gradient descent is used to find the most optimal value of parameters / weights
which minimizes the loss function. Loss function, simply speaking, is the measure of the squared
difference between actual values and predictions. In order to minimize the objective function, the most
optimal value of the parameters of the function from large or infinite parameter space are found.
Gradient of a function at any point is the direction of steepest increase or ascent of the function at
that point.
Based on above, the gradient descent of a function at any point, thus, represent the direction of steepest
decrease or descent of function at that point.
In order to find the gradient of the function with respect to x dimension, take the derivative ofthe
function with respect to x , then substitute the x-coordinate of the point of interest in for the x values in the
derivative. Once gradient of the function at any point is calculated, the gradient descent can becalculated by
multiplying the gradient with -1. Here are the steps of finding minimum of the function using gradient descent:
• Calculate the gradient by taking the derivative of the function with respect to the specific
parameter. In case, there are multiple parameters, take the partial derivatives with respect to
different parameters.
• Calculate the descent value for different parameters by multiplying the value of derivatives with
learning or descent rate (step size) and -1.
• Update the value of parameter by adding up the existing value of parameter and the descent value.
The diagram below represents the updation of parameter [latex]\theta[/latex] with the value of
gradient in the opposite direction while taking small steps.
Gradient descent is an optimization algorithm that‘s used when training deep learning models. It‘s based on
a convex function and updates its parameters iteratively to minimize a given function to its local minimum.
15
The notation used in the above Formula is given below,
In the above formula,
• α is the learning rate,
• J is the cost function, and
• ϴ is the parameter to be updated.
As we see, the gradient represents the partial derivative of J(cost function) with respect to ϴj
Note that, as we reach closer to the global minima, the slope(gradient) of the curve becomes less and less
steep, which results in a smaller value of derivative, which in turn reduces the step size(learning rate)
automatically.
It is the most basic but most used optimizer that directly uses the derivative of the loss function and learning
rate to reduce the loss function and tries to reach the global minimum.
Thus, the Gradient Descent Optimization algorithm has many applications including-
• Linear Regression,
• Classification Algorithms,
• Back-propagation in Neural Networks, etc.
The above-described equation calculates the gradient of the cost function J(θ) with respect to the network
parameters θ for the entire training dataset:
Our aim is to reach at the bottom of the graph (Cost vs weight), or to a point where we can no longer move
downhill–a local minimum.
➢ Role of Gradient
In general, Gradient represents the slope of the equation while gradients are partial derivatives and they
describe the change reflected in the loss function with respect to the small change in parameters of the
function. Now, this slight change in loss functions can tell us about the next step to reduce the output of the
loss function.
16
➢ Role of Learning Rate
Learning rate represents the size of the steps our optimization algorithm takes to reach the global minima.
To ensure that the gradient descent algorithm reaches the local minimum we must set the learning rate to an
appropriate value, which is neither too low nor too high.
Taking very large steps i.e, a large value of the learning rate may skip the global minima, and the model will
never reach the optimal value for the loss function. On the contrary, taking very small steps i.e, a small value
of learning rate will take forever to converge.
Thus, the size of the step is also dependent on the gradient value.
As we discussed, the gradient represents the direction of increase. But our aim is to find the minimum point
in the valley so we have to go in the opposite direction of the gradient. Therefore, we update parameters in
the negative gradient direction to minimize the loss.
Algorithm: θ=θ−α⋅∇J(θ)
In code, Batch Gradient Descent looks something like this:
for x in range(epochs):
params_gradient = find_gradient(loss_function, data, parameters)
parameters = parameters - learning_rate * params_gradient
➢ Advantages of Batch Gradient Descent
• Easy computation
• Easy to implement
• Easy to understand
➢ Disadvantages of Batch Gradient Descent
• May trap at local minima
• Weights are changed after calculating the gradient on the whole dataset. So, if the dataset is too
large then this may take years to converge to the minima
• Requires large memory to calculate gradient on the whole dataset
4.3.2 Stochastic Gradient Descent
To overcome some of the disadvantages of the GD algorithm, the SGD algorithm comes into the picture as
an extension of the Gradient Descent. One of the disadvantages of the Gradient Descent algorithm is that it
requires a lot of memory to load the entire dataset at a time to compute the derivative of the loss function.
17
So, In the SGD algorithm, we compute the derivative by taking one data point at a time i.e, tries to update the
model‘s parameters more frequently. Therefore, the model parameters are updated after the computation of
loss on each training example.
So, let‘s have a dataset that contains 1000 rows, and when we apply SGD it will update the model parameters
1000 times in one complete cycle of a dataset instead of one time as in Gradient Descent.
Algorithm: θ=θ−α⋅∇J (θ;x(i);y(i)) , where {x(i),y(i)} are the training examples
We want the training, even more, faster, so we take a Gradient Descent step for each training example. Let‘s
see the implications in the image below:
It is observed that the derivative of the loss function for MB-GD is almost the same as a derivate of the loss
function for GD after some number of iterations. But the number of iterations to achieve minima is large for
MB-GD compared to GD and the cost of computation is also large.
Therefore, the weight updation is dependent on the derivate of loss for a batch of points. The updates in the
case of MB-GD are much noisy because the derivative is not always towards minima.
It updates the model parameters after every batch. So, this algorithm divides the dataset into various batches
and after every batch, it updates the parameters.
Algorithm: θ=θ−α⋅∇J (θ; B(i)), where {B(i)} are the batches of training examples
n the code snippet, instead of iterating over examples, we now iterate over mini-batches of size 30:
for x in range(epochs):
np.random.shuffle(data)
for batch in get_batches(data, batch_size=30):
params_gradient = find_gradient(loss_function, batch, parameters)
parameters = parameters - learning_rate * params_gradient
➢ Advantages of Mini Batch Gradient Descent
• Updates the model parameters frequently and also has less variance
• Requires not less or high amount of memory i.e requires a medium amount of memory
➢ Disadvantages of Mini Batch Gradient Descent
• The parameter updating in MB-SGD is much noisy compared to the weight updating in the GD
algorithm
• Compared to the GD algorithm, it takes a longer time to converge
• May get stuck at local minima
as error or loss. The goal of the model is to minimize the error or loss function by adjusting its 20
internal
parameters.
• Model Optimization Process: The model optimization process is the iterative process of adjusting the
internal parameters of the model to minimize the error or loss function. This is done using an
optimization algorithm, such as gradient descent. The optimization algorithm calculates the gradient
of the error function with respect to the model‘s parameters and uses this information to adjust the
parameters to reduce the error. The algorithm repeats this process until the error is minimized to a
satisfactory level.
Once the model has been trained and optimized on the training data, it can be used to make predictions
on new, unseen data. The accuracy of the model‘s predictions can be evaluated using various performance
metrics, such as accuracy, precision, recall, and F1 -score.
5.3 Machine Learning lifecycle:
The lifecycle of a machine learning project involves a series of steps that include:
1. Study the Problems: The first step is to study the problem. This step involves understanding the
business problem and defining the objectives of the model.
2. Data Collection: When the problem is well-defined, we can collect the relevant data required for
the model. The data could come from various sources such as databases, APIs, or web scraping.
3. Data Preparation: When our problem-related data is collected. then it is a good idea to check the
data properly and make it in the desired format so that it can be used by the model to find the
hidden patterns. This can be done in the following steps:
• Data cleaning
• Data Transformation
• Explanatory Data Analysis and Feature Engineering
• Split the dataset for training and testing.
4. Model Selection: The next step is to select the appropriate machine learning algorithm that is suitable
for our problem. This step requires knowledge of the strengths and weaknesses of different
algorithms. Sometimes we use multiple models and compare their results and select thebest model
as per our requirements.
5. Model building and Training: After selecting the algorithm, we have to build the model.
a. In the case of traditional machine learning building mode is easy it is just a few
hyperparameter tunings.
b. In the case of deep learning, we have to define layer-wise architecture along with input and
output size, number of nodes in each layer, loss function, gradient descent optimizer, etc.
c. After that model is trained using the preprocessed dataset.
6. Model Evaluation: Once the model is trained, it can be evaluated on the test dataset to determineits
accuracy and performance using different techniques like classification report, F1 score, precision,
recall, ROC Curve, Mean Square error, absolute error, etc.
7. Model Tuning: Based on the evaluation results, the model may need to be tuned or optimized to
improve its performance. This involves tweaking the hyperparameters of the model.
8. Deployment: Once the model is trained and tuned, it can be deployed in a production environment
to make predictions on new data. This step requires integrating the model into an existing software
system or creating a new system for the model.
9. Monitoring and Maintenance: Finally, it is essential to monitor the model‘s performance21in the
production environment and perform maintenance tasks as required. This involves monitoring for data
drift, retraining the model as needed, and updating the model as new data becomes available.
5.4 Types of Machine Learning
The types are as follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Machine Learning
23
The number of nodes in a layer is referred to as the width and the number of layers in a model is referred to
as its depth. Increasing the depth increases the capacity of the model. Training deep models, e.g. those with
many hidden layers, can be computationally more efficient than training a single layer network with a vast
number of nodes.
5.6 Over-fitting and under-fitting
Over-fitting and under-fitting are two crucial concepts in machine learning and are the prevalent causes for
the poor performance of a machine learning model. In this topic we will explore over-fitting and under- fitting
in machine learning.
➢ Over-fitting
When a model performs very well for training data but has poor performance with test data (new data), it is
known as over-fitting. In this case, the machine learning model learns the details and noise in the training data
such that it negatively affects the performance of the model on test data. Over-fitting can happen due tolow
bias and high variance.
➢ Under-fitting
When a model has not learned the patterns in the training data well and is unable to generalize well on the
new data, it is known as under-fitting. An under-fit model has poor performance on the training data and
will result in unreliable predictions. Under-fitting occurs due to high bias and low variance.
24
➢ Reasons for under-fitting
• Data used for training is not cleaned and contains noise (garbage values) in it
• The model has a high bias
• The size of the training dataset used is not enough
• The model is too simple
5.7 Hyper-parameter 25
Hyper-parameters are defined as the parameters that are explicitly defined by the user to control the
learning process. The value of the Hyper-parameter is selected and set by the machine learning engineer
before the learning algorithm begins training the model. These parameters are tunable and can directly
affect how well a model trains. Hence, these are external to the model, and their values cannot be changed
during the training process. Some examples of hyper-parameters in machine learning:
• Learning Rate
• Number of Epochs
• Momentum
• Regularization constant
• Number of branches in a decision tree
• Number of clusters in a clustering algorithm (like k-means)
5.7.1 Model Parameters:
Model parameters are configuration variables that are internal to the model, and a model learns them on its
own. For example, Weights or Coefficients of dependent variables in the linear regression model.
Weights or Coefficients of independent variables in SVM, weight, and biases of a neural network,
cluster centroid in clustering. Some key points for model parameters are as follows:
• They are used by the model for making predictions
• They are learned by the model from the data itself
• These are usually not set manually
• These are the part of the model and key to a machine learning algorithm
5.7.2 Model Hyper-parameters:
• Hyper-parameters are those parameters that are explicitly defined by the user to control the learning
process. Some key points for model parameters are as follows:
• These are usually defined manually by the machine learning engineer.
• One cannot know the exact best value for hyper-parameters for the given problem. The best value can
be determined either by the rule of thumb or by trial and error.
• Some examples of Hyper-parameters are the learning rate for training a neural network, K in the
KNN algorithm
5.7.3 Difference between Model and Hyper parameters
The difference is as tabulated below.
MODEL PARAMETERS HYPER-PARAMETERS
They are required for estimating the model
They are required for making predictions
parameters
They are estimated by optimization
They are estimated by hyperparameter tuning
algorithms(Gradient Descent, Adam, Adagrad)
They are not set manually They are set manually
The choice of hyperparameters decide how
The final parameters found after training will efficient the training is. In gradient descent the
decide how the model will perform on unseen learning rate decide how efficient and accurate
data the optimization process is in estimating the
parameters
There is a disadvantage of this technique; that is, it can be computationally difficult for the large p. 28
5.8.2.3 Leave one out cross-validation
This method is similar to the leave-p-out cross-validation, but instead of p, we need to take 1 dataset out of
training. It means, in this approach, for each learning set, only one datapoint is reserved, and the remaining
dataset is used to train the model. This process repeats for each datapoint. Hence for n samples, we get n
different training set and n test set. It has the following features:
• In this approach, the bias is minimum as all the data points are used.
• The process is executed for n times; hence execution time is high.
• This approach leads to high variation in testing the effectiveness of the model as we iteratively
check against one data point.
5.8.2.4 K-Fold Cross-Validation
K-fold cross-validation approach divides the input dataset into K groups of samples of equal sizes. These
samples are called folds. For each learning set, the prediction function uses k-1 folds, and the rest of the folds
are used for the test set. This approach is a very popular CV approach because it is easy to understand, and
the output is less biased than other methods.
The steps for k-fold cross-validation are:
• Split the input dataset into K groups
• For each group:
• Take one group as the reserve or test data set.
• Use remaining groups as the training dataset
• Fit the model on the training set and evaluate the performance of the model using the test set.
Let's take an example of 5-folds cross-validation. So, the dataset is grouped into 5 folds. On 1st iteration, the
first fold is reserved for test the model, and rest are used to train the model. On 2nd iteration, the second fold
is used to test the model, and rest are used to train the model. This process will continue until each fold is
not used for the test fold.
Consider the below diagram:
➢ Point estimator 30
To distinguish estimates of parameters from their true value, a point estimate of a parameter θis represented
by θˆ. Let {x(1) , x(2) ,..x(m)} be m independent and identically distributed data points. Then a point
estimator is any function of the data:
This definition of a point estimator is very general and allows the designer of an estimator great flexibility.
While almost any function thus qualifies as an estimator, a good estimator is a function whose output is close
to the true underlying θ that generated the training data.
Point estimation can also refer to estimation of relationship between input and target variablesreferred to as
function estimation.
➢ Function Estimator
Here we are trying to predict a variable y given an input vector x. We assume that there is a function f(x)
that describes the approximate relationship between y and x. For example,
we may assume that y = f(x) + ε, where ε stands for the part of y that is not predictable from x. In function
estimation, we are interested in approximating f with a model or estimate fˆ. Function estimation is really
just the same as estimating a parameter θ; the function estimator fˆ is simply a point estimator in function
space. Ex: in polynomial regression we are either estimating a parameter w or estimating a function mapping
from x to y.
5.9.1 Uses of Estimators
By quantifying guesses, estimators are how machine learning in theory is implemented in practice. Without
the ability to estimate the parameters of a dataset (such as the layers in a neural network or the bandwidth in
a kernel), there would be no way for an AI system to ―learn.‖
A simple example of estimators and estimation in practice is the so-called ―German Tank Problem‖ from
World War Two. The Allies had no way to know for sure how many tanks the Germans were building every
month. By counting the serial numbers of captured or destroyed tanks, allied statisticians created anestimator
rule. This equation calculated the maximum possible number of tanks based upon the sequential serial
numbers, and applies minimum variance analysis to generate the most likely estimate for how many new tanks
German was building.
5.9.2 Types of Estimators
Estimators come in two broad categories, point and interval. Point equations generate single value results,
such as standard deviation, that can be plugged into a deep learning algorithm‘s classifier functions. Interval
equations generate a range of likely values, such as a confidence interval, for analysis.
In addition, each estimator rule can be tailored to generate different types of estimates:
• Biased: Either an overestimate or an underestimate.
• Efficient: Smallest variance analysis. The smallest possible variance is referred to as the ―best‖
estimate.
• Invariant: Less flexible estimates that aren‘t easily changed by data transformations.
• Shrinkage: An unprocessed estimate that‘s combined with other variables to create complex
estimates.
• Sufficient: Estimating the total population‘s parameter from a limited dataset.
• Unbiased: An exact-match estimate value that neither underestimates nor overestimates.
5.9.3.2 Bias
In general, a machine learning model analyses the data, find patterns in it and make predictions. While
training, the model learns these patterns in the dataset and applies them to test data for prediction. While
making predictions, a difference occurs between prediction values made by the model and actual
values/expected values, and this difference is known as bias errors or Errors due to bias. It can be defined
as an inability of machine learning algorithms such as Linear Regression to capture the true relationship
between the data points. Each algorithm begins with some amount of bias because bias occurs from
assumptions in the model, which makes the target function simple to learn. A model has either:
• Low Bias: A low bias model will make fewer assumptions about the form of the target function.
• High Bias: A model with a high bias makes more assumptions, and the model becomes unable
to capture the important features of our dataset. A high bias model also cannot perform well
on new data.
Generally, a linear algorithm has a high bias, as it makes them learn fast. The simpler the algorithm, the higher
the bias it has likely to be introduced. Whereas a nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are Decision Trees, k-Nearest Neighbours
and Support Vector Machines. At the same time, an algorithm with high bias is Linear Regression, Linear
Discriminant Analysis and Logistic Regression.
Some examples of machine learning algorithms with low variance are, Linear Regression, Logistic
Regression, and Linear discriminant analysis. At the same time, algorithms with high varianceare
decision tree, Support Vector Machine, and K-nearest neighbours.
➢ Ways to Reduce High Variance:
• Reduce the input features or number of parameters as a model is overfitted.
• Do not use a much complex model.
• Increase the training data.
• Increase the Regularization term.
35
One of the main challenges of Deep Learning derived from this is being able to deliver great performances
with a lot less training data. As we will see later, recent advances like transfer learning or semi-supervised
learning are already taking steps in this direction, but still it is not enough.
2. Coping with data from outside the training distribution
Data is dynamic, it changes through different drivers like time, location, and many other conditions.
However, Machine Learning models, including Deep Learning ones, are built using a defined set of data
(the training set) and perform well as long as the data that is later used to make predictions once the system
is built comes from the same distribution as the data the system was built with.
This makes them perform poorly when data that is not entirely different, but that does have some variations
from the training data is fed to them. Another challenge of Deep Learning in the future will be to overcome
this problem, and still perform reasonably well when data that does not exactly match the training data is fed
to them.
3. Incorporating Logic
Incorporating some sort of rule based knowledge, so that logical procedures can be implemented and
sequential reasoning used to formalize knowledge.
While these cases can be covered in code, Machine Learning algorithms don‘t usually incorporate sets or rules
into their knowledge. Kind of like a prior data distribution used in Bayesian learning, sets of pre- defined rules
could assist Deep Learning systems in their reasoning and live side by side with the ‗learning from data‘
based approach.
4. The Need for less data and higher efficiency
Although we kind of covered this in our first two sections, this point is really worth highlighting.
The success of Deep Learning comes from the possibility to incorporate many layers into our models, allowing
them to try an insane number of linear and non-linear parameter combinations. However, with more layers
comes more model complexity and we need more data for this model to function correctly.
When the amount of data that we have is effectively smaller than the complexity of the neural network then
we need to resort to a different approach like the aforementioned Transfer Learning.
Also, too big Deep Learning models, aside from needing crazy amounts of data to be trained on, use a lot of
computational resources and can take a very long while to train. Advances on the field should also be oriented
towards making the training process more efficient and cost effective
6. Deep Neural Network
Deep neural networks (DNN) is a class of machine learning algorithms similar to the artificial neural network
and aims to mimic the information processing of the brain. Deep neural networks, or deep learning networks,
have several hidden layers with millions of artificial neurons linked together. A number, called weight,
represents the connections between one node and another. The weight is a positive number if one node excites
another, or negative if one node suppresses the other.
6.1 Feed-Forward Neural Network
In its most basic form, a Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs
enter the layer and are multiplied by the weights in this model. The weighted input values are then summed
together to form a total. If the sum of the values is more than a predetermined threshold, which is normally
set at zero, the output value is usually 1, and if the sum is less than the threshold, the output value is usually -
1. The single-layer perceptron is a popular feed-forward neural network model that is frequently used for
classification. Single-layer perceptrons can also contain machine learning features.
36
The neural network can compare the outputs of its nodes with the desired values using a property known
as the delta rule, allowing the network to alter its weights through training to create more accurate
output values. This training and learning procedure results in gradient descent. The technique of updating
weights in multi-layered perceptrons is virtually the same, however, the process is referred to as back-
propagation. In such circumstances, the output values provided by the final layer are used to alter each hidden
layer inside the network.
6.1.1 Work Strategy
The function of each neuron in the network is similar to that of linear regression. The neuron also has
an activation function at the end, and each neuron has its weight vector.
Now, we will add a loss function and optimize parameter to make the model that can predict the accurate
value of Y. The loss function for the linear regression is called as RSS or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
o Ridge Regression
o Lasso Regression
6.2.1 Ridge Regression
38
Ridge regression is one of the types of linear regression in which a small amount of bias is introduced so
that we can get better long-term predictions.
Ridge regression is a regularization technique, which is used to reduce the complexity of the model. It is
also called as L2 regularization.
In this technique, the cost function is altered by adding the penalty term to it. The amount of bias added to the
model is called Ridge Regression penalty. We can calculate it by multiplying with the lambda to the squared
weight of each individual feature.
The equation for the cost function in ridge regression will be:
In the above equation, the penalty term regularizes the coefficients of the model, and hence ridge regression
reduces the amplitudes of the coefficients that decreases the complexity of the model.
As we can see from the above equation, if the values of λ tend to zero, the equation becomes the cost
function of the linear regression model. Hence, for the minimum value of λ, the model will resemble the
linear regression model.
A general linear or polynomial regression will fail if there is high collinearity between the independent
variables, so to solve such problems, Ridge regression can be used.
It helps to solve the problems if we have more parameters than samples.
6.2.2 Lasso Regression
Lasso regression is another regularization technique to reduce the complexity of the model. It stands
for Least Absolute and Selection Operator.
It is similar to the Ridge Regression except that the penalty term contains only the absolute weights instead
of a square of weights.
Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge Regression can only shrink
it near to 0.
It is also called as L1 regularization. The equation for the cost function of Lasso regression will be:
Some of the features in this technique are completely neglected for model evaluation.
Hence, the Lasso regression can help us to reduce the overfitting in the model as well as the feature
selection.
Key Difference between Ridge Regression and Lasso Regression
Ridge regression is mostly used to reduce the overfitting in the model, and it includes all the features
present in the model. It reduces the complexity of the model by shrinking the coefficients.
Lasso regression helps to reduce the overfitting in the model as well as feature selection.
39
6.3 Optimization in Machine Learning
In machine learning, optimization is the procedure of identifying the ideal set of model parameters that
minimize a loss function. For a particular set of inputs, the loss function calculates the discrepancy between
the predicted and actual outputs. For the model to successfully forecast the output for fresh inputs,
optimization seeks to minimize the loss function.
A method for finding a function's minimum or maximum is called an optimization algorithm, which is used
in optimization. Up until the minimum or maximum of the loss function is reached, the optimization algorithm
iteratively modifies the model parameters. Gradient descent, stochastic gradient descent, Adam, Adagrad,
and RMSProp are a few optimization methods that can be utilised in machine learning.
• Gradient Descent
In machine learning, gradient descent is a popular optimization approach. It is a first-order
optimization algorithm that works by repeatedly changing the model's parameters in the opposite
direction of the loss function's negative gradient. The loss function lowers most quickly in that
direction because the negative gradient leads in the direction of the greatest descent.
The gradient descent algorithm operates by computing the gradient of the loss function with respect
to each parameter starting with an initial set of parameters. The partial derivatives of the loss function
with respect to each parameter are contained in a vector known as the gradient. After that, the
algorithm modifies the parameters by deducting a small multiple of the gradient from their existing
values.
• Stochastic Gradient Descent
A part of the training data is randomly chosen for each iteration of the stochastic gradient descent
process, which is a variant on the gradient descent technique. This makes the algorithm's computations
simpler and speeds up its convergence. For big datasets when it is not practical to compute the gradient
of the loss function for all of the training data, stochastic gradient descent is especially helpful.
The primary distinction between stochastic gradient descent and gradient descent is that stochastic
gradient descent changes the parameters based on the gradient obtained for a single example rather
than the full dataset. Due to the stochasticity introduced by this, each iteration of the algorithm may
result in a different local minimum.
• Adam
Adam is an optimization algorithm that combines the advantages of momentum-based techniques and
stochastic gradient descent. The learning rate during training is adaptively adjusted using the first and
second moments of the gradient. Adam is frequently used in deep learning since it is known to
converge more quickly than other optimization techniques.
• Adagrad
An optimization algorithm called Adagrad adjusts the learning rate for each parameter based on
previous gradient data. It is especially beneficial for sparse datasets with sporadic occurrences of
specific attributes. Adagrad can converge more quickly than other optimization methods because it
uses separate learning rates for each parameter.
• RMSProp
An optimization method called RMSProp deals with the issue of deep neural network gradients that
vanish and explode. It employs the moving average of the squared gradient to normalize the learning
40
rate for each parameter. Popular deep learning optimization algorithm RMSProp is well known for
converging more quickly than some other optimization algorithms.
6.3.1 Importance of Optimization in Machine Learning
Machine learning depends heavily on optimization since it gives the model the ability to learn from data
and generate precise predictions. Model parameters are estimated using machine learning techniques using
the observed data. Finding the parameters' ideal values to minimize the discrepancy between the predicted
and actual results for a given set of inputs is the process of optimization. Without optimization, the model's
parameters would be chosen at random, making it impossible to correctly forecast the outcome for brand-
new inputs.
Optimization is highly valued in deep learning models, which have multiple levels of layers and millions of
parameters. Deep neural networks need a lot of data to be trained, and optimizing the parameters of the model
in which they are used requires a lot of processing power. The optimization algorithm chosen can have a big
impact on the training process's accuracy and speed.
New machine learning algorithms are also implemented solely through optimization. Researchers are
constantly looking for novel optimization techniques to boost the accuracy and speed of machine learning
systems. These techniques include normalization, optimization strategies that account for knowledge of the
underlying structure of the data, and adaptive learning rates.
6.3.2Challenges in Optimization
There are difficulties with machine learning optimization. One of the most difficult issues is overfitting, which
happens when the model learns the training data too well and is unable to generalize to new data. When the
model is overly intricate or the training set is insufficient, overfitting might happen.
When the optimization process converges to a local minimum rather than the global optimum, it poses the
problem of local minima, which is another obstacle in optimization. Deep neural networks, which contain
many parameters and may have multiple local minima, are highly prone to local minima.
41