Machine Learning Techniques
Machine Learning Techniques
net/publication/367161404
CITATIONS READS
0 1,191
2 authors:
All content following this page was uploaded by Rajeev Kapoor on 21 January 2023.
By
AG
Books
PH
2022
i
MACHINE LEARNING
TECHNIQUES
Dr. Rajeev Kapoor,
Dimple
© 2022 @ Authors
ISBN – 978-93-95936-62-0
Published by:
Contact: +91-7089366889
ii
Preface
To improve their predictive abilities without being expressly
designed to do so, software programmes may use a kind of
AI called machine learning (ML). When predicting future
output values, the machine learning algorithms take in
previously collected data as input. One popular use of
machine learning is recommendation engines. In addition to
these primary applications, fraud detection, malware threat
detection, spam filtering, business process automation (BPA),
and predictive maintenance are other very common usage.
iii
About the Book
Techniques in machine learning (ML) allow computers to gain
knowledge via observation and practice. Machine learning (ML)
is the process by which a system learns new information without
being explicitly programmed to do so. This allows a system to
acquire & integrate knowledge via the large-scale observations
and to grow and adapt to its environment.
iv
Contents
CHAPTER-1: INTRODUCTION 1
v
CHAPTER-1: introduction
1
respond following what they have learned from previous
experiences.
2
Machine Learning is employed everywhere from
automating monotonous work to delivering insightful
insights, industries in every industry aim to gain from it. A
gadget that makes use of it may already be in your
possession. Wearable fitness trackers like Fitbit, or smart
speakers such as Google Home, are two such examples. In
contrast, there are several other applications of ML.
3
• The financial industry and trading — Businesses
use ML to analyse applicants' credit histories and
investigate possible fraud.
*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08
4
The very first computer game to boast that it could defeat
the world checkers champion was developed in the 1950s.
To play checkers better, this application was invaluable.
Simultaneously, Frank Rosenblatt developed the
Perceptron, a basic classifier that, when networked
together in vast numbers, may be very effective. So, the
monster is contextual, and back then, it was a major
technological advance. The area of neural networks then
seems to stall for a few years as researchers struggle to find
solutions to their most pressing concerns.
5
Training Data: The training data is used to construct the
Machine Learning model. The model learns to recognise
important trends & patterns in training data that allow for
accurate prediction of the output.
*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08
6
Machine Learning Process
*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08
7
this point, it's also crucial to make mental notes on the
types of data that may be utilized to address the issue, as
well as the strategies that should be used.
8
discrepancies since they may create inaccurate calculations
and forecasts. Now is the time to go through the whole
dataset for discrepancies and address them when they are
found.
9
paragraphs will go into detail about the many issues that
may be addressed using Machine Learning.
Step 7: Predictions
https://www.edureka.co/blog/introduction-to-machine-learning/
*
10
If you look at the diagram above, you'll see that there are
primarily three kinds of challenges that may be addressed
by Machine Learning:
11
even realising it, we use machine learning every day in the
form of Google Maps, Alexa, Google Assistant, etc. The
following are some of the most well-known current uses
of machine learning in the real world:
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
12
2. Product Recommendations
3. Image Recognition
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
13
recognising a feature or item in a digital picture. Face
detection, Pattern recognition, and face identification are
only a few of the applications that have embraced this
method for deeper examination.
4. Sentiment Analysis
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
14
mood. The review-based website, the decision-making
software, etc., might all benefit from this sentiment
analysis tool.
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
15
Figure 1.9 Access Control*
https://www.hulkautomation.com/access-control
*
16
By using machine learning algorithms, researchers can
better govern as well as monitor the populations of marine
animals like the critically endangered cetaceans.
9. Banking Domain
https://www.nature.com/articles/s41467-022-27980-y
*
17
technologies made possible by machine learning.
Algorithms decide which criteria should be used to design
a filter to prevent damage. Sites that are determined to be
fake would be blocked from processing payments
immediately.
• Supervised Learning.
• Unsupervised Learning.
• Reinforcement Learning.
18
Supervised Learning
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
19
computer "this is how Tom appears," for example. You
may use labelled data to train the computer in this way.
An organized training phase using labelled data is at the
heart of Supervised Learning.
Unsupervised Learning
*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications
20
For instance, it recognizes that this is a type 1 picture of
Tom by recognising his characteristic pointed ears, larger
stature, etc. The programme recognizes similar charac-
teristics in Jerry and concludes that this is a type 2 picture.
As a result, it divides the pictures into two groups, Cat and
Dog, without ever learning who Tom or Jerry are.
Reinforcement Learning
1.3. Evaluation
When we evaluate a model, we try to put a number on
how accurate the model is in making predictions. We
achieve this by testing the accuracy of the freshly trained
model on an unrelated data set. This model will check its
predictions against labelled data.
21
• Do I have an under-fitting or over-fitting model?
22
• Accuracy = # correct predictions / # total data
points.
23
as well as an actual label on the other. For this
purpose, let's refer to N as the total number of
categories. To simplify things, let's say N=2 in
the problem of binary categorization.
24
1.4.2. Test data set
Test data sets are data sets that are not part of a training
data set, but that follow the same probability distribution.
Overfitting is not severe if a model that performs well on
the training set also performs well on the test set.
Overfitting occurs when the model fits the training data
better than the test data.
*https://en.wikipedia.org/wiki/Training,_validation,_and_test_da
ta_sets
25
(training & test datasets). Keep in mind that several
publications caution against using this approach. When
utilising a technique like cross-validation, however, where
results are averaged after several rounds of the model
training & testing to help decrease bias and variability, two
divisions may be enough and successful.
1.5. Cross-validation
If you want to get a rough idea of how well your machine
learning model is doing, you may utilise the statistical
technique of cross-validation to do so. It is employed to
prevent a predictive model from being too sensitive to its
inputs, which may happen when data are scarce. For cross-
validation, a certain number of data divisions (called
"folds") are created, the analysis is performed on every
fold, and an average error estimate is then used.
26
in the actual world. Even though the training dataset is
also real-world data, it only represents a subset of all the
available data points (instances) out there, and our
primary goal is for the model to perform well on real-
world data.
Non-exhaustive Methods
Holdout method
27
"testing" it on another. As a rule of thumb, 70:30 or 80:20 is
a common ratio for how training and testing datasets are
divided.
28
Create your model using k minus one fold of data for
every fold in your dataset. Then, assess the model's
performance for the kth fold improvement.
29
Using a technique called "stratification," the data is
reorganised to guarantee that each "fold" accurately
represents the entire. In the binary classification issue with
equal amounts of data within every class, for example, the
data should be folded such that in each subset the
proportion of occurrences in each class is approximately
equal to its overall proportion.
Leave-P-Out cross-validation
Leave-one-out cross-validation
30
1.6. Linear Regression: Introduction
Linear regression is a method used in statistics to describe
the association between the scalar response (or dependent
variable) and one or even more explanatory factors (or
the independent variables). For this reason, the term
"Simple Linear Regression" is used to describe a situation
in which there is just a single explanatory variable.
31
“y = b₀ + b₁⋅ x₁”,
&
“b₁” is a coefficient of “x₁”,
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
32
After running the linear regression, we get the line
equation that describes how the orange line looks across
the points. A rough approximation of the result may be
calculated from any value on a horizontal axis using this
line (value on a vertical axis).
𝑦 = 𝑏0 + 𝑏1 𝑥1
33
Multiple Linear Regression
𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛
Linearity
34
Figure 1.15 Multiple Linear Regression*
https://medium.com/@alexandre.hsd/an-introduction-to-
*
linear-regression-13527642f49
35
2. No endogeneity
36
Considering that our intuition tells us that a larger price
tag means more lavish digs, the finding is rather
paradoxical. There is a corresponding decline in value
when x (the size of the apartment) grows. This indicates
that independent variables and errors have non-zero
covariance. You may probe your thoughts by questioning:
37
3. Normality and Homoscedasticity
38
To begin, it's crucial to plot the data and understand how
it truly appears:
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
39
Figure 1.17 There seems to be a spreading of the data as one
moves from left to right along the x-axis.*
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
†https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
40
➢ Fixing heteroscedasticity
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
41
Given this equation, we have a novel model that we name
a Semi-log model:
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
42
The interpretation is: when x raises by 1%, 𝑦 also raises by
b₁%.
4. No autocorrelation
43
Now the question is, how can autocorrelation be
identified? Plotting all the residuals on the graph and
inspecting them for patterns is a typical method. Not
finding any means you are probably okay.
*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
44
score of 2 implies no autocorrelation, but values between -
1 and +3 might raise red flags.
1. Autoregressive model.
5. No Multicollinearity.
Take, 𝑦 = 3+2𝑥:
45
The connection between y and x is a perfect linear one.
You can stand in for y with an x and vice versa. It is
possible to have complete multicollinearity (𝜌 = 1) in a
model where both y and x are present. This is a significant
challenge for our model since the calculated coefficients
would be inaccurate. If y could be represented by x, then
there is no need to use both of them.
46
• “y = β0 +β1x+ε” is a formula employed for the
simple linear regression.
Linearity
47
Independent of Errors
Normal Distribution
Variance Equality
y= a0+a1x+ ε
48
Where,
49
Assuming that there are k independent variables x1, x2,...,
xk that can be controlled, this technique uses these
variables to calculate the probability of a certain result Y.
50
All independent variables are given the values "xi1, xi2,...,
xik" for the ith observation, and the value of a random
variable Yi is recorded.
51
Assumptions of Multiple Linear Regression
52
polynomial regression. Here is the formula for polynomial
regression:
53
are organised in a non-linear form. The
accompanying comparison graphic between
a linear dataset as well as a non-linear dataset will
help us grasp the concept.
https://www.javatpoint.com/machine-learning-
*
polynomial-regression
54
CHAPTER-2: Decision tree
learning
2.1. Introduction
In machine learning, classification entails two phases: the
learning phase and the prediction phase. During the
training phase, the model is refined using the information
it has received. Following data input, the model is utilized
to provide a forecast of the outcome. When it comes to
classifying data, the Decision Tree is often regarded as one
of the most intuitive and widely used approaches.
55
supervised learning algorithms, may be used for both
regression & classification issues.
56
established as a significant factor, a decision tree may be
developed to estimate a customer's annual income based
on their employment, the kind of product they purchase,
and other factors. Here, we are making forecasts about the
values of continuous variables.
57
• Parent and Child Node: Sub-nodes are the children
of a parent node, and the node that divides into
them is termed the parent node.
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*
explained.html
58
Every node in a tree represents a test case, and the edges
radiating out from each node represent alternative values
for the corresponding property in a test case. This iterative
procedure is carried out for each child tree that now has its
root at a new node.
59
disjoint branches terminating in the same class form
the disjunction (sum).
60
C4.5 → (successor of ID3).
61
Attribute Selection Measures
• Entropy.
• Reduction in Variance.
• Gini index.
• Information gain.
• Gain Ratio.
• Chi-Square.
Entropy
62
Figure 2.2 A graph showing Entropy*
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*
explained.html
63
Where S represents the current state and Pi represents the
probability of event I occurring in state S or the fraction of
nodes belonging to class i.
Information Gain
64
There is less entropy after an information gain. Given
attribute values, it determines the change in entropy
between the original dataset and also the split versions.
Information gain is used by the ID3 (Iterative
Dichotomiser) decision tree method.
65
Information Gain
Gini Index
Gain ratio
66
acquisition perspective. This indicates that it gives more
weight to the property that has a wider variety of possible
values.
Reduction in Variance
67
Steps to calculate Variance:
Chi-Square
68
We may express Chi-squared mathematically as:
2. Random Forest.
69
Pruning Decision Trees
https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*
explained.html
70
The overfitting problem is solved by deleting the 'Age'
attribute from the left side of a tree in the following figure,
which is more important on a right side of a tree.
Random Forest
https://www.javatpoint.com/machine-learning-decision-
*
tree-classification-algorithm
71
the classification. Using the above diagram as a guide, we
may classify an instance by first locating its root node, then
checking the attribute indicated by this node, and then
proceeding along the tree branch according to the value of
an attribute. After then, the new node's subtree undergoes
the same steps.
72
say we're interested in testing our machine learning
model's ability to adapt to new information. Overfitting &
underfitting are two key causes of the machine learning
algorithms' poor results in this area.
Underfitting
Overfitting
73
parameters of a supervised learning algorithm, such as the
maximum depth of a decision tree, or by using a linear
method if the data is linear.
*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm
74
The results of overfitting in the common use of decision
tree learning are shown in the diagram below. In this
instance, the ID3 algorithm is used to determine whether
individuals in a healthcare setting have diabetes.
75
than the one shown in h'. Obviously, h will be a great
match for the set of training instances, while h' will not.
*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm
76
to divide the cases extremely effectively. There is a danger
of overfitting if there are such accidental regularities.
Avoiding Overfitting —
Use all the data you can get your hands on for the training,
but use the statistical test to see whether a specific node's
77
expansion (or pruning) will provide results outside of the
training set.
78
1. Reduced Error Pruning
79
further limits the number of training instances when data
is scarce by reserving some of them for a validation set
Figure 2.7 Figure accuracy versus tree size shows the tree's
accuracy assessed over both training & test cases. *
2. Rule Post-Pruning
*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm
80
• Allow overfitting to occur by inferring a decision
tree from a training set and expanding the tree
until training data is fit optimally.
81
(Outlook = Sunny) and (Humidity = High).
82
2. Incorporating Continuous-Valued Attributes
83
the middle of the corresponding values of a continuous
characteristic A by sorting the instances according to A
and then selecting neighbouring examples that vary in
their target categorization. The optimal value of c, in terms
of the information gain, can be demonstrated to always fall
inside this limit. An information gain associated with
every candidate threshold may then be calculated and
used to make a decision.
84
divide the training instances into several smaller groups.
This means it will learn far more from the data than it did
from the training samples.
Alternate measure-1
85
Attributes with several equally dispersed values are
discouraged by the Splitlnformation term (e.g., Date).
Alternate measure-2
86
GainRatio measure, this distance metric has the advantage
of producing much smaller trees when dealing with data
sets whose characteristics have a large variation in
the number of values they may take.
Method-1
87
Method-2
88
emotional toll that these features have on the patient is
very variable. For these kinds of jobs, we choose decision
trees that make efficient use of cheap features, turning to
pricey ones only when they're necessary for accurate
categorization.
Method-1
89
Method-2
90
sender, frequent usage of the same terms, or any other
factor might serve as a measure of correspondence
similarity between the two emails.
Advantages:
Disadvantages:
91
those K points, choose the one that comes closest to
representing the data under scrutiny. The KNN algorithm
determines which of the 'K' classes in training data best fits
the test data, and then favours that class. In regression, the
value is the average of the 'K' chosen points in the training
set.
92
Let's say we have a photo of an animal that may be either a
cat or a dog, but we need to know which it is. KNN may
be used for this identification since it is based on
the measure of similarity. Our KNN model would
compare a new data set to cat and dog photographs and
classify the images into the appropriate categories
depending on the similarities between the two sets of data.
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
93
find a solution. K-NN is useful for quickly and accurately
determining a dataset's class. Think about the diagram
below:
Figure 2.9 A graph showing results before K-NN and after K-NN
*
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
94
Step 5: You should put the latest information into the
group with the highest average neighbour count.
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
95
previously examined the concept of the Euclidean
distance, which is the physical separation of any
two points. The formula for this is:
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
96
Figure 2.12 Categorizing the data*
Choosing a K value
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
97
Figure 2.13 A new example to classify data is shown on a graph*
*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn
98
• Just choose a number for K at random and begin
calculating.
99
CHAPTER-3: Probability and
Bayes Learning
100
theory are widely employed in fundamental mathematical
ideas like probability. When making class predictions
in Machine Learning, Bayes' theorem is also often utilised.
Machine learning applications, such as classification tasks,
make use of a notion from Bayes' theorem known as the
Bayesian technique to compute the conditional probability.
To cut down on calculation time and average project costs,
a simpler form of Bayes' theorem is also utilised (Nave
Bayes classification).
101
such as health and medicine, research and surveying,
aviation, etc.
Bayes Theorem
102
Bayes Rule, often known as Bayes Theorem, is the above
equation:
1. Experiment
103
2. Sample Space
S1 = {1, 2, 3, 4, 5, 6}.
S2 = {Head, Tail}.
3. Event
104
A = Event when the even number is achieved = {2, 4, 6}.
Similarly,
=2/6
=1/3
=0.333
A∩B= {6}
105
Disjoint Event: Disjoint events, also referred to
as mutually exclusive events, are those whose intersection
is "empty" or "null".
4. Random Variable
106
of a function, which may be discrete, continuous, or a
hybrid of the two; it is neither random nor variable.
5. Exhaustive Event
6. Independent Event
When one event does not have any bearing on the other,
we say that the two occurrences are independent of each
other. To put it another way, the chances of either
happening do not rely on the other.
7. Conditional Probability
107
8. Marginal Probability
108
If we make the simplistic assumption of conditional
independence,
Since,
109
assumptions. To estimate the required parameters, just a
minimal quantity of training data is needed.
110
parameterized by a vector θy=(θy1,…,θyn), where n is
the number of the features (or vocabulary size in the case
of the text classification) and θyi is a probability P(xiy) of
feature i occurring in a sample from class y.
Where,
111
multinomial naive Bayes (MNB) method. To calculate the
model's weights, CNB takes into account information from
each class's anticlass. The developers of CNB provide
empirical evidence that their method yields more
consistent parameter estimations than MNB. In addition,
CNB consistently beats MNB on text categorization tasks,
and frequently by a large margin. Here is the formula for
determining the relative importance of each weight:
112
Bernoulli Naive Bayes
113
categories. Each characteristic, denoted by index i is
thought to follow its unique category distribution.
Where,
114
3.3. Logistic Regression
In statistical analysis, logistic regression is used to make
predictions about a binary result from a series of
observations, like yes or no.
115
anticipate classes within data sets when more relevant data
is added.
116
for the answer has two values, pass & fail, the same model
may be used to predict whether or not a student would
pass.
117
Common applications include determining if an e-
mail is spam or not, and determining whether a
tumour is cancerous. This method is the most
popular choice for binary classification, and it is
also the most used method for logistic regression.
118
computation itself is quite involved, but much of the
drudgery may be eliminated with the use of contemporary
statistical software. This greatly reduces the complexity of
controlling for confounding variables and assessing the
influence of various variables.
119
separate. This means that although a model may utilise
the zip code and a person's gender, it would not be
possible to use the zip code and a person's state of
residence.
120
Logistic regression also presumes that all variables may be
represented by a pair of discrete labels, like "male" or
"female" or "click" or "no-click." Categories having more
than two classes need a specific method to properly
represent them. It's possible, for instance, to split up a
single category containing three age bands into three
distinct variables, each of which would indicate whether
or not a given person falls inside that age band.
121
• Using a borrower's yearly income, default history,
and outstanding obligations, banks may calculate
the likelihood that a customer would fail on a loan.
122
Numerous implementations of the logistic regression and
integration of its findings into other techniques may be
found in R and Python-based "data science programming
languages & frameworks". Logistic regression analysis
may also be done using several other tools and methods in
addition to Excel.
123
When creating a hyperplane, SVM selects the extremal
points and vectors. Support vectors are used to represent
these extreme circumstances, which is why the
corresponding technique is named a Support Vector
Machine. Take a look at the picture below, in which
the decision boundary (or hyperplane) is used to classify
items into two groups:
*https://www.javatpoint.com/machine-learning-support-vector-
machine-algorithm
124
trained on thousands of photographs of cats & dogs to
teach it to recognise those species, and then we'll put it to
the test with this outlandish animal. As a result of the
support vector's tendency to draw a line of demarcation
between two sets of data (in this instance, cats and dogs)
and pick out extreme examples, this would focus on the
latter. It will be labelled a cat on a basis of support vectors.
Think about the diagram below:
o Types of SVM
125
o Non-linear SVM: If the dataset cannot be
categorised using the straight line, we call it non-
linear data, and the classifier we use to categorise it
is called a Non-linear Support Vector Machine
(SVM).
Hyperplane:
Support Vectors:
126
comes from the fact that these vectors help keep the
hyperplane stable.
Linear SVM
127
Since this is a 2-dimensional space, a simple straight line
will do to divide these categories. However, these
categories may also be split along more than one line. Take
a look at the picture below:
128
the distance that vectors are away from the hyperplane.
Ultimately, SVM is used to increase this profit margin. A
good definition of an ideal hyperplane has the largest
margin.
Non-Linear SVM:
129
Figure 3.4 Non-Linear SVM*
z=x2 +y2
https://www.edureka.co/blog/support-vector-machine-in-
*
python/
130
Figure 3.5 Results of the third dimension of the sample space*
https://www.edureka.co/blog/support-vector-machine-in-
*
python/
†https://www.edureka.co/blog/support-vector-machine-in-
python/
131
It seems to be a flat x-axis-aligned plane, yet we are really
in 3-d space. Thus, in the case of "non-linear data", we get a
circle of radius 1 when we transform it into "2D space with
z=1".
https://www.edureka.co/blog/support-vector-machine-in-
*
python/
132
Therefore, in the case of the non-linear data, we get a circle
with a radius of 1.
Random Variables
133
The experimental results, represented by X, are
transformed into actual values by the random variable ξ.
As an example, let's say X represents all the possible
patients seen by a doctor, and let be the mapping ξ from X
to the patients' actual body mass index and height.
Distributions
134
When it's clear what we're talking about, we'll use the
somewhat casual notation.
https://alex.smola.org/drafts/thebook.pdf
*
135
A PDF is often used in conjunction with an indefinite
integral over the p. The "cumulative distribution function"
(CDF) is a statistical tool used often in practice.
136
figure out that this represents a mean of 3.5 (1 + 2 + 3 + 4 +
5 + 6)/.
137
that a standard deviation, which is a square root of the
variance, is often used when talking about the
characteristics of random variables.
138
The result of a subsequent voltage measurement won't be
affected by previous measurements. We'll refer to such
random variables as iids (short for "independent as well as
identically distributed") since they behave in the same way
in every possible situation. For an illustration of a
dependent & independent random variable pair, see
Figure.
https://alex.smola.org/drafts/thebook.pdf
*
139
his way because the other lights would be red. Similarly,
when shown with a numeric image (x), we expect that x is
somehow linked to the word for that digit (y).
140
CHAPTER-4: Artificial Neural
Networks
4.1. Introduction
To model complex patterns and foresee problems, experts
often turn to Artificial Neural Networks (ANN), which are
algorithms inspired by the way the brain operates.
Inspired by how the human brain uses Biological Neural
Networks to learn, the "Artificial Neural Network" (ANN)
is a kind of deep learning. The creation of ANN came
about as a consequence of efforts to simulate brain activity.
Biological neural networks and ANNs have many
functional similarities but have some key differences.
When it comes to input, the ANN algorithm can only deal
with numbers and structured information.
141
their complex structure, MLPs (Multi-Layer
Perceptron) has earned a nickname: "many, many layers".
https://www.analyticsvidhya.com/blog/2021/09/introduction-to-
*
artificial-neural-networks/
142
4.1.2. Basic Structure of ANNs
The concept of ANNs is predicated on the premise that the
functioning of the human brain may be mimicked by
employing silicon and wires as biological neurons &
dendrites.
https://www.tutorialspoint.com/artificial_intelligence/artificial_i
*
ntelligence_neural_networks.htm
143
Multi-node ANNs mimic the actual neurons seen in the
human brain. The neurons have linkages between them
and communicate with one another. The nodes may
process basic information with the help of inputs. It is then
the job of other neurons to receive and process the results
of these calculations. Activation is another name for the
value produced by a node.
https://www.tutorialspoint.com/artificial_intelligence/artificial_i
*
ntelligence_neural_networks.htm
144
4.1.3. Types of Artificial Neural Networks
ANN may have either a FeedForward or a Feedback
topology.
FeedForward ANN
https://www.javatpoint.com/artificial-neural-network
*
145
FeedBack ANN
https://www.javatpoint.com/artificial-neural-network
*
146
connection between the neuron's output and the input as
directed edges with the weights. The input signal for
the Artificial Neural Network is often a vector
representing a pattern or an image from an outside source.
After that, for every n number of the inputs, the x(n)
notation is used to mathematically assign values to them.
https://www.javatpoint.com/artificial-neural-network
*
147
weights often stand for the robustness of the connections
between individual neurons in the ANN. Internally, the
computer compiles a summary of all the inputs' relative
weights.
Binary:
148
Sigmoidal Hyperbolic
149
members of the set into groups according to an
undiscovered pattern.
150
4.1.7. Disadvantages of Artificial Neural Networks
1. Hardware Dependence:
151
5. The network’s lifetime is unknown
1. Social Media
152
prevalent in many areas of modern marketing, including
but not limited to book sites, hospitality sites, movie
services, etc. Artificial neural networks are used to learn
about a customer's preferences and behaviour based on
their past purchases and other data.
3. Healthcare
4. Personal Assistants
153
4.2. Biological motivation
The discovery that the biological learning systems (like the
human brain) consist of very complex networks of linked
neurons has served as inspiration for the research of
artificial neural networks.
http://data-machine.net/nmtutorial/biologicalmotivation.htm
*
154
of impulses. When the total voltage within the cell goes
over a certain point, it "fires," producing a spike that
travels down the cell's axon. This sets off the cascade of
actions in the linked neurons.
155
• It's OK to have lengthy training periods
4.4. Perceptron
In the field of Machine Learning, the perceptron model
refers to a specific supervised learning approach for
the binary classifiers. The perceptron model, which is
analogous to the single neuron, can determine if a function
is the input and assign it to one of two categories.
156
• Input values.
• Net sum.
• Activation function.
157
capacity to use layers. This is a multi-layer
classification technique, which means that
machines may analyse inputs in parallel utilising
several different layers.
158
Here is a quick rundown of the perceptron method using
the Heaviside activation function:-
= {0 otherwise
The machine's inputs are largely given the weight that has
been previously learnt by a perceptron algorithm
(dimension or strength of the connection between the data
units). After applying these weights to the input data, the
final tally may be calculated (total value).
159
4.4.2. Components of a Perceptron
https://www.analytixlabs.co.in/blog/what-is-perceptron/
*
160
• Activation Function: It makes the perceptron model
non-linear.
161
However, there were technological limitations to the
perceptron method. Considering its single layer, a
perceptron model could only be used for classes that could
be separated linearly. The problem was fixed, however,
when multi-layered perceptron algorithms were
developed. The many perceptron models are described in
depth below:
https://www.analytixlabs.co.in/blog/what-is-perceptron/
*
162
layer perceptron is a 1 if and only if the value of the total is
greater than a threshold or a specified value.
163
Figure 4.10 Multilayer Perceptron Model*
https://www.analytixlabs.co.in/blog/what-is-perceptron/
*
164
Advantages of Multi-Layer Perceptron
165
• A hard limit transfer function ensures that the
perceptron can only produce a binary output
(either 0 or 1).
166
A multi-layer perceptron allows the model to do input
classification with the assistance of several layers, making
it well-suited to more complicated inputs than a "single-
layer perceptron".
167
Future perceptron technology would continue to assist and
encourage analytical behaviour in machines, which will
increase the efficiency of computers as Artificial
Intelligence advances.
168
broken down as follows: 3 inputs (excluding the bias unit),
1 output, and 4 unknowns (1 bias unit is not involved)
https://www.geeksforgeeks.org/multi-layered-neural-
*
networks-in-r-programming/
169
methods is necessary for determining the best settings for
hyperparameters. Training with a varying load is
performed using the Back-Propagation method.
Backpropagation
Backpropagation algorithm
170
Working of Backpropagation
Backpropagation Algorithm
171
the number of inputs. Since backpropagation doesn't need
any previous information about the network, it's a very
adaptable technique.
Types of Backpropagation
Advantages
172
Disadvantages
173
CHAPTER-5: Ensembles
5.1. Introduction
• Statistical Problem –
174
probability that the selected hypothesis is true with respect
to the unseen data is low.
• Computational Problem –
https://www.geeksforgeeks.org/ensemble-classifier-data-
*
mining/
175
Main Challenge for Developing Ensemble Models?
176
In contrast, it is well-known that averaging tends to
smooth out (reduce) fluctuations. Ensemble systems aim to
decrease variance by combining the results of the several
classifiers that have been trained to have the same or
similar bias and then using some method, such as
averaging, to combine the results.
https://doc.lagout.org/science/Artificial%20Intelligence/Machine
*
%20learning/Ensemble%20Machine%20Learning_%20Methods%
20and%20Applications%20%5BZhang%20%26%20Ma%202012-
02-17%5D.pdf
177
unrelated, averaging will remove the noise component
while leaving the common information content of the
signal unchanged. Assuming that classifiers produce
various mistakes on each sample but usually agree on their
right classifications, averaging a classifier output
minimises the error by averaging out error components,
and this is precisely how an ensemble of the classifiers
improves classifier accuracy.
178
result. So, variety in the ensemble members' choices is
required, especially when things go wrong. It is generally
agreed that ensemble systems benefit greatly from
increased variety. Independent or ideally negatively
correlated, the classifier outputs are optimal.
179
Bagging (and related techniques arc-x4 or rather random
forests), boosting (and its numerous forms), stack
generalisation, and hierarchical mean overall error (MoE)
remain the most often used methodologies despite the
proliferation of alternative algorithms.
180
5.2. Bagging and boosting
Both Bagging & Boosting are predicated on the idea of
breaking down a larger job into smaller manageable
chunks, and then working on each chunk individually
before merging their results to form the whole.
Bagging
181
Boosting
The first model is trained using the original data, and then
the second model is trained with a heavier weight of
the observations to correct the first model's mistakes. The
process is repeated until either a reliable forecast is made
or a large number of models have been explored.
Working of Bagging
182
The best way to conclude is to use a strategy that takes into
account the predictions of all models and takes the mean.
Advantages of Bagging
Disadvantages of Bagging
Working of Boosting
183
before being fully trained using the whole data. Train M2
and M3 using M1's projected error on the same sample
data.
Advantages of Boosting
Disadvantages of Boosting
184
• Variance reduction: Overfitting & increased
variance may be corrected using both methods.
Boosting Bagging
185
use classifiers. variation inside the
classifier and it is
unstable.
186
forest is predicted not by a single decision tree, but by
aggregating the forecasts of many individual trees.
https://www.javatpoint.com/machine-learning-random-forest-
*
algorithm
187
5.3.1. Assumptions for Random Forest
188
5.3.2. How does the Random Forest algorithm
work?
The first step of using a Random Forest is to generate the
random forest by mixing N decision trees, and the second
step is to use the generated trees to make the predictions.
Here's how it all works; just follow the steps and refer to
the diagram:
189
prediction for every new data point depending on
a majority of outcomes. Take a look at the picture below:
https://www.javatpoint.com/machine-learning-random-forest-
*
algorithm
190
3. Land Use: Using this method, we can pinpoint
locations with a comparable land use pattern.
191
• Parallelization: Since each tree in a random forest is
generated independently using various data and
attributes, we can make full use of the central
processing unit to construct these forests.
192
Figure 5.5 An example of Clustering*
https://www.geeksforgeeks.org/clustering-in-machine-
*
learning/
† https://www.geeksforgeeks.org/clustering-in-machine-learning/
193
The clustering of such data points is based on the simple
idea that each point falls within a certain distance of the
cluster's centre. The outliers are determined using several
different distance measures and approaches.
Clustering Methods :
194
"Ordering Points to Identify Clustering Structure"
(OPTICS), etc.
195
such grids are quick and independent of the
number of data items.
Clustering Algorithms
https://www.geeksforgeeks.org/clustering-in-machine-learning/
*
196
5.4.2. Types of Clustering Methods/ Algorithms
Connectivity Models
197
the data into a single cluster and then splitting it up into
subgroups based on their distance from one another. The
model can be understood, however, it cannot handle large
datasets.
Distribution Models
Density Models
Centroid Models
198
5.4.3. Applications of Clustering
• Clustering is an excellent method for dealing with a
wide range of machine learning issues, and it finds
use in a wide range of sectors.
199
• Useful in analysing earthquake-affected regions to
identify the high-risk zones (applicable for
the other natural hazards too).
200
Step 3: Distribute the data points across the K clusters by
assigning them to the centroid that is geographically
closest to them.
https://www.javatpoint.com/k-means-clustering-
*
algorithm-in-machine-learning
201
To illustrate, let's say that we have two unknown
quantities, M1 and M2. Here is a scatter diagram of these
two factors along the x-axis:
https://www.javatpoint.com/k-means-clustering-algorithm-in-
*
machine-learning
202
Figure 5.10 Locating the median point midway between the two
centres*
https://www.javatpoint.com/k-means-clustering-algorithm-in-
*
machine-learning
203
If we want to locate the nearest cluster, we must start the
procedure again by selecting a different centroid. New
centroids would be selected by computing their centre of
gravity in the following ways:
One yellow dot can be seen to the left of a line, and two
blue ones can be seen to the right of a line in the picture
above. Therefore, new centroids will be calculated using
these three positions.
https://www.javatpoint.com/k-means-clustering-algorithm-in-
*
machine-learning
https://www.javatpoint.com/k-means-clustering-algorithm-in-
†
machine-learning
204
Figure 5.13 Using the median line method*
https://www.javatpoint.com/k-means-clustering-algorithm-in-
*
machine-learning
205
Figure 5.15 The structure of the new centroid*
https://www.javatpoint.com/k-means-clustering-algorithm-in-
*
machine-learning
https://www.javatpoint.com/k-means-clustering-algorithm-in-
†
machine-learning
206
Figure 5.17 Come to the conclusion that a model is complete *
* https://www.javatpoint.com/k-means-clustering-
algorithm-in-machine-learning
† https://www.javatpoint.com/k-means-clustering-
algorithm-in-machine-learning
207