0% found this document useful (0 votes)
22 views

Machine Learning Techniques

Uploaded by

Diksha K.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Machine Learning Techniques

Uploaded by

Diksha K.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 214

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/367161404

Machine Learning Techniques

Book · January 2023

CITATIONS READS

0 1,191

2 authors:

Rajeev Kapoor Dimple Sachdeva


Punjabi University, Patiala Punjabi university neighborhood campus, jaitu
18 PUBLICATIONS 22 CITATIONS 1 PUBLICATION 0 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rajeev Kapoor on 21 January 2023.

The user has requested enhancement of the downloaded file.


About The Author

Dr. Rajeev Kapoor is an Assistant Professor in Punjabi University Neighbourhood Campus,


Jaito of Punjabi University, Patiala. He has completed his Ph.D. in Computer Science and
Engineering from Chitkara University, Punjab. He has also completed Master in
Engineering in Computer Science and Engineering from Thapar Institute of Engineering
Technology, Patiala and M.C.A from IGNOU, New Delhi. The author has fourteen years
rich experience in computer science subjects. The author has attended a number of
workshops and short term courses of Research Methodology from HRDC UGC and other
agencies. His research domain in Artificial Intelligence, Machine Learning and cloud
computing. He has published eight research papers in Peer reviewed Index Journals, One
Book Chapter, One book and six research papers in international conferences. This book is
contributed for insight depth knowledge of Machine Learning Algorithms for students of
different M.Tech/ M.Sc. (Data Science) as well as M.C.A of different universities in India .

Dimple is an Assistant Professor in Punjabi University Neighbourhood Campus, Jaito of


Punjabi University, Patiala. She is Research Scholar in Panjab University, Chandigarh. She
has qualified for UGC-NET and has also been entitled to JRF in the subject of Computer
Science. She has secured 2nd University position at Panjab University, Chandigarh. She has
also completed Masters in Computer Science and Applications from Kurukshetra
University, Kurukshetra. The author has twelve years rich experience in computer science
subjects. This book is contributed for insight depth knowledge of Machine Learning
Algorithms for students of different M.Tech/ M.Sc. (Data Science) as well as M.C.A of
different universities in India .

Price: 495 INR


AGPH Books
MACHINE LEARNING
TECHNIQUES

By

Dr. Rajeev Kapoor


&
Dimple

AG
Books
PH

2022

i
MACHINE LEARNING
TECHNIQUES
Dr. Rajeev Kapoor,
Dimple

© 2022 @ Authors

All rights reserved. No part of this Publication may be


reproduced or transmitted in any form or by any means,
without permission of the author. Any person who does
any unauthorised act in relation to this Publication may be
liable to criminal prosecution and civil claims for damage.
[The responsibility for the facts stated, conclusion reached,
etc. is entirely that of the author. The publisher is not
responsible for them, whatsoever.]

ISBN – 978-93-95936-62-0

Published by:

AGPH Books (Academic Guru Publishing House)


Bhopal, M.P. India

Contact: +91-7089366889

ii
Preface
To improve their predictive abilities without being expressly
designed to do so, software programmes may use a kind of
AI called machine learning (ML). When predicting future
output values, the machine learning algorithms take in
previously collected data as input. One popular use of
machine learning is recommendation engines. In addition to
these primary applications, fraud detection, malware threat
detection, spam filtering, business process automation (BPA),
and predictive maintenance are other very common usage.

In addition to aiding in product creation, machine learning


helps businesses keep tabs on shifting client preferences and
organisational tendencies. Some of today's most successful
businesses rely heavily on machine learning, including
Google, Facebook, and Uber. Many businesses now use
machine learning as a key differentiator in the marketplace.

This book will educate you on the wide variety of machine


learning algorithms available, each with its own area of
application. It is crucial to choose an appropriate algorithm
for each application. Because of the impressive precision and
adaptability it provides, neural networks are a popular class
of algorithms. When dealing with sparse data, though, it's
usually best to go with a less complex model. The higher the
quality of a machine learning model, the more precisely it can
identify characteristics and patterns in the data. Therefore, it
follows that its judgments and forecasts will be more
accurate.

iii
About the Book
Techniques in machine learning (ML) allow computers to gain
knowledge via observation and practice. Machine learning (ML)
is the process by which a system learns new information without
being explicitly programmed to do so. This allows a system to
acquire & integrate knowledge via the large-scale observations
and to grow and adapt to its environment.

Machine learning (ML) is a broad field that has yielded


fundamental statistical-computational theories of the learning
processes, designed learning algorithms routinely utilized in
the commercial systems like speech re-cognition as well
as computer vision, and spawned an industry in the data mining
which discovers hidden regularities in the ever-increasing
volume of the online data. Methods like this intelligently record
and also reason about the data, allowing them to organise
previously acquired information and gain new knowledge. Self-
improving learning systems have the ability to make their
systems more and more efficient and successful over time, and
they have already accomplished a wide range of successes, from
simple memorizing to the development of whole new scientific
ideas. Intelligent instructors employ ML methods to learn about
their pupils, categories their abilities, and develop their own
methods of instruction. By keeping track of students' responses
over time and extrapolating rules about the class or the
individual, they find ways to enhance instruction. They draw on
prior knowledge to guide current action, make it easier to adjust
to novel settings, and infer or deduce information not directly
known to the instructor.

iv
Contents

CHAPTER-1: INTRODUCTION 1

CHAPTER-2: DECISION TREE LEARNING 55

CHAPTER-3:PROBABILITY AND BAYES LEARNING 100

CHAPTER-4: ARTIFICIAL NEURAL NETWORKS 141

CHAPTER-5: ENSEMBLES 174

v
CHAPTER-1: introduction

1.1. Introduction to Machine Learning

Machine learning (ML) is a form of artificial intelligence


(AI) that allows computers to “self-learn” from the training
data & improve over time, without even being explicitly
programmed. It is possible to train a machine to recognise
patterns in data and use that knowledge to generate
predictions. In a nutshell, machine learning relies on
iterative algorithms and models that improve with
practice.

Conventional programming is the creation of a set of


instructions by a computer expert that tells a computer
how to take some kind of input and produce some sort of
intended result. The majority of instructions follow an IF-
THEN format: the programme operates only if the
specified criteria are satisfied.

The opposite is true with machine learning, which is


the automated process that allows computers to solve iss-
ues with little or even no human intervention and to

1
respond following what they have learned from previous
experiences.

While artificial intelligence & machine learning are


sometimes used interchangeably, they are two separate
ideas. AI is the wider notion - robots making choices,
acquiring new abilities, and comparably solving problems
to humans – but machine learning is the subset of AI that
allows intelligent systems to independently learn new
things from the data.

Machine learning algorithms may be taught to accomplish


tasks without being explicitly programmed to do so by
providing them with examples of labelled data (referred to
as training data).

According to Google's Chief Decision Scientist, machine


learning is just a sophisticated labelling machine. Once
machines are trained to label items, such as apples and
pears, by giving them examples of these items, the
machines will be able to identify apples & pears without
further instruction, given that they were trained using
accurate and relevant training examples.

Large datasets are no problem for machine learning, which


can process them more precisely than humans. It may help
you save time and money on jobs and analyses, such as
fixing customer pain points to boost customer happiness,
and support ticket automation, including data mining
from internal sources and all over the internet.

2
Machine Learning is employed everywhere from
automating monotonous work to delivering insightful
insights, industries in every industry aim to gain from it. A
gadget that makes use of it may already be in your
possession. Wearable fitness trackers like Fitbit, or smart
speakers such as Google Home, are two such examples. In
contrast, there are several other applications of ML.

• Prediction — Prediction systems may also benefit


from the use of machine learning. In the loan
scenario, the system would need to categorise the
available data to calculate the defect likelihood.

• Image recognition — Face recognition in images is


another use of machine learning. If you have many
persons in your database, each of them will have its
section.

• Speech Recognition — It’s the process through


which spoken speech is converted into written
form. It has several applications including voice-
based search but is not limited to them. In addition
to making and receiving calls, voice user interfaces
may also be used to manage home appliances. It's
also handy for making neat and orderly papers and
entering basic data.

• Medical diagnoses — Machine learning is taught


to identify malignant tissue.

3
• The financial industry and trading — Businesses
use ML to analyse applicants' credit histories and
investigate possible fraud.

1.1.1. A Quick History of Machine Learning

Figure 1.1 A Quick History of Machine Learning *

The Electronic Numerical Integrator and Computer, or


ENIAC, was the first computer system to be run by
humans. It was built in the 1940s. To distinguish it from
humans, who at the time were often referred to as
"computers," ENIAC was given the moniker of "numerical
computing machines." You can argue that it has nothing to
do with education. False, the goal was always to create a
machine with the cognitive abilities of a human being.

*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08

4
The very first computer game to boast that it could defeat
the world checkers champion was developed in the 1950s.
To play checkers better, this application was invaluable.
Simultaneously, Frank Rosenblatt developed the
Perceptron, a basic classifier that, when networked
together in vast numbers, may be very effective. So, the
monster is contextual, and back then, it was a major
technological advance. The area of neural networks then
seems to stall for a few years as researchers struggle to find
solutions to their most pressing concerns.

1.1.2. Machine Learning Definitions

Algorithm: An algorithm for machine learning is a


collection of guidelines and statistical methods for
discovering insights in data. The reasoning behind an ML
model. A Linear Regression algorithm is an instance of
the Machine Learning algorithm.

Model: One of Machine Learning's core building blocks is


a model. Models are trained using machine learning
algorithms. Algorithms are blueprints that lay out the
sequence of steps that must be taken by a model in
response to input to get the desired result.

Predictor Variable: One or more characteristics of the


input data that may be utilised to make a forecast about
the result.

Response Variable: This predictor variable is used to


make predictions about the feature or the output variable
(s).

5
Training Data: The training data is used to construct the
Machine Learning model. The model learns to recognise
important trends & patterns in training data that allow for
accurate prediction of the output.

Testing Data: The predictive ability of a model is


determined by its testing once training is complete. The
testing dataset is used for this purpose.

Figure 1.2 Testing Data*

Consider the above diagram as a summary. The first step


in the Machine Learning process is to provide the machine
with a large quantity of data, which it will use to learn
how to identify patterns and outliers. The data is then sent
into an algorithm, which uses the information to construct
the Machine Learning Model.

*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08

6
Machine Learning Process

Machine learning entails constructing a Predictive model


which can be used for the Problem Statement to determine
an appropriate course of action. Assume you have indeed
been given an issue to tackle utilizing:

Figure 1.3 Machine Learning Process*

Predicting when and where rain will fall using machine


learning is the challenge at hand.

The stages of a Machine Learning procedure are as


follows:

Step 1: Define the objective of the Problem Statement

The first step is to define the scope of the prediction


problem. The purpose of this exercise is to use meteoro-
logical analysis to foretell whether or not it will rain. At

*https://towardsdatascience.com/introduction-to-machine-
learning-for-beginners-eed6024fdb08

7
this point, it's also crucial to make mental notes on the
types of data that may be utilized to address the issue, as
well as the strategies that should be used.

Step 2: Data Gathering

The questions you should be asking now are,

• What information is lacking to find a solution to


this issue?

• Do we have access to the information needed?

• Where can I get this data?

After identifying the necessary data types, you must learn


to extract those values. Both manual and automated web
scraping methods are available for data acquisition. If you
are just starting with Machine Learning and don't care
about the data, this isn't an issue. There are many online
data sources; all you need to do is choose the one you
want, download the data set, and start working with it.

To return to the issue at hand, factors like humidity,


pressure, temperature, location, altitude (if you reside in a
hill station), and so on are required for accurate weather
forecasting. Gathering and archiving this information is
necessary for future use.

Step 3: Data Preparation

The information you've gathered is nearly never in the


proper format. Many problems, like missing data,
duplicate records, and variables with the same name, may
be found in the dataset. It is crucial to get rid of these

8
discrepancies since they may create inaccurate calculations
and forecasts. Now is the time to go through the whole
dataset for discrepancies and address them when they are
found.

Step 4: Exploratory Data Analysis

Exploratory data analysis (EDA) is the preliminary step of


a Machine Learning process. Knowing the trends &
patterns in data is an essential part of data exploration. All
the relevant conclusions have been reached, and the
connections between the variables have been figured out.

For instance, if the temperature drops we know that rain is


more likely to occur, which helps us make rain forecasts.
These connections need to be analyzed and laid out now.

Step 5: Building a Machine Learning Model

A Machine Learning Model is constructed using all the


discoveries and patterns discovered in the Data
Exploration phase. The first step in this process is to divide
the dataset into the training set as well as the test set. The
model will be developed and analysed based on the
training data. The model's reasoning is grounded on the
actual Machine Learning Algorithm being used.

We may use a Classification Algorithm, like Logistic


Regression, to make rain forecasts since the result will be
either True or False.

The data collection, the issue's complexity, and the sort of


problem you're attempting to answer all play a role in
determining the best method to use. The next few

9
paragraphs will go into detail about the many issues that
may be addressed using Machine Learning.

Step 6: Model Evaluation & Optimization

To put the model to a test once it has been constructed


using a training data set. The model's efficacy and
predictive accuracy are evaluated using the testing data
set. Once the precision has been determined, the model
may be tweaked to increase its performance. To enhance
the model's functionality, techniques such as parameter
tweaking & cross-validation may be included.

Step 7: Predictions

After the model has been tested, tweaked, and refined, it


may be utilized to generate predictions. The result may be
a continuous quantity or a Boolean value (such as True or
False).

1.1.3. Type Of Problems In Machine Learning

Figure 1.4 Type of Problems In Machine Learning *

https://www.edureka.co/blog/introduction-to-machine-learning/
*

10
If you look at the diagram above, you'll see that there are
primarily three kinds of challenges that may be addressed
by Machine Learning:

• Regression: The solution to this sort of issue


always takes the form of a continuous number. So,
if you want to guess how fast a vehicle will go
based on how far it has to go, you have a
Regression issue on your hands. Regression issues
may be handled by employing Supervised
Learning methods like Linear Regression.

• Classification: There is a category value produced


by this sort of system. Supervised Learning
classification methods like Naive Bayes, Support
Vector Machines, K Nearest Neighbor, Logistic
Regression, etc. may be used to address the
classification issue of sorting emails into spam as
well as non-spam categories.

• Clustering: Input is divided into two or more


groups according to shared characteristics
(clustering). For example, grouping viewers into
similar groups depending on their hobbies, age,
region, etc may be done by employing
Unsupervised Learning techniques such as K-
Means Clustering.

1.1.4. Application of Machine Learning


The field of machine learning is a hot topic in today's
technology, and its popularity is only increasing. Without

11
even realising it, we use machine learning every day in the
form of Google Maps, Alexa, Google Assistant, etc. The
following are some of the most well-known current uses
of machine learning in the real world:

1. Social Media Features

Some of the best and most interesting additions to social


networking sites are the results of machine learning
algorithms and techniques. Facebook, for example, keeps
track of your conversations, likes, comments, and the
amount of time you spend on various categories of
postings. Machine learning uses data about your activities
to recommend people you may like and content you
would want to see on your profile.

Figure 1.5 Social Media Features*

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

12
2. Product Recommendations

Figure 1.6 Product Recommendations*

One of the most well-known and widespread uses


of machine learning nowadays is in the realm of product
recommendation. Almost all e-commerce sites now use a
sophisticated application of machine learning techniques:
product recommendations. Sites use AI and machine
learning to analyse your browsing habits, transactions, and
cart contents to offer goods and services.

3. Image Recognition

One of the most important and well-known applications of


machine learning and artificial intelligence is image
recognition, which is the method for categorising and

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

13
recognising a feature or item in a digital picture. Face
detection, Pattern recognition, and face identification are
only a few of the applications that have embraced this
method for deeper examination.

Figure 1.7 Image Recognition*

4. Sentiment Analysis

An essential use of machine learning is sentiment analysis.


To ascertain the speaker's or writer's emotional state,
sentiment analysis employs real-time machine learning.
When someone writes an email, review, or any other kind
of document, the sentiment analyzer can quickly and
accurately determine the author's true intentions and

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

14
mood. The review-based website, the decision-making
software, etc., might all benefit from this sentiment
analysis tool.

Figure 1.8 Sentiment Analysis*

5. Automating Employee Access Control

Machine learning algorithms are increasingly being used


by businesses to ascertain what levels of access workers
need in different sectors of the company. One of the most
interesting uses of machine learning is in this.

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

15
Figure 1.9 Access Control*

6. Marine Wildlife Preservation

Figure 1.10 Marine Wildlife Preservation*

https://www.hulkautomation.com/access-control
*

16
By using machine learning algorithms, researchers can
better govern as well as monitor the populations of marine
animals like the critically endangered cetaceans.

7. Regulating Healthcare Efficiency and Medical Services

Major medical fields are investigating the use of machine


learning algorithms for improved administration. They
estimate how long patients will have to wait in the
hospital’s emergency wards. Data from staff shift
schedules, patient histories, departmental conversations,
and ER floor plans are all used to inform the models.
Disease detection, therapeutic preparation, and prognosis
all include the use of machine learning algorithms. This is
among the most important uses for machine learning.

8. Predict Potential Heart Failure

Medical experts are excited about the potential of an


algorithm that can read the doctor's free-form electronic
notes and extract relevant information about the patient's
cardiovascular history. Computers now make the analysis
based on available information, reducing duplication that
would otherwise need a doctor to go through many health
records to arrive at a correct diagnosis.

9. Banking Domain

To combat fraud and safeguard customer accounts from


hackers, financial institutions have used cutting-edge

https://www.nature.com/articles/s41467-022-27980-y
*

17
technologies made possible by machine learning.
Algorithms decide which criteria should be used to design
a filter to prevent damage. Sites that are determined to be
fake would be blocked from processing payments
immediately.

10. Language Translation

The linguistic translation is among the most popular uses


for machine learning. A lot of work goes into translating
one language into another, and machine learning plays a
big part in that. We're awed by how smoothly and
accurately websites could translate across languages while
maintaining the original content's context. Machine
translation describes the underlying technology that
allows for this translation tool to function. It's made it
possible for individuals all over the globe to communicate
with one another; without it, modern life would be much
more challenging. It has given tourists and businesspeople
the assurance they need to go to distant countries without
worrying about communication barriers caused by a lack
of a common language.

1.2. Different types of learning


All three of these methods may be used to teach a
computer how to solve a problem. A machine may learn in
these ways:

• Supervised Learning.

• Unsupervised Learning.

• Reinforcement Learning.

18
Supervised Learning

The term "supervised learning" refers to a method of


machine training in which data is labelled for instruction.

To better grasp Supervised Learning, this example will


help. We all required help as youngsters to figure out how
to do arithmetic. Our educators enlightened us on the
nature of addition and its procedures. Comparatively,
supervised learning is a kind of Machine Learning that
relies on human oversight. To learn how to recognise
patterns in data, the labelled data set serves as a mentor. A
training data set is just a labelled data set.

Figure 1.11 Training the computer to distinguish the Tom &


jerry*

Consider the above diagram into account. Here, we're


training the computer to distinguish between Tom and
Jerry's pictures and place them in one of two categories.
During model training, the data set is labelled to show the

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

19
computer "this is how Tom appears," for example. You
may use labelled data to train the computer in this way.
An organized training phase using labelled data is at the
heart of Supervised Learning.

Unsupervised Learning

Unsupervised learning is a kind of machine learning in


which the model is trained with data that has not been
labelled and then given the freedom to make decisions
independently.

Unsupervised learning is like a bright child that discovers


new things on their own. In unlabeled data Machine
Learning, the model is given as much data as possible
without being told anything about the data, such as "this
picture is Tom & this is Jerry." The model then uses this
data to discover patterns and distinguish between Tom &
Jerry.

Figure 1.12 Recognition of type 1 picture of Tom*

*https://www.simplilearn.com/tutorials/machine-learning-
tutorial/machine-learning-applications

20
For instance, it recognizes that this is a type 1 picture of
Tom by recognising his characteristic pointed ears, larger
stature, etc. The programme recognizes similar charac-
teristics in Jerry and concludes that this is a type 2 picture.
As a result, it divides the pictures into two groups, Cat and
Dog, without ever learning who Tom or Jerry are.

Reinforcement Learning

The field of machine learning known as Reinforcement


Learning involves placing a virtual agent in a real-world
setting and teaching it how to act appropriately by
watching the consequences of its choices and adjusting its
behaviour accordingly.

1.3. Evaluation
When we evaluate a model, we try to put a number on
how accurate the model is in making predictions. We
achieve this by testing the accuracy of the freshly trained
model on an unrelated data set. This model will check its
predictions against labelled data.

Key lessons from the use of performance indicators for


evaluating models include:

• How effective is our model, exactly?

• Can we trust our model enough to put it into


production?

• Will adding more data to the training set make my


model better?

21
• Do I have an under-fitting or over-fitting model?

When your model makes categorization predictions, it


might result in one of four states:

• When your system correctly identifies an


observation as belonging to a certain class, you
have a true positive.

• For an observation to be considered the true


negative, your system must correctly anticipate that
it doesn't belong to a certain class.

• The false positive is a prediction that an


observation belongs to the class when, in fact, the
observation does not fit that class. Also referred to
as a kind 2 mistake.

• False negatives arise when you anticipate an


observation doesn't belong to a class but in reality,
it does. Type 1 mistake is another name for this.

Using the aforementioned results, we may assess a model's


efficacy using several measures.

Metrics for classification models

When assessing categorization models, the following


metrics are often reported:

• Accuracy is the ratio of correct outcomes to the


total number of instances. The rate of accuracy
should be high.

22
• Accuracy = # correct predictions / # total data
points.

• The log loss is a scalar value that quantifies the


classifier's performance relative to the random
prediction. As a comparison of the probability of
your model's outputs to known values, the log loss
quantifies the degree of uncertainty in the model
(ground truth). Overall, you should try to keep the
model's log loss as little as possible.

• Precision is defined as the percentage of correct


answers relative to the total number of correct
answers.

• The proportion of true predictions that a model


successfully makes is known as its recall.

• Precision and recall are used to calculate the F1-


score, which should ideally be 1.

• An area under the curve (y-axis) that represents the


proportion of correct predictions made relative to
the total number of predictions (x-axis) is what
AUC calculates. When comparing models of
various sorts, having a single statistic like this may
be quite helpful.

• The label-to-model categorization correlation is


represented by the confusion matrix. A confusion
matrix has the model's predicted label on one axis

23
as well as an actual label on the other. For this
purpose, let's refer to N as the total number of
categories. To simplify things, let's say N=2 in
the problem of binary categorization.

1.4. Training and test sets

1.4.1. Training data set


During a learning process, the parameters (e.g., weights) of
a model, like a classifier, are fitted using the training data
set of instances.

A supervised learning algorithm examines the training


data set to discover the most effective combinations of
the variables to produce a reliable prediction model,
allowing it to be used for classification tasks. Ultimately,
we want to end up with a trained (fitted) model that can
successfully apply to brand-new, unanticipated data. To
measure the fitted model's accuracy in categorising new
data, we use "new" instances from the withheld datasets
(validation as well as test datasets) to evaluate a model.
Over-fitting and other problems may be avoided if
validation & test dataset instances are not utilized during
model training.

Overfitting the data is a common problem for methods


that sift through the training data in search of empirical
correlations; this occurs when the methods discover and
capitalize on associations that exist only in training data
but do not hold in the real world.

24
1.4.2. Test data set
Test data sets are data sets that are not part of a training
data set, but that follow the same probability distribution.
Overfitting is not severe if a model that performs well on
the training set also performs well on the test set.
Overfitting occurs when the model fits the training data
better than the test data.

Accordingly, the only purpose of the test set is to evaluate


the generalization capabilities of a fully trained and
defined classifier. Classifications of test set instances are
predicted using the final model. The accuracy of the model
is determined by comparing its predictions to the actual
labels assigned to the instances.

Figure 1.13 Test data set*

Where validation & test datasets are used, the latter is


often used to evaluate the validated model's performance.
To evaluate a model, the test data set may only be used
once an original data set has been split into two parts

*https://en.wikipedia.org/wiki/Training,_validation,_and_test_da
ta_sets

25
(training & test datasets). Keep in mind that several
publications caution against using this approach. When
utilising a technique like cross-validation, however, where
results are averaged after several rounds of the model
training & testing to help decrease bias and variability, two
divisions may be enough and successful.

1.5. Cross-validation
If you want to get a rough idea of how well your machine
learning model is doing, you may utilise the statistical
technique of cross-validation to do so. It is employed to
prevent a predictive model from being too sensitive to its
inputs, which may happen when data are scarce. For cross-
validation, a certain number of data divisions (called
"folds") are created, the analysis is performed on every
fold, and an average error estimate is then used.

Understanding the nature of the issue at hand is essential


when tasked with the Machine Learning project since this
will allow you to choose the algorithm most likely to
provide optimal results. However, evaluating the various
models is problematic.

Let's say you're curious about the model's capabilities after


training it using the provided dataset. Using the same data
set that a model was trained on for evaluation purposes is
one option, but this isn't always the best course of action.

So why can't we just use the training data to evaluate the


model? By doing so, we are making the absurd
assumption that training data covers every potential event

26
in the actual world. Even though the training dataset is
also real-world data, it only represents a subset of all the
available data points (instances) out there, and our
primary goal is for the model to perform well on real-
world data.

The model must be evaluated on data it has never seen


before (the "testing set") to get an accurate rating.
However, won't we be missing out on valuable insights
included in a test dataset if we separate our data into
the training data and the test data? Let's look at the various
forms of cross-checking to see what works.

1.5.1. Types of Cross-Validation

Non-exhaustive techniques, as well as exhaustive


methods, are two main categories into which the many
cross-validation approaches may be sorted. Some instances
from each group will be examined.

Non-exhaustive Methods

As the name implies, the non-exhaustive cross-validation


approaches do not calculate every possible partition
of original data. Let's go through the processes so you can
see how they work.

Holdout method

In this strategy, we split the total dataset in half, creating


separate sets of data for training and testing. According to
its moniker, we "train" the model on one set of data before

27
"testing" it on another. As a rule of thumb, 70:30 or 80:20 is
a common ratio for how training and testing datasets are
divided.

In this method, the data is initially randomly shuffled


before being divided. Instability may emerge from the
model's inconsistency in producing the same results each
time it is trained since it is trained on a unique set of data
points each time. In addition, there is no guarantee that the
sample we used for the training set is indeed
representative of a whole dataset.

Even if our dataset isn't huge, we may be missing out on


some crucial details by not training the model using data
from the testing set.

The hold-out approach is useful in data science when you


have a large dataset, are short on time, or are just getting
started with your first model.

K fold cross validation

One option to enhance the holdout technique is to use k-


fold cross-validation. This approach ensures that the
model's performance is independent of the criteria used to
choose the training and evaluation data. After splitting the
dataset into k equal parts, the holdout technique is
executed on each of the subsets independently. Proceed
with me here step-by-step:

Divide your data set into k equal folds at the random


(subsets).

28
Create your model using k minus one fold of data for
every fold in your dataset. Then, assess the model's
performance for the kth fold improvement.

Iterate until all k-folds have been used as a test sample.

As a measure of your model's efficacy, you should use the


cross-validation accuracy, which is just the mean of your k
recorded accuracy values.

This strategy often yields a less biased model compared to


others since it guarantees that every observation from
an original dataset has the chance of appearing in
the training and the test set. If we have a little amount of
data to work with, this is a good method to try.

The training process must be redone from scratch k times,


which increases the amount of time and effort required to
evaluate the results by a factor of k.

Stratified K Fold Cross Validation

One must use caution while using K Fold on


the classification task. We are introducing potential bias
into our training process by randomly shuffling the data
before splitting it into folds. Take, as an illustration, a fold
where the majority are members of one class (let's say the
positive) and the minority are members of the other class
(the negative). To prevent this from happening and keep
our training intact, we use stratification to create folded
layers.

29
Using a technique called "stratification," the data is
reorganised to guarantee that each "fold" accurately
represents the entire. In the binary classification issue with
equal amounts of data within every class, for example, the
data should be folded such that in each subset the
proportion of occurrences in each class is approximately
equal to its overall proportion.

Leave-P-Out cross-validation

As part of this thorough procedure, we choose p data


points at random from the whole dataset (say n). We use
(n - p) data points to train the model plus p data points to
test it. For the whole range of p values included in the
initial data set, we carry out the same procedure. Once we
have completed many iterations, we use an average of the
resulting accuracies to get a final accuracy.

To train a model as thoroughly as possible, we try every


conceivable permutation of the data. Keep in mind that if
we choose a larger value for p, there will be a larger total
number of alternatives, and therefore we may conclude
that the approach becomes much more exhaustive.

Leave-one-out cross-validation

The value of p is kept at one for this form of "Leave-P-Out


cross-validation". Due to this change, the approach is
much less exhaustive, as there are now n rather than pn
possible choices for n data points & p=1.

30
1.6. Linear Regression: Introduction
Linear regression is a method used in statistics to describe
the association between the scalar response (or dependent
variable) and one or even more explanatory factors (or
the independent variables). For this reason, the term
"Simple Linear Regression" is used to describe a situation
in which there is just a single explanatory variable.

To discuss the efficacy and assessment of models, we will


first define linear regression, explain its underlying
assumptions, and introduce some fundamental statistical
concepts. We'll wrap up by putting Python to use to
implement two instances. Thus, it is highly recommended
that you familiarize yourself with elementary inferential
statistics, descriptive statistics, and hypothesis testing.
Have no fear if you lack these abilities; you will still be
able to grasp the material.

Simple and Multiple Linear Regression

When there is just one variable in the linear model, we call


it simple linear regression. It's put to use whenever a
connection between two variables has to be simulated.
Predicting Brazilian college students' IRA (Academic
Performance Index) scores using their ENEM (Brazilian
standardised college entrance exam) scores is an example
of Simple Linear Regression. Or maybe just utilising the
home's square footage as an input to determine its value.
The standard equation for representing such a connection
looks like this:

31
“y = b₀ + b₁⋅ x₁”,

where “b₀” is called “intercept”,

&
“b₁” is a coefficient of “x₁”,

which is an input (independent) variable.

Depending on our IRA vs ENEM instances,

we can write IRA = “b₀ + b₁⋅ENEM”.

As a result, once we know both parameters (b1 and b0), we


may use them in the linear equation that can look like this:

Figure 1.14 Simple Linear Regression*

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

32
After running the linear regression, we get the line
equation that describes how the orange line looks across
the points. A rough approximation of the result may be
calculated from any value on a horizontal axis using this
line (value on a vertical axis).

To make reliable forecasts in practical situations, we often


require information from several sources. It is possible to
use a student's ENEM score alone to make educated
guesses about their IRA.

However, many other elements might affect the student's


performance, such as how far they have to travel to go to
school, the amount of money their family makes, how long
it's been since they graduated high school, the student's
gender, his or her marital status, and so forth.

An extension of the simpler linear regression is the


multiple linear regression. The following equation
describes it well:

y = b₀ + b₁⋅ x₁ + b₂⋅ x₂ +⋅⋅⋅ + bn⋅ xn.

Where b₀ is an intercept and “b₁, b₂, …, bn” are the


coefficients of independent variables.

To explain a linear connection between a single dependent


variable (y) and several independent variables S = {x₁, x₂,
…, xn}., multiple linear regression may be used.

Simple linear Regression

𝑦 = 𝑏0 + 𝑏1 𝑥1

33
Multiple Linear Regression

𝑦 = 𝑏0 + 𝑏1 𝑥1 + ⋯ + 𝑏𝑛 𝑥𝑛

Assumptions for Linear Regression

Before doing the regression analysis, it is crucial to be


familiar with and take into account all of the relevant
assumptions in linear regression. If you find that your
regression model fails to meet at least one of the
assumptions, you should reconsider using it to fit your
data.

These are the presumptions behind the regression:

Linearity

When it comes to non-trivial relationships, linear


regression is the easiest. For this reason, the regression
equation is linear, thus the name "linear." Any suggestions
on how to test whether a linear relationship exists between
the two numbers?

It's simplest to use a scatter plot, in which one variable is


plotted against another.

Instead, have a look at the figure. The data points cannot


be adequately represented by a straight line. The data
variability is better shown by the curve line in this
example. Due to this, linear regression should be avoided
on these data.

34
Figure 1.15 Multiple Linear Regression*

For example, non-linearity may be "fixed" in a few simple


ways, such as,

1. Run the “non-linear regression”.

2. Apply the “exponential transformation”.

3. Apply the “logarithmical transformation”.

Since we are just interested in elaborating on linear


regression assumptions, we would not go into these
methods now. Know that you should check for feasibility
if you think you may want to run a linear regression.

It is recommended to first attempt the transformation and


afterwards the linear regression if data cannot be fitted by
the straight line.

https://medium.com/@alexandre.hsd/an-introduction-to-
*

linear-regression-13527642f49

35
2. No endogeneity

Second, "no endogeneity of regressors" is assumed. A ban


on any association between the independent factors and
the mistakes is intended. The following equation gives its
mathematical expression:

According to this equation, the error (the gap between


observed and anticipated values) cannot be a function of
another independent variable. An example of missing
variable bias is this situation. The issue occurs when a
necessary variable is left out of the model. Let's say you're
in this situation:

Both the explanatory variable x and an additional


explanatory variable, x*, is necessary to fully explain the
dependent variable, y. In every probability, x is connected
to x*. On the other hand, we did not include it as
the repressor. The mistake sums together everything that
can't be accounted for by the model. Therefore, the mistake
starts to be associated with everything.

Now, let's look at a different illustration. Let's pretend


we're interested in predicting the value of the apartment
building in midtown Manhattan-based just on its size (x).
Afterwards, we used this model to get the following
equation:

36
Considering that our intuition tells us that a larger price
tag means more lavish digs, the finding is rather
paradoxical. There is a corresponding decline in value
when x (the size of the apartment) grows. This indicates
that independent variables and errors have non-zero
covariance. You may probe your thoughts by questioning:

• From where do I get the sample?

• Where can I get a better sample?

• Why do larger properties tend to be less expensive?

Consider the following: Apartment complexes in


Manhattan make up a large enough portion of the sample
to rule out sampling error.

The site of some of the world's most expensive properties.


To keep things simple, we didn't provide the location
information. Million-dollar suits in New York City, to use
our example, were the turning point. If we include this
determinant in our analysis, we may get:

Although these are fictitious calculations, the outcome is


impressive. It seems to reason that the greater the
property, the more it will cost.

The takeaway here is that expertise in the field is useful.


The endogeneity trap occurs when one fails to fully
appreciate the nature of the situation at hand.

37
3. Normality and Homoscedasticity

Think of this as the most likely scenario of all of them. The


equation 𝜖 ∼ (0,²) describes the third hypothesis. The third
assumption is a three-part mathematical derivation:

• Normality: It is assumed that the error term


follows a normal distribution. Regressions do not
need a normally distributed data set, but
conclusions must.

• Zero Mean: A-line would not provide a decent fit if


the predicted mean of the errors is not zero. Having
the intercept, however, eliminates the issue.

• Homoscedasticity: That's why the mistake needs to


be distributed uniformly in this respect. In other
words.

➢ Homoscedasticity vs Heteroscedasticity: How to


detect and visualize

To better understand Homoscedasticity &


Heteroscedasticity, we shall go through an instance. We'll
simulate the incidence of car crashes in relation to urban
populations using the cross-sectional study as an example.
The information is all made up, yet it accurately depicts
the issue.

Jim's work on heteroscedasticity in the regression analysis


in Statistics inspired this illustration.

38
To begin, it's crucial to plot the data and understand how
it truly appears:

Figure 1.16Results of linear regression shown on a graph*

The results of the linear regression on the data would look


like this:

If you look at the orange regression line, you'll see that


an error grows along with the population. It's a strong
indicator of Heteroscedasticity.

Heteroscedasticity may also be examined using the fitted


value plot to evaluate the residuals. In residual plots,
heteroscedasticity is shown by a fan or cone form. For this,
we have the necessary data plotted out below. Once
again, the error variance is not equal to zero.

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

39
Figure 1.17 There seems to be a spreading of the data as one
moves from left to right along the x-axis.*

Figure 1.18 Examining Heteroscedasticity†

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49
†https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

40
➢ Fixing heteroscedasticity

To obtain Homoscedasticity and improve regressions, data


may be transformed in several ways.

You may establish the regression between the log of y and


the independent x's by first calculating the natural log of
every observation in a dependent variable. On the other
side, you may use the same strategy for the problematic x
variable. What does this look like?

First, let's consider the previous decline in performance. A


major contributor to heteroscedasticity was found to
be population data (represented by the x-axis). Since this is
the case, let's apply the logarithmic modification
to population data and see what we get:

Population data is represented on the Y-axis*.

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

41
Given this equation, we have a novel model that we name
a Semi-log model:

The y-axis is another potential candidate for change to


determine whether our model improves.

Figure 1.19 Population data showing linear regression*

While the changes did not eliminate heteroscedasticity,


they did get us very close to homoscedasticity, as you
might have guessed. The equation for this model, known
as the log-log model, is:

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

42
The interpretation is: when x raises by 1%, 𝑦 also raises by
b₁%.

4. No autocorrelation

The assumption of no autocorrelation comes up next, and


it is the most important one. Since it cannot be loosened,
this property is among the most troublesome. The
mathematical definition of "no autocorrelation" is as
follows:

Errors are thought to be unrelated to one another.


Autocorrelation is very unlikely to be present in cross-
sectional data (data collected at a single point in time). The
opposite is true, however: time series is rife with this
phenomenon.

Having autocorrelation is a good indicator that the data


has some kind of pattern to it. That's something that
doesn't be taken into account by linear regression. Let's
have a look at a dataset of daily low temperatures in
Melbourne, Australia during the years 1891 and 1991 as an
example.

Take into account the consistency we see in the data.


Errors are assumed to be uniformly distributed around the
regression line in the linear regression. This will most
definitely not be the case.

43
Now the question is, how can autocorrelation be
identified? Plotting all the residuals on the graph and
inspecting them for patterns is a typical method. Not
finding any means you are probably okay.

Figure 1.20 An example of Melbourne showing temperature in


the different years on the graph*

There's also the option of doing a Durbin-Watson test on


the regression. Its values are often between 0 and 4. A

*https://medium.com/@alexandre.hsd/an-introduction-to-linear-
regression-13527642f49

44
score of 2 implies no autocorrelation, but values between -
1 and +3 might raise red flags.

Avoiding a linear regression model is your only option


when autocorrelation occurs. Possible substitutes include:

1. Autoregressive model.

2. The autoregressive integrated moving average


model.

3. The autoregressive moving average model.

4. Moving average model.

5. No Multicollinearity.

Finally, we assume there is no multicollinearity. The


regression coefficient may be understood as the average
change in a dependent variable for every one-unit change
in the independent variable while all the other
independent variables are held constant.

To keep the statement true, it is necessary that when one


variable is changed, the others remain unchanged. The
concept of multicollinearity is opposed to this. For any
given set of circumstances, if two variables are connected,
modifying one of them will inherently modify the other.

Multiple collinearities occur when three or more variables


are highly correlated. Mathematically,

Use an equation to see how this works in practice.

Take, 𝑦 = 3+2𝑥:

45
The connection between y and x is a perfect linear one.
You can stand in for y with an x and vice versa. It is
possible to have complete multicollinearity (𝜌 = 1) in a
model where both y and x are present. This is a significant
challenge for our model since the calculated coefficients
would be inaccurate. If y could be represented by x, then
there is no need to use both of them.

Take z and t, two variables, and assume a correlation of "𝜌


= 0.93" between them. Similarly, imperfect
multicollinearity would exist in the regression model in
which z and t were used. The continued violation of the
assumption suggests that there may be an issue with our
model.

1.7. Simple and Multiple Linear regression

1.7.1. Simple linear regression


Establishing the straight-line connection between the two
variables is the goal of the statistical technique known
as simple linear regression. Finding the line's slope &
intercept that define a line and reduce regression mistakes,
allows one to draw the line.

One x-variable & one y-variable make up the bare


minimum for basic linear regression. It doesn't matter how
you attempt to forecast the dependent variable, therefore
it's the x variable that gets to be the independent one. As
the outcome of your predictions will directly affect the y-
variable, this is the dependent variable.

46
• “y = β0 +β1x+ε” is a formula employed for the
simple linear regression.

• Assuming a certain value of an independent


variable (x), we may predict the value for
a dependent variable, denoted by y.

• The projected value of y at zero x is denoted by the


intercept that is B0.

• The coefficient of determination (B1) indicates the


predicted slope of the line connecting the two
variables.

• The Independent variable is denoted by "x."

• An estimate's error, or the margin of error, is


denoted by the symbol "e."

The line that is produced by simple linear regression is a


decent approximation of the data, but this is not
a guarantee of accuracy. With basic linear regression, for
instance, you would get an unfitting downward-sloping
line if the data points exhibit an upward trend and thus are
extremely far apart.

1.7.2. Assumptions of Simple Linear Regression

Linearity

Linearity between x & y is expected. To put it another way,


if you raise one number, the other will rise proportionally.
Indeed, this linearity ought to be shown in a scatterplot.

47
Independent of Errors

Verifying that your data are error-free is crucial. Residuals


might be problematic for your model if they correlate with
the variable in question. The scatterplot of "residuals vs
fits" may be used to test for the independency of errors; the
plot must not suggest any kind of link between the two
sets of data.

Normal Distribution

In addition, you should make sure your data follow a


normal distribution. This may be checked by looking at a
histogram of residuals, which should have a normal
distribution shape. Most of your data points should cluster
around 0 and 1 (maximum and minimum values) on
a histogram. It would assist you in making sure your
model is trustworthy and correct.

Variance Equality

The last step is to make sure your data have similar


variances. For this, you'll need to take a close look at
the scatterplot in search of conflicting points that don't
appear to be very close to one another (you can also
use statistical software such as Minitab or Excel). If there
are data points that deviate far from the norm.

1.7.3. Simple Linear Regression Model


The following equation is a representation of the Simple
Linear Regression model:

y= a0+a1x+ ε

48
Where,

a0= it is represented as an intercept of a Regression line


(which could be obtained by putting x=0).

a1= It represents a slope of a regression line, that tells


whether a line is increasing or decreasing.

ε = The error term. (For a good model it will be negligible).

1.7.4. Multiple Linear Regression

Multiple linear regression is a popular technique for


making predictions. This kind of analysis helps you grasp
the connection between the continuous dependent variable
and two or even more unrelated variables.

Both continuous (like age & height) and discrete (like


gender) independent variables are acceptable (gender
& occupation). To avoid false results, it is recommended to
dummy code the dependent variable if it is the categorical
one.

1.7.5. Formula and Calculation of Multiple Linear


Regression

Multiple regression analysis allows for the adjustment of


many factors that have a synergistic effect on the
dependent variable. The correlation between these two
groups may be examined via regression analysis.

Let k stand for the number of independent variables


represented by x1, x2, x3, ……, xk.

49
Assuming that there are k independent variables x1, x2,...,
xk that can be controlled, this technique uses these
variables to calculate the probability of a certain result Y.

In addition, we assume that the relationship between Y


and the variables is linear based on:

“Y = β0 + β1x1 + β2x2 + · · · + βkxk + ε”

• The dependence or prediction of the yi variable is


dependent.

• When xi & x2 are both zero, y equals 0, since the


slope of y is determined by the y-intercept and y
will be β0.

• The coefficients β1 & β2 on a regression line


express the amount by which y shifts for each one-
unit shift in xi1 and xi2, respectively.

• βp is the slope coefficient for every independent


variable.

• ε term describes a random error (residual) in


a model.

Like in the simple linear regression, where ε denotes the


standard error, however, in this case, k is not restricted to
1.

Here, we have n observations, where n is often


considerably larger than k.

50
All independent variables are given the values "xi1, xi2,...,
xik" for the ith observation, and the value of a random
variable Yi is recorded.

Therefore, the equations may be used to describe the


model.

“Yi = β0 + β1xi1 + β2xi2 + · · · + βkxik + i for i = 1, 2, . . . , n,”

In which the mistakes i are a set of independent standard


variables with zero means and the very same unknown
variance σ2.

The total number of free variables in a multivariate linear


regression model is k + 2:

“β0, β1, . . . , βk, and σ 2.”

When k =1, “least squares line” y = βˆ 0 +βˆ 1x was found.

It was the line in a plane R 2.

Now, with “k ≥ 1”, we’ll have the “least squares


hyperplane”.

“y = βˆ 0 + βˆ 1x1 + βˆ 2x2 + · · · + βˆ kxk in Rk+1.”

A way to find estimators βˆ 0, βˆ 1, . . ., and βˆ k is the


same.

Estimate the error squared using partial derivatives.

“Q = Xn i=1 (yi − (β0 + β1xi1 + β2xi2 + · · · + βkxik))2 “

Fitted values are obtained when the system is solved.

“yˆi = βˆ 0 + βˆ 1xi1 + βˆ 2xi2 + · · · + βˆ kxik for i = 1, . . . , n”


that must be close to an actual value of yi.

51
Assumptions of Multiple Linear Regression

A dependent variable in the multiple linear regression is


the value of interest. Your dependent variable may be
better understood in the context of the independent
variables. Using these, you may construct a model that
reliably predicts your reliant variable given your
explanatory factors.

Several things must be met for your model to be trusted


and credible:

• The relationship between independent and


dependent variables is linear.

• In this case, the independent variables do not have


a substantial relationship.

• The dispersion of residuals remains static


throughout time.

• The data collected from each experiment should be


treated separately.

• A requirement for multivariate normalcy exists for


all variables.

1.8. Polynomial regression


The connection between a dependent (y) and an in-
dependent (x) variable is modelled as the polynomial of
degree n in a regression procedure known as the

52
polynomial regression. Here is the formula for polynomial
regression:

“y= b0+b1x1+ b2x12+ b2x13+...... bnx1n.”

• It is often referred to as a variant of Multiple Linear


Regression in ML. To make the transition
from Multiple Linear to Polynomial Regression, we
add a few polynomial terms to the original
regression equation.

• It is a tweaked version of a linear model meant to


improve precision.

• Polynomial regression employs the non-linear


dataset for training purposes.

• To accommodate non-linear and intricate functions


and data sets, it employs the linear regression
model.

Need for Polynomial Regression

A few examples are provided below to illustrate why ML


makes use of Polynomial Regression:

• Applying a linear model to a linear dataset yields a


nice result, as we saw in the Simple Linear
Regression, but doing the same thing with the non-
linear dataset yields a very different outcome.
Whichever loss function rises first, errors pile up as
precision plummets.

• Therefore, the Polynomial Regression model is


required in these situations when the data points

53
are organised in a non-linear form. The
accompanying comparison graphic between
a linear dataset as well as a non-linear dataset will
help us grasp the concept.

Figure 1.21 Simple and polynomial model*

Take a look at the picture up there; it depicts data from a


collection that was organized in a non-linear fashion.
Therefore, it is evident that no data is well captured by a
linear model. In contrast, the Polynomial model's curve
can adequately describe the majority of the data points.

Consequently, we should employ the Polynomial


Regression model rather than the Simple Linear
Regression model if datasets are organized in a non-linear
form.

https://www.javatpoint.com/machine-learning-
*

polynomial-regression

54
CHAPTER-2: Decision tree
learning

2.1. Introduction
In machine learning, classification entails two phases: the
learning phase and the prediction phase. During the
training phase, the model is refined using the information
it has received. Following data input, the model is utilized
to provide a forecast of the outcome. When it comes to
classifying data, the Decision Tree is often regarded as one
of the most intuitive and widely used approaches.

2.1.1. Decision Tree Algorithm


In machine learning, classification entails two phases: the
learning phase and the prediction phase. During the
training phase, the model is refined using the information
it has received. Following data input, the model is utilised
to provide a forecast of the outcome. When it comes to
classifying data, the Decision Tree is often regarded as one
of the most intuitive and widely used approaches.

The decision tree method is the supervised learning


technique. The decision tree approach, unlike the other

55
supervised learning algorithms, may be used for both
regression & classification issues.

With the use of the Decision Tree, we can build a model


that could predict the target variable's class or value based
on a few basic rules derived from training data.

Predicting the record's class label using Decision Trees


involves traversing the tree from the top down. The
attribute values of a root and a record are compared. When
comparing two values, we take the branch that leads to the
next node based on the difference.

2.1.2. Types of Decision Trees

For each goal variable, there exists a corresponding


decision tree type. There are two distinct categories:

• Categorical Variable Decision Tree: Any decision


tree in which the dependent variable is
the categorical choice is said to have a categorical
variable as its goal.

• Continuous Variable Decision Tree: Whenever the


dependent variable in the decision tree is
the continuous one, we refer to it as the Continuous
Variable Decision Tree.

Example:- Consider the challenge of determining the


likelihood (yes/ no) that a client would pay his insurance
renewal payment. The insurance business knows that
customers' income is a key factor in this context, but it does
not have such information for all consumers. Once this is

56
established as a significant factor, a decision tree may be
developed to estimate a customer's annual income based
on their employment, the kind of product they purchase,
and other factors. Here, we are making forecasts about the
values of continuous variables.

2.1.3. Important Terminology related to Decision


Trees

• Root Node: It is a representation of the whole


population or sample that is then split into two or
more similar groups.

• Splitting: To perform this operation, a node is first


divided into two or even more sub-nodes.

• Decision Node: It is a decision node when the sub-


node divides into two or more child nodes.

• Leaf / Terminal Node: A leaf node, also known as a


terminal node, is a non-splitting node.

• Pruning: Pruning refers to the process of


eliminating child nodes from a decision node. The
reverse of the process of the splitting.

• Branch / Sub-Tree: The term "branch" is also used


to refer to a smaller "sub-tree" inside the larger
"parent" tree.

57
• Parent and Child Node: Sub-nodes are the children
of a parent node, and the node that divides into
them is termed the parent node.

Figure 2.1 An example of a Decision tree*

Instances are sorted from a tree's root node to its leaf or


terminal node, where their classification is determined,
and the tree's structure is then analysed to determine the
best classification for the example at a terminal node.

https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*

explained.html

58
Every node in a tree represents a test case, and the edges
radiating out from each node represent alternative values
for the corresponding property in a test case. This iterative
procedure is carried out for each child tree that now has its
root at a new node.

2.1.4. Assumptions while creating Decision Tree


The following are some of the presumptions we make
while using the Decision tree:

• Initialization times are spent using the whole


training set as the first root.

• It is recommended that feature values be


categorised.

• Before a model can be constructed, it must first be


discretized if the values being used are continuous.

• According to the values of the attributes, records


are passed around recursively.

• Some kind of statistical method is used to


determine the proper order in which characteristics
should be placed as the tree's root or internal
nodes.

In a decision tree, data is represented using the SOP (Sum


of Products) format. Another name for the Sum of the
product (SOP) is the Disjunctive Normal Form. Every
branch in a tree that ends in the node of the same class is
the conjunction (product) of values for that class, whereas

59
disjoint branches terminating in the same class form
the disjunction (sum).

The most difficult part of implementing a decision tree is


figuring out which qualities should be used to determine
the tree's starting point and subsequent nodes. The term
for this procedure is "attributes selection." Every note's
root attribute is identified using a distinct attribute
selection measure.

2.1.5. Decision Trees working


The correctness of a tree is significantly affected by the
selection of where to make strategic splits. Classification
trees & regression trees use distinct sets of criteria for
making decisions.

When determining whether or not to divide a node into


two or even more sub-nodes, the decision trees employ a
combination of methods. The homogeneity of the resulting
sub-nodes improves as more of them are created. In other
words, the node's purity improves as measured against the
criterion variable. The nodes are partitioned based on all of
the relevant criteria, and the decision tree ultimately
chooses the partition that yields the most similar child
nodes.

The nature of the desired variables also plays a role in the


algorithm selection process. Consider a few Decision Tree
Algorithms:

ID3 → (extension of D3).

60
C4.5 → (successor of ID3).

CART → (Classification And Regression Tree).

CHAID → (Chi-square automatic interaction detection


Performs multi-level splits when computing classification
trees).

MARS → (multivariate adaptive regression splines).

ID3 uses a top-down, non-recursive, greedy search of


a space of potential branches to construct decision trees. As
its name implies, a greedy algorithm will always choose
the option that provides the most immediate benefit.

Steps in ID3 algorithm:

• The first node is the original set S.

• The approach iteratively estimates the Entropy(H)


& Information gain(IG) of the most underutilized
feature of set S on every iteration.

• The feature with the highest information gain and


lowest entropy is then chosen.

• The subset of data is then generated by splitting the


set S along the chosen attribute.

• The algorithm iteratively cycles through each


subset, taking into account only newly unselected
properties.

61
Attribute Selection Measures

Choosing which attribute to put at a root or various levels


of a tree as the internal

nodes is a hard process if the dataset has N attributes. A


solution cannot be found by arbitrarily choosing a node as
the root. Random methods often provide subpar outcomes
with little to no precision.

Researchers laboured and came up with answers for this


characteristic selection issue. They recommended utilising
standards such as:

• Entropy.

• Reduction in Variance.

• Gini index.

• Information gain.

• Gain Ratio.

• Chi-Square.

Entropy

The entropy of information being processed is a


quantitative measure of its inherent unpredictability. As
entropy increases, it becomes more difficult to infer
meaning from data. An example of an activity that gives
random information is flipping a coin.

62
Figure 2.2 A graph showing Entropy*

This graph shows that when a probability is either "0 or 1",


an entropy H(X) is 0. At a probability of 0.5, the Entropy is
at its highest, suggesting that the data is completely
random and that there is no way of accurately predicting
the result.

The entropy of a single property is expressed


mathematically as:

https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*

explained.html

63
Where S represents the current state and Pi represents the
probability of event I occurring in state S or the fraction of
nodes belonging to class i.

Entropy may be expressed mathematically for a collection


of qualities using the form:

Where T→ “Current state” and X → “Selected attribute”.

Information Gain

Information gain (IG) is a metric used in statistics to assess


how effectively a feature distinguishes across classes of
training data. Finding the characteristic with the greatest
information gain and also the lowest entropy is the key to
building a good decision tree.

64
There is less entropy after an information gain. Given
attribute values, it determines the change in entropy
between the original dataset and also the split versions.
Information gain is used by the ID3 (Iterative
Dichotomiser) decision tree method.

Mathematically, you can represent IG as:

One may arrive at a much more elementary conclusion by


saying that:

65
Information Gain

Where "before" is the original dataset, "K" is the total


number of new datasets created by the split, and "(j, after)"
is the newly created dataset.

Gini Index

The Gini index may be thought of as the cost function for


assessing data subsets. To determine this, we take away
from one the total squared probability of each category. It's
simple and effective, and it favours big divisions, whereas
knowledge gain tends to prefer smaller partitions with
more nuanced values.

The success/failure outcome categories are suitable for use


with Gini Index. There are no other kinds of splits than
Binary ones that it can do.

How to Determine the Gini Index for a split:

• Using the aforementioned formula for


the success(p) & failure(q) (p2+q2), get the Gini for
sub-nodes.

• Weigh every node's Gini score and compute the


Gini index for a split.

Gain ratio

Selecting properties with several possible values as root


nodes is more advantageous from an information

66
acquisition perspective. This indicates that it gives more
weight to the property that has a wider variety of possible
values.

Gain ratio, a variant of Information gain that eliminates


some of its inherent bias, is used in C4.5, a refinement of
ID3. By considering the potential number of forks before
performing the split, the gain ratio gets around the issue of
information gain. Accounting for a split's inherent
information mitigates information gain.

Where "before" is the original dataset, "K" is the total


number of new datasets created by a split, and "(j, after)" is
the newly created dataset.

Reduction in Variance

One such approach for continuous dependent variables is


the reduction in the variance (regression problems). For
making the optimal split, this technique uses the well-
known variance formula. The criterion used to divide the
population is the one that results in the lowest variance:

The mean of the values, shown by the X-bar above X, is n,


whereas X represents the actual numbers.

67
Steps to calculate Variance:

1. For each node, compute its variance.

2. Using a weighted average of the variances at each


node, get the variance for each partition.

Chi-Square

CHAID is an abbreviation for "Chi-squared Automatic


Interaction Detector." It dates back to antiquity and is still
used today to categorise trees. It determines whether or
not the differences between child and parent nodes are
statistically significant. We quantify this by tallying the
total of the standard deviations between the actual and
predicted frequencies of the dependent variable.

It may be used with a binary outcome variable such as


"Success" or "Failure." More than one split is possible. A
larger Chi-Square value indicates a more significant
statistical difference between the sub-node and the Parent
node.

The resulting structure is a CHAID tree (Chi-square


Automatic Interaction Detector).

Methods for Determining Chi-Square in a Split:

1. Determine the Chi-square of a node by determining


the standard deviation of its outcomes (both
successful and unsuccessful).

2. The Chi-Square of the Split was computed by


adding the Chi-Squares of the successful and
unsuccessful branches.

68
We may express Chi-squared mathematically as:

2.1.6. Avoiding/countering Overfitting in Decision


Trees
A typical issue with decision trees is that they try to cram
too much information into a limited space, such as a
column-heavy table. Sometimes it seems as though the tree
has learned all it can from its training data. Since a
decision tree would always produce at least one leaf for
every observation in the absence of any constraints, this
would achieve perfect accuracy on a training data set. In
turn, this reduces the precision with which we can forecast
samples that are not included in a training set.

We can eliminate overfitting in two ways:

1. Pruning Decision Trees.

2. Random Forest.

69
Pruning Decision Trees

When the stopping requirements are met, the process


stops and mature trees are produced. However, the fully
mature tree often overfits the data, which results in low
accuracy on new data.

Pruning is the process of removing branches from the tree,


or decision nodes, working from a leaf node backwards, in
a way that does not compromise the correctness of the tree
as a whole. An actual training set is split into two parts: D
for the training data and V for validation data. D. Create
a decision tree using the isolated data set for training. The
tree should be further pruned as needed to maximise
accuracy in a validation data set, V.

Figure 2.3 An example of an Original and pruned tree *

https://www.kdnuggets.com/2020/01/decision-tree-algorithm-
*

explained.html

70
The overfitting problem is solved by deleting the 'Age'
attribute from the left side of a tree in the following figure,
which is more important on a right side of a tree.

Random Forest

By combining many different machine learning algo-


rithms, we may improve their prediction accuracy, like in
the case of the Random Forest, an instance of ensemble
learning.

2.2. Decision tree representation

Figure 2.4 An example of the Decision tree representation*

An instance is classified using a decision tree by being


moved from the root node to the leaf node, which contains

https://www.javatpoint.com/machine-learning-decision-
*

tree-classification-algorithm

71
the classification. Using the above diagram as a guide, we
may classify an instance by first locating its root node, then
checking the attribute indicated by this node, and then
proceeding along the tree branch according to the value of
an attribute. After then, the new node's subtree undergoes
the same steps.

2.3. Issues in decision tree learning


Decision trees may be useful for learning, but they can also
provide some practical challenges.

• Figuring out how far down to branch the decision


tree.

• Working with a variety of continuous qualities.

• Deciding on a reliable method for attribute


selection.

• Dealing with incomplete attribute values in


the training data.

• Manage qualities with varying prices.

• Optimizing computational performance.

1. Avoiding Overfitting the Data

If the model can correctly generalise to fresh input data


from the issue domain, we may call it a good machine
learning model. As a result, we can make predictions
about data that a data model has never seen before. Let's

72
say we're interested in testing our machine learning
model's ability to adapt to new information. Overfitting &
underfitting are two key causes of the machine learning
algorithms' poor results in this area.

Underfitting

Underfitting occurs in machine learning algorithms when


they fail to recognise an important pattern in the data. The
precision of our ML model is ruined by underfitting. If this
happens, it only signifies that our model or algorithm isn't
a good match for the data. This often occurs when there is
insufficient data to construct a reliable model, or when a
linear model is attempted on data that exhibit non-linear
behaviour. When there is insufficient information, the
machine learning model is likely to produce inaccurate
predictions because its rules are overly forgiving and
general. More data & feature reduction through feature
selection are also good ways to prevent underfitting.

Overfitting

When we train the machine learning algorithm with an


abundance of data, we say that the algorithm is overfitted.
When a model is fed so much information, it begins to pick
up on the inaccuracies and noise that exist within our data.
If there are too many outliers and irrelevant details in the
data, the model will misclassify the information.
Overfitting occurs when the machine learning algorithm is
given too much leeway in creating a model from the
dataset, as is the case with non-parametric & non-linear
approaches. Overfitting may be prevented by adjusting the

73
parameters of a supervised learning algorithm, such as the
maximum depth of a decision tree, or by using a linear
method if the data is linear.

However, our ID3 method may run into trouble when


there is noise in the data or when there aren't enough
training instances to provide a representative sample of
the genuine target function. Overfitting trees may be
generated using this approach.

Definition — Overfit: A hypothesis h in the given


hypothesis space H is said to overfit training data if
another hypothesis h' in the same space has a lower error
over the whole distribution of examples than h does over
the training situations.

Figure 2.5 Accuracy is shown on training and test data based on


their tree size*

*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm

74
The results of overfitting in the common use of decision
tree learning are shown in the diagram below. In this
instance, the ID3 algorithm is used to determine whether
individuals in a healthcare setting have diabetes.

As the decision tree takes shape, the number of nodes


increases along the horizontal axis of this diagram.

How well a tree predicts the future is shown by the


verticle axis.

The decision tree's performance on a training set is shown


by the solid line, while performance on the test set is
depicted by the broken line (not involved in a training set).

Here, we'll attempt to figure out what happens when a


positive training sample that was mislabeled as a negative
is added to the dataset.

<Sunny, Hot, Normal, Strong, ->, Example is noisy because


the correct label is +.

The decision tree illustrated in Figure is generated by ID3


given original, error-free data.

Now that this erroneous example has been included, ID3


will generate a more intricate tree. Specifically, the new
instance would be placed in the same leaf node as D9 and
D11 from the prior positive instances in the learnt tree
shown in the preceding Figure. ID3 will look for additional
refinements to a tree below this node since the new
instance is marked as the negative example. This causes
ID3 to produce a decision tree (h) that is more involved

75
than the one shown in h'. Obviously, h will be a great
match for the set of training instances, while h' will not.

Figure 2.6 Decision tree generated by ID3*

We anticipate h to outperform h' on future data collected


from the distribution of the same instances since the new
decision node is only a result of fitting a noisy training
example.

Even if the training data are devoid of noise, overfitting


may still occur, particularly if just a few samples are linked
to leaf nodes. Specifically, it is conceivable for accidental
regularities to arise, in which an unrelated feature occurs

*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm

76
to divide the cases extremely effectively. There is a danger
of overfitting if there are such accidental regularities.

Avoiding Overfitting —

Overfitting may be avoided in decision tree learning in a


few different ways. We may divide them into two
categories:

• Pre-pruning (avoidance): Prematurely stopping


tree growth before it can classify the training set
completely.

• Post-pruning (recovery): Post-prune the tree once


it has to overfit the data.

Post-pruning overfit trees have indeed been proven to be


more effective than the first method, which may appear
more straightforward at first. This is because it is hard to
determine exactly when to cease growing the tree using
the first method. The dilemma of what criteria should be
employed to identify the proper end tree size arises
whether the right size is obtained by halting early or
by post-pruning.

The optimal cut tree size is determined by the following


criteria.

To assess the value of remaining nodes in a tree after


pruning, use a different set of instances than those used for
training.

Use all the data you can get your hands on for the training,
but use the statistical test to see whether a specific node's

77
expansion (or pruning) will provide results outside of the
training set.

The training samples and also the decision tree may be


encoded using the complexity measure, with tree
development being halted when this amount of encoding
is reduced. "Minimum Description Length" is the term for
this strategy.

MDL — Minimize: size (tree) + size (misclassifications


(tree).

The first method, which is typically called a training &


validation set strategy, has proven to be the most popular.
Here, we'll go through the two most common variations of
this strategy. The available data are split into two
collections: a training set, from which the learnt hypothesis
is derived, and the validation set, from which the
correctness of the hypothesis is assessed in light of the
following data and, in particular, the influence of trimming
this hypothesis.

The rationale is that the validation set is less likely to


display random oscillations that might mislead the learner
than the training set. Because of this, the validation set is
relied upon to serve as a safeguard against overfitting the
spurious features of a training set. Of course, the
validation set has to be sufficiently big to offer a
representative sample of instances. A popular approach is
to train using two-thirds of the data and then use the
remaining one-third as a validation set.

78
1. Reduced Error Pruning

If we utilize a validation set, how can we ensure that we


are not overfitting the data? One method, reduced-error
pruning (Quinlan, 1987), takes into account all of the tree’s
decision nodes as potential trims.

• When doing reduced-error pruning, all of the tree's


decision nodes are considered potential cuts.

• When a decision node is pruned, the subtree


originating from it is removed, the node becomes a
leaf, & the most frequent label from the training
instances associated with it is assigned to it.

• Nodes are trimmed off the tree only if the resultant


structure performs as well as, or better than, the
original across the validation set.

• Any leaf node created by chance in a training set is


more than likely to be removed in a validation set,
making usage of reduced error pruning a necessary
practice.

The accompanying diagram depicts how reduced-error


pruning improves the decision tree's precision.

If a substantial quantity of data is available, using a


distinct data set to guide pruning is an efficient method.
One rule of thumb is to divide the data into thirds: 60% for
the training set, 20% for a validation set, and 20% for the
test set. One important downside of this strategy is that it

79
further limits the number of training instances when data
is scarce by reserving some of them for a validation set

Figure 2.7 Figure accuracy versus tree size shows the tree's
accuracy assessed over both training & test cases. *

In the following paragraphs, we will discuss another


method of pruning that has proven effective in many real-
world scenarios with sparse data. Various other methods
have also been suggested, most of which involve
repeatedly and arbitrarily slicing the available data into
smaller subsets before averaging the findings.

2. Rule Post-Pruning

• Post-pruning of rules entails the following


procedures:

*https://www.javatpoint.com/machine-learning-decision-tree-
classification-algorithm

80
• Allow overfitting to occur by inferring a decision
tree from a training set and expanding the tree
until training data is fit optimally.

• Build a rule for each possible route from a root to


the leaves of the learning tree.

• Each rule should be "pruned," or "generalised," by


deleting any necessary circumstances that might
increase its projected correctness.

• Afterwards, categories of cases using the trimmed


rules in the order in which their estimated accuracy
was determined.

As an example, go back and look at a decision tree in the


preceding diagram again. After the tree has been pruned,
new rules are created for each of its leaf nodes.
Classification at a leaf node is becoming the rule
consequent (postcondition), and every attribute test along
the route to a leaf becomes the rule antecedent
(precondition). The rule, for instance, reflects the leftmost
branch of the tree in the illustration.

IF (Outlook = Sunny) ^ (Humidity = High).

THEN PlayTennis = No.

Following this, each rule is "pruned" by eliminating any


prerequisites that do not contribute to the estimated
correctness of the rule. The aforementioned rule suggests
that rule post-pruning should take into account the
possibility of eliminating the prerequisites.

81
(Outlook = Sunny) and (Humidity = High).

Whichever of such pruning processes improved estimated


rule accuracy the most would be chosen, and then
a second precondition may be pruned as the further
pruning step.

No pruning is done if it will lower the predicted rule


accuracy.

There are three major benefits to trimming the decision


tree by first turning it into rules:

• By converting to rules, you can identify the many


settings in which the decision node is used. The
pruning choice for that the attribute test might vary
depending on which route was used to get to that
node in a decision tree.

• If you switch to using rules, you no longer need to


differentiate between attribute checks that are
performed closer to a tree's root and those that are
performed closer to its leaves. This eliminates the
need to rearrange the tree or deal with other
complicated accounting concerns if a root node is
trimmed while some of a subtree below this test is
kept.

Rule conversion enhances readability. When compared to


grey areas, rules provide clearer guidance.

82
2. Incorporating Continuous-Valued Attributes

Our first attempt at defining ID3 only allows for values


that are either true or false.

1. To use a learnt tree to predict a value for a target


attribute, that attribute must have discrete values.

2. Decision node characteristics being examined must also


have discrete values.

This second limitation may be readily lifted to allow


continuous-valued decision characteristics to be integrated
into the learning tree. The procedure may generate
the new boolean attribute A for the continuous-valued
attribute A, where A is true if A is less than c and false
otherwise. The only open issue is how to determine the
optimal threshold value, c.

Let's pretend we need to introduce the Temperature


property, which has a continuous scale. And now, let's
assume that the Temperature and the target attribute
values in training instances for a given node in a decision
tree are as follows:

Which temperature-based boolean characteristic should


have a threshold defined?

Select a cutoff value c that maximises the information


gained. We may build a set of the candidate thresholds in

83
the middle of the corresponding values of a continuous
characteristic A by sorting the instances according to A
and then selecting neighbouring examples that vary in
their target categorization. The optimal value of c, in terms
of the information gain, can be demonstrated to always fall
inside this limit. An information gain associated with
every candidate threshold may then be calculated and
used to make a decision.

The values of (48 + 60)/2 and (80 + 90)/2 are potential


cutoffs here since they correspond to the points at which
a value of PlayTennis shifts.

The best candidate attribute (Temperature >54) may then


be chosen by calculating the information gain for every
candidate attribute (Temperature >54 and Temperature
>85).

When the decision tree is being built, this newly generated


boolean attribute may compete with other candidate
attributes that have discrete values.

3. Alternative Measures for Selecting Attributes

The information gain metric is predisposed to prioritise


qualities with numerous values over those with
some values.

To illustrate this point, think about the property Date,


which may take on an enormous variety of values.
Incorrect use of the Date property; please explain. Since it
may take on so many different values, it will inevitably

84
divide the training instances into several smaller groups.
This means it will learn far more from the data than it did
from the training samples.

Despite its substantial information gain, it is a poor


predictor of a target function for new data.

Alternate measure-1

The gain ratio is the alternative metric that has been


utilised effectively. Attributes like Date are punished by
the gain ratio metric due to the inclusion of a word called
split information that is sensitive to how extensively and
consistently the attribute separates the data:

Where S1 through Sc are c subsets of samples obtained by


dividing S by a c-valued property A. Bear in mind that
Splitlnformation is an entropy of S relative to values of
the attribute A. In the past, we have only studied the
entropy of S with regard to a target attribute whose value
is to be anticipated by the learnt tree, but this new
application expands our understanding of the use of
entropy in general.

With the original Gain measure and this new


Splitlnformation in mind, we can define a Gain Ratio as
follows:

85
Attributes with several equally dispersed values are
discouraged by the Splitlnformation term (e.g., Date).

The denominator may be zero or extremely tiny when |Si|


≈ |S| for one of the Si, which is a problem when using
GainRatio instead of Gain to select the attributes. For
qualities that have the same value for almost all
participants in S, this renders the GainRatio either
undefined or very high. We may use a heuristic, such as
computing the Gain of each attribute and then using the
GainRatio test to only consider qualities with "above-
average Gain", to avoid picking attributes only on this
basis.

Alternate measure-2

To overcome this problem, Lopez de Mantras proposed a


distance-based alternative to GainRatio in the year 1991.
The basis of this metric is the establishment of the distance
measure between the various data subsets. The value of an
attribute is determined by how much the data partition it
produces deviates from an ideal partition (i.e., the partition
which perfectly classifies training data). Attributes are
selected based on which division comes closest to being
optimal. The predictive accuracy of induced trees is not
noticeably different from that achieved with Gain & Gain
Ratio measures, and it is not skewed toward characteristics
with a high number of values. However, in contrast to the

86
GainRatio measure, this distance metric has the advantage
of producing much smaller trees when dealing with data
sets whose characteristics have a large variation in
the number of values they may take.

4. Handling Missing Attribute Values

Some properties may not have a corresponding value in


the provided data. In the medical field, for instance, the
Blood-Test-Result might only be accessible for certain
individuals, despite our desire to make prognoses based
on the results of a battery of laboratory tests. It is usual to
practise making an educated guess as to a missing
attribute value by comparing it to other instances where
the attribute value is known.

Take the case where, at node n in a decision tree, you need


to determine whether attribute A is the best one to test. In
this case, you'll be calculating Gain(S, A). Let's pretend
that A(x) isn't known, but that the training instances in S
include the pair (x, c(x)).

Method-1

When an attribute value is absent, one solution is to use


the value most often seen in the training samples at node
n. Alternatively, we may give it the value that occurs most
often among instances at node n that belong to class c(x).
Once this estimated value for A(x) has been included in a
more detailed training example, the current decision tree
learning method may utilize it without any more
modification.

87
Method-2

A second, more involved method involves giving each


potential value of A certain probability. The frequencies of
the different values for A among the instances at node n
may be used to recalculate these probabilities.

For the boolean attribute A, the likelihood that A(x) = 1 is


0.6 and the probability that A(x) = 0 is 0.4 if node n has six
known cases with A = 1 and 4 with A = 0.

Distributing 0.6 of the x instances along the A = 1 branch &


0.4 along the other tree fork. Information Gain is calculated
using fractional examples, which may be further split at
the following tree nodes if the second missing attribute
value has to be verified. After learning, this same
subsample selection may be used to categorise fresh cases
for which the values of their attributes are unknown. In
this scenario, the new instance's classification is simply the
most likely classification, calculated by adding the weights
of instance fragments categorized in various ways at the
branch nodes of a tree.

C4.5 makes use of this technique for dealing with missing


attribute values.

5. Handling Attributes with Differing Costs

In certain cases of data mining, the instance characteristics


may incur charges. A patient's biopsy result, temperature,
pulse rate, blood test results, etc. might all be used in a
disease classification training set. The financial and

88
emotional toll that these features have on the patient is
very variable. For these kinds of jobs, we choose decision
trees that make efficient use of cheap features, turning to
pricey ones only when they're necessary for accurate
categorization.

Adding a cost term to attribute selection measure is one


way to make ID3 more cost-aware. By dividing the Gain
by the attribute's cost, for instance, we may favour
qualities with lower costs. These cost-sensitive techniques
do not ensure that the best cost-sensitive decision tree will
be discovered, but they do weigh the search more heavily
towards cheap characteristics.

Method-1

One such strategy is described and implemented in the


context of a robot perception challenge by Tan and
Schlimmer (1990) and Tan (1993), where the robot should
learn to categorise objects based on how they may be
handled by the robot’s manipulator. Here, the qualities
map to data collected by the robot's moveable sonar. To
calculate the cost of the attribute, we look at how long it
takes to position and operate the sonar to collect the
attribute value. By substituting the following metric for
an information gain attribute selection measure, they show
that more efficient recognition techniques are acquired
without compromising classification accuracy.

89
Method-2

Similarly, Nunez (1988), the author discusses the use of a


similar method to learn medical diagnostic standards.
Here, symptoms and the associated laboratory tests might
vary in price. A constant, w [0, 1], defines the weight given
to cost and knowledge gain in his system's attribute
selection process.

2.4. Instance-based Learning


Instance-based learning describes Machine Learning
systems that memorise their training examples before
applying what they've learned to novel cases using a
similarity metric. The name "instance-based" comes from
the fact that the system's hypotheses are constructed using
data collected from training examples. The term "memory-
based learning" (or "lazy learning") also applies. This
algorithm's time complexity is proportional to the amount
of training data. This approach has an O (n) "worst-case
time complexity", where n is the total number of training
cases.

If we were to build a spam filter using the instance-based


learning algorithm, for example, it would be able to
recognize messages that are highly similar to those that
have previously been flagged as spam. This calls for a way
to compare the similarity of the two emails. The same

90
sender, frequent usage of the same terms, or any other
factor might serve as a measure of correspondence
similarity between the two emails.

Advantages:

• Local approximations of a target function may be


produced rather than global estimations.

• This algorithm is very flexible and can quickly


learn from data that is gathered in real-time.

Disadvantages:

• It's expensive to classify data.

• In addition to the need for a lot of memory to hold


the data, every query also necessitates the
identification of the local model from the scratch.

The following are examples of algorithms used in instance-


based learning:

• K Nearest Neighbor (KNN).

• Learning Vector Quantization (LVQ).

• Self-Organizing Map (SOM).

• Locally Weighted Learning (LWL).

2.5. K nearest neighbour

The supervised learning method known as K-nearest


neighbours (KNN) may be utilised for both regression &
classification tasks. After determining the distance
between test data and training points, KNN attempts to
forecast the proper class for the test data. Then, among

91
those K points, choose the one that comes closest to
representing the data under scrutiny. The KNN algorithm
determines which of the 'K' classes in training data best fits
the test data, and then favours that class. In regression, the
value is the average of the 'K' chosen points in the training
set.

To help illustrate this point, please view the illustration


below.

92
Let's say we have a photo of an animal that may be either a
cat or a dog, but we need to know which it is. KNN may
be used for this identification since it is based on
the measure of similarity. Our KNN model would
compare a new data set to cat and dog photographs and
classify the images into the appropriate categories
depending on the similarities between the two sets of data.

Figure 2.8 An example depicting the KNN classifier*

Why do we need a K-NN Algorithm?

Assuming we have two groups, let's call them A and B,


and a new data item x1, we want to know where it
belongs. In this case, a K-NN method is what is required to

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

93
find a solution. K-NN is useful for quickly and accurately
determining a dataset's class. Think about the diagram
below:

Figure 2.9 A graph showing results before K-NN and after K-NN
*

How does K-NN work?

K-Nearest Neighbors' operation may be understood in


light of the following method:

Step 1: You must choose K neighbours.

Step 2: Determine the average Euclidean distance between


each of the K neighbours.

Step 3: Rely on the estimated Euclidean distance to choose


which K neighbours to use.

Step 4: Find out how many of these k neighbours fall into


every data category by tallying up the data points.

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

94
Step 5: You should put the latest information into the
group with the highest average neighbour count.

Step 6: It's finally time to unveil our model.

Let's pretend we've discovered some fresh information


that has to be filed away. Take a look at the picture below:

Figure 2.10 Result obtained for nearest neighbour*

• As a first step, let's determine how many neigh-


bours we want, therefore we'll set k=5.

• Following this, we will determine the Euclidean


distance between the spots. In geometry, we have

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

95
previously examined the concept of the Euclidean
distance, which is the physical separation of any
two points. The formula for this is:

Figure 2.11 Obtaining Euclidean distance with the help of a


graph*

We found the closest neighbours using the Euclidean


distance formula, using three neighbours in group A and
two neighbours in group B. Take a look at the fig-
ure below:

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

96
Figure 2.12 Categorizing the data*

This new data point must also be in category A, as its three


closest neighbours are all from that group.

Choosing a K value

A high K-value suggests a large number of neighbours.


Calculating distances between the test points and the
points used to create the training labels is a necessary step.
As a slow learning method, KNN avoids the
computational cost of constantly updating distance
measures.

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

97
Figure 2.13 A new example to classify data is shown on a graph*

As can be seen in the graphic above, if we continue with


"K=3", we forecast that a test input belongs to "class B", and
if we continue with "K=7", we anticipate that a test input
belongs to "class A".

Thus, it is easy to foresee that the K value has a significant


impact on KNN efficiency.

The question then becomes, how should K be chosen?

• Unfortunately, optimal values of K cannot be


determined using any of the established statistical
approaches.

*https://www.datacamp.com/tutorial/k-nearest-neighbor-
classification-scikit-learn

98
• Just choose a number for K at random and begin
calculating.

• Choosing a low value for K causes the decision


limits to be shaky.

• Classification performance improves with a higher


K value because it leads to fewer abrupt transitions
between classes.

• Create a scatter plot showing the error rate vs K for


a given set of values. Then, choose K such that the
error rate is minimised.

99
CHAPTER-3: Probability and
Bayes Learning

3.1. Bayesian Learning


One of the most promising areas of AI development is
machine learning. In the 21st century, everything is
powered by cutting-edge technology and devices, many of
which have yet to be fully explored or used. Like many
other emerging technologies, Machine Learning is in its
infancy. Machine learning is an improved technology
because of many ideas, such as supervised learning,
reinforcement learning, unsupervised learning, neural
networks, perceptron models, etc. The Bayes Theorem is
another essential principle of machine learning, and it is
the topic of this article. However, you must have a basic
familiarity with Bayes' theorem before delving into this
subject, including knowledge of its definition, its use
in Machine Learning, some instances of its application, and
so on. Let's go into a quick explanation of Bayes' theorem.

Mr Thomas Bayes, a philosopher, English statistician, and


Presbyterian pastor in the 17th century, gives us the Bayes
theorem. The ideas that Bayes contributes to decision

100
theory are widely employed in fundamental mathematical
ideas like probability. When making class predictions
in Machine Learning, Bayes' theorem is also often utilised.
Machine learning applications, such as classification tasks,
make use of a notion from Bayes' theorem known as the
Bayesian technique to compute the conditional probability.
To cut down on calculation time and average project costs,
a simpler form of Bayes' theorem is also utilised (Nave
Bayes classification).

Bayes' theorem goes under a few other names, including


the Bayes rule and Bayes Law. Bayes's theorem is useful
for estimating the likelihood of an occurrence when only a
little information is available. It may be used to figure out
how likely it is that one thing will happen if another thing
has previously happened. The relationship between
condition probability & marginal probability may be
determined using this technique.

To put it another way, Bayes' theorem aids in the


contribution of more precise findings.

Bayes' Theorem gives a formula for determining


conditional probability, which may be used to make value
estimates. Ironically, this seemingly complex method is
utilised to simply determine the conditional likelihood of
occurrences when common sense falls short. However,
contrary to the beliefs of some statisticians, Bayes' theorem
is not only used by the banking and insurance sectors.
Bayes's theorem is used widely in fields outside of finance,

101
such as health and medicine, research and surveying,
aviation, etc.

Bayes Theorem

One of the most well-known ideas in machine learning is


Bayes' theorem, which provides a method for estimating
the likelihood of one event given only partial evidence of
its occurrence.

With a product rule and a conditional probability of


the event X was given event Y, "Bayes' theorem" may be
derived:

The product rule allows us to describe the likelihood of


event X given the occurrence of event Y;

“P(X ? Y)= P(X|Y) P(Y)“ {equation 1}

Together, the two equations on the right-hand side


provide the mathematical expression of Bayes' theorem.
We will get:

In this case, the outcomes of X and Y do not rely on one


another's probabilities since they are separate occurrences.

102
Bayes Rule, often known as Bayes Theorem, is the above
equation:

• We need to compute the posterior, denoted by


P(X|Y). The term "improved likelihood" refers to
the new estimate of the likelihood that results from
analyzing the data.

• The probability of Y given X is denoted by the


symbol P(Y|X). When the hypothesis is correct, this
is the evidence's likelihood.

• Prior probability, or the likelihood of


the hypothesis before evidence is considered, is
represented by the symbol P(X).

• Marginal probability, denoted by P(Y), is a measure


of statistical uncertainty. It is the likelihood that
something is true, regardless of the circumstances.

Thus, Bayes Theorem can be represented as:

Posterior = likelihood * prior / evidence.

Prerequisites for the Bayes Theorem

There are just a few fundamental ideas that must be


grasped before delving into the Bayes theorem. Following
is a list of them:

1. Experiment

Tossing a coin, picking a card at random, rolling a die, etc.,


are all examples of experiments since they are
premeditated actions performed in a regulated setting.

103
2. Sample Space

The results of the experiment are termed outcomes, and


the collection of all potential results of the event is termed
the sample space. The sample space for the dice throw, for
instance, would be:

S1 = {1, 2, 3, 4, 5, 6}.

In a similar vein, if our experiment involves tossing a coin


and noting the results, our sample space would be:

S2 = {Head, Tail}.

3. Event

An experiment's event may be thought of as a specific


subset of its whole sample space. In addition, it might be
thought of as a collection of results.

Let's pretend that our dice-rolling experiment involves two


events, A and B, where;

104
A = Event when the even number is achieved = {2, 4, 6}.

B = Event when the number is larger than 4 = {5, 6}.

Probability of the event A ''P(A)'' = Number of favourable


outcomes / Total number of possible outcomes.

P(E) = 3/6 =1/2 =0.5

Similarly,

Probability of the event B ''P(B)''= Number of favourable


outcomes / Total number of possible outcomes.

=2/6

=1/3

=0.333

Union of an event A and B:

A∪B = {2, 4, 5, 6}.

The intersection of an event A and B:

A∩B= {6}

105
Disjoint Event: Disjoint events, also referred to
as mutually exclusive events, are those whose intersection
is "empty" or "null".

4. Random Variable

It is the function with real values that provides


the mapping between the experimental sample space and
the true experimental line. Taken on at random, with equal
odds, value for the random variable. The behaviour is that

106
of a function, which may be discrete, continuous, or a
hybrid of the two; it is neither random nor variable.

5. Exhaustive Event

A series of events in which at least one event happens at


each given moment is known as an exhaustive event of an
experiment, as suggested by the name.

When only one of two occurrences, A and B, may happen


at any given moment, and also the events are mutually
exclusive, we say that A & B are exhaustive.

6. Independent Event

When one event does not have any bearing on the other,
we say that the two occurrences are independent of each
other. To put it another way, the chances of either
happening do not rely on the other.

Two occurrences, A and B, are said to be independent in


mathematics if and only if they cannot be determined to
have any effect on one another:

P(A ∩ B) = P(AB) = P(A)*P(B).

7. Conditional Probability

A conditional probability of event A is a probability of A


given the occurrence of event B. (i.e. A conditional B).
Specifically, we may define P(A|B), which stands for
"probability that A follows from B," as:

P(A|B) = P(A ∩ B) / P(B).

107
8. Marginal Probability

The marginal probability of event A happening is the


likelihood of A occurring in the absence of occurrence B. In
addition, it is the weighted likelihood of evidence in every
scenario.

P(A) = P(A|B)*P(B) + P(A|~B)*P(~B).

3.2. Naïve Bayes


Naive Bayes techniques are a type of supervised learning
algorithm that takes as a given that all pairs of features are
independent of one another regardless of a value of a class
variable. Given the class variable and the dependent
feature vector, Bayes' theorem establishes the following
connection:

108
If we make the simplistic assumption of conditional
independence,

for all, such a relationship is to:

Since,

remains unchanged regardless of the input, we may utilise


the following criterion for categorization:

Maximum A Posteriori (MAP) estimation may be used to


determine, with the former being the class frequency
distribution in a training set.

Assumptions about the distribution of data are where the


various naïve Bayes classifiers diverge from one another.

Document categorization and spam filtering are two well-


known applications where "naive Bayes classifiers" have
proven effective, despite their seemingly simplified

109
assumptions. To estimate the required parameters, just a
minimal quantity of training data is needed.

When contrasted to more complex algorithms, Naive


Bayes learners and classifiers may sometimes be lightning
quickly. Each distribution of class-conditioned features
may be estimated separately as a one-dimensional
distribution due to their decoupling. This, in turn, helps to
fix issues brought on by the curse of dimensionality.

However, the probability outputs from predict_proba


should not be taken too seriously since, despite naive
Bayes's reputation as a good classifier, it is infamous for
being a poor estimator.

Gaussian Naive Bayes

The Gaussian Naive Bayes classification technique is


implemented in GaussianNB. All feature probabilities are
assumed to follow a Gaussian distribution:

Parameters σy & μy are estimated employing the


maximum likelihood.

Multinomial Naive Bayes

Among the two main naive Bayes variations used in


the text classification, Multinomial NB is the implement-
ation of a naive Bayes algorithm for the multinomially
distributed data. For every class y, the distribution is

110
parameterized by a vector θy=(θy1,…,θyn), where n is
the number of the features (or vocabulary size in the case
of the text classification) and θyi is a probability P(xiy) of
feature i occurring in a sample from class y.

The parameters θy are calculated using a smoothed


maximum likelihood method, which is equivalent to a
count-based relative frequency distribution:

Where,

is the frequency with which a feature i occurs in


the training set T sample from class y,

is the total count of all the features for


class y.

To avoid future calculations resulting in zero probability,


the smoothing priors α≥0 are used to adjust for
characteristics that were not present in the training
samples. Laplace smoothing is achieved by setting α=1,
whereas Lidstone smoothing is achieved by setting α<1.

Complement Naive Bayes

ComplementNB is a program that uses the


ComplementNaiveBayes (CNB) algorithm. For unbalanced
datasets, consider using CNB, a variant of the more used

111
multinomial naive Bayes (MNB) method. To calculate the
model's weights, CNB takes into account information from
each class's anticlass. The developers of CNB provide
empirical evidence that their method yields more
consistent parameter estimations than MNB. In addition,
CNB consistently beats MNB on text categorization tasks,
and frequently by a large margin. Here is the formula for
determining the relative importance of each weight:

where s is a smoothing hyperparameter similar to MNB, I


is a term in document j, dij is the count or tf-idf value of
term I in document j, and α=∑iαi is the sum of over all
the documents j that is not in class c. Specifically, the
second normalisation corrects the issue of lengthier
documents skewing the MNB parameter estimations. As
per the rule of classification:

If two classes have the same information, then the


document is placed in the class with the lowest
complement match.

112
Bernoulli Naive Bayes

BernoulliNB is a library that implements "naive


Bayes" training as well as the classification algorithms for
the data with the "multivariate Bernoulli distribution"; that
is, there might be numerous features, but each is
considered to be the binary-valued (Bernoulli, Boolean)
variable. Samples must be provided as the binary-valued
feature vectors for this class; otherwise, a BernoulliNB
instance might resort to binarizing the data sent to it
(based on a "binarize parameter").

Bernoulli's naive Bayes relies on the following rule of


thumb to make its decisions:

This contrasts with multinomial NB's rule in that it


penalizes, in contrast to the multinomial variation, the
absence of a characteristic that is an indication of class.

For text classification, it is possible to train and employ this


classifier using word occurrence vectors (instead of word
count vectors).

Some datasets, notably those with shorter texts, may


benefit more from Bernoulli’s implementation. If possible,
it is recommended to compare both models.

Categorical Naive Bayes

CategoricalNB is a programme that uses the categorical


naive Bayes method to analyse data that is organised into

113
categories. Each characteristic, denoted by index i is
thought to follow its unique category distribution.

Categorical NB estimates the categorical distribution of


features i of X conditioned on the class y for every feature
i of a training set X. Sample index set J={1,…,m} where m is
the total number of the samples.

In this case, we may estimate the likelihood of category t in


a feature I given class c as:

Where,

Nc=|{j∈J∣yj=c}| is the total number of samples in class c,


α is the smoothing parameter, and ni is the total number of
the categories for feature i. The frequency with which a
given category t occurs in class c's samples xi is denoted
by:

For CategoricalNB to work, it is assumed that the sample


matrix X has been encoded (using an Ordinal Encoder, for
example) such that the values 0 through n-1, where n is
a number of the categories for the feature I reflect the
categories for that feature.

114
3.3. Logistic Regression
In statistical analysis, logistic regression is used to make
predictions about a binary result from a series of
observations, like yes or no.

Using the connection between one or more independent


variables, the logistic regression model may make
predictions about a dependent data variable. In politics,
logistic regression may be used to forecast an electoral
victory or defeat, while in higher education, it could be
used to determine if a high school senior would be
accepted to a certain university. These simple yes/no
choices are made possible by these binary results.

Multiple criteria may be used as inputs to the logistic


regression model. Logical functions may take into account
a wide range of variables; for example, SAT
score, student's grade point average, and the number
of extracurricular activities while deciding whether to
admit them to college. The system assigns a score between
0 and 1 to new instances based on the likelihood of their
falling into one of two result categories, determined by
analysing data from previous cases with the same input
criteria.

In recent years, logistic regression has emerged as a


powerful technique in the field of machine learning. It
enables the classification of new data based on previously
collected data using the algorithms used in machine
learning applications. Algorithms improve their ability to

115
anticipate classes within data sets when more relevant data
is added.

Data preparation tasks may benefit from logistic regression


by facilitating the classification of data sets into
predetermined categories before an extract, transform, and
load (ETL) phase, which sets the scene for further analysis.

3.3.1. Purpose of logistic regression

Logistic regression simplifies the arithmetic involved in


determining the correlation between several factors (such
as age, gender, and ad placement) and a single result (e.g.,
click-through or ignore). The generated models may be
used to determine how beneficial various treatments are
for various age groups, sex groups, and other
demographics.

Furthermore, logistic models may be used to develop


features for use in various forms of artificial intelligence
and machine learning by transforming raw data streams.
Logistic regression is a popular machine learning approach
for two-class issues like "this or that" predictions (which
have two possible answers: "yes" or "no"), "A" or "B," etc.

In addition to estimating the likelihood of occurrences,


logistic regression may also identify associations between
attributes and estimated probabilities. Specifically, it may
be put to use in a classification setting by building a model
that links the number of hours spent studying to the
probability of success or failure. When the number of
study hours is supplied as a feature as well as the variable

116
for the answer has two values, pass & fail, the same model
may be used to predict whether or not a student would
pass.

Logistic regression applications in business

Logistic regression results are used by businesses to


improve their strategies for reaching their objectives, such
as cutting costs and improving profits and marketing
return on investment.

An online retailer that sends out pricey direct mail


promotions to clients, for instance, would benefit from
knowing which customers are most likely to take
advantage of the offers. Propensity-to-react modelling is a
technique used in marketing to predict how likely a
consumer is to make a purchase.

Similarly, a credit card business would create a model


based on factors like a customer's yearly income, the
customer's monthly credit card payments, and the client's
history of defaults to determine the likelihood that the
customer would not pay its credit card balance in full. The
term "default propensity modelling" is often used to
describe this practise in the banking industry.

3.3.2. Types of logistic regression


Categorical response definitions lead to the three distinct
logistic regression models.

• Binary logistic regression: With this method, there


are just two potential values for the dependent
variable (or response) being studied (e.g. 0 or 1).

117
Common applications include determining if an e-
mail is spam or not, and determining whether a
tumour is cancerous. This method is the most
popular choice for binary classification, and it is
also the most used method for logistic regression.

• Multinomial logistic regression: The dependant


variable in this kind of logistic regression model
may take on three or even more values, but these
values are not ranked in any particular order.
Studios, for the sake of better-targeted advertising,
want to know, for instance, what kind of film a
certain audience member is most likely to attend at
the theatre. Studio executives may learn how much
age, gender, and relationship status impact movie
preferences by using the multinomial logistic
regression model. The company may then tailor its
marketing to a certain demographic it knows will
be interested in seeing the film.

• Ordinal logistic regression: When a response


variable may take on three or more distinct
values—with a clear hierarchy among them—an
ordered logistic regression model is used. Grading
schemes from A to F, as well as rating scales from 1
to 5, are both instances of ordinal replies.

Importance of logistic regression

As a result of logistic regression, the complex probability


calculations may be simplified into an elementary
arithmetic problem, making it a powerful tool. The

118
computation itself is quite involved, but much of the
drudgery may be eliminated with the use of contemporary
statistical software. This greatly reduces the complexity of
controlling for confounding variables and assessing the
influence of various variables.

Therefore, statisticians can rapidly predict and investigate


the contribution of several variables to a particular
outcome.

The medical researcher, for instance, would be interested


in how a new medicine affects treatment results for people
of varying ages. For this, you'll need to do a lot of nested
multiplication & division to compare the results for young
and older people who never received treatment, for
the younger people who received treatment, for the older
people who received a treatment, and for the overall
spontaneous healing rate of an entire group. The relative
likelihood of every subgroup may be expressed as
the logarithmic number, or the regression coefficient, in
the logistic regression, and then added to or subtracted
from another logarithmic number to get the final result.

The simpler regression coefficients they provide may also


be used to streamline the development of additional
machine learning & data science.

The key assumptions of logistic regression

Logistic regression requires a few assumptions to be made


by statisticians as well as citizen data scientists. It is
essential, to begin with, that the variables be treated as

119
separate. This means that although a model may utilise
the zip code and a person's gender, it would not be
possible to use the zip code and a person's state of
residence.

When complicated machine learning, as well as data


science applications, begin using logistic regression, other
less obvious associations between variables might well be
lost in the shuffle. Data scientists may take great pains, for
instance, to exclude potential discriminatory factors like a
person's gender or race from the model's equations.
However, they may be unwittingly weaved into
an algorithm via other factors, such as a person's zip code,
educational background, or interests.

The raw data is also assumed to be representative of


unique or singular events. When conducting a survey,
such as one to gauge consumer satisfaction, it is important
to get responses from a variety of individuals. However,
these findings might be distorted if the same person
conducted the poll numerous times from various email
addresses to accumulate points toward a prize.

Logarithmic odds, which provide a little more leeway than


linear chances, are also useful for describing the
connection between the factors and the result.

Also, a high sample size is needed for logistic regression.


The minimum number of samples for every model
variable is 10. However, this criterion increases as the
likelihood of each result decreases.

120
Logistic regression also presumes that all variables may be
represented by a pair of discrete labels, like "male" or
"female" or "click" or "no-click." Categories having more
than two classes need a specific method to properly
represent them. It's possible, for instance, to split up a
single category containing three age bands into three
distinct variables, each of which would indicate whether
or not a given person falls inside that age band.

• Logistic regression use cases

Since it allows advertisers to estimate the proportion of


website visitors who would click on certain ads, logistic
regression has found ready use in online marketing.

The following applications of logistic regression are also


possible:

• In healthcare determine illness predisposition and


develop preventative strategies.

• In drugs, analysis disentangles the effects of


treatment according to demographic variables
including age, gender, and race.

• In snowfall prediction apps and other methods for


predicting the weather.

• To gauge support for a candidate in an election.

• Using a person's gender, age, and physical


condition to calculate the likelihood that one would
die before the policy's term ends.

121
• Using a borrower's yearly income, default history,
and outstanding obligations, banks may calculate
the likelihood that a customer would fail on a loan.

Advantages and disadvantages of logistic regression

Logistic regression's key benefit is how simple it is to


implement and train in comparison to other machine lear-
ning and AI programmes.

Furthermore, it is a very effective technique when the


data's varying results can be separated into linearly
distinct categories. Therefore, a straight line may be drawn
between the outcomes of the logistic regression analysis.

Statisticians are drawn to logistic regression in large part


because of its ability to shed light on the connections
between variables but also their effects on outcomes. More
studying is often connected with better test results,
therefore this might be used to rapidly establish whether
two variables are "positively or negatively correlated", as
in the example given above. Nonetheless, please be aware
that to go from correlation to causation, further methods,
such as causal AI, are needed.

Logistic regression tools

Before the invention of computers, calculating logistic


regression was a tedious and time-consuming process.
Logistic regression features have now become standard
fare in contemporary statistical analytics software like
SPSS and SAS.

122
Numerous implementations of the logistic regression and
integration of its findings into other techniques may be
found in R and Python-based "data science programming
languages & frameworks". Logistic regression analysis
may also be done using several other tools and methods in
addition to Excel.

To fully realise the benefits of data science demo-


cratisation, managers need also think about additional
data preparation and administration technologies. Data
warehouses & data lakes, for instance, may assist organise
massive data collections for study. Any problems with
the logistic regression's quality or usability may be
uncovered with the use of data catalogue tools. Analytics
executives may benefit from data science platforms by
establishing proper boundaries for the widespread use of
logistic regression.

3.4. Support Vector Machine: Introduction


The “Support Vector Machine”, or “SVM”, is a well-
known Supervised Learning technique that may be used
for both classification and regression tasks. However, its
most common use is in Machine Learning for
Classification.

For future convenience in classifying new data points, the


SVM algorithm seeks to find the optimal line or decision
boundary which divides n-dimensional space into
the classes. A hyperplane describes the optimal boundaries
for making decisions.

123
When creating a hyperplane, SVM selects the extremal
points and vectors. Support vectors are used to represent
these extreme circumstances, which is why the
corresponding technique is named a Support Vector
Machine. Take a look at the picture below, in which
the decision boundary (or hyperplane) is used to classify
items into two groups:

Example: Using the KNN classifier as an instance of SVM's


utility is a good place to start learning about it. Assuming
we encounter a peculiar cat that shares characteristics with
dogs, we may like to develop a model that could reliably
tell us whether we're looking at a cat or a dog. The SVM
method allows us to achieve just that. Our model would be

*https://www.javatpoint.com/machine-learning-support-vector-
machine-algorithm

124
trained on thousands of photographs of cats & dogs to
teach it to recognise those species, and then we'll put it to
the test with this outlandish animal. As a result of the
support vector's tendency to draw a line of demarcation
between two sets of data (in this instance, cats and dogs)
and pick out extreme examples, this would focus on the
latter. It will be labelled a cat on a basis of support vectors.
Think about the diagram below:

The SVM algorithm has several potential applications,


including facial recognition, text classification, image
classification, etc.

o Types of SVM

SVM can be of two types:

o Linear SVM: For data that can be neatly divided


into two categories by a straight line, a classifier
known as a Linear Support Vector Machine (SVM)
is used.

125
o Non-linear SVM: If the dataset cannot be
categorised using the straight line, we call it non-
linear data, and the classifier we use to categorise it
is called a Non-linear Support Vector Machine
(SVM).

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane:

In n-dimensional space, there may be several lines or


decision borders that may be used to separate the classes;
nevertheless, it is necessary to choose the most effective
decision boundary for classifying the data. The optimal
boundary, or hyperplane of SVM, is so named because of
its smoothness.

If there are just two characteristics (as seen in the


illustration), then the hyperplane would be a straight line
since its dimensions are determined by those features.
Moreover, if there are three characteristics, the hyperplane
would be a two-dimensional plane.

When making a hyperplane, we always aim for the largest


possible margin or the greatest possible separation
between the data points.

Support Vectors:

The term "Support Vector" is used to describe the set


of data points or vectors that have the most influence on
the hyperplane's location. The name "Support vector"

126
comes from the fact that these vectors help keep the
hyperplane stable.

3.4.1. SVM working

Linear SVM

The SVM algorithm's operation may be grasped with the


aid of a case study. Let's pretend we have a dataset with
two labels (green and blue) and two features (x1 and x2) to
work with. The coordinates (x1, x2) need to be classified as
either green or blue, thus a classifier that could accomplish
that would be ideal. Take a look at the picture below:

Figure 3.1 Classification of coordinates in 2-D space*

127
Since this is a 2-dimensional space, a simple straight line
will do to divide these categories. However, these
categories may also be split along more than one line. Take
a look at the picture below:

Figure 3.2 Optimal line or decision border is shown in 2-D space*

As a result, the SVM method aids in locating the optimal


line or decision border; this optimal boundary or area is
known as the hyperplane. The SVM method locates the
node where lines from both categories converge. There's a
name for these factors: support vectors. Margin refers to

128
the distance that vectors are away from the hyperplane.
Ultimately, SVM is used to increase this profit margin. A
good definition of an ideal hyperplane has the largest
margin.

Figure 3.3 Maximized margin shown in 2-D space*

Non-Linear SVM:

In the case of linearly organised data, a straight line may


be used to demarcate the various segments, but this is not
possible with non-linear data. Take a look at the picture
below:

129
Figure 3.4 Non-Linear SVM*

Therefore, we need to add another dimension to


distinguish these specks of information. As we have only
employed two dimensions (x, y) so far, we will need to
include a third (z) when dealing with non-linear data. The
formula for determining it is:

z=x2 +y2

Adding the third dimension to the sample space results in


the following diagram:

https://www.edureka.co/blog/support-vector-machine-in-
*

python/

130
Figure 3.5 Results of the third dimension of the sample space*

Therefore, SVM would categorise the datasets as follows.


Take a look at the picture below:

Figure 3.6 Categorization of the dataset†

https://www.edureka.co/blog/support-vector-machine-in-
*

python/
†https://www.edureka.co/blog/support-vector-machine-in-
python/

131
It seems to be a flat x-axis-aligned plane, yet we are really
in 3-d space. Thus, in the case of "non-linear data", we get a
circle of radius 1 when we transform it into "2D space with
z=1".

Python Implementation of Support Vector Machine

We would now use Python to create a working version of


the SVM algorithm. As in our previous work with Logistic
regression & KNN classification, we would be using
a user_data dataset here.

Figure 3.7 Python Implementation of Support Vector Machine *

https://www.edureka.co/blog/support-vector-machine-in-
*

python/

132
Therefore, in the case of the non-linear data, we get a circle
with a radius of 1.

3.5. Probability Theory


We need a vocabulary that can precisely express the issues
we're facing to address the situations where machine
learning can be put to use.

Random Variables

Let's pretend we're interested in the probability of rolling a


1 while using a dice. Since each of the six possible values
for X = {1, . . . , 6}, are equally likely to appear, we should
see a 1 around once per six rolls of the dice. Uncertainty in
the results of such tests may be modelled using probability
theory. The formal expression is that the likelihood of 1
happening is 1/6.

The results of many tests, such as the toss of a dice, are


purely numerical and hence easy to manipulate.
Sometimes, the results are not numbers, as when we flip a
coin and see whether it lands on heads or tails. Putting a
monetary value on the results helps in these situations.
The random variable is used for this purpose. Let's say we
want to simulate a coin toss by assigning the value 1 to the
random variable X when it comes up heads and also the
value -1 when it doesn't. For the sake of clarity, we would
use capital letters (e.g., X, Y, etc.) to signify random
variables as well as lowercase letters (e.g., x, y, etc.) to
indicate the values they represent.

133
The experimental results, represented by X, are
transformed into actual values by the random variable ξ.
As an example, let's say X represents all the possible
patients seen by a doctor, and let be the mapping ξ from X
to the patients' actual body mass index and height.

Distributions

Assigning probabilities to the possible values of


the random variable is perhaps the most useful technique
to define it. This distribution of the probabilities is known
as the "probability mass function" (PMF) if a random
variable is discrete, taking on a limited number of values.
By definition, the total of the components of a PMF cannot
be less than one. For example, if a coin is fair as well as the
odds of getting heads or tails are equal, then the specified
random variable X may take on the values of "+1 and -
1" with the probability of 0.5. This may also be expressed
as:

134
When it's clear what we're talking about, we'll use the
somewhat casual notation.

p(x) := P r(X = x).

When probabilities are assigned to the continuous random


variable, the "probability density function" (PDF) is the
end outcome. Density or distribution is commonly used in
place of the probability density function, which is a misuse
of language but is still an accepted practice. A PDF, like the
PMF, should be non-negative and integrate into 1. Two
distributions are shown in Figure: equal sharing

Figure 3.8 A graph showing the Gaussian distribution (middle)


and normal distribution (right) both have zero mean *

A normal distribution across the range [1, 1] is shown on


the left. The Gaussian distribution (middle) and normal
distribution (right) both have zero mean & unit variance
(also called the normal distribution)

https://alex.smola.org/drafts/thebook.pdf
*

135
A PDF is often used in conjunction with an indefinite
integral over the p. The "cumulative distribution function"
(CDF) is a statistical tool used often in practice.

Mean and Variance

The issue of a random variable's anticipated value is a


prevalent one. Taking a voltage metre as an example, it's
reasonable to wonder what normal readings may look like.
A doctor considering giving a kid growth hormone may
want to know what an appropriate height range is.
Expectations and associated distributional quantities must
be defined for this purpose.

(Mean) The average of the random variable X is defined as:

As an extension, f(X) is also the random variable if and


only if f(X) is the function of form f: R → R. Their mean is
defined as

The integral in the above equation may be changed to


the summation whenever X is the discrete random
variable:

For instance, in the case of dice, each of the six potential


outcomes has a 1 in 6 chance of happening. It's not hard to

136
figure out that this represents a mean of 3.5 (1 + 2 + 3 + 4 +
5 + 6)/.

The anticipated value of the random variable may be


calculated using its mean. As stock brokers, we can care
about things like our investment's predicted worth in a
year. We should also look at the potential downsides of
our investment. Specifically, how probable it is that the
investment's value will diverge from its expectation, since
this may be more important for our selections.

This necessitates the existence of a metric for measuring


the degree of uncertainty around a given outcome. The
variance of the random variable may be used as such a
metric.

(Variance) We define a variance of the random variable X


as

As before, if f: R → R is the function, then the variance of


f(X) is given by:

The variance quantifies the typical dispersion of f(X)


around the mean. The upper limit on the variance may be
utilised to provide assurances that f(X) would be close to
its anticipated value. This is one reason why a variance of
the random variable is commonly used to represent the
level of risk associated with that variable. Keep in mind

137
that a standard deviation, which is a square root of the
variance, is often used when talking about the
characteristics of random variables.

4 Marginalization, Independence, Conditioning, and


Bayes Rule

The joint density p of two random variables X & Y may be


calculated (x, y). To get p(x) back from the joint density,
we have to integrate y. Marginalization describes this
process:

A summation may be used instead of integration if Y is


the discrete random variable:

If the values of X are not affected by the values of Y at any


time, we say that X & Y are independent.

When estimating the behaviour of a large number


of random variables simultaneously, independence is
helpful. Repeated measurements of the quantity, like
the voltage of the device, are usually treated as if they
came from the same distribution and were performed
independently of one another.

138
The result of a subsequent voltage measurement won't be
affected by previous measurements. We'll refer to such
random variables as iids (short for "independent as well as
identically distributed") since they behave in the same way
in every possible situation. For an illustration of a
dependent & independent random variable pair, see
Figure.

Figure 3.9 A graph showing dependent & independent random


variable pair*

Two random variables are used to create a sample. The


first coordinate provides us with more information with
which to make an educated estimate about the second.
Correct: a sample created by permuting a dependent
variable, which is selected from the two independent
random variables.

On the other side, dependencies may be quite helpful in


classification & regression issues. As an illustration,
the intersection's traffic lights are interdependent. Since of
this, a motorist may safely assume that while the lights are
"green" in his direction, there would be no traffic crossing

https://alex.smola.org/drafts/thebook.pdf
*

139
his way because the other lights would be red. Similarly,
when shown with a numeric image (x), we expect that x is
somehow linked to the word for that digit (y).

Conditional probabilities, or the likelihood that X takes on


a certain value given a value of Y, are of particular
relevance when dealing with the dependent random
variables. It's clear that “P r(X = rain | Y = cloudy)” is
higher than “P r(X = rain | Y = sunny)”. That is to say, the
dispersion of X is heavily impacted by how well one
understands the value of Y. It is represented by conditional
probabilities:

140
CHAPTER-4: Artificial Neural
Networks

4.1. Introduction
To model complex patterns and foresee problems, experts
often turn to Artificial Neural Networks (ANN), which are
algorithms inspired by the way the brain operates.
Inspired by how the human brain uses Biological Neural
Networks to learn, the "Artificial Neural Network" (ANN)
is a kind of deep learning. The creation of ANN came
about as a consequence of efforts to simulate brain activity.
Biological neural networks and ANNs have many
functional similarities but have some key differences.
When it comes to input, the ANN algorithm can only deal
with numbers and structured information.

Unstructured and non-numerical data types including


images, texts, and speeches may be accepted by
Convolutional Neural Networks (CNNs) and Recursive
Neural Networks (RNNs).

4.1.1. Artificial Neural Networks Architecture


1. The network's design consists of three distinct levels:
input layer, hidden layer(s), and output layer(s). Due to

141
their complex structure, MLPs (Multi-Layer
Perceptron) has earned a nickname: "many, many layers".

Figure 4.1Artificial Neural Networks Architecture *

2. The hidden layer may be seen as the "distillation layer,"


which selects the most important input patterns and
passes them on to the next layer for processing. By
prioritising and filtering inputs to extract just the most
relevant data, it improves the network's speed and
performance.

3. The power button has two crucial purposes: first, it


enables you to switch on your computer.

https://www.analyticsvidhya.com/blog/2021/09/introduction-to-
*

artificial-neural-networks/

142
4.1.2. Basic Structure of ANNs
The concept of ANNs is predicated on the premise that the
functioning of the human brain may be mimicked by
employing silicon and wires as biological neurons &
dendrites.

There are about "86 billion neurons" in the human brain.


Axons link them to several other cells. Dendrites can take
in information from the sense organs and the outside
world. The electric impulses generated by these inputs
move swiftly across the brain network. After that, the
neuron either relays the information to another neuron to
deal with the problem or discards the information.

Figure 4.2 Basic Structure of ANNs*

https://www.tutorialspoint.com/artificial_intelligence/artificial_i
*

ntelligence_neural_networks.htm

143
Multi-node ANNs mimic the actual neurons seen in the
human brain. The neurons have linkages between them
and communicate with one another. The nodes may
process basic information with the help of inputs. It is then
the job of other neurons to receive and process the results
of these calculations. Activation is another name for the
value produced by a node.

In this system, the value of each connection is quantified.


ANNs can learn, which happens when the weight values
are adjusted. Below is an example of a very basic ANN:–

Figure 4.3 An example of a basic ANN*

https://www.tutorialspoint.com/artificial_intelligence/artificial_i
*

ntelligence_neural_networks.htm

144
4.1.3. Types of Artificial Neural Networks
ANN may have either a FeedForward or a Feedback
topology.

FeedForward ANN

This ANN only allows for one-way transmission of data. A


unit communicates with another unit but receives no
feedback. All feedback mechanisms have been eliminated.
One common use is in the creation, identification, and
categorization of patterns. Their inputs and outputs are
predetermined.

Figure 4.4 FeedForward ANN*

https://www.javatpoint.com/artificial-neural-network
*

145
FeedBack ANN

Assume that feedback loops are permitted. They function


in CAMs or content-addressable memory.

Figure 4.5 FeedBack ANN*

4.1.4. Working of ANNs


The ideal representation of the artificial neural network is
a directed graph with weights assigned to the artificial
neurons that make up the graph. We may think of the

https://www.javatpoint.com/artificial-neural-network
*

146
connection between the neuron's output and the input as
directed edges with the weights. The input signal for
the Artificial Neural Network is often a vector
representing a pattern or an image from an outside source.
After that, for every n number of the inputs, the x(n)
notation is used to mathematically assign values to them.

Figure 4.6 Working of ANNs*

After that, we multiply each input by its associated weight


( these weights are the details utilised by the "artificial
neural networks" to solve the specific problem ). These

https://www.javatpoint.com/artificial-neural-network
*

147
weights often stand for the robustness of the connections
between individual neurons in the ANN. Internally, the
computer compiles a summary of all the inputs' relative
weights.

The output is biassed to be non-zero or some other value


if the weighted total is zero, which is done to increase the
system's reaction. The input for bias and the value of
weight are both 1.

In this case, the sum of the weighted inputs might be any


positive number. Here, the maximum value is used as a
reference to ensure that the response stays within the
acceptable range, and also the activation function is
applied to the sum of weighted inputs.

We call the collection of the transfer functions that provide


the desired result the "activation function." The activation
function may come in a few distinct flavours but often falls
into one of two categories: linear or non-linear. The linear
set, Binary set, and the Tan hyperbolic sigmoidal set are
just a few of the more popular families of activation
functions. Let's break down each of them into their
constituent parts:

Binary:

The result of the binary activation function is always a 1 or


a 0. Here, a cutoff value has been established for this
purpose. The ultimate output of an activation function is
one if a net weighted input of the neurons is greater than
one, and zero otherwise

148
Sigmoidal Hyperbolic

The "S" shaped curve is the most common visual re-


presentation of the Sigmoidal Hyperbola function. In this
case, a tan hyperbolic function is employed to
approximatively derive output from the real net input. We
may define the function as:

F(x) = (1/1 + exp(-????x))

Where “ ????” represents the “Steepness parameter”.

4.1.5. Machine Learning in ANNs


To maximise their learning potential, ANNs need training.
Various methods of instruction exist, including:−

Supervised Learning − There's a need for a professor who


knows more than the ANN does. If a teacher is certain that
they know the answers to certain questions, they may put
in some sample data.

Take pattern recognition as an example. While


recognising, the ANN makes educated estimates. After
that, the educator feeds the ANN the correct responses.
The network then evaluates its predictions against the
instructor's "right" replies and modifies its predictions
accordingly.

Unsupervised Learning − When there is no preexisting


data collection from which to extrapolate known
responses, this step becomes obligatory. Finding an
obscure pattern is one such example. Existing data sets are
used to do clustering or the process of putting the

149
members of the set into groups according to an
undiscovered pattern.

Reinforcement Learning − Methodology based on careful


observation. Based on what it learns from its
surroundings, the ANN then makes a choice. Every time
the network receives a negative observation, it readjusts its
weights so that it can make the appropriate judgement.

4.1.6. Benefits of Artificial Neural Networks


ANNs have several advantages that make them useful in
certain contexts:

1. Given that many connections between inputs & outputs


in the actual world are non-linear & complex, ANNs'
ability to learn and represent such interactions is crucial.

2. Learning from the inputs and their correlations, an ANN


might also deduce unknown relationships from
the anonymous data, enabling it to generalise and forecast
unknown data.

3. Unlike several other methods of prediction, ANN does


not restrict the variables that may be used as inputs. Many
studies have shown that ANNs' ability to uncover hidden
correlations in data without imposing any predetermined
connections allows them to more accurately imitate
heteroskedasticity or data with significant volatility & non-
constant variance.

This is especially useful in projecting financial time series


(like stock prices) when there is substantial volatility in the
underlying data.

150
4.1.7. Disadvantages of Artificial Neural Networks
1. Hardware Dependence:

• Parallel processing units are required for usage in


the creation of Artificial Neural Networks.

• This means the equipment can only be realised if


certain conditions are met.

2. Understanding the network’s operation:

• If there is one major problem with ANN, this is it.

• When ANN delivers a response to a question, it


does not elaborate on how or why it arrived at that
solution.

• The trust of the network suffers as a consequence.

3. Assured network structure:

• The structure of artificial neural networks is not


determined by any universal law.

• A workable network architecture is found via trial


& error and experience.

4. Difficulty in presenting the issue to the network:

• ANNs can process numerical information.

• Problems need to be quantified in numerical terms


before ANN can be used.

• Whichever manner of presentation is used will


affect how well the network functions overall.

• It all depends on how skilled the user is.

151
5. The network’s lifetime is unknown

• Training is complete if a network's error on


a sample has been reduced to a target value.

• Value does not lead to optimal results.

4.1.8. Applications of Artificial Neural Networks

1. Social Media

Social media makes extensive use of artificial neural


networks. Take Facebook's "People you might know"
function, which recommends users they may already
know in the real life and encourages them to add them as
friends. In reality, this almost miraculous effect is
accomplished by using Artificial Neural Networks to go
through your profile, hobbies, existing friends, and
also their friends to determine who else you could be
connected to. Face recognition is another popular use
of Machine Learning in the social media realm. To do
this, convolutional neural networks are used to identify
around 100 landmarks on the subject's face and then
compare them to those in the database.

2. Marketing and Sales

Online stores like Amazon & Flipkart use your browsing


history to make suggestions for items you would want to
purchase when you log in. If you use a food delivery
service like Swiggy, Zomato, etc. and specify a preference
for Pasta, the app will propose restaurants to you. This is
achieved via the use of individualised marketing and is

152
prevalent in many areas of modern marketing, including
but not limited to book sites, hospitality sites, movie
services, etc. Artificial neural networks are used to learn
about a customer's preferences and behaviour based on
their past purchases and other data.

3. Healthcare

The field of oncology has made use of artificial neural


networks to develop algorithms that can detect malignant
tissue at the microscopic level with the same accuracy as
human pathologists. The use of Facial Analysis on patient
photographs has the potential to detect the early stages of
several uncommon illnesses that express themselves via
visible symptoms. Therefore, the widespread use of ANNs
in healthcare settings could only improve the diagnostic
skills of medical professionals and, in turn, contribute to
an increase in the standard of healthcare provided
throughout the globe.

4. Personal Assistants

You've probably used voice assistants like Siri, Cortana,


Alexa, etc., on the phones you own. Personal digital
assistants are a kind of voice recognition that implements
Natural Language Processing to enable conversation with
and provide answers for its users. To manage the
language's grammar, semantics, accurate speech,
communication, etc., Natural Language Processing
employs artificial neural networks designed for this
purpose.

153
4.2. Biological motivation
The discovery that the biological learning systems (like the
human brain) consist of very complex networks of linked
neurons has served as inspiration for the research of
artificial neural networks.

Figure 4.7 Biological motivation*

Electrical pulses (spikes) travelling through a long, thin


strand called an axon are how neurons communicate with
one another. Synapses are the terminals at which such
pulses are received by receiving neurons. (They are located
on dendrites, which are protrusions from the soma of
the cell. Different types of chemical activity in dendrites
caused by these pulses may either suppress or stimulate
the creation of pulses in a receiving neuron, depending on
the shape of a synapse. Over the course of its dendritic tree
& over time, the neuron adds up the impacts of thousands

http://data-machine.net/nmtutorial/biologicalmotivation.htm
*

154
of impulses. When the total voltage within the cell goes
over a certain point, it "fires," producing a spike that
travels down the cell's axon. This sets off the cascade of
actions in the linked neurons.

Learning is a complicated process that involves altering


the efficacy of synapses to modify the impact of one
neuron on another. While neuroscience motivated ANN
research, researchers avoided the temptation to be
excessively precise in their attempts to mimic the
biological world. Therefore, most ANNs are connectionist
models, which combine basic computational building
blocks (called also units, neurons, or nodes). Changing the
weights of the connections between neurons is a common
way that ANNs learn. Neurons in certain ANNs may have
a kind of local memory.

4.3. The appropriate problem for ANN learning


• The sensor data used for training is noisy and
complicated.

• Similarly, in the case of symbolic algos ("decision


tree learning"; DTL), ANN and DTL provide
similarly accurate results.

• Pairs of traits and values; attributes might well be


strongly connected or independent; values might
well be anything.

• The target function might have either discrete or


continuous values, or it could be a real number or a
vector.

155
• It's OK to have lengthy training periods

• Needs quick assessment of the goal function being


mastered.

• People do not need to comprehend the intended


function being learnt.

4.4. Perceptron
In the field of Machine Learning, the perceptron model
refers to a specific supervised learning approach for
the binary classifiers. The perceptron model, which is
analogous to the single neuron, can determine if a function
is the input and assign it to one of two categories.

A perceptron model, or simply the perceptron, simulates


the behaviour of a real neuron in the brain by simulating
the activity of a network of neurons. The perceptron is
the linear ML technique that allows neurons to learn &
register information obtained from inputs, allowing for
the binary classification or the two-class categorization.

Therefore, the perceptron model is the linear classification


model since it employs a hyperplane line to categorise two
inputs based on the two classes that the machine learns.
The perceptron model, created by "Frank Rosenblatt" in
1957, is a crucial component of Machine Learning (ML),
which is well known for its categorization goals and
method.

A perceptron model is made up of four separate parts.


Following is a list of them:-

156
• Input values.

• Net sum.

• Weights and bias.

• Activation function.

With the perceptron model, robots may automatically


learn weighting factors that aid in the categorization of
inputs. The perceptron model, also known as the "Linear
Binary Classifier", is a powerful tool for organizing and
categorizing large amounts of data.

4.4.1. Understanding the Perceptron


We would now have a more nuanced understanding of the
model due to our prior exposure to the perceptron model
in Machine Learning.

The perceptron model is the most basic form of artificial


neuron networks, and it is based on the idea that a neuron
in the brain serves just to aid in a binary categorization of
data along a hyperplane.

Two distinct kinds of perceptron models exist:-

• Single Layer Perceptron- Specifically, the Single


Layer perceptron is characterised by its capacity to
linearly categorise inputs. This implies that the
inputs are classified along a single hyperplane line
using the previously learnt weights in this kind of
model.

• Multi-Layer Perceptron- To categorise inputs, the


Multi-Layer Perceptron is distinguished by its

157
capacity to use layers. This is a multi-layer
classification technique, which means that
machines may analyse inputs in parallel utilising
several different layers.

Model operation is grounded on the Perceptron Learning


Rule, which allows the programme to automatically learn
coefficients of the weights that indicate multiple inputs.

Each input is recorded by the computer and given a score


based on the coefficients that help place it in one of many
classes. The net total and activation function calculated
after the process will determine the outcome.

Let's go through an example to see how the perceptron


model works in practise.

1. First-layer inputs may be entered by entering


pieces of information (Input Value).

2. The input values would be multiplied by the


weights (pre-learned coefficients). All input values
would be multiplied and the sum would be added
to the result.

3. At this last step (activation function/output


outcome), the bias value would change.

4. Following the weighted input, the activation


function would be evaluated. At this moment, we
shall include the bias value.

5. The value obtained will be the output value used to


decide whether or not to release the output.

158
Here is a quick rundown of the perceptron method using
the Heaviside activation function:-

f(z) = {1 if xTw+b > 0

= {0 otherwise

Artificial neurons from the field of artificial intelligence


serve as the model's input value, allowing information to
be fed into the system.

The machine's inputs are largely given the weight that has
been previously learnt by a perceptron algorithm
(dimension or strength of the connection between the data
units). After applying these weights to the input data, the
final tally may be calculated (total value).

At last, the input value is sent to an activation function,


which either produces an output or discards it. To decide
whether or not an input is higher than 0, a final stage's
activation function ("weighted total added with bias") is
crucial.

Training is the procedure that teaches a perceptron model


to perform the mathematical operations necessary to
transform input into output. A perceptron model uses a
training procedure to enable computers to calculate output
values without being given any input values.

To prepare for the future and inculcate predictive patterns,


robots must be fed historical data throughout the training
phase. The perceptron model is a kind of machine learning
that evaluates input on the fly and generates qualitative
patterns in the same way that the human brain does.

159
4.4.2. Components of a Perceptron

Figure 4.8 Components of a Perceptron*

Four components make up a perceptron:

• Input Values: A group of input values used to


calculate a predicted output value. They are also
known as features & datasets, respectively.

• Weights: Each feature's "weight" is its actual,


quantifiable worth. It indicates how much weight
that particular attribute has in determining the final
value.

• Bias: Using bias, we may move the activation


function to the left or right. The y-intercept in a line
equation is a common way to think about it.

• Summation Function: A connection between the


weights & inputs is established by the summing
function. An algorithm is used to calculate their
total.

https://www.analytixlabs.co.in/blog/what-is-perceptron/
*

160
• Activation Function: It makes the perceptron model
non-linear.

4.4.3. Characteristics of Perceptron


The following are features of the perceptron model.

1. The Perceptron method is used in machine learning


for the supervised learning of binary classifiers.

2. A weight coefficient is learnt automatically in


Perceptron.

3. At first, the decision of whether or not to activate a


neuron is determined by multiplying weights by
input characteristics.

4. To determine whether a weight function is larger


than zero, the activation function uses a step rule.

5. A linear decision boundary is designed to separate


the two classes +1 and -1 along a linear axis.

6. An output signal is generated if and only if the


total input value is greater than a predetermined
threshold.

4.4.4. Perceptron Model


The "perceptron model" was initially developed in 1957 at
"Cornell Aeronautical Laboratory" in the United States and
is used in computer-based image recognition. It was said
to be the most significant AI-based breakthrough since it
was the first "artificial neural network".

161
However, there were technological limitations to the
perceptron method. Considering its single layer, a
perceptron model could only be used for classes that could
be separated linearly. The problem was fixed, however,
when multi-layered perceptron algorithms were
developed. The many perceptron models are described in
depth below:

Single Layer Perceptron Model

The bare minimum for an artificial neural network is


a single-layer perceptron model. The threshold transfer
function is used with the feed-forward network which can
only assess linearly separable items. There are just two
possible results (goal) that the model can produce: 1 and 0.

Figure 4.9 Single Layer Perceptron Model*

In the single-layered perceptron model, the algorithm


begins with no prior knowledge. Due to the random
distribution of weights, the method simply sums the
values assigned to each weight. The output of a single-

https://www.analytixlabs.co.in/blog/what-is-perceptron/
*

162
layer perceptron is a 1 if and only if the value of the total is
greater than a threshold or a specified value.

We may claim the perceptron did a good job when the


input values are close to the target values for the
anticipated output. To mitigate the impact of unexpected
results on subsequent predictions depending on the same
parameters, the weights must be modified whenever there
is a discrepancy between expectations and actual results.

Multilayer Perceptron Model

The backpropagation technique is implemented in a multi-


layer perceptron model. Despite seeming like a standard
perceptron, this model conceals one or more additional
layers.

163
Figure 4.10 Multilayer Perceptron Model*

There are two stages to running the backpropagation


algorithm:

• Forward phase- From an input layer to an output


layer, the activation functions are sent. The sigmoid
threshold is used to calculate outputs after all
the weighted inputs have been supplied.

• Backward phase- The discrepancies between


an output layer's actual value as well as the
required nominal value are sent rearward. To get
the desired result, we adjust the weights and
biases. The adjustment is made by giving more or
less weight and bias to each component in
proportion to its influence on the inaccuracy.

https://www.analytixlabs.co.in/blog/what-is-perceptron/
*

164
Advantages of Multi-Layer Perceptron

• Complex non-linear issues may be tackled with the


help of the multi-layered perceptron model.

• It performs well with both tiny and big input data.

• After the training, we can acquire accurate


predictions more quickly.

• The same ratio of accuracy may be achieved with


big and small data sets using this method.

Disadvantages of Multi-Layer Perceptron

• Processing data with Multi-layer perceptron is


laborious and time-consuming.

• Predicting the relative impact of the dependent


variable on every independent variable in
the multi-layer Perceptron is notoriously difficult.

• The efficiency of the model is inversely


proportional to the quality of training.

4.4.5. Limitations of the Perceptron Model


The following are some of the restrictions that come with
using a perceptron model:

• For the network to make adjustments depending


on the outcomes of every presentation, the input
vectors should be given to a network one at a time
or in batches.

165
• A hard limit transfer function ensures that the
perceptron can only produce a binary output
(either 0 or 1).

• It excels at classifying sets of inputs that can be


neatly divided along linear boundaries, but it
struggles with non-linear input vectors.

4.4.6. Significance of the Perceptron model

In Machine Learning, the Perceptron Model is


the supervised learning technique that prioritises input
categorization along a linear binary axis. The original goal
of developing this method was to help computers
recognise images.

To improve upon preexisting Machine Learning


algorithms and potentially produce new, more
sophisticated ones, the model was heralded as a watershed
advancement in the field of artificial intelligence.

Many people had high hopes for this breakthrough since it


seemed to work in most cases and gained popularity
rapidly at the outset. However, the infrastructure limits
were quickly revealed, and it was thought that it would be
a while before perceptron can be deployed without much
trouble.

The perceptron model is important because it is a learning


method that autonomously organises a network
of artificial neurons to integrate desirable characteristics,
allowing robots to perform well in binary classification
tasks.

166
A multi-layer perceptron allows the model to do input
classification with the assistance of several layers, making
it well-suited to more complicated inputs than a "single-
layer perceptron".

When it comes to Machine Learning, the most popular and


sought-after algorithm is the supervised learning
algorithm, such as a perceptron model. A perceptron
model is widely used in data analytics because it facilitates
binary categorization and leads to issue solutions
concerning data points.

4.4.7. Future of the Perceptron


Machine learning is an AI approach to gaining insight
from data via the development of intuitive patterns and
the subsequent application of these developed patterns.
Artificial intelligence, unlike traditional computer
programmes, does not need substantial programming to
achieve. AI researchers are always on the lookout for new
forms of computational intelligence which can be recreated
in computers.

In machine learning, the artificial neural network (ANN) is


a kind of deep learning architecture that employs artificial
intelligence to recognise and improve complicated patterns
and learning skills.

In this setting, the perceptron model has a promising and


potentially huge future. Why? The linear approach of
segregation used by the perceptron model for binary
classification is the primary reason for its popularity.

167
Future perceptron technology would continue to assist and
encourage analytical behaviour in machines, which will
increase the efficiency of computers as Artificial
Intelligence advances.

Moreover, the current emphasis on neural networks


predicts that the perceptron model would grow in
prominence in the technical sphere since it helps machines
forecast the future in terms of data as well as perform
quickly without requiring considerable programming.

In any case, the perceptron model would improve and lead


to significant changes in how robots mimic the brain's
operations.

4.5. Multilayer networks and the back-


propagation algorithm.
Multi-Layered Neural Network

Multi-Layer Perceptron is the proper name for an accurate


"Multi-Layer Neural Network" with all layers linked. To
put it simply, the "Multi-Layer Neural Network" has many
layers of artificial neurons, also called nodes. Multi-Layer
Neural Networks have replaced Single-Layer Neural
Networks as the standard in recent years. The multi-layer
neural network is shown in the next diagram.

Explanation: The "1"-marked nodes are bias units in this


case. Layer 1, on the left, is an input layer, Layer 2, in the
centre, is the hidden layer, and Layer 3, on the right, is
an output layer. The units in the preceding figure may be

168
broken down as follows: 3 inputs (excluding the bias unit),
1 output, and 4 unknowns (1 bias unit is not involved)

Figure 4.11 Multi-Layered Neural Network*

In most cases, the "Feed Forward Neural Network" is


implemented in a Multi-layered Neural Network. Neural
Networks' hyperparameters include the number
of neurons and several layers. Using cross-validation

https://www.geeksforgeeks.org/multi-layered-neural-
*

networks-in-r-programming/

169
methods is necessary for determining the best settings for
hyperparameters. Training with a varying load is
performed using the Back-Propagation method.

Backpropagation

One such process is called backpropagation, and it works


by sending mistakes from the network's output nodes back
to their respective input nodes. Errors propagate in
reverse, hence this phenomenon is known by that name.
For example, it is used in many neural network data
mining applications, such as character recognition and
signature verification.

Backpropagation algorithm

When training feedforward neural networks, backpropagation


is a common algorithm. It is far more efficient than the
basic approach of calculating the gradient of a loss
function with respect to every individual weight in
a network.

Due to their effectiveness, gradient techniques (including


variations like gradient descent & stochastic gradient
descent) are often used to train the "multi-layer networks
and update weights" to minimise loss.

The backpropagation method uses the chain rule to


iteratively calculate the gradient of a loss function with
respect per weight, "layer by layer", starting at the last
layer to avoid recomputing intermediate terms.

170
Working of Backpropagation

Neural networks produce output vectors from the input


vectors via supervised learning. If the created output
vector does not correspond to the requested output vector,
the error report is issued. If you submit a bug report, it will
alter the weights until you achieve the expected results.

Backpropagation Algorithm

Step 1: A stream of X inputs arrives through the


established route.

Step 2: Authentic weights W are used to simulate the


input. To a large extent, weights are selected at random.

Step 3: Determine the results of every neuron's input,


hidden, and output layers.

Step 4: Find the outputs' error rate.

Backpropagation Error= Actual Output – Desired Output.

Step 5: To decrease the error, we must return to a hidden


layer from the output layer and modify the weights there.

Step 6: Iterate the steps until the desired result is reached.

Need for Backpropagation

Backpropagation, or "backpropagation of mistakes," is an


effective method for training neural networks. It can be
put into action quickly and with little effort. The only
parameter that has to be specified for backpropagation is

171
the number of inputs. Since backpropagation doesn't need
any previous information about the network, it's a very
adaptable technique.

Types of Backpropagation

Backpropagation networks may be split into two


categories:

1. Static backpropagation: The goal of a network


trained by static backpropagation is to provide the
same outputs regardless of the inputs. Static
classification issues, like "optical character
recognition" (OCR), are within the scope of these
networks.

2. Recurrent backpropagation: Another kind of


network used for fixed-point learning is recursive
backpropagation. Recurrent backpropagation uses
a feed-forward activation up until a threshold is
met. Real-time mapping is available with static
backpropagation but not with recurrent
backpropagation.

Advantages

• It's quick, simple, and straightforward to set up.

• All that is adjusted are the input's numbers; no


additional variables are involved.

• It's effective and adaptable.

• No specialised skills are required of consumers.

172
Disadvantages

• It's very sensitive to anomalies and noisy input.


Results may be off if they are based on noisy data.

• The output is extremely sensitive to the


information that is provided.

• Training for too long a period.

• A matrix-based method is chosen over the mini-


batch method.

173
CHAPTER-5: Ensembles

5.1. Introduction

By integrating many models, ensemble learning may aid in


producing more accurate machine learning outcomes. By
combining several models, superior predictive
performance may be achieved compared to that of a single
model. The fundamental concept is to study a group of
classifiers (specialists) and then ask them to cast votes.

Advantage: Enhanced ability to forecast outcomes.

Disadvantage: Grasping the meaning of many classifiers at


once is challenging.

Ensembles, as shown by Dietterich (2002), can solve three


issues–

• Statistical Problem –

When there is more information to consider than possible


hypotheses, we run into the Statistical Problem. Therefore,
several hypotheses are equally supported by the evidence,
but only one will be selected by the learning process. The

174
probability that the selected hypothesis is true with respect
to the unseen data is low.

Figure 5.1 An example of ensembles*Why do ensembles work?

• Computational Problem –

When a learning system can not reliably locate the optimal


hypothesis, we have a Computational Problem.

• Representational Problem – This is known as


the Representational problem when there is no
reliable approximation of a target class in a hypo-
thesis space (es).

https://www.geeksforgeeks.org/ensemble-classifier-data-
*

mining/

175
Main Challenge for Developing Ensemble Models?

The difficulty is not in obtaining very accurate base models


but instead in obtaining basic models which make a
variety of errors. When using ensembles for classification,
for instance, high accuracies may be achieved even if the
basis classifier accuracy is poor since multiple base models
might misclassify different training samples.

Statistical and Computational Justifications for En-


semble Systems

Making use of the ensemble-based decision systems in CI


is built on similar premises to making use of them in real
life. Since each decision-maker has a unique track record
and level of accuracy, we often seek input from others
around us before moving forward. If there were such a
person—an oracle, perhaps—whose forecasts were always
accurate, we wouldn't need any other decision-maker, and
we certainly wouldn't want to rely on the ensemble-based
systems.

Unfortunately, there is no such thing as an oracle; all


decision-makers have a track record of varying degrees of
success. That is, there is a non-zero variance in the degree
to which each decision-maker is right. To begin, remember
that two components of any classification mistake that we
can influence are the classifier's bias and its variance
throughout training on various data sets. High-variance
classifiers tend to have low bias & vice versa.

176
In contrast, it is well-known that averaging tends to
smooth out (reduce) fluctuations. Ensemble systems aim to
decrease variance by combining the results of the several
classifiers that have been trained to have the same or
similar bias and then using some method, such as
averaging, to combine the results.

Figure 5.2 Ensemble systems aim to decrease variance by


combining the results of the several classifiers*

To reduce the variation, the moving average filter is


applied to the signal, averaging the samples around each
sample. Considering that the noise within every sample is

https://doc.lagout.org/science/Artificial%20Intelligence/Machine
*

%20learning/Ensemble%20Machine%20Learning_%20Methods%
20and%20Applications%20%5BZhang%20%26%20Ma%202012-
02-17%5D.pdf

177
unrelated, averaging will remove the noise component
while leaving the common information content of the
signal unchanged. Assuming that classifiers produce
various mistakes on each sample but usually agree on their
right classifications, averaging a classifier output
minimises the error by averaging out error components,
and this is precisely how an ensemble of the classifiers
improves classifier accuracy.

First, averaging classifier outputs is only one technique to


combine ensemble members; in the context of
the ensemble systems, there are many more methods to do
this. Second, it is not certain that classification performance
would improve by pooling the classifier outputs over the
best classifier in an ensemble. Instead, it lessens the
possibility that we'll choose a classifier that performs
poorly. After all, we wouldn't need an ensemble if we
could predict which classifier would provide the best
results; in that case, we could just select that one.

5.1.1. Building an Ensemble System


The development of a powerful ensemble system requires
the selection of three approaches, these are the three
mainstays of ensemble systems:

1 Data Sampling and Selection

In "ensemble-based systems", diversity in the form of


multiple types of a mistake on any given sample is of
utmost relevance. After all, there's no use in putting
together an ensemble if each of its parts produces the same

178
result. So, variety in the ensemble members' choices is
required, especially when things go wrong. It is generally
agreed that ensemble systems benefit greatly from
increased variety. Independent or ideally negatively
correlated, the classifier outputs are optimal.

Several methods exist for increasing the diversity of


the ensembles, the most popular of which is to use various
subsets of training data. As the sampling method changes,
the resulting ensemble algorithms shift as well. Bagging,
on the other hand, results from utilising bootstrapped
clones of training data, whereas boosting methods are built
upon favouring samples that were previously
misclassified.

However, random subspace approaches also exist,


whereby features are partitioned and used to train
individual classifiers. Other, less frequent methods include
using a variety of basis classifiers, varying the parameters
of a base classifier (for example, training an ensemble of
the multilayer perceptron, every with a unique number of
the hidden layer nodes), or a combination of these
techniques. A variety of diversity metrics are defined
there. Although it has been shown that a lack of variety
leads to worse group performance, it has not yet been
determined whether or not there is a direct correlation
between diversity and ensemble precision.

2 Training Member Classifiers

The method used in training individual ensemble


members is important to any ensemble-based system.

179
Bagging (and related techniques arc-x4 or rather random
forests), boosting (and its numerous forms), stack
generalisation, and hierarchical mean overall error (MoE)
remain the most often used methodologies despite the
proliferation of alternative algorithms.

3 Combining Ensemble Members

The last stage of any ensemble-based system is


a mechanism that brings together several classifiers. The
approach used at this stage is partly determined by the
classifiers that make up the ensemble. Some classifiers,
including support vector machines, exclusively provide
discrete-valued label outputs, for instance. For such
classifiers, majority voting (either simple or weighted) is
by far the most popular combination rule, followed
by Borda count.

Class-specific outputs from other classifiers, like the


multilayer perceptron or (naive) Bayes classifier,
are continuous-valued and may be understood as the
classifier's level of support for each class.

Classifiers of this kind may make use of a larger variety of


methods than only voting-based ones, including arithmetic
( product, sum, mean, etc.) combiners and more complex
decision templates. When training is complete, many such
combiners are ready to be employed, while more
sophisticated combination algorithms may need extra
training (as used in the stacked generalisation or
hierarchical MoE).

180
5.2. Bagging and boosting
Both Bagging & Boosting are predicated on the idea of
breaking down a larger job into smaller manageable
chunks, and then working on each chunk individually
before merging their results to form the whole.

Bagging

It is the machine learning technique for reducing variance


and avoiding overfitting, leading to a more precise
prediction by learning models, and it goes by the name
Bootstrap Aggregating.

In Bagging, you divide up the training data in many


different ways, using sampling & replacement techniques.
Models built on top of foundational algorithms like
Decision Trees are then utilised to make predictions about
new data obtained from the same collection. Model
averaging strategies, such as Random Forest, which take
the output of several models and use it to make a single
prediction, are used.

Datasets are produced randomly with replacements in an


original input dataset, and several datasets are used to
train different models simultaneously.

Predicting a model's behaviour in Bagging is often done


using the Decision tree method.

The Random Forest Model, which employs Bagging


Bagging, is one such example; it employs several decision
trees to build trees, producing a full random forest.

181
Boosting

It's an ensemble learning technique that builds on previous


models by giving more weight to the observations (data
points) that were used to train the strongest model.

The first model is trained using the original data, and then
the second model is trained with a heavier weight of
the observations to correct the first model's mistakes. The
process is repeated until either a reliable forecast is made
or a large number of models have been explored.

The following approach keeps tabs on each student's


blunders and reduces them by giving more weight to the
observation.

Example: In the case of Addabooster, boosting is employed


to lower the error rate.

Working of Bagging

Assume M models, N data, D datasets, and F a data


feature. All M models in the Bagging operate in parallel.

Separating the data into the test and training sets.

Using examples from the training data, a first model (m1)


predicts that. You may resample the training data from
one model and retrain it with a new set of data using
modelm2.

Since we are just sampling a dataset without adding to it


or deleting any entries, we may end up with many copies
of the same training data. Raw sampling with
the replacement is the method used here.

182
The best way to conclude is to use a strategy that takes into
account the predictions of all models and takes the mean.

Advantages of Bagging

• Convert weak learners: Parallel processing is a


powerful tool for transforming fragile models into
robust learners.

• Reduce variance: As a result, the resulting learning


model is more accurate and has less variation and
overfitting.

• Increase accuracy: In applications like regression &


statistical classification, it improves the machine
learning algorithms' precision.

Disadvantages of Bagging

• Underfitting: Underfitting occurs when a model


has not been adequately trained.

• Costly: High cost due to the need to use many


models.

Working of Boosting

To do Boosting, we need to start with a dataset and


assume that M learning models are processing the input in
order.

Let's suppose M models are available, and D datasets are


used for evaluation.

In the first model, M1 is said to be trained using a small


subset of the dataset to verify its progress toward mastery

183
before being fully trained using the whole data. Train M2
and M3 using M1's projected error on the same sample
data.

When all the slow students have been taught, or when we


have made the best possible forecast, the procedure will
terminate.

Advantages of Boosting

• Highly efficient at minimising uncertainty and


resolving the dilemma of two classifications.

• Handle missing data: Useful for dealing with


missing data since many models are linked
sequentially to find and fill in gaps.

Disadvantages of Boosting

• Complex: It's difficult to keep track of how all the


models function and how each inaccuracy adds
more weight to the data. Putting an algorithm
through its paces in real-time is a challenging task.

• Dependency: This chain of dependency between


successive models introduces mistake potential.

Similarities between Bagging and Boosting

To compare and contrast Bagging with boosting, we might


list the following:

• Ensemble method: Ensemble strategies like bag-


ging & boosting are used to motivate inactive
students to participate in the learning process.

184
• Variance reduction: Overfitting & increased
variance may be corrected using both methods.

• Generate datasets: They both randomly produce


many datasets by sampling and making
adjustments.

• Average predicted result: They are similar in that


they both employ average model approaches to
forecast the result of N learners being converted
into a single learner.

Differences between Bagging and Boosting

Boosting Bagging

Dataset Train your models Every time you train a


using many new learner, make the
datasets, preferably dataset heavier.
with some data
replacements.

Issue There is less In machine learning, it


prejudice in the decreases both variance
machine learning and overfitting.
system.

Order In other words, it's A homogenous parallel


of a homogenous model.
working model that
proceeds in time.

When to high-bias simple When there is a lot of

185
use classifiers. variation inside the
classifier and it is
unstable.

Weights Detecting errors in The significance of


prediction should observation is equivalent.
be given more
weight by
observations.

Effects Each model is Each model may be used


on weak impacted by the individually.
model models to which it
is related.

example AdaBoost Random Forest

5.3. Random forest


The supervised learning approach that Random Forest is a
part of is one of the most widely used in machine learning.
The ML tool may be used for both the classification and
the regression. Ensemble learning, the act of merging
several classifiers to solve a complicated issue and increase
the performance of a model, forms the basis of this
approach.

Random Forest, as its name indicates, is a classifier which


uses many decision trees trained on different parts of a
dataset and then averages the results to increase the
dataset's prediction accuracy. The output of a random

186
forest is predicted not by a single decision tree, but by
aggregating the forecasts of many individual trees.

Overfitting is avoided and accuracy is increased as the


forest size increases.

The following graphic illustrates how the Random Forest


method works:

Figure 5.3 Random forest*

https://www.javatpoint.com/machine-learning-random-forest-
*

algorithm

187
5.3.1. Assumptions for Random Forest

Since the random forest uses a combination of trees to


determine the dataset's classification, certain decision trees
may likely provide the desired results while others will
not. However, by combining the results of all the trees, we
obtain an accurate prediction. Consequently, here are two
presumptions that can help you build a more accurate
Random forest classifier

• For the classifier to make a reliable prediction, the


feature variable in a dataset should have some
observed data.

• There can't be much overlap between the results


predicted by the various trees.

Why use Random Forest?

The following are some arguments in favour of using the


Random Forest method:

• It requires less time to train than competing


algorithms.

• The efficiency of its predictions even for the mas-


sive dataset is impressive.

• It can also preserve precision even when a signi-


ficant chunk of data is absent.

188
5.3.2. How does the Random Forest algorithm
work?
The first step of using a Random Forest is to generate the
random forest by mixing N decision trees, and the second
step is to use the generated trees to make the predictions.

Here's how it all works; just follow the steps and refer to
the diagram:

Step 1: Pull K arbitrary training data points at random.

Step 2: Construct the related decision trees for the given


data (Subsets).

Step 3: Pick an N-size target for your tree-based decision-


making.

Step 4: It is necessary to do Steps 1 and 2 again.

Step 5: Locate the forecasts of every decision tree for new


data points, and place them in a category that receives the
most votes.

As an illustration of the algorithm's operation, consider the


following scenario:

Example: Let's pretend there's a dataset with several


pictures of different types of fruit. The Random forest
classifier is then applied to this data set. Every decision
tree is given the subset of the dataset. Each decision tree
generates a prediction result during a training phase, and
a "Random Forest classifier" subsequently makes the final

189
prediction for every new data point depending on
a majority of outcomes. Take a look at the picture below:

Figure 5.4 An example of a random forest*

5.3.3. Applications of Random Forest


For the most part, Random Forest is used in the following
four fields:

1. Banking: This algorithm is widely used by the


banking industry to determine loan risk.

2. Medicine: This method may be used to forecast


illness occurrence and severity.

https://www.javatpoint.com/machine-learning-random-forest-
*

algorithm

190
3. Land Use: Using this method, we can pinpoint
locations with a comparable land use pattern.

4. Marketing: This algorithm may be used to detect


shifts in the marketing landscape.

5.3.4. Advantages of Random Forest

• Classification & regression are both within Random


Forest's capabilities.

• It works well with high-dimensional data sets and


can handle enormous datasets.

• It improves the reliability of the model and helps


avoid the problem of overfitting.

5.3.5. Disadvantages of Random Forest


• Despite its versatility, random forest is not more
suited for Regression problems than classification.

Essential Features of Random Forest

• Miscellany: When compared to other types of trees,


each one has its special qualities. Different types of
trees exist.

• Immune to the curse of dimensionality: A tree is


only a notion, hence there is no need to analyse it
for specific characteristics. This results in a con-
densed feature space.

191
• Parallelization: Since each tree in a random forest is
generated independently using various data and
attributes, we can make full use of the central
processing unit to construct these forests.

• Train-Test split: As the decision tree in a Random


Forest is never exposed to 30% of the data, there is
no need to split the training and testing sets with
this method.

• Stability: The fin

5.4. Clustering: Introduction


It is an example of an unsupervised learning technique. An
unsupervised learning approach is one in which we
compile our reference data from inputs without the use of
labelled outputs. It is a method for discovering the
underlying structure, processes, generating properties, and
groups of a given collection of cases.

Clustering refers to the process of grouping a population


or set of data points into subsets where members of the
same subset have more similarities than differences with
members of other subsets. This is essentially an
assemblage of things that have been grouped based on
their similarities and differences.

To illustrate, the points in the graph below that are close to


each other may be thought of as belonging to the same
group. To our eyes, there are three distinct groupings in
the image below.

192
Figure 5.5 An example of Clustering*

Clusters need not take on a spherical shape. The following


are examples of this:

Figure 5.6 DBSCAN: Density-based Spatial Clustering of


Applications with Noise †

https://www.geeksforgeeks.org/clustering-in-machine-
*

learning/
† https://www.geeksforgeeks.org/clustering-in-machine-learning/

193
The clustering of such data points is based on the simple
idea that each point falls within a certain distance of the
cluster's centre. The outliers are determined using several
different distance measures and approaches.

5.4.1. Why Clustering?

Clustering is crucial because it reveals hidden patterns in


unlabeled data. Bad clustering is not evaluated in any way.
The criteria that may be used to determine whether a
demand has been met are entirely dependent on the user.
For example, we could want to locate representatives for
homogenous groups (data reduction), discover "natural
clusters" and characterise their unknown qualities
("natural" data types), discover appropriate and helpful
classifications ("useful" data classes), or discover peculiar
data items (outlier detection). The assumptions used by
this algorithm about the similarities between data points
might vary widely while still producing meaningful
results.

Clustering Methods :

• Density-Based Methods: These techniques see the


clusters as dense areas sharing certain
characteristics with the less dense portion of the
universe while also displaying some distinctive
features. These techniques may effectively combine
two separate clusters into a single one. Instances of
such methods are "Density-Based Spatial
Clustering of Applications with Noise" (DBSCAN),

194
"Ordering Points to Identify Clustering Structure"
(OPTICS), etc.

• Hierarchical-Based Methods: The clusters


generated by this technique branch out like a tree
depending on the hierarchy. To create further
clusters, a single cluster is used as a building block
for the creation of still others. It's broken up into
two groups:

1. Agglomerative (bottom-up approach).

2. Divisive (top-down approach).

Examples are CURE (Clustering Using Representatives),


BIRCH (Balanced Iterative Reducing Clustering and using
Hierarchies), etc.

• Partitioning Methods: These strategies cluster items


into k distinct groups, with each resulting subset
being a single cluster. Whenever the distance plays
a significant role like in K-means, CLARANS
(Clustering Large Applications using Randomized
Search), etc., this technique is used to enhance the
similarity function.

• Grid-based Methods: This technique divides the


data space into a grid of cells with a fixed number
of cells per cell. Instances of grids that do clustering
operations are STING (Statistical Information
Grid), wave cluster, CLIQUE (Clustering in Quest),
etc., and all the clustering processes performed on

195
such grids are quick and independent of the
number of data items.

Clustering Algorithms

The K-means method is the quickest and easiest way to


solve the clustering issue using unsupervised learning. The
k-means method divides a dataset of n observations into k
groups, where each observation is assigned to the cluster
containing the mean closest to its own.

Figure 5.7 solve the clustering issue using unsupervised learning *

https://www.geeksforgeeks.org/clustering-in-machine-learning/
*

196
5.4.2. Types of Clustering Methods/ Algorithms

Due to the subjective nature of the clustering, several


alternative methods have been developed to solve the
varying challenges that arise. Each issue has its guidelines
for what constitutes a "high degree of similarity" between
two data points, necessitating a unique technique for each
particular clustering goal. More than a hundred different
clustering methods based on machine learning are now in
use.

A few Types of Clustering Algorithms

Connectivity Models

Connectivity models, as their name suggests, tend to


categorise data points following how near they are to one
another. The idea behind it is that neighbouring data
points will have more in common than those that are
farther apart. The method can handle a complex
hierarchical structure in which clusters may and do
combine. It can work with several different ways of
splitting up the data.

Each clustering task may call for a different distance


function, therefore developers have some leeway in
making that decision. Two distinct connection model
strategies exist for dealing with clustering issues. First, the
data is clustered according to its similarity, and then the
clusters are combined as their distance from the centre
diminishes. The second strategy involves grouping all of

197
the data into a single cluster and then splitting it up into
subgroups based on their distance from one another. The
model can be understood, however, it cannot handle large
datasets.

Distribution Models

It is the assumption that all data points in the cluster


follow the same distribution (either the Normal or
Gaussian) that forms the basis of distribution models. The
model is somewhat flawed in that it is prone to overfitting.
A well expectation-maximization method is a great
illustration of this approach.

Density Models

These models look for distinct clusters of data points and


then focus on those clusters. After that, it clusters data
points that are geographically close together. Examples of
popular density models are DBSCAN & OPTICS.

Centroid Models

In iterative clustering methods known as centroid models,


proximity to the cluster centre is used to measure how
similar the data points are. The formation of the centroid
(the cluster's centre) is based on the assumption that the
data points should be as close as possible to the centroid.
Such clustering issues often need numerous iterations to
approach a solution. The K-means method is a kind of
centroid model.

198
5.4.3. Applications of Clustering
• Clustering is an excellent method for dealing with a
wide range of machine learning issues, and it finds
use in a wide range of sectors.

• Market researchers use this method to learn more


about potential target groups and identify existing
ones.

• Using image recognition methods for categorising


plant and animal species.

• It sorts genes with similar functions into groups,


which helps in deducing the taxonomies of plants
and animals, and provides insight into the
underlying structures of populations.

• It may be used in urban planning to classify


clusters of buildings based on their function,
market value, and location.

• It also locates places that have similar land use and


labels them as either agricultural, industrial,
commercial, residential, etc.

• A system that organises online content for easier


searching.

• Good for using as the data mining feature to learn


about how information is grouped and what
factors separate various clusters.

• Used in outlier identification applications, it may


help spot cases of credit & insurance fraud.

199
• Useful in analysing earthquake-affected regions to
identify the high-risk zones (applicable for
the other natural hazards too).

• Library book clustering by subject, genre, and other


criteria is a possible use case.

• Cancer cell detection relies on comparing them to


healthy cells for a classification system, hence this
is a crucial use.

• As a consequence of applying clustering


algorithms, search engines provide results that are
the most closely related to the user's search query.

• Clustering techniques are used in wireless


networks to reduce power consumption and
maximise data throughput.

• Clustering methods are also used by hashtags


on social media to group all postings with the same
hashtag into a single stream.

5.5. K-mean clustering


How does the K-Means Algorithm Work?

Following are some stages that outline how the K-Means


algorithm works:

Step 1: Choose the value K to determine how many groups


will be created.

Step 2: Pick any K of the points or centres at random. It


may be different from the original dataset used.

200
Step 3: Distribute the data points across the K clusters by
assigning them to the centroid that is geographically
closest to them.

Step 4: A new centroid for each cluster should be


determined after computing the variance.

Step 5: In other words, you need to redo Step 3 again, this


time assigning each data point to the cluster with the
nearest centroid.

Step 6: If there has been a reassignment, go to 4; otherwise,


end the process.

Step7: Our model is complete and ready for use.

Consider the graphic representations to better grasp the


aforementioned procedures:

Figure 5.8 A scatter diagram of M1 and M2 factors along the x-


axis*

https://www.javatpoint.com/k-means-clustering-
*

algorithm-in-machine-learning

201
To illustrate, let's say that we have two unknown
quantities, M1 and M2. Here is a scatter diagram of these
two factors along the x-axis:

Let's divide the dataset into two groups (K=2) to make it


easier to analyse. This means we'll attempt to divide these
data sets into two categories here.

To create the cluster, we must choose the random set of k


points or the random centroid. These may be any points,
not only those in the dataset. To illustrate, we'll use the
two points outside of our data set (below) to determine the
value of k. Think about the picture below:

Figure 5.9 Using two points outside the data*

The next step is to determine the nearest K-point or


centroid for each data point in the scatter plot. We would
calculate it using the techniques we learned in school for
measuring great distances. Since this is the case, we will
locate a median point midway between the two centres.
Take a look at the figure below:

https://www.javatpoint.com/k-means-clustering-algorithm-in-
*

machine-learning

202
Figure 5.10 Locating the median point midway between the two
centres*

From the picture above, we may deduce that the K1 or


the blue centroid is located on a left side of a line, while the
yellow centroid is located on the right side of the line. If it
helps, let's give them a blue and a yellow hue.

Figure 5.11 K1 or the blue centroid is located on a left side of


a line, while the yellow centroid is located on a right side of
a line*

https://www.javatpoint.com/k-means-clustering-algorithm-in-
*

machine-learning

203
If we want to locate the nearest cluster, we must start the
procedure again by selecting a different centroid. New
centroids would be selected by computing their centre of
gravity in the following ways:

Figure 5.12 Computing their centre of gravity†

After that, we'll re-centre all of the data. To do so, we will


once again use the median line method. In terms of the
median, it will look like the graphic below:

One yellow dot can be seen to the left of a line, and two
blue ones can be seen to the right of a line in the picture
above. Therefore, new centroids will be calculated using
these three positions.

https://www.javatpoint.com/k-means-clustering-algorithm-in-
*

machine-learning
https://www.javatpoint.com/k-means-clustering-algorithm-in-

machine-learning

204
Figure 5.13 Using the median line method*

Figure 5.14 Calculating new centroids using three positions†

Now that we have to locate new centroids or K-points


because of the reassignment, we'll return to step 4.

• This procedure will be repeated until the new


centroids have the shape shown in the figure
below, which is the result of discovering the
centroids' centres of gravity:

https://www.javatpoint.com/k-means-clustering-algorithm-in-
*

machine-learning

205
Figure 5.15 The structure of the new centroid*

• Since we now have new axes' centroids, we can


once again recalculate the median and reclassify
the data. As a result, the final product will look like
this:

Figure 5.16Final result obtained†

• As can be seen in the graphic above, our model is


complete when there are no data points on either
side of a line that does not belong there. Take a
look at the diagram below:

https://www.javatpoint.com/k-means-clustering-algorithm-in-
*

machine-learning
https://www.javatpoint.com/k-means-clustering-algorithm-in-

machine-learning

206
Figure 5.17 Come to the conclusion that a model is complete *

Now that our model is complete, we can get rid of the


hypothetical centres, leaving us with two clusters that look
like the ones in the figure below:

Figure 5.18 Getting rid of hypothetical centres †

* https://www.javatpoint.com/k-means-clustering-
algorithm-in-machine-learning
† https://www.javatpoint.com/k-means-clustering-
algorithm-in-machine-learning

207

View publication stats

You might also like