0% found this document useful (0 votes)
17 views

Unit 3

Uploaded by

N x10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Unit 3

Uploaded by

N x10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

Machine Learning

What is Data Classification?


Data classification is broadly defined as the process of organizing data by relevant
categories so that it may be used and protected more efficiently. On a basic level, the
classification process makes data easier to locate and retrieve. Data classification is
of particular importance when it comes to risk management, compliance, and
data security.
Data classification involves tagging data to make it easily searchable and trackable.
It also eliminates multiple duplications of data, which can reduce storage and backup
costs while speeding up the search process.
Reasons for Data Classification
Data classification has improved significantly over time. Today, the technology is used
for a variety of purposes, often in support of data security initiatives. But data may be
classified for a number of reasons, including ease of access (while avoiding unauthorized
access), maintaining regulatory compliance, and to meet various other business or
personal objectives. In some cases, data classification is a regulatory requirement, as data
must be searchable and retrievable within specified timeframes. For the purposes of data
security, data classification is a useful tactic that facilitates proper security responses
based on the type of data being retrieved, transmitted, or copied.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 2
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 3
Classification is a supervised
machine learning method where the
model tries to predict the correct
label of a given input data. In
classification, the model is fully
trained using the training data, and
then it is evaluated on test data
before being used to perform
prediction on new unseen data.
For instance, an algorithm can learn
to predict whether a given email is
spam or ham (no spam), as
illustrated below.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 4
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 5
What is the Perceptron model in Machine Learning?
• Perceptron is Machine Learning algorithm for supervised learning of various binary
classification tasks. Further, Perceptron is also understood as an Artificial Neuron or
neural network unit that helps to detect certain input data computations in business
intelligence.
• Perceptron model is also treated as one of the best and simplest types of Artificial Neural
networks. However, it is a supervised learning algorithm of binary classifiers. Hence, we
can consider it as a single-layer neural network with four main parameters, i.e., input
values, weights and Bias, net sum, and an activation function.
What is Binary classifier in Machine Learning?
In Machine Learning, binary classifiers are defined as the function that helps in deciding
whether input data can be represented as vectors of numbers and belongs to some specific
class.
Binary classifiers can be considered as linear classifiers. In simple words, we can
understand it as a classification algorithm that can predict linear predictor function in
terms of weight and feature vectors.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 6
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 7
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 8
•Input Nodes or Input Layer:
This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
•Wight and Bias:
Weight parameter represents the strength of the connection between units. This is
another most important parameter of Perceptron components. Weight is directly
proportional to the strength of the associated input neuron in deciding the output.
Further, Bias can be considered as the line of intercept in a linear equation.
•Activation Function:
These are the final and important components that help to determine whether the
neuron will fire or not. Activation Function can be considered primarily as a step
function.
Types of Activation functions:
•Sign function
•Step function, and
•Sigmoid function
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 9
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 10
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 11
This step function or Activation function plays a vital role in ensuring that output is mapped
between required values (0,1) or (-1,1). It is important to note that the weight of input is indicative of the
strength of a node. Similarly, an input's bias value gives the ability to shift the activation function curve up
or down.
Perceptron model works in two important steps as follows:

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 12
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 13
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 14
Linear margin Maximum Classifier
A linear margin maximum classifier is a concept related to support vector
machines (SVMs), which is a type of supervised machine learning algorithm
used for classification and regression. SVMs are particularly powerful for
classification tasks, and they work by finding a hyperplane that maximally
separates different classes in the feature space.
1.Linear Classifier:
1. The term "linear" indicates that the decision boundary (hyperplane) used
for classification is a linear function of the input features.
2. In a two-dimensional space, the decision boundary is a straight line, and in
a higher-dimensional space, it's a hyperplane.
2.Margin:
1. The margin in SVM is the distance between the decision boundary and the
nearest data point from each class.
2. A larger margin is desirable because it provides a greater separation
between classes and is associated with better generalization to unseen data.
3.Maximum Margin:
1. The goal of an SVM is to find the hyperplane that maximizes the margin between
different classes.
2. The hyperplane that achieves this maximum margin is called the "maximum
margin hyperplane."
4.Support Vectors:
3. Support vectors are the data points that lie closest to the decision boundary
and have an impact on determining the position and orientation of the
hyperplane.
4. These are the crucial points for defining the margin.
5.Soft Margin:
5. In some cases, the data may not be perfectly separable by a hyperplane. In such
situations, a soft margin SVM allows for some misclassification to find a
balance between maximizing the margin and minimizing errors.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 18
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 19
A larger magnitude indicates a
stronger influence of the
features on the model's
predictions.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 20
The central idea behind PAC
learning is to define what it
means for a learning algorithm to
be "probably approximately
correct" in its predictions on
unseen data.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 21
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 22
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 23
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 24
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 25
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 26
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 27
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 28
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 29
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 30
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 31
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 32
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 33
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 34
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 35
1.Linear SVM: This is the basic form of SVM that separates classes using a
linear boundary in feature space.
2.Nonlinear SVM: When the data is not linearly separable, a nonlinear
SVM is used. It uses kernel functions to transform the input data into a
higher-dimensional feature space, where the data can be linearly separated.
3.One-Class SVM: This variant of SVM is used for outlier detection or
novelty detection. It learns the boundary of a set of observations that contain
no anomalies, and then detects new observations that fall outside of this
boundary.
4.Support Vector Regression (SVR): This variant of SVM is used for
regression problems, where the goal is to predict continuous output
variables. It works by minimizing the distance between the predicted output
and the actual output.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 36
5. Nu-SVM: This variant of SVM introduces a parameter “nu” that controls
the number of support vectors and the margin width.
6. Weighted SVM: This variant of SVM allows for assigning different
weights to different classes in the training data, which can help in cases
where the classes are imbalanced.
7. Multiple Kernel Learning (MKL): This variant of SVM allows for
combining multiple kernel functions to achieve better performance on
complex classification problems.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 37
Examples of each in real time:
1.Linear SVM: Linear SVM is used when the data is linearly separable. For example, let’s
say we have a dataset of iris flowers with their sepal length and sepal width measurements.
We want to predict whether a flower belongs to the setosa or versicolor species. We can
use a linear SVM to separate the two species using a linear boundary in the feature space
of sepal length and sepal width.
2.Nonlinear SVM: Nonlinear SVM is used when the data is not linearly separable. For
example, let’s say we have a dataset of images of handwritten digits, and we want to
classify each image as one of the 10 digits (0–9). We can use a nonlinear SVM with a
polynomial kernel function to transform the input data into a higher-dimensional feature
space, where the data can be linearly separated.
3.One-Class SVM: One-Class SVM is used for outlier detection or novelty detection. For
example, let’s say we have a dataset of credit card transactions, and we want to identify
transactions that are likely to be fraudulent. We can use a one-class SVM to learn the
boundary of a set of transactions that contain no anomalies, and then detect new
transactions that fall outside of this boundary as potential fraudulent transactions.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 38
4. Support Vector Regression (SVR): SVR is used for regression problems, where the goal is to predict
continuous output variables. For example, let’s say we have a dataset of housing prices with their features
such as square footage, number of bedrooms, and location. We can use an SVR to predict the price of a
new house based on its features.
5. Nu-SVM: Nu-SVM is a variant of SVM that introduces a parameter “nu” that controls the number of
support vectors and the margin width. For example, let’s say we have a dataset of email messages, and we
want to classify each message as spam or not spam. We can use a nu-SVM with a smaller value of “nu” to
allow for more support vectors and a wider margin, which can help to reduce the number of false positives
(i.e., classifying a non-spam message as spam).
6. Weighted SVM: Weighted SVM is used when the classes in the training data are imbalanced. For
example, let’s say we have a dataset of medical images, and we want to classify each image as benign or
malignant. If the dataset has more benign images than malignant images, we can use a weighted SVM to
assign a higher weight to the malignant class, which can help to improve the performance of the classifier
on the minority class.
7. Multiple Kernel Learning (MKL): MKL is a variant of SVM that allows for combining multiple
kernel functions to achieve better performance on complex classification problems. For example, let’s say
we have a dataset of protein sequences, and we want to classify each sequence as belonging to one of
several protein families. We can use MKL to combine multiple kernel functions that capture different
aspects of the protein sequences, such as their amino acid composition, secondary structure, and
evolutionary conservation.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 39
Comparison of all variants

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 40
Decision Tree in Machine Learning

A decision tree is a type of supervised learning algorithm that is commonly used in


machine learning to model and predict outcomes based on input data. It is a tree-like
structure where each internal node tests on attribute, each branch corresponds to
attribute value and each leaf node represents the final decision or prediction. The
decision tree algorithm falls under the category of supervised learning. They can be used to
solve both regression and classification problems.

Decision Tree Terminologies


There are specialized terms associated with decision trees that denote various components
and facets of the tree structure and decision-making procedure. :
•Root Node: A decision tree’s root node, which represents the original choice or feature
from which the tree branches, is the highest node.
•Internal Nodes (Decision Nodes): Nodes in the tree whose choices are determined by
the values of particular attributes. There are branches on these nodes that go to other
nodes.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 41
•Leaf Nodes (Terminal Nodes): The branches’ termini, when choices or forecasts are
decided upon. There are no more branches on leaf nodes.
•Branches (Edges): Links between nodes that show how decisions are made in response
to particular circumstances.
•Splitting: The process of dividing a node into two or more sub-nodes based on a
decision criterion. It involves selecting a feature and a threshold to create subsets of data.
•Parent Node: A node that is split into child nodes. The original node from which a split
originates.
•Child Node: Nodes created as a result of a split from a parent node.
•Decision Criterion: The rule or condition used to determine how the data should be split
at a decision node. It involves comparing feature values against a threshold.
•Pruning: The process of removing branches or nodes from a decision tree to improve its
generalization and prevent overfitting.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 42
Example of Decision Tree
Let’s understand decision trees with the help of an example:
Decision trees are upside down which means
the root is at the top and then this root is split
into various several nodes. Decision trees are
nothing but a bunch of if-else statements in
layman terms. It checks if the condition is true
and if it is then it goes to the next node attached
to that decision.
In the below diagram the tree will first ask what
is the weather? Is it sunny, cloudy, or rainy? If
yes then it will go to the next feature which is
humidity and wind. It will again check if there
is a strong wind or weak, if it’s a weak wind The goal of machine learning is to decrease uncertainty
and it’s rainy then the person may go and play. or disorders from the dataset and for this, we use
decision trees.
How Decision Tree is formed?
The process of forming a decision tree involves recursively partitioning the data based on the
values of different attributes. The algorithm selects the best attribute to split the data at
each internal node, based on certain criteria such as information gain or Gini impurity.
This splitting process continues until a stopping criterion is met, such as reaching a
maximum depth or having a minimum number of instances in a leaf node.
Why Decision Tree?
Decision trees are widely used in machine learning for a number of reasons:
•Decision trees are so versatile in simulating intricate decision-making processes, because
of their interpretability and versatility.
•They provide comprehensible insights into the decision logic, decision trees are especially
helpful for tasks involving categorization and regression.
•They are proficient with both numerical and categorical data, and they can easily adapt to a
variety of datasets thanks to their autonomous feature selection capability.
•Decision trees also provide simple visualization, which helps to comprehend and elucidate
the underlying decision processes in a model.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 45
Decision Tree Classification Algorithm
•Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes represent
the features of a dataset, branches represent the decision rules and each leaf node
represents the outcome.
•In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
•The decisions or the test are performed on the basis of features of the given dataset.
•It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
•It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
•In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
•A decision tree simply asks a question, and based on the answer (Yes/No), it further split
May 28, 2024
the tree into subtrees. KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
46
•Below diagram explains the general structure of a decision tree:
https://www.javatpoint.com/machine-learning-decision-tree-classification-algorithm
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 47
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best
algorithm for the given dataset and problem is the main point to remember
while creating a machine learning model. Below are the two reasons for
using the Decision tree:
•Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
•The logic behind the decision tree can be easily understood because it
shows a tree-like structure.
•Decision Tree Terminologies
•Root Node: Root node is from where the decision tree starts. It represents
the entire dataset, which further gets divided into two or more homogeneous
sets.
•Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated
May 28, 2024
further after getting a leaf node.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
48
Splitting: Splitting is the process of dividing the decision node/root node
into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches
from the tree.
Parent/Child node: The root node of the tree is called the parent node,
and other nodes are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 49
For the next node, the algorithm again compares the attribute value with the
other sub-nodes and move further. It continues the process until it reaches
the leaf node of the tree. The complete process can be better understood
using the below algorithm:
•Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
•Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
•Step-3: Divide the S into subsets that contains possible values for the best
attributes.
•Step-4: Generate the decision tree node, which contains the best attribute.
•Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 50
Example: Suppose there is a candidate who
has a job offer and wants to decide whether
he should accept the offer or Not. So, to
solve this problem, the decision tree starts
with the root node (Salary attribute by
ASM). The root node splits further into the
next decision node (distance from the
office) and one leaf node based on the
corresponding labels. The next decision
node further gets split into one decision
node (Cab facility) and one leaf node.
Finally, the decision node splits into two
leaf nodes (Accepted offers and Declined
offer). Consider the below diagram:
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 51
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the
best attribute for the root node and for sub-nodes. So, to solve such problems
there is a technique which is called as Attribute selection measure or ASM. By
this measurement, we can easily select the best attribute for the nodes of the tree.
There are two popular techniques for ASM, which are:
•Information Gain
•Gini Index
1. Information Gain:
•Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
•It calculates how much information a feature provides us about a class.
•According to the value of information gain, we split the node and build the decision
tree.
•A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 52
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 53
Entropy
Entropy is nothing but the uncertainty in our dataset or
measure of disorder. Let me try to explain this with the help
of an example.
Suppose you have a group of friends who decides which
movie they can watch together on Sunday. There are 2
choices for movies, one is “Lucy” and the second
is “Titanic” and now everyone has to tell their choice. After
everyone gives their answer we see that “Lucy” gets 4
votes and “Titanic” gets 5 votes. Which movie do we watch
now? Isn’t it hard to choose 1 movie now because the votes
for both the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal
number of votes for both the movies, and we can’t really
decide which movie we should watch. It would have been
much easier if the votes for “Lucy” were 8 and for
“Titanic” it was 2. Here we could easily say that most votes
are for “Lucy” hence everyone will be watching this movie.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 54
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 55
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 56
Entropy basically measures the impurity of a node. Impurity is the degree
of randomness; it tells how random our data is. Apure sub-splitmeans that
either you should be getting “yes”, or you should be getting “no”.
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left
node gets 5 ‘yes’ and 2 ‘no 'whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. To decide tree, we need to calculate the impurity of
each split, and when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for
Entropy formula.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 57
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 58
We can clearly see from the tree itself that left node has low entropy or
more purity than right node since left node has a greater number of
“yes” and it is easy to decide here.
Always remember that the higher the Entropy, the lower will be the
purity and the higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the
uncertainty or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if the parent
entropy or the entropy of a particular node has decreased or not.
For this, we bring a new metric called “Information gain” which tells
us how much the parent entropy has decreased after splitting it with
some feature.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 59
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 60
Now we have two features to predict
whether he/she will go to the gym or
not.
•Feature 1 is “Energy” which takes two
values “high” and “low”
•Feature 2 is “Motivation” which takes
3 values “No motivation”, “Neutral”
and “Highly motivated”.
Let’s see how our decision tree will be
made using these 2 features. We’ll use
information gain to decide which
feature should be the root node and
which feature should be placed after the
split.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 61
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 62
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 63
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 64
We now see that the “Energy” feature gives more reduction which is
0.37 than the “Motivation” feature. Hence, we will select the feature
which has the highest information gain and then split the node based on
that feature.
In this example “Energy” will be our root node and we’ll do the same
for sub-nodes. Here we can see that when the energy is “high” the
entropy is low and hence we can say a person will definitely go to the
gym if he has high energy, but what if the energy is low? We will again
split the node based on the new feature which is “Motivation”.

KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India


May 28, 2024 65
Advantages of the Decision Tree
• It is simple to understand as it follows the same process which a
human follow while making any decision in real-life.
• It can be very useful for solving decision-related problems.
• It helps to think about all the possible outcomes for a problem.
• There is less requirement of data cleaning compared to other
algorithms.
Disadvantages of the Decision Tree
• The decision tree contains lots of layers, which makes it complex.
• It may have an overfitting issue, which can be resolved using the
Random Forest algorithm.
• For more class labels, the computational complexity of the decision
tree may increase.
https://medium.datadriveninvestor.com/decision-tree-algorithm-with-hands-on-example-e6c2afb40d38
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 66
ID3 [Iterative Dichotomiser3]
(It is the most popular algorithms used to constructing trees.)

ID3 stands for Iterative Dichotomizer3 and is named such because the
algorithm iteratively(repeatedly) dichotomizes(divides) features into two or
more groups at each step.ID3 is an algorithm invented by Ross Quinlan used to
generate a decision tree from a dataset and is the most popular algorithms used
to constructing trees.

ID3 is the core algorithm for building a decision tree .It employs a top-down
greedy search through the space of all possible branches with no backtracking.
This algorithm uses information gain and entropy to construct a classification
decision tree.
Characteristics of ID3 Algorithm
Major Characteristics of the ID3 Algorithm are listed below:
•ID3 can overfit the training data (to avoid overfitting, smaller
decision trees should be preferred over larger ones).
•This algorithm usually produces small trees, but it does not
always produce the smallest possible tree.
•ID3 is harder to use on continuous data (if the values of any
given attribute is continuous, then there are many more places to
split the data on this attribute, and searching for the best value to
split by can be time-consuming).
C4.5
The C4.5 algorithm is a successor of ID3. The most significant difference between C4.5
and ID3 is that C4.5 efficiently allows for continuous features by partitioning
numerical values into distinct intervals.
•Instead of Information Gain, C4.5 uses Information Gain Ratio to determine the best
feature to split on.
•C4.5 uses post-pruning after growing an overly large tree.
•C4.5 has a bias towards features with a lesser number of distinct values.
CART (Classification and Regression Tree)
CART, similarly to C4.5, supports both categorical and numerical data. However, CART differs
from C4.5 in the way that it also supports regression.
One major difference between CART, ID3, and C4.5 is the feature selection criteria. While ID3
and C4.5 use Entropy/Information Gain/Information Gain Ratio, CART uses Gini Impurity.
Additionally, CART constructs a binary tree, which means that every node has exactly two
children, unlike the other algorithms we have discussed, which don't necessarily have two child
nodes per parent.
•When training a CART decision tree, the best split is chosen by minimizing the Gini Impurity.
•CART uses post-pruning after growing an overly large tree.
KPR Institute of Engineering and Technology, Coimbatore, Tamil Nadu, India
May 28, 2024 86

You might also like