0% found this document useful (0 votes)
5 views

CS-DM MODULE- 3

module 3

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CS-DM MODULE- 3

module 3

Uploaded by

Varaha Giri
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 27

MODULE-3

Data Similarity and Dissimilarity Classification

A) Measuring Data Similarity and Dissimilarity :


Similarity and Dissimilarity between Simple Attributes :
The proximity of objects with a number of attributes is typically defined by combining t he
proximities of individual attributes, and thus, we first discuss proximity between objects
having a single attribute.
For objects with a single ordinal attribute, the situation is more complicated because
information about order should be taken into account. Consider an attribute that
measures the quality of a product, e.g., a candy bar, on the scale {poor, fair, OK, good,
wonderful} . It would seem reasonable that a product, P1, which is rated wonderful, would
be closer to a product P2, which is rated good, than it would be to a product P3, which is
rated OK. To make this observation quantitative, the values of the ordinal attribute are
often mapped to successive integers, beginning at 0 or 1, e.g., {poor=O, fair=1, OK =2,
good=3, wonderful=4}. Then, d(P1,P2) = 3-2 = 1 or, if we want the dissimilarity to fall
between 0 and 1, d(Pl, P2) = 2 = 0.25. A similarity for ordinal attributes can then be
defined as s = 1-d.
This definition of similarity (dissimilarity) for an ordinal attribute should make the reader a
bit uneasy since this assumes equal intervals, and this is not so. Otherwise, we would have
an interval or ratio attribute. Is the difference between the values fair and good really the
same as that between the values OK and wonderful? Probably not, but in practice, our
options are limited, and in the absence of more information, this is the standard approach
for defining proximity between ordinal attributes. For interval or ratio attributes, the natural
measure of dissimilarity between two objects is the absolute difference of their values. For
example, we might compare our current weight and our weight a year ago by saying "I am
ten pounds heavier." ln cases such as these, the dissimilarities typically range from 0 to co,
rather than from 0 to 1. T he similarity of interval or ratio attributes is typically expressed
by transforming a similarity into a dissimilarity
Dissimilarities and Similarities between Data Objects :

Common Properties of Dissimilarity Measures

Distance, such as the Euclidean distance, is a dissimilarity measure and has


some well known properties:
1. d(p,q) ≥ 0 for all p and q,and d(p,q)= 0 if and only if p=q,
2. d(p,q)=d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r)for all p, q, and r, where d(p, q) is the distance
(dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all
continuous.

a) Euclidean Distance
Assume that we have measurements xik, i=1,…,N, on variables k=1, …, p(also
called attributes).
The Euclidean distance between the ith and jth objects is

For every pair(i, j) of observations. The weighted Euclidean distance is

If scales of the attributes differ substantially, standardization is necessary.

b)Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement ,xik,i=1,…,N, k=1,… ,p ,the Minkowski distance is

whereλ≥1.ItisalsocalledtheLλmetric.
λ=1:L1metric, Man hattan or City-block distance. λ = 2 : L 2metric, Euclidean
distance.
λ→∞: L∞metric, Supremum distance
Note that λ and p are two different parameters. Dimension of the data matrix
remains finite

c)Mahalanobis Distance
Let X be a N×p matrix.Then the ith row of X is

i𝑥𝑇= (𝑥i1,𝑥i2,…..,𝑥i𝑝)
The Mahalanobis distance is

Where ∑ is the p×p sample covariance matrix.y is 1


f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the
simple matching coefficient (SMC) , which is defined as

Jaccard Coefficient :
The Jaccard coefficient, which is often symbolized by J, is given by the following equation:

Cosine Similarity :
The cosine similarity, defined next, is one of the most common measure of
document similarity. If x and y are two document vectors, then
Common Properties of Similarity Measures

Similarities have some well known properties:


1.s(p,q)=1(or maximum similarity) only if p =q,
2.s(p, q)= s(q, p)for all p and q,where s(p,q) is the similarity between data objects, p
and q.
Examples of Proximity Measures :
This section provides specific examples of some similarity and dissimilarity
measures.

Similarity Measures for Binary Data:

Similarity measures between objects that contain only binary attributes are called
similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates
that the two objects are completely similar, while a value of 0 indicates that the objects
are not at all similar.
Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the simple
matchin g coefficient (SMC) , which is defined as

Jaccard Coefficient : Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix. The Jaccard coefficient, which is often symbolized by
J, is given by the following equation:

Cosine Similarity : The cosine similarity, defined next, is one of the most common
measure of document similarity. If x and y are two document vectors, then

Extended Jaccard Coefficient:


The extended Jaccard coefficient can be used for document data and that reduces to the
Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also
known as the Tanimoto coefficient

Correlation The correlation between two data objects that have binary or continuous
variables is a measure of the linear relationship between the attributes of the objects.
Pearson's correlation coefficient between two data objects, x and y, is defined by the
following equation:

Issues in Proximity Calculation :


This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/ or are
correlated.
(2) how to calculate proximity between objects that are composed of different types of
attributes, e.g., quantitative and qualitative.
(3) and how to handle proximity calculation when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects.
A generalization of Euclidean distance, the Mahalanobis distance, is useful when
attributes are correlated, have different ranges of values (different variances), and the
distribution of the data is approximately Gaussian (normal). Specifically, the Mahalanobis
distance between two objects (vectors) x and y is defined as

Combining Similarities for Heterogeneous Attributes :


A general approach is needed when the attributes are of different types. One
straightforward approach is to compute the similarity between each attribute separately.
Then combine these similarities using a method that results in a similarity between 0 and
l. Typically, the overall similarity is defined as the average of all the individual attribute
similarities.
Using Weights :
This is not desirable when some attributes are more important to the definition of
proximity than others. To address these situation the formulas for proximity can be
modified by weighting the contribution of each attribute.

If the weights wk sum to 1, then it becomes

The definition of the Minkowski distance can also be modified as follows:


some common issues associated with proximity measures:
1. Sensitivity to Scale:
 Problem: Many proximity measures are sensitive to the scale of the variables. If the
scales are not standardized, variables with larger magnitudes may dominate the distance
calculations.
 Solution: Standardize or normalize the variables before applying proximity measures to
ensure that all variables contribute equally.
2. Dimensionality:
 Problem: In high-dimensional spaces, the distance between points may become less
meaningful due to the "curse of dimensionality." This can lead to increased computational
complexity and decreased performance.
 Solution: Dimensionality reduction techniques or feature selection methods can be
applied to address this issue.
3. Assumption of Linearity:
 Problem: Some proximity measures, like Euclidean distance, assume linear
relationships between variables. In non-linear scenarios, these measures may not capture
the true underlying similarities.
 Solution: Consider using proximity measures that are more suitable for non-linear
relationships, or transform the data to make it more linear if appropriate.
4. Outliers:
 Problem: Proximity measures can be sensitive to outliers, which might
disproportionately influence the results. Outliers can distort distance calculations and lead
to inaccurate similarity assessments.
 Solution: Robust proximity measures or outlier detection/preprocessing techniques
can be employed to mitigate the impact of outliers.
5. Metric vs. Non-metric Measures:
 Problem: Some proximity measures may violate the triangle inequality, a key property for
metrics. Non-metric measures can lead to inconsistencies in clustering or classification
algorithms.
 Solution: Carefully choose measures that satisfy the metric properties when working with
algorithms that assume metric distances.
6. Subjectivity in Measure Selection:
 Problem: The choice of proximity measure may depend on the specific
characteristics of the data and the problem at hand. Different measures may yield different
results.
 Solution: Understand the characteristics of your data and the requirements of your
application, and choose a proximity measure accordingly. Sensitivity analysis can also be
performed to assess the impact of different measures.
7. Data Sparsity:
 Problem: In sparse datasets, where many entries are missing or zero, traditional proximity
measures may not provide accurate similarity assessments.
 Solution: Consider using specialized measures designed for sparse data or impute
missing values before applying proximity measures
Selection of Right Proximity Measure :
Selecting the right proximity measures, also known as similarity or
distance measures, is crucial in data mining tasks such as clustering,
classification, and recommendation systems. The choice of proximity
measure depends on the nature of your data and the specific goals of
your analysis. Here are some commonly used proximity measures and
factors to consider when selecting them:

1. Euclidean Distance:

 Formula:
 Use Case: Suitable for continuous numerical data.
 Considerations: Sensitive to outliers.
2. Manhattan Distance (L1 Norm):
 Formula: d(x,y)= | x 1 − x 2 | + | y 1 − y 2 | .
 Use Case: Suitable for sparse data and less sensitive to outliers than Euclidean
distance.
3. Cosine Similarity:

 Formula:
 Us`e Case: Effective for text data, document similarity, and high-dimensional data.
 Considerations: Ignores magnitude and focuses on direction.
4. Jaccard Similarity:

 Formula:
 Use Case: Suitable for binary or categorical data; often used in set comparisons.
5. Hamming Distance:
 Formula: Number of positions at which the corresponding symbols differ.
 Use Case: Applicable to binary or categorical data of the same length.
6. Minkowski Distance:

 Formula:
 Use Case: Generalization of Euclidean and Manhattan distances; the parameter p
determines the norm.
7. Correlation-based Measures:
 Pearson Correlation Coefficient: Measures linear correlation.
 Spearman Rank Correlation Coefficient: Measures monotonic relationships.
 Use Case: Suitable for comparing the relationship between variables.
8. Mahalanobis Distance:

 Formula:
 Use Case: Effective when dealing with multivariate data with different scales.

When selecting a proximity measure, consider the following factors:


 Data Type: Choose a measure that is appropriate for the type of data you are working with
(e.g., numerical, categorical, binary).
 Scale Sensitivity: Some measures are sensitive to the scale of the variables, so standardize
or normalize your data if needed.
 Domain Knowledge: Consider the characteristics of your data and the problem domain.
 Computational Complexity: Some measures may be computationally expensive, especially
with large datasets.
 Noise and Outliers: Choose a measure that is robust to noise and outliers if your data
contains them.
 Interpretability: Consider the interpretability of the measure in the context of your analysis.

B) Classification, : Classification is the task of assigning objects to one of several predefined categories, is
a pervasive problem that encompasses many diverse applications. Examples include detecting spam email
messages based upon the message header and content, categorizing cells as malignant or benign based upon
the results of MRI scans, and classifying galaxies based upon their shapes

Classification
input output
model

attribute set(x) class label(y)

Fig: Classification
The input data for a classification task is a collection of records. Each record, also known as an instance or
example, is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated
as the class label (also known as category or target attribute).
Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories:
mammal, bird, fish, reptile, or amphibian.
The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of
reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.1 are
mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand,
must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a
predictive modeling task in which y is a continuous attribute.

Classification is the task of learning a target function f that maps each attribute set x to one of
the predefined class labels y.

The target function is also known informally as a classification model. A classification model is useful
for the following purposes
Descriptive Modeling: A classification model can serve as an explanatory tool to distinguish between
objects of different classes. For example, it would be useful—for both biologists and others to have a
descriptive model that summarizes the data shown in Table 4.1 and explains what features define a vertebrate
as a mammal, reptile, bird, fish, or amphibian
Predictive Modeling: A classification model can also be used to predict the class label of unknown
records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically
assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the
following characteristics of a creature known as a gila monster:
Body Skin Gives Aquatic Aerial Has Hiber- Class
Name
Temperature Cover Birth Creature Creature Legs nates Label
gila monster cold-blooded scales no No no yes yes ?

Figure : 4.2
Classification techniques are most suited for predicting or describing data sets with binary or nominal
categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-,
medium-, or low- income group) because they do not consider the implicit order among the categories.
General Approach to Solving a Classification Problem:

A classification technique (or classifier) is a systematic approach to building classification models from an
input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support
vector machines, and na¨ıve Bayes classifiers. Each technique employs a learning algorithm to identify
a model that best fits the relationship between the attribute set and class label of the input data. The model
generated by a learning algorithm should both fit the input data well and correctly predict the class labels
of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models
with good generalization capability; i.e., models that accurately predict the class labels of previously
unknown records.
Predicted Class
Class = 1Class = 0
Actual Class = 1 f11 f10
Class Class = 0 f01 f00

Table 4.2. Confusion matrix for a 2-class problem.

Figure 4 .3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classification model, which is subsequently applied to the test set, which consists of records with
unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test
records correctly and incorrectly predicted by the model. These counts are tabulated in a
table known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary
classification problem.

Each entry fij in this table denotes the number of records from class i predicted to be of class j. For
instance, f01 is the number of records from class 0 incorrectly predicted as class 1. Based on the entries in
the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the
total number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a classification model
performs) sunrmarizing this inforn1ation with a single number would take it more convenient to compare the
perforn1ance of different models. Tllis can be done using a performance metric such as accuracy, which is
defined as follows:

Accuracy = f11 + f00


.= number of predictions

f11 + f10 + f01 + f00 Total number of predictions


Equivalently, the performance of a model can be expressed in terms of its error rate, which is given by
the following equation:

Error rate = f10+f01 = Number of wrong predictions


F11+f10+f01+f00 Total number of predictions
Decision Tree Induction:

How a Decision Tree Works:

Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-
mammal? One approach is to pose a series of questions about the characteristics of the species. The first
question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it is
definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a
follow-up question: Do the females of the species give birth to their young? Those that do give birth are
definitely mammals, while those that do not are likely to be non-mammals (with the exception of egg-laying
mammals such as the platypus and spiny anteater). The tree has three types of nodes:

• A root node that has no incoming edges and zero or more outgoing edges.

• Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.

• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non- terminal nodes, which include
the root and other internal nodes, contain attribute test conditions to separate records that have different
characteristics. For example, the root node shown in Figure uses the attribute Body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals,
a leaf node labeled Non-mammals is created as the right child of the root node. If the vertebrate is
warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-
blooded creatures, which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root
node, we apply the test condition to the record and follow the appropriate branch based on the outcome of
the test.

How to Build a Decision Tree:


In principle, there are exponentially many decision trees that can be constructed from a given set of
attributes. While some of the trees are more accurate than others, finding the optimal tree is computationally
infeasible because of the exponential size of the search space. Nevertheless, efficient algorithms have been
developed to induce a reasonably accurate, albeit suboptimal, decision tree in a reasonable amount of time.
These algorithms usually employ a greedy strategy that grows a decision tree by making a series of locally
optimum decisions about which attribute to use for partitioning the data. One such algorithm is Hunt’s
algorithm, which is the basis of many existing decision tree induction algorithms,
Hunt’s Algorithm

In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets. Let Dt be the set of training records that are associated with node t and
y = {y1, y2, . . . , yc} be the class labels. The following is a recursive definition of Hunt’s algorithm.

Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.

Step 2: If Dt contains records that belong to more than one class, an at- tribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is then
recursively applied to each child node.

Tid Home Marital Annual Defaulted


1 Yes Single Income
125K Borrower
No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes

Figure 4.6. Training set for predicting borrowers who will default on loan payments.

The initial tree for the classification problem contains a single node with class label Defaulted = No (see
Figure 4.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however,
needs to be refined since the root node contains records from both classes. The records are subsequently
divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure
4.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will
assume that this is the best criterion for splitting the data at this point. Hunt’s algorithm is then applied
recursively to each child of the root node. From the training set given in Figure 4.6, notice that all
borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf
node labeled Defaulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the
recursive step of Hunt’s algorithm until all the records belong to the same class. The trees resulting from
each recursive step are shown in Figures 4.7(c) and (d).
Hunt’s algorithm will work if every combination of attribute values is present in the training data and
each combination has a unique class label. These assumptions are too stringent for use in most practical
situations. Additional conditions are needed to handle the following cases:

1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no records
associated with these nodes. This can happen if none of the training records have the combination of attribute
values associated with such nodes. In this case the node is declared a leaf node with the same class label
as the majority class of training records associated with its parent node.

2. In Step 2, if all the records associated with Dt have identical attribute values (except for the class
label), then it is not possible to split these records any further. In this case, the node is declared a leaf node
with the same class label as the majority class of training records associated with this node.

Example of how to build Decision tree:

Step 1: Determine the Decision Column


Since decision trees are used for classification, you need to determine the classes which are the basis for the
decision.
In this case, it the last column, that is Play Golf column with classes Yes and No.
To determine the root Node we need to compute the entropy.
To do this, we create a frequency table for the classes (the Yes/No column).

Table 2: Frequency Table

Step 2: Calculating Entropy for the classes (Play Golf)


In this step, you need to calculate the entropy for the Play Golf column and the calculation step is given
below.

Entropy(Play Golf) = E(5,9)

Step 3: Calculate Entropy for Other Attributes After Split


For the other four attributes, we need to calculate the entropy after each of the split.

 E(Play Golf, Outloook)


 E(Play Golf, Temperature)
 E(Play Golf, Humidity)
 E(Play Golf,Windy)

The entropy for two variables is calculated using the formula.

(don’t worry if about this formula, its really easy doing the calculation
There to calculate E(Play Golf, Outlook), we would use the formula below:

Which is
the same as:
E(Play Golf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)
This formula may look unfriendly, but it is quite clear. The easiest way to approach this calculation is to
create a frequency table for the two variables, that is Play Golf and Outlook.

This frequency table is given below:

Table 3: Frequency Table for Outlook


Using this table, we can then calculate E(Play Golf, Outlook), which would then be given by the formula
below

Let’s go ahead to calculate E(3,2)


We would not need to calculate the second and the third terms! This is because

E(4, 0) = 0
E(2,3) = E(3,2)
Isn’t this interesting!!!
Just for clarification, let’s show the the calculation steps
The calculation steps for E(4,0):

The calculation step for E(2,3) is given below

Time to put it all together.

We go ahead to calculate the E(Play Golf, Outlook) by substituting the values we calculated
from E(Sunny), E(Overcast) and E(Rainy) in the equation:

E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)

E(PlayGolf, Temperature) Calculation: Just like in the previous calculation, the calculation of E(Play Golf,
Temperature) is given below. It
It is easier to do if you form the frequency table for the split for Temperature as shown.

Table 4: Frequency Table for Temperature


E(PlayGolf, Temperature) = P(Hot) E(2,2) + P(Cold) E(3,1) + P(Mild) E(4,2)

E(Play Golf, Humidity) Calculation

Just like in the previous calculation, the calculation of E(Play Golf, Humidity) is given below. It
It is easier to do if you form the frequency table for the split for Humidity as shown.

Table 5: Frequency Table for Humidity


E(PlayGolf, Windy) Calculation

Just like in the previous calculation, the calculation of E(PlayGolf, Windy) is given below. It
It is easier to do if you form the frequency table for the split for Windy as shown.

Table 6: Frequency

Table for Windy

Wow! That is so much work! So take break, walk around a little and take a glass of cold water.
Then we continue.
So now that we have all the entropies for all the four attributes, let’s go ahead to summarize them as shown
in below:

1. E(Play Golf, Outloook) = 0.693


2. E(Play Golf, Temperature) = 0.911
3. E(Play Golf, Humidity) = 0.788
4. E(Play Golf,Windy) = 0.892

Step 4: Calculating Information Gain for Each Split


The next step is to calculate the information gain for each of the attributes. The information gain is calculated
from the split using each of the attributes. Then the attribute with the largest information gain is used for the
split.

The information gain is calculated using the formula:

Gain(S,T) = Entropy(S) – Entropy(S,T)

For example, the information gain after spliting using the Outlook attibute is given by:

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)

So let’s go ahead to do the calculation

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)


= 0.94 – 0.693 = 0.247
Gain(Play Golf, Temperature) = Entropy(Play Golf) – Entropy(Play Golf, Temparature)
= 0.94 – 0.911 = 0.029
Gain(Play Golf, Humidity) = Entropy(Play Golf) – Entropy(Play Golf, Humidity)
= 0.94 – 0.788 = 0.152
Gain(Play Golf, Windy) = Entropy(Play Golf) – Entropy(Play Golf, Windy)
= 0.94 – 0.892 = 0.048
Having calculated all the information gain, we now choose the attribute that gives the highest information
gain after the split.
Step 5: Perform the First Split
Draw the First Split of the Decision Tree
Now that we have all the information gain, we then split the tree based on the attribute with the highest
information gain.

From our calculation, the highest information gain comes from Outlook. Therefore the split will look like
this:

Figure 2: Decision Tree after first split


Now that we have the first stage of the decison tree, we see that we have one leaf node. But we still need to
split the tree further.

To do that, we need to also split the original table to create sub tables.
This sub tables are given in below.

Table 7: Initial Split using Outlook


From Table 3, we could see that the Overcast outlook requires no further split because it is just one
homogeneous group. So we have a leaf node.

Step 6: Perform Further Splits

The Sunny and the Rainy attributes needs to be split

The Rainy outlook can be split using either Temperature, Humidity or Windy.

What attribute would best be used for this split? Why?

Answer: Humidity. Because it produces homogenous groups.

Table 8: Split using Humidity


The Rainy attribute could be split using High and Normal attributes and that would give us the tree below.

Figure 3: Split using the Humidity Attribute

Let’t now go ahead to do the same thing for the Sunny outlook The Rainy outlook can be split using either
Temperature, Humidity or Windy.

What attribute would best be used for this split? Why?


Answer: Windy . Because it produces homogeneous groups.
Table 9: Split using Windy Attribute
If we do the split using the Windy attribute, we would have the final tree that would require no further
splitting! This is shown in Figure 4

Step 7: Complete the Decision Tree


The complete table is shown in Figure 4
Note that the same calculation that was used initially could also be used for the further splits. But that would
not be necessary since you could just look at the sub table and be able to determine which attribute to use for
the split.

Figure 4: Final Decision Tree

Final Notes :
Now we have successfully completed the decision tree.
I think we need to celebrate with a bottle of beer!
This is how easy it is to build a decision three. Remember, the initial steps of calculating
the entropy and the gain is the most difficult part. But after that, everything falls into place.

Algorithm for Decision tree induction:


Step1:
Input:
Training dataset with features and corresponding labels.
Step 2:
Decision Tree Initialization:
Create a root node for the decision tree.
Step 3:
Select Best Attribute for Splitting:
Evaluate each attribute and measure impurity (e.g., Gini index, entropy, information gain) for
possible splits.
Choose the attribute that provides the best split, i.e., maximizes purity or minimizes
impurity.
Step 4:
Create a Decision Node:
Split the dataset based on the chosen attribute.
Create a decision node in the tree, representing the decision based on the selected attribute.
Step 5:
Recursive Splitting:
For each subset created by the split, repeat the process recursively:
If the subset is pure (contains only one class for classification), create a leaf node with the
corresponding class label.
If the subset is impure, go back to step 3 and repeat the process.
Step 6:
Stopping Criteria:
Define stopping criteria to halt the tree-building process. This helps avoid over fitting.
Examples of stopping criteria include a maximum depth limit, a minimum number of samples
in a node, or a minimum impurity threshold.
Step 7:
Tree Pruning (Optional):
Post-process the tree to reduce its size and complexity, aiming to improve generalization on
new data.
Pruning involves removing branches that do not contribute significantly to predictive
accuracy.
Step 8:
Output:
The resulting decision tree, with nodes representing decisions and leaves representing class
labels or regression values

You might also like