0% found this document useful (0 votes)

5 views

CS-DM MODULE- 3

module 3

Uploaded by

Varaha Giri

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

CS-DM MODULE- 3

module 3

Uploaded by

Varaha Giri

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 27

MODULE-3

Data Similarity and Dissimilarity Classification

A) Measuring Data Similarity and Dissimilarity :

Similarity and Dissimilarity between Simple Attributes :
The proximity of objects with a number of attributes is typically defined by combining t he
proximities of individual attributes, and thus, we first discuss proximity between objects
having a single attribute.
For objects with a single ordinal attribute, the situation is more complicated because
information about order should be taken into account. Consider an attribute that
measures the quality of a product, e.g., a candy bar, on the scale {poor, fair, OK, good,
wonderful} . It would seem reasonable that a product, P1, which is rated wonderful, would
be closer to a product P2, which is rated good, than it would be to a product P3, which is
rated OK. To make this observation quantitative, the values of the ordinal attribute are
often mapped to successive integers, beginning at 0 or 1, e.g., {poor=O, fair=1, OK =2,
good=3, wonderful=4}. Then, d(P1,P2) = 3-2 = 1 or, if we want the dissimilarity to fall
between 0 and 1, d(Pl, P2) = 2 = 0.25. A similarity for ordinal attributes can then be
defined as s = 1-d.
This definition of similarity (dissimilarity) for an ordinal attribute should make the reader a
bit uneasy since this assumes equal intervals, and this is not so. Otherwise, we would have
an interval or ratio attribute. Is the difference between the values fair and good really the
same as that between the values OK and wonderful? Probably not, but in practice, our
options are limited, and in the absence of more information, this is the standard approach
for defining proximity between ordinal attributes. For interval or ratio attributes, the natural
measure of dissimilarity between two objects is the absolute difference of their values. For
example, we might compare our current weight and our weight a year ago by saying "I am
ten pounds heavier." ln cases such as these, the dissimilarities typically range from 0 to co,
rather than from 0 to 1. T he similarity of interval or ratio attributes is typically expressed
by transforming a similarity into a dissimilarity
Dissimilarities and Similarities between Data Objects :

Common Properties of Dissimilarity Measures

Distance, such as the Euclidean distance, is a dissimilarity measure and has

some well known properties:
1. d(p,q) ≥ 0 for all p and q,and d(p,q)= 0 if and only if p=q,
2. d(p,q)=d(q,p) for all p and q,
3. d(p, r) ≤ d(p, q) + d(q, r)for all p, q, and r, where d(p, q) is the distance
(dissimilarity) between points (data objects), p and q.
A distance that satisfies these properties are called a metric. Following is a list of several common
distance measures to compare multivariate data. We will assume that the attributes are all
continuous.

a) Euclidean Distance
Assume that we have measurements xik, i=1,…,N, on variables k=1, …, p(also
called attributes).
The Euclidean distance between the ith and jth objects is

For every pair(i, j) of observations. The weighted Euclidean distance is

If scales of the attributes differ substantially, standardization is necessary.

b)Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement ,xik,i=1,…,N, k=1,… ,p ,the Minkowski distance is

whereλ≥1.ItisalsocalledtheLλmetric.
λ=1:L1metric, Man hattan or City-block distance. λ = 2 : L 2metric, Euclidean
distance.
λ→∞: L∞metric, Supremum distance
Note that λ and p are two different parameters. Dimension of the data matrix
remains finite

c)Mahalanobis Distance
Let X be a N×p matrix.Then the ith row of X is

i𝑥𝑇= (𝑥i1,𝑥i2,…..,𝑥i𝑝)
The Mahalanobis distance is

Where ∑ is the p×p sample covariance matrix.y is 1

f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the
simple matching coefficient (SMC) , which is defined as

Jaccard Coefficient :
The Jaccard coefficient, which is often symbolized by J, is given by the following equation:

Cosine Similarity :
The cosine similarity, defined next, is one of the most common measure of
document similarity. If x and y are two document vectors, then
Common Properties of Similarity Measures

Similarities have some well known properties:

1.s(p,q)=1(or maximum similarity) only if p =q,
2.s(p, q)= s(q, p)for all p and q,where s(p,q) is the similarity between data objects, p
and q.
Examples of Proximity Measures :
This section provides specific examples of some similarity and dissimilarity
measures.

Similarity Measures for Binary Data:

Similarity measures between objects that contain only binary attributes are called
similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates
that the two objects are completely similar, while a value of 0 indicates that the objects
are not at all similar.
Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the simple
matchin g coefficient (SMC) , which is defined as

Jaccard Coefficient : Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix. The Jaccard coefficient, which is often symbolized by
J, is given by the following equation:

Cosine Similarity : The cosine similarity, defined next, is one of the most common
measure of document similarity. If x and y are two document vectors, then

Extended Jaccard Coefficient:

The extended Jaccard coefficient can be used for document data and that reduces to the
Jaccard coefficient in the case of binary attributes. The extended Jaccard coefficient is also
known as the Tanimoto coefficient

Correlation The correlation between two data objects that have binary or continuous
variables is a measure of the linear relationship between the attributes of the objects.
Pearson's correlation coefficient between two data objects, x and y, is defined by the
following equation:

Issues in Proximity Calculation :

This section discusses several important issues related to proximity measures:
(1) how to handle the case in which attributes have different scales and/ or are
correlated.
(2) how to calculate proximity between objects that are composed of different types of
attributes, e.g., quantitative and qualitative.
(3) and how to handle proximity calculation when attributes have different weights; i.e.,
when not all attributes contribute equally to the proximity of objects.
A generalization of Euclidean distance, the Mahalanobis distance, is useful when
attributes are correlated, have different ranges of values (different variances), and the
distribution of the data is approximately Gaussian (normal). Specifically, the Mahalanobis
distance between two objects (vectors) x and y is defined as

Combining Similarities for Heterogeneous Attributes :

A general approach is needed when the attributes are of different types. One
straightforward approach is to compute the similarity between each attribute separately.
Then combine these similarities using a method that results in a similarity between 0 and
l. Typically, the overall similarity is defined as the average of all the individual attribute
similarities.
Using Weights :
This is not desirable when some attributes are more important to the definition of
proximity than others. To address these situation the formulas for proximity can be
modified by weighting the contribution of each attribute.

If the weights wk sum to 1, then it becomes

The definition of the Minkowski distance can also be modified as follows:

some common issues associated with proximity measures:
1. Sensitivity to Scale:
 Problem: Many proximity measures are sensitive to the scale of the variables. If the
scales are not standardized, variables with larger magnitudes may dominate the distance
calculations.
 Solution: Standardize or normalize the variables before applying proximity measures to
ensure that all variables contribute equally.
2. Dimensionality:
 Problem: In high-dimensional spaces, the distance between points may become less
meaningful due to the "curse of dimensionality." This can lead to increased computational
complexity and decreased performance.
 Solution: Dimensionality reduction techniques or feature selection methods can be
applied to address this issue.
3. Assumption of Linearity:
 Problem: Some proximity measures, like Euclidean distance, assume linear
relationships between variables. In non-linear scenarios, these measures may not capture
the true underlying similarities.
 Solution: Consider using proximity measures that are more suitable for non-linear
relationships, or transform the data to make it more linear if appropriate.
4. Outliers:
 Problem: Proximity measures can be sensitive to outliers, which might
disproportionately influence the results. Outliers can distort distance calculations and lead
to inaccurate similarity assessments.
 Solution: Robust proximity measures or outlier detection/preprocessing techniques
can be employed to mitigate the impact of outliers.
5. Metric vs. Non-metric Measures:
 Problem: Some proximity measures may violate the triangle inequality, a key property for
metrics. Non-metric measures can lead to inconsistencies in clustering or classification
algorithms.
 Solution: Carefully choose measures that satisfy the metric properties when working with
algorithms that assume metric distances.
6. Subjectivity in Measure Selection:
 Problem: The choice of proximity measure may depend on the specific
characteristics of the data and the problem at hand. Different measures may yield different
results.
 Solution: Understand the characteristics of your data and the requirements of your
application, and choose a proximity measure accordingly. Sensitivity analysis can also be
performed to assess the impact of different measures.
7. Data Sparsity:
 Problem: In sparse datasets, where many entries are missing or zero, traditional proximity
measures may not provide accurate similarity assessments.
 Solution: Consider using specialized measures designed for sparse data or impute
missing values before applying proximity measures
Selection of Right Proximity Measure :
Selecting the right proximity measures, also known as similarity or
distance measures, is crucial in data mining tasks such as clustering,
classification, and recommendation systems. The choice of proximity
measure depends on the nature of your data and the specific goals of
your analysis. Here are some commonly used proximity measures and
factors to consider when selecting them:

1. Euclidean Distance:

 Formula:
 Use Case: Suitable for continuous numerical data.
 Considerations: Sensitive to outliers.
2. Manhattan Distance (L1 Norm):
 Formula: d(x,y)= | x 1 − x 2 | + | y 1 − y 2 | .
 Use Case: Suitable for sparse data and less sensitive to outliers than Euclidean
distance.
3. Cosine Similarity:

 Formula:
 Us`e Case: Effective for text data, document similarity, and high-dimensional data.
 Considerations: Ignores magnitude and focuses on direction.
4. Jaccard Similarity:

 Formula:
 Use Case: Suitable for binary or categorical data; often used in set comparisons.
5. Hamming Distance:
 Formula: Number of positions at which the corresponding symbols differ.
 Use Case: Applicable to binary or categorical data of the same length.
6. Minkowski Distance:

 Formula:
 Use Case: Generalization of Euclidean and Manhattan distances; the parameter p
determines the norm.
7. Correlation-based Measures:
 Pearson Correlation Coefficient: Measures linear correlation.
 Spearman Rank Correlation Coefficient: Measures monotonic relationships.
 Use Case: Suitable for comparing the relationship between variables.
8. Mahalanobis Distance:

 Formula:
 Use Case: Effective when dealing with multivariate data with different scales.

When selecting a proximity measure, consider the following factors:

 Data Type: Choose a measure that is appropriate for the type of data you are working with
(e.g., numerical, categorical, binary).
 Scale Sensitivity: Some measures are sensitive to the scale of the variables, so standardize
or normalize your data if needed.
 Domain Knowledge: Consider the characteristics of your data and the problem domain.
 Computational Complexity: Some measures may be computationally expensive, especially
with large datasets.
 Noise and Outliers: Choose a measure that is robust to noise and outliers if your data
contains them.
 Interpretability: Consider the interpretability of the measure in the context of your analysis.

B) Classiﬁcation, : Classification is the task of assigning objects to one of several predeﬁned categories, is
a pervasive problem that encompasses many diverse applications. Examples include detecting spam email
messages based upon the message header and content, categorizing cells as malignant or benign based upon
the results of MRI scans, and classifying galaxies based upon their shapes

Classification
input output
model

attribute set(x) class label(y)

Fig: Classification
The input data for a classification task is a collection of records. Each record, also known as an instance or
example, is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated
as the class label (also known as category or target attribute).
Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories:
mammal, bird, fish, reptile, or amphibian.
The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of
reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.1 are
mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand,
must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a
predictive modeling task in which y is a continuous attribute.

Classiﬁcation is the task of learning a target function f that maps each attribute set x to one of
the predeﬁned class labels y.

The target function is also known informally as a classification model. A classification model is useful
for the following purposes
Descriptive Modeling: A classification model can serve as an explanatory tool to distinguish between
objects of different classes. For example, it would be useful—for both biologists and others to have a
descriptive model that summarizes the data shown in Table 4.1 and explains what features define a vertebrate
as a mammal, reptile, bird, fish, or amphibian
Predictive Modeling: A classification model can also be used to predict the class label of unknown
records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically
assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the
following characteristics of a creature known as a gila monster:
Body Skin Gives Aquatic Aerial Has Hiber- Class
Name
Temperature Cover Birth Creature Creature Legs nates Label
gila monster cold-blooded scales no No no yes yes ?

Figure : 4.2
Classiﬁcation techniques are most suited for predicting or describing data sets with binary or nominal
categories. They are less eﬀective for ordinal categories (e.g., to classify a person as a member of high-,
medium-, or low- income group) because they do not consider the implicit order among the categories.
General Approach to Solving a Classification Problem:

A classification technique (or classifier) is a systematic approach to building classification models from an
input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support
vector machines, and na¨ıve Bayes classifiers. Each technique employs a learning algorithm to identify
a model that best fits the relationship between the attribute set and class label of the input data. The model
generated by a learning algorithm should both fit the input data well and correctly predict the class labels
of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models
with good generalization capability; i.e., models that accurately predict the class labels of previously
unknown records.
Predicted Class
Class = 1Class = 0
Actual Class = 1 f11 f10
Class Class = 0 f01 f00

Table 4.2. Confusion matrix for a 2-class problem.

Figure 4 .3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classiﬁcation model, which is subsequently applied to the test set, which consists of records with
unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test
records correctly and incorrectly predicted by the model. These counts are tabulated in a
table known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary
classification problem.

Each entry fij in this table denotes the number of records from class i predicted to be of class j. For
instance, f01 is the number of records from class 0 incorrectly predicted as class 1. Based on the entries in
the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the
total number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a classification model
performs) sunrmarizing this inforn1ation with a single number would take it more convenient to compare the
perforn1ance of different models. Tllis can be done using a performance metric such as accuracy, which is
defined as follows:

Accuracy = f11 + f00

.= number of predictions

f11 + f10 + f01 + f00 Total number of predictions

Equivalently, the performance of a model can be expressed in terms of its error rate, which is given by
the following equation:

Error rate = f10+f01 = Number of wrong predictions

F11+f10+f01+f00 Total number of predictions
Decision Tree Induction:

How a Decision Tree Works:

Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-
mammal? One approach is to pose a series of questions about the characteristics of the species. The first
question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it is
definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a
follow-up question: Do the females of the species give birth to their young? Those that do give birth are
definitely mammals, while those that do not are likely to be non-mammals (with the exception of egg-laying
mammals such as the platypus and spiny anteater). The tree has three types of nodes:

• A root node that has no incoming edges and zero or more outgoing edges.

• Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.

• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non- terminal nodes, which include
the root and other internal nodes, contain attribute test conditions to separate records that have diﬀerent
characteristics. For example, the root node shown in Figure uses the attribute Body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals,
a leaf node labeled Non-mammals is created as the right child of the root node. If the vertebrate is
warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-
blooded creatures, which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root
node, we apply the test condition to the record and follow the appropriate branch based on the outcome of
the test.

How to Build a Decision Tree:

In principle, there are exponentially many decision trees that can be constructed from a given set of
attributes. While some of the trees are more accurate than others, ﬁnding the optimal tree is computationally
infeasible because of the exponential size of the search space. Nevertheless, eﬃcient algorithms have been
developed to induce a reasonably accurate, albeit suboptimal, decision tree in a reasonable amount of time.
These algorithms usually employ a greedy strategy that grows a decision tree by making a series of locally
optimum decisions about which attribute to use for partitioning the data. One such algorithm is Hunt’s
algorithm, which is the basis of many existing decision tree induction algorithms,
Hunt’s Algorithm

In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets. Let Dt be the set of training records that are associated with node t and
y = {y1, y2, . . . , yc} be the class labels. The following is a recursive deﬁnition of Hunt’s algorithm.

Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.

Step 2: If Dt contains records that belong to more than one class, an attribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is then
recursively applied to each child node.

Tid Home Marital Annual Defaulted

1 Yes Single Income
125K Borrower
No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes

Figure 4.6. Training set for predicting borrowers who will default on loan payments.

The initial tree for the classification problem contains a single node with class label Defaulted = No (see
Figure 4.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however,
needs to be refined since the root node contains records from both classes. The records are subsequently
divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure
4.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will
assume that this is the best criterion for splitting the data at this point. Hunt’s algorithm is then applied
recursively to each child of the root node. From the training set given in Figure 4.6, notice that all
borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf
node labeled Defaulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the
recursive step of Hunt’s algorithm until all the records belong to the same class. The trees resulting from
each recursive step are shown in Figures 4.7(c) and (d).
Hunt’s algorithm will work if every combination of attribute values is present in the training data and
each combination has a unique class label. These assumptions are too stringent for use in most practical
situations. Additional conditions are needed to handle the following cases:

1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no records
associated with these nodes. This can happen if none of the training records have the combination of attribute
values associated with such nodes. In this case the node is declared a leaf node with the same class label
as the majority class of training records associated with its parent node.

2. In Step 2, if all the records associated with Dt have identical attribute values (except for the class
label), then it is not possible to split these records any further. In this case, the node is declared a leaf node
with the same class label as the majority class of training records associated with this node.

Example of how to build Decision tree:

Step 1: Determine the Decision Column

Since decision trees are used for classification, you need to determine the classes which are the basis for the
decision.
In this case, it the last column, that is Play Golf column with classes Yes and No.
To determine the root Node we need to compute the entropy.
To do this, we create a frequency table for the classes (the Yes/No column).

Table 2: Frequency Table

Step 2: Calculating Entropy for the classes (Play Golf)

In this step, you need to calculate the entropy for the Play Golf column and the calculation step is given
below.

Entropy(Play Golf) = E(5,9)

Step 3: Calculate Entropy for Other Attributes After Split

For the other four attributes, we need to calculate the entropy after each of the split.

 E(Play Golf, Outloook)

 E(Play Golf, Temperature)
 E(Play Golf, Humidity)
 E(Play Golf,Windy)

The entropy for two variables is calculated using the formula.

(don’t worry if about this formula, its really easy doing the calculation
There to calculate E(Play Golf, Outlook), we would use the formula below:

Which is
the same as:
E(Play Golf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)
This formula may look unfriendly, but it is quite clear. The easiest way to approach this calculation is to
create a frequency table for the two variables, that is Play Golf and Outlook.

This frequency table is given below:

Table 3: Frequency Table for Outlook

Using this table, we can then calculate E(Play Golf, Outlook), which would then be given by the formula
below

Let’s go ahead to calculate E(3,2)

We would not need to calculate the second and the third terms! This is because

E(4, 0) = 0
E(2,3) = E(3,2)
Isn’t this interesting!!!
Just for clarification, let’s show the the calculation steps
The calculation steps for E(4,0):

The calculation step for E(2,3) is given below

Time to put it all together.

We go ahead to calculate the E(Play Golf, Outlook) by substituting the values we calculated
from E(Sunny), E(Overcast) and E(Rainy) in the equation:

E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)

E(PlayGolf, Temperature) Calculation: Just like in the previous calculation, the calculation of E(Play Golf,
Temperature) is given below. It
It is easier to do if you form the frequency table for the split for Temperature as shown.

Table 4: Frequency Table for Temperature

E(PlayGolf, Temperature) = P(Hot) E(2,2) + P(Cold) E(3,1) + P(Mild) E(4,2)

E(Play Golf, Humidity) Calculation

Just like in the previous calculation, the calculation of E(Play Golf, Humidity) is given below. It
It is easier to do if you form the frequency table for the split for Humidity as shown.

Table 5: Frequency Table for Humidity

E(PlayGolf, Windy) Calculation

Just like in the previous calculation, the calculation of E(PlayGolf, Windy) is given below. It
It is easier to do if you form the frequency table for the split for Windy as shown.

Table 6: Frequency

Table for Windy

Wow! That is so much work! So take break, walk around a little and take a glass of cold water.
Then we continue.
So now that we have all the entropies for all the four attributes, let’s go ahead to summarize them as shown
in below:

1. E(Play Golf, Outloook) = 0.693

2. E(Play Golf, Temperature) = 0.911
3. E(Play Golf, Humidity) = 0.788
4. E(Play Golf,Windy) = 0.892

Step 4: Calculating Information Gain for Each Split

The next step is to calculate the information gain for each of the attributes. The information gain is calculated
from the split using each of the attributes. Then the attribute with the largest information gain is used for the
split.

The information gain is calculated using the formula:

Gain(S,T) = Entropy(S) – Entropy(S,T)

For example, the information gain after spliting using the Outlook attibute is given by:

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)

So let’s go ahead to do the calculation

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)

= 0.94 – 0.693 = 0.247
Gain(Play Golf, Temperature) = Entropy(Play Golf) – Entropy(Play Golf, Temparature)
= 0.94 – 0.911 = 0.029
Gain(Play Golf, Humidity) = Entropy(Play Golf) – Entropy(Play Golf, Humidity)
= 0.94 – 0.788 = 0.152
Gain(Play Golf, Windy) = Entropy(Play Golf) – Entropy(Play Golf, Windy)
= 0.94 – 0.892 = 0.048
Having calculated all the information gain, we now choose the attribute that gives the highest information
gain after the split.
Step 5: Perform the First Split
Draw the First Split of the Decision Tree
Now that we have all the information gain, we then split the tree based on the attribute with the highest
information gain.

From our calculation, the highest information gain comes from Outlook. Therefore the split will look like
this:

Figure 2: Decision Tree after first split

Now that we have the first stage of the decison tree, we see that we have one leaf node. But we still need to
split the tree further.

To do that, we need to also split the original table to create sub tables.
This sub tables are given in below.

Table 7: Initial Split using Outlook

From Table 3, we could see that the Overcast outlook requires no further split because it is just one
homogeneous group. So we have a leaf node.

Step 6: Perform Further Splits

The Sunny and the Rainy attributes needs to be split

The Rainy outlook can be split using either Temperature, Humidity or Windy.

What attribute would best be used for this split? Why?

Answer: Humidity. Because it produces homogenous groups.

Table 8: Split using Humidity

The Rainy attribute could be split using High and Normal attributes and that would give us the tree below.

Figure 3: Split using the Humidity Attribute

Let’t now go ahead to do the same thing for the Sunny outlook The Rainy outlook can be split using either
Temperature, Humidity or Windy.

What attribute would best be used for this split? Why?

Answer: Windy . Because it produces homogeneous groups.
Table 9: Split using Windy Attribute
If we do the split using the Windy attribute, we would have the final tree that would require no further
splitting! This is shown in Figure 4

Step 7: Complete the Decision Tree

The complete table is shown in Figure 4
Note that the same calculation that was used initially could also be used for the further splits. But that would
not be necessary since you could just look at the sub table and be able to determine which attribute to use for
the split.

Figure 4: Final Decision Tree

Final Notes :
Now we have successfully completed the decision tree.
I think we need to celebrate with a bottle of beer!
This is how easy it is to build a decision three. Remember, the initial steps of calculating
the entropy and the gain is the most difficult part. But after that, everything falls into place.

Algorithm for Decision tree induction:

Step1:
Input:
Training dataset with features and corresponding labels.
Step 2:
Decision Tree Initialization:
Create a root node for the decision tree.
Step 3:
Select Best Attribute for Splitting:
Evaluate each attribute and measure impurity (e.g., Gini index, entropy, information gain) for
possible splits.
Choose the attribute that provides the best split, i.e., maximizes purity or minimizes
impurity.
Step 4:
Create a Decision Node:
Split the dataset based on the chosen attribute.
Create a decision node in the tree, representing the decision based on the selected attribute.
Step 5:
Recursive Splitting:
For each subset created by the split, repeat the process recursively:
If the subset is pure (contains only one class for classification), create a leaf node with the
corresponding class label.
If the subset is impure, go back to step 3 and repeat the process.
Step 6:
Stopping Criteria:
Define stopping criteria to halt the tree-building process. This helps avoid over fitting.
Examples of stopping criteria include a maximum depth limit, a minimum number of samples
in a node, or a minimum impurity threshold.
Step 7:
Tree Pruning (Optional):
Post-process the tree to reduce its size and complexity, aiming to improve generalization on
new data.
Pruning involves removing branches that do not contribute significantly to predictive
accuracy.
Step 8:
Output:
The resulting decision tree, with nodes representing decisions and leaves representing class
labels or regression values

MuleSoft Training
No ratings yet
MuleSoft Training
25 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
L13
No ratings yet
L13
19 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
26 pages
Lec 5
No ratings yet
Lec 5
22 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
Clustering
No ratings yet
Clustering
15 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-1
11 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Similarity
No ratings yet
Similarity
19 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Lab 2
No ratings yet
Lab 2
21 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Cluster
No ratings yet
Cluster
13 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Seminar On Data Mining and Data Warehousing Concepts of Second Module Chapter Two
No ratings yet
Seminar On Data Mining and Data Warehousing Concepts of Second Module Chapter Two
7 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
IDS4
No ratings yet
IDS4
50 pages
Similarity_Based_learning_(part_2_)__
No ratings yet
Similarity_Based_learning_(part_2_)__
15 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
STAT243 Chapter 2 - Section 2.4 (1)
No ratings yet
STAT243 Chapter 2 - Section 2.4 (1)
41 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
Lecture 3-Know Your Data - M
No ratings yet
Lecture 3-Know Your Data - M
19 pages
Dist
No ratings yet
Dist
14 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
Clustering
0% (1)
Clustering
127 pages
distance-and-similarity
No ratings yet
distance-and-similarity
33 pages
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
No ratings yet
THE ULTRAMETRIC PROPERTIES OF BINARY DATASETS P. WILCZEK Silesian - J - Pure - Appl - Math - v6 - I1 - STR - 069-084
16 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 1
6 pages
A_Comparative_Study_on_Distance_Measuring_Approach
No ratings yet
A_Comparative_Study_on_Distance_Measuring_Approach
3 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Understanding Vector Calculus: Practical Development and Solved Problems
From Everand
Understanding Vector Calculus: Practical Development and Solved Problems
Jerrold Franklin
No ratings yet
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
distributed-systems-pranay
No ratings yet
distributed-systems-pranay
108 pages
Lec22
No ratings yet
Lec22
22 pages
CS-DM Module-4
No ratings yet
CS-DM Module-4
22 pages
CS-DM MODULE-5
No ratings yet
CS-DM MODULE-5
26 pages
CS-DM MODULE -1
No ratings yet
CS-DM MODULE -1
27 pages
CS-DM MODULE-2
No ratings yet
CS-DM MODULE-2
30 pages
Photoshop CS Tutorial
No ratings yet
Photoshop CS Tutorial
42 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Manual Escaner Zebra Sdk-Windows-Dg-En
No ratings yet
Manual Escaner Zebra Sdk-Windows-Dg-En
150 pages
Mpi Assignment
No ratings yet
Mpi Assignment
14 pages
Coin Operated Internet Wifi Hotspot Station
No ratings yet
Coin Operated Internet Wifi Hotspot Station
10 pages
Mock2 SBQ Java 21 Nov MUM With Answers
No ratings yet
Mock2 SBQ Java 21 Nov MUM With Answers
3 pages
3300 03 System Monitor Manual 89604
No ratings yet
3300 03 System Monitor Manual 89604
22 pages
17.invoice Correction Process With Credit Memo (BKL)
No ratings yet
17.invoice Correction Process With Credit Memo (BKL)
14 pages
Release Note Nemo Analyze: Version 8.30, May 2019
No ratings yet
Release Note Nemo Analyze: Version 8.30, May 2019
7 pages
What Do SSL Certificates Use To Establish Encryption
No ratings yet
What Do SSL Certificates Use To Establish Encryption
5 pages
Creeper World 2 Editor - Manual Second Edition
100% (1)
Creeper World 2 Editor - Manual Second Edition
44 pages
Devops Syllabus - by Murali P N, Besant Technologies PDF
No ratings yet
Devops Syllabus - by Murali P N, Besant Technologies PDF
10 pages
Functional, Enterprise, and Inter Organizational Systems
No ratings yet
Functional, Enterprise, and Inter Organizational Systems
26 pages
Anomaly Detection in Cybersecurity With Graph Based Approaches
No ratings yet
Anomaly Detection in Cybersecurity With Graph Based Approaches
9 pages
Login Credential ReportbrCollege Name City Academy Law College, Course Name L.L.B 3 (Year) Semester 3
No ratings yet
Login Credential ReportbrCollege Name City Academy Law College, Course Name L.L.B 3 (Year) Semester 3
12 pages
ch12 - Binary 2
No ratings yet
ch12 - Binary 2
35 pages
Ar 1
No ratings yet
Ar 1
32 pages
2018 H2 Prelim Compilation (Complex Numbers)
No ratings yet
2018 H2 Prelim Compilation (Complex Numbers)
16 pages
PR-5114A Manual
No ratings yet
PR-5114A Manual
17 pages
Cloud Computing Assignments: Name: Shubham Ubhe GR No: 21810164 Roll No: 321055 Class: TY Btech Branch: Computer
No ratings yet
Cloud Computing Assignments: Name: Shubham Ubhe GR No: 21810164 Roll No: 321055 Class: TY Btech Branch: Computer
105 pages
HTML-Chapter 1 - 4
100% (1)
HTML-Chapter 1 - 4
126 pages
(eBook PDF) Computer Architecture: A Quantitative Approach 6th Editionpdf download
100% (5)
(eBook PDF) Computer Architecture: A Quantitative Approach 6th Editionpdf download
44 pages
PICAXE VSM Tutorial - Part 2 Installation
No ratings yet
PICAXE VSM Tutorial - Part 2 Installation
11 pages
1691654424037
No ratings yet
1691654424037
13 pages
PULSE 2023 Final
No ratings yet
PULSE 2023 Final
56 pages
2856ya600b R2 Instruction v1.01
No ratings yet
2856ya600b R2 Instruction v1.01
204 pages
FPP - A Fortran Pre-Processor: Preprocessor Source
No ratings yet
FPP - A Fortran Pre-Processor: Preprocessor Source
4 pages
Platform Series
No ratings yet
Platform Series
4 pages
595352-2023-2025-Syllabus Page 20
No ratings yet
595352-2023-2025-Syllabus Page 20
1 page
Doc1 v4 5 0 Programmers Guide 7d
No ratings yet
Doc1 v4 5 0 Programmers Guide 7d
208 pages
10Gb/s XFP Optical Transceiver Module SXP3104EX-M: Features
No ratings yet
10Gb/s XFP Optical Transceiver Module SXP3104EX-M: Features
24 pages

CS-DM MODULE- 3

Uploaded by

CS-DM MODULE- 3

Uploaded by

MODULE-3

Data Similarity and Dissimilarity Classification

A) Measuring Data Similarity and Dissimilarity :

Common Properties of Dissimilarity Measures

Distance, such as the Euclidean distance, is a dissimilarity measure and has

For every pair(i, j) of observations. The weighted Euclidean distance is

If scales of the attributes differ substantially, standardization is necessary.

Where ∑ is the p×p sample covariance matrix.y is 1

Similarities have some well known properties:

Similarity Measures for Binary Data:

Extended Jaccard Coefficient:

Issues in Proximity Calculation :

Combining Similarities for Heterogeneous Attributes :

If the weights wk sum to 1, then it becomes

The definition of the Minkowski distance can also be modified as follows:

When selecting a proximity measure, consider the following factors:

attribute set(x) class label(y)

Table 4.2. Confusion matrix for a 2-class problem.

Accuracy = f11 + f00

f11 + f10 + f01 + f00 Total number of predictions

Error rate = f10+f01 = Number of wrong predictions

How a Decision Tree Works:

How to Build a Decision Tree:

Tid Home Marital Annual Defaulted

Example of how to build Decision tree:

Step 1: Determine the Decision Column

Table 2: Frequency Table

Step 2: Calculating Entropy for the classes (Play Golf)

Entropy(Play Golf) = E(5,9)

Step 3: Calculate Entropy for Other Attributes After Split

 E(Play Golf, Outloook)

The entropy for two variables is calculated using the formula.

This frequency table is given below:

Table 3: Frequency Table for Outlook

Let’s go ahead to calculate E(3,2)

The calculation step for E(2,3) is given below

Time to put it all together.

E(PlayGolf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)

Table 4: Frequency Table for Temperature

E(Play Golf, Humidity) Calculation

Table 5: Frequency Table for Humidity

Table for Windy

1. E(Play Golf, Outloook) = 0.693

Step 4: Calculating Information Gain for Each Split

The information gain is calculated using the formula:

Gain(S,T) = Entropy(S) – Entropy(S,T)

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)

So let’s go ahead to do the calculation

Gain(Play Golf, Outlook) = Entropy(Play Golf) – Entropy(Play Golf, Outlook)

Figure 2: Decision Tree after first split

Table 7: Initial Split using Outlook

Step 6: Perform Further Splits

The Sunny and the Rainy attributes needs to be split

What attribute would best be used for this split? Why?

Answer: Humidity. Because it produces homogenous groups.

Table 8: Split using Humidity

Figure 3: Split using the Humidity Attribute

What attribute would best be used for this split? Why?

Step 7: Complete the Decision Tree

Figure 4: Final Decision Tree

Algorithm for Decision tree induction:

You might also like