CS-DM MODULE- 3
CS-DM MODULE- 3
a) Euclidean Distance
Assume that we have measurements xik, i=1,…,N, on variables k=1, …, p(also
called attributes).
The Euclidean distance between the ith and jth objects is
b)Minkowski Distance
The Minkowski distance is a generalization of the Euclidean distance.
With the measurement ,xik,i=1,…,N, k=1,… ,p ,the Minkowski distance is
whereλ≥1.ItisalsocalledtheLλmetric.
λ=1:L1metric, Man hattan or City-block distance. λ = 2 : L 2metric, Euclidean
distance.
λ→∞: L∞metric, Supremum distance
Note that λ and p are two different parameters. Dimension of the data matrix
remains finite
c)Mahalanobis Distance
Let X be a N×p matrix.Then the ith row of X is
i𝑥𝑇= (𝑥i1,𝑥i2,…..,𝑥i𝑝)
The Mahalanobis distance is
Jaccard Coefficient :
The Jaccard coefficient, which is often symbolized by J, is given by the following equation:
Cosine Similarity :
The cosine similarity, defined next, is one of the most common measure of
document similarity. If x and y are two document vectors, then
Common Properties of Similarity Measures
Similarity measures between objects that contain only binary attributes are called
similarity coefficients, and typically have values between 0 and 1. A value of 1 indicates
that the two objects are completely similar, while a value of 0 indicates that the objects
are not at all similar.
Let x and y be two objects that consist of n binary attributes. The comparison of two such
objects, i.e., two binary vectors, leads to the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Simple Matching Coefficient One commonly used similarity coefficient is the simple
matchin g coefficient (SMC) , which is defined as
Jaccard Coefficient : Suppose that x and y are data objects that represent two rows (two
transactions) of a transaction matrix. The Jaccard coefficient, which is often symbolized by
J, is given by the following equation:
Cosine Similarity : The cosine similarity, defined next, is one of the most common
measure of document similarity. If x and y are two document vectors, then
Correlation The correlation between two data objects that have binary or continuous
variables is a measure of the linear relationship between the attributes of the objects.
Pearson's correlation coefficient between two data objects, x and y, is defined by the
following equation:
1. Euclidean Distance:
Formula:
Use Case: Suitable for continuous numerical data.
Considerations: Sensitive to outliers.
2. Manhattan Distance (L1 Norm):
Formula: d(x,y)= | x 1 − x 2 | + | y 1 − y 2 | .
Use Case: Suitable for sparse data and less sensitive to outliers than Euclidean
distance.
3. Cosine Similarity:
Formula:
Us`e Case: Effective for text data, document similarity, and high-dimensional data.
Considerations: Ignores magnitude and focuses on direction.
4. Jaccard Similarity:
Formula:
Use Case: Suitable for binary or categorical data; often used in set comparisons.
5. Hamming Distance:
Formula: Number of positions at which the corresponding symbols differ.
Use Case: Applicable to binary or categorical data of the same length.
6. Minkowski Distance:
Formula:
Use Case: Generalization of Euclidean and Manhattan distances; the parameter p
determines the norm.
7. Correlation-based Measures:
Pearson Correlation Coefficient: Measures linear correlation.
Spearman Rank Correlation Coefficient: Measures monotonic relationships.
Use Case: Suitable for comparing the relationship between variables.
8. Mahalanobis Distance:
Formula:
Use Case: Effective when dealing with multivariate data with different scales.
B) Classification, : Classification is the task of assigning objects to one of several predefined categories, is
a pervasive problem that encompasses many diverse applications. Examples include detecting spam email
messages based upon the message header and content, categorizing cells as malignant or benign based upon
the results of MRI scans, and classifying galaxies based upon their shapes
Classification
input output
model
Fig: Classification
The input data for a classification task is a collection of records. Each record, also known as an instance or
example, is characterized by a tuple (x, y), where x is the attribute set and y is a special attribute, designated
as the class label (also known as category or target attribute).
Table 4.1 shows a sample data set used for classifying vertebrates into one of the following categories:
mammal, bird, fish, reptile, or amphibian.
The attribute set includes properties of a vertebrate such as its body temperature, skin cover, method of
reproduction, ability to fly, and ability to live in water. Although the attributes presented in Table 4.1 are
mostly discrete, the attribute set can also contain continuous features. The class label, on the other hand,
must be a discrete attribute. This is a key characteristic that distinguishes classification from regression, a
predictive modeling task in which y is a continuous attribute.
Classification is the task of learning a target function f that maps each attribute set x to one of
the predefined class labels y.
The target function is also known informally as a classification model. A classification model is useful
for the following purposes
Descriptive Modeling: A classification model can serve as an explanatory tool to distinguish between
objects of different classes. For example, it would be useful—for both biologists and others to have a
descriptive model that summarizes the data shown in Table 4.1 and explains what features define a vertebrate
as a mammal, reptile, bird, fish, or amphibian
Predictive Modeling: A classification model can also be used to predict the class label of unknown
records. As shown in Figure 4.2, a classification model can be treated as a black box that automatically
assigns a class label when presented with the attribute set of an unknown record. Suppose we are given the
following characteristics of a creature known as a gila monster:
Body Skin Gives Aquatic Aerial Has Hiber- Class
Name
Temperature Cover Birth Creature Creature Legs nates Label
gila monster cold-blooded scales no No no yes yes ?
Figure : 4.2
Classification techniques are most suited for predicting or describing data sets with binary or nominal
categories. They are less effective for ordinal categories (e.g., to classify a person as a member of high-,
medium-, or low- income group) because they do not consider the implicit order among the categories.
General Approach to Solving a Classification Problem:
A classification technique (or classifier) is a systematic approach to building classification models from an
input data set. Examples include decision tree classifiers, rule-based classifiers, neural networks, support
vector machines, and na¨ıve Bayes classifiers. Each technique employs a learning algorithm to identify
a model that best fits the relationship between the attribute set and class label of the input data. The model
generated by a learning algorithm should both fit the input data well and correctly predict the class labels
of records it has never seen before. Therefore, a key objective of the learning algorithm is to build models
with good generalization capability; i.e., models that accurately predict the class labels of previously
unknown records.
Predicted Class
Class = 1Class = 0
Actual Class = 1 f11 f10
Class Class = 0 f01 f00
Figure 4 .3 shows a general approach for solving classification problems. First, a training set
consisting of records whose class labels are known must be provided. The training set is used to
build a classification model, which is subsequently applied to the test set, which consists of records with
unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test
records correctly and incorrectly predicted by the model. These counts are tabulated in a
table known as a confusion matrix. Table 4.2 depicts the confusion matrix for a binary
classification problem.
Each entry fij in this table denotes the number of records from class i predicted to be of class j. For
instance, f01 is the number of records from class 0 incorrectly predicted as class 1. Based on the entries in
the confusion matrix, the total number of correct predictions made by the model is (f11 + f00) and the
total number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a classification model
performs) sunrmarizing this inforn1ation with a single number would take it more convenient to compare the
perforn1ance of different models. Tllis can be done using a performance metric such as accuracy, which is
defined as follows:
Suppose a new species is discovered by scientists. How can we tell whether it is a mammal or a non-
mammal? One approach is to pose a series of questions about the characteristics of the species. The first
question we may ask is whether the species is cold- or warm-blooded. If it is cold-blooded, then it is
definitely not a mammal. Otherwise, it is either a bird or a mammal. In the latter case, we need to ask a
follow-up question: Do the females of the species give birth to their young? Those that do give birth are
definitely mammals, while those that do not are likely to be non-mammals (with the exception of egg-laying
mammals such as the platypus and spiny anteater). The tree has three types of nodes:
• A root node that has no incoming edges and zero or more outgoing edges.
• Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
• Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges.
In a decision tree, each leaf node is assigned a class label. The non- terminal nodes, which include
the root and other internal nodes, contain attribute test conditions to separate records that have different
characteristics. For example, the root node shown in Figure uses the attribute Body Temperature to separate
warm-blooded from cold-blooded vertebrates. Since all cold-blooded vertebrates are non-mammals,
a leaf node labeled Non-mammals is created as the right child of the root node. If the vertebrate is
warm-blooded, a subsequent attribute, Gives Birth, is used to distinguish mammals from other warm-
blooded creatures, which are mostly birds.
Classifying a test record is straightforward once a decision tree has been constructed. Starting from the root
node, we apply the test condition to the record and follow the appropriate branch based on the outcome of
the test.
In Hunt’s algorithm, a decision tree is grown in a recursive fashion by partitioning the training records
into successively purer subsets. Let Dt be the set of training records that are associated with node t and
y = {y1, y2, . . . , yc} be the class labels. The following is a recursive definition of Hunt’s algorithm.
Step 1: If all the records in Dt belong to the same class yt, then t is a leaf node labeled as yt.
Step 2: If Dt contains records that belong to more than one class, an at- tribute test condition is
selected to partition the records into smaller subsets. A child node is created for each outcome of the test
condition and the records in Dt are distributed to the children based on the outcomes. The algorithm is then
recursively applied to each child node.
Figure 4.6. Training set for predicting borrowers who will default on loan payments.
The initial tree for the classification problem contains a single node with class label Defaulted = No (see
Figure 4.7(a)), which means that most of the borrowers successfully repaid their loans. The tree, however,
needs to be refined since the root node contains records from both classes. The records are subsequently
divided into smaller subsets based on the outcomes of the Home Owner test condition, as shown in Figure
4.7(b). The justification for choosing this attribute test condition will be discussed later. For now, we will
assume that this is the best criterion for splitting the data at this point. Hunt’s algorithm is then applied
recursively to each child of the root node. From the training set given in Figure 4.6, notice that all
borrowers who are home owners successfully repaid their loans. The left child of the root is therefore a leaf
node labeled Defaulted = No (see Figure 4.7(b)). For the right child, we need to continue applying the
recursive step of Hunt’s algorithm until all the records belong to the same class. The trees resulting from
each recursive step are shown in Figures 4.7(c) and (d).
Hunt’s algorithm will work if every combination of attribute values is present in the training data and
each combination has a unique class label. These assumptions are too stringent for use in most practical
situations. Additional conditions are needed to handle the following cases:
1. It is possible for some of the child nodes created in Step 2 to be empty; i.e., there are no records
associated with these nodes. This can happen if none of the training records have the combination of attribute
values associated with such nodes. In this case the node is declared a leaf node with the same class label
as the majority class of training records associated with its parent node.
2. In Step 2, if all the records associated with Dt have identical attribute values (except for the class
label), then it is not possible to split these records any further. In this case, the node is declared a leaf node
with the same class label as the majority class of training records associated with this node.
(don’t worry if about this formula, its really easy doing the calculation
There to calculate E(Play Golf, Outlook), we would use the formula below:
Which is
the same as:
E(Play Golf, Outlook) = P(Sunny) E(3,2) + P(Overcast) E(4,0) + P(rainy) E(2,3)
This formula may look unfriendly, but it is quite clear. The easiest way to approach this calculation is to
create a frequency table for the two variables, that is Play Golf and Outlook.
E(4, 0) = 0
E(2,3) = E(3,2)
Isn’t this interesting!!!
Just for clarification, let’s show the the calculation steps
The calculation steps for E(4,0):
We go ahead to calculate the E(Play Golf, Outlook) by substituting the values we calculated
from E(Sunny), E(Overcast) and E(Rainy) in the equation:
E(PlayGolf, Temperature) Calculation: Just like in the previous calculation, the calculation of E(Play Golf,
Temperature) is given below. It
It is easier to do if you form the frequency table for the split for Temperature as shown.
Just like in the previous calculation, the calculation of E(Play Golf, Humidity) is given below. It
It is easier to do if you form the frequency table for the split for Humidity as shown.
Just like in the previous calculation, the calculation of E(PlayGolf, Windy) is given below. It
It is easier to do if you form the frequency table for the split for Windy as shown.
Table 6: Frequency
Wow! That is so much work! So take break, walk around a little and take a glass of cold water.
Then we continue.
So now that we have all the entropies for all the four attributes, let’s go ahead to summarize them as shown
in below:
For example, the information gain after spliting using the Outlook attibute is given by:
From our calculation, the highest information gain comes from Outlook. Therefore the split will look like
this:
To do that, we need to also split the original table to create sub tables.
This sub tables are given in below.
The Rainy outlook can be split using either Temperature, Humidity or Windy.
Let’t now go ahead to do the same thing for the Sunny outlook The Rainy outlook can be split using either
Temperature, Humidity or Windy.
Final Notes :
Now we have successfully completed the decision tree.
I think we need to celebrate with a bottle of beer!
This is how easy it is to build a decision three. Remember, the initial steps of calculating
the entropy and the gain is the most difficult part. But after that, everything falls into place.