0% found this document useful (0 votes)
51 views

Data Mining: Similarity and Distance

This document discusses similarity and distance metrics in data mining. It defines distance as a numerical measure of how different two data objects are, with lower values indicating more similarity. A distance metric must be a non-negative function where the distance is 0 only when comparing an object to itself, and it must satisfy the triangle inequality. Common distance metrics for real vectors include Lp norms, while Hamming distance counts the number of differing positions between bit vectors or categorical attributes.

Uploaded by

Mohamed K Marah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Data Mining: Similarity and Distance

This document discusses similarity and distance metrics in data mining. It defines distance as a numerical measure of how different two data objects are, with lower values indicating more similarity. A distance metric must be a non-negative function where the distance is 0 only when comparing an object to itself, and it must satisfy the triangle inequality. Common distance metrics for real vectors include Lp norms, while Hamming distance counts the number of differing positions between bit vectors or categorical attributes.

Uploaded by

Mohamed K Marah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DATA MINING

LECTURE 5
Similarity and Distance
Distance
• Numerical measure of how different two data
objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an object
with itself.
• Upper limit varies
Distance Metric
• A distance function d is a distance metric if it is a
function from pairs of objects to real numbers
such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Triangle Inequality
• Triangle inequality guarantees that the distance
function is well-behaved.
• The direct connection is the shortest distance

• It is useful also for proving useful properties about


the data.
Distances for real vectors

Lp norms are known to be distance metrics


6

Hamming Distance
• Hamming distance is the number of positions in
which bit-vectors differ.
• Example: p1 = 10101 p2 = 10011.
• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and 4th
positions.
• The L1 norm for the binary vectors

• Hamming distance between two vectors of


categorical attributes is the number of positions in
which they differ.
• Example: x = (married, low income, cheat),
y = (single, low income, not cheat)
• d(x,y) = 2

You might also like