Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
Proximity Measures
Source: Books by
Tan, Steinbach, Kumar ; Han, Kamber &
Pei; Evans; Dinesh Kumar + Experiential
Knowledge
Measures of Similarity and Dissimilarity (Proximity Measures)
Correlation and Euclidean Distance: useful for dense data such as time
series or two-dimensional points.
Jaccard and cosine similarity measures: useful for sparse data like
documents
Euclidean Distance
Euclidean is one of the frequently used distance
measures when the data are either in interval or ratio
scale.
3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k stands
for the kth observation in the data set). It is possible that the
range of X1k can be much smaller compared to X2k, resulting
in skewed Euclidean distance value. An easier way of
handling the potential bias is to standardize the data using
the following equation:
X ik X i
Standardized value of the attribute = Xi
Where i
X and Xi are, respectively, the mean and standard
1 p
n p
Minkowski D( X 1 , X 2 ) X 1i X 2i
i 1
n: no. of dimensions
When p = 1, City block (Manhattan, taxicab, L1 norm)
distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
For p = 2, Minkowski distance is same as the Euclidean
distance.
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
Simulated bivariate normal data
Think of multivariate normal distribution. The following graph shows simulated bivariate normal data
that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and
90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The
prediction ellipses are contours of the bivariate normal density function. The probability density is high
for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further
away, such as the 90% prediction ellipse.
In the graph, two observations are displayed by using red stars as markers. The first
observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which
marker is closer to the origin? (The origin is the multivariate center of this distribution.)
The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2,
respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for
this distribution, the variance in the Y direction is less than the variance in the X direction, so in
some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is.
Notice the position of the two observations relative to the ellipses. The point (0,2) is located at
the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction
ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the
sense that you are more likely to observe an observation near (4,0) than to observe one near
(0,2). The probability density is higher near (4,0) than it is near (0,2).
In this sense, prediction ellipses are a multivariate generalization of "units of standard
deviation." You can use the bivariate probability contours to compare distances to the bivariate
mean. A point p is closer than a point q if the contour that contains p is nested within the
contour that contains q.
Mahalanobis Distance
The MD is a measure of distance between a data vector and a set of data, or a
variation that measures the distance between two vectors from the same dataset.
Suppose you have data for five people, and each person vector has a Height,
Score on some test, and an Age:
Suppose we want to know how far another person, v = (66, 640, 44), is from this data.
Mahalanobis Distance
mahalanobi s ( x , ) ( x ) ( x ) 1 T
1 n
j ,k ( X ij X j )( X ik X k )
n 1 i 1
Covariance Matrix:
0.3 0.2
0.2 0. 3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Jaccard Similarity Coefficient (Jaccard
Index)
Jaccard similarity coefficient (JSC) or Jaccard index (Real and
Vargas, 1996) is a measure used when the data is qualitative,
especially when attributes can be represented in binary form.
JSC for two n-dimensional data (n attributes), X1 and X2, is
given by
Jaccard(X1, X2) =
n( X1 X 2 )
n( X1 X 2 )
where n(X1 X2) is the number of attributes that belong to both
X1 and X2 (that is, X1 X2), n(X1 X2) is the number of attributes
that belong to either X1 or X2 (that is, X1 X2).
Example: Compute Jaccard Coefficient
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the
two customers are shown in Table
Customer 1 1 1 0 1 1 1 1 0 0
Customer 2 0 0 1 1 1 1 1 1 1
The JSC is given by
n(customer 1 customer 2) 4
JSC 0.44
n(customer 1 customer 2) 9
x= 1000000000
y= 0000001001
Example:
X1 = 3 2 0 5 0 0 0 2 0 0
X2 = 1 0 0 0 0 0 0 1 0 2
X1 X2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
If the cosine similarity is 1: the angle between x and y is 0 degree, and X & Y
are same except for magnitude
If the cosine similarity is 0, then angle between x and y is 90 degree & they do
not share any words.
Correlation
correlation( p, q) p q
Visually Evaluating Correlation
Scatter plots
showing the
similarity from
–1 to 1.