0% found this document useful (0 votes)
84 views

Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge

The document discusses various proximity measures used in data mining techniques like clustering and classification. It describes measures of similarity and dissimilarity between objects, including Euclidean distance, standardized Euclidean distance, Minkowski distance, Mahalanobis distance, and Jaccard similarity coefficient. These proximity measures are numerical representations of how similar or different two objects are based on their attributes.

Uploaded by

Kushagra Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views

Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge

The document discusses various proximity measures used in data mining techniques like clustering and classification. It describes measures of similarity and dissimilarity between objects, including Euclidean distance, standardized Euclidean distance, Minkowski distance, Mahalanobis distance, and Jaccard similarity coefficient. These proximity measures are numerical representations of how similar or different two objects are based on their attributes.

Uploaded by

Kushagra Singhal
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Lecture on Data:

Proximity Measures

Source: Books by
Tan, Steinbach, Kumar ; Han, Kamber &
Pei; Evans; Dinesh Kumar + Experiential
Knowledge
Measures of Similarity and Dissimilarity (Proximity Measures)

Useful in data mining techniques such as Clustering, nearest neighbor


classification and anomaly detection.

Similarity between two objects: A numerical measure of the degree to which


two objects are alike. Range: [0,1]

Dissimilarity between two objects: A numerical measure of the degree to


which two objects are different. Range: [0,1] or [0, ∞].

Distance: A special case of dissimilarity.


Measures of Similarity and Dissimilarity (Proximity Measures)

One approach is to transform data to similarity (dissimilarity) space and then


perform analysis.

Proximity between two objects = f(Proximity between corresponding


attributes of the two objects)

Simplest case is to study proximity between 2 objects having one simple


attribute. This will be followed by proximity measures between objects with
multiple attributes.
Measures of Similarity and Dissimilarity (Proximity Measures)

Correlation and Euclidean Distance: useful for dense data such as time
series or two-dimensional points.

Jaccard and cosine similarity measures: useful for sparse data like
documents
Euclidean Distance
Euclidean is one of the frequently used distance
measures when the data are either in interval or ratio
scale.

The Eucledian distance between two n-dimensional


observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n) is
given by
D( X1, X2 )  (x11  x21 )2  (x12  x22 )2   (x1n  x2n )2

Standardization is necessary, if scales differ.


Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k stands
for the kth observation in the data set). It is possible that the
range of X1k can be much smaller compared to X2k, resulting
in skewed Euclidean distance value. An easier way of
handling the potential bias is to standardize the data using
the following equation: 
 
 X ik  X i 
 

Standardized value of the attribute =  Xi 

Where i
X and Xi are, respectively, the mean and standard

deviation of ith attribute


Example: comparison of two persons based on their age
and income. (Income differences larger hence dominate
comparison)
Limitation

Euclidean distance may not be useful while measuring


distance between two locations (say two shops)

We can use Minkowski distance for that.


Minkowski Distance
Minkowski distance is the generalized distance measure
between two cases (data objects) X1 and X2 in the
dataset and is given by

1 p
 n p

Minkowski D( X 1 , X 2 )    X 1i  X 2i 
 i 1
 

n: no. of dimensions
When p = 1, City block (Manhattan, taxicab, L1 norm)
distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
For p = 2, Minkowski distance is same as the Euclidean
distance.
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Simulated bivariate normal data

Think of multivariate normal distribution. The following graph shows simulated bivariate normal data
that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and
90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The
prediction ellipses are contours of the bivariate normal density function. The probability density is high
for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further
away, such as the 90% prediction ellipse.
 In the graph, two observations are displayed by using red stars as markers. The first
observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which
marker is closer to the origin? (The origin is the multivariate center of this distribution.)
 The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2,
respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for
this distribution, the variance in the Y direction is less than the variance in the X direction, so in
some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is.
 Notice the position of the two observations relative to the ellipses. The point (0,2) is located at
the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction
ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the
sense that you are more likely to observe an observation near (4,0) than to observe one near
(0,2). The probability density is higher near (4,0) than it is near (0,2).
 In this sense, prediction ellipses are a multivariate generalization of "units of standard
deviation." You can use the bivariate probability contours to compare distances to the bivariate
mean. A point p is closer than a point q if the contour that contains p is nested within the
contour that contains q.
Mahalanobis Distance
 The MD is a measure of distance between a data vector and a set of data, or a
variation that measures the distance between two vectors from the same dataset.

 Suppose you have data for five people, and each person vector has a Height,
Score on some test, and an Age:

Height Score Age


64.0 580.0 29.0
66.0 570.0 33.0
68.0 590.0 37.0
69.0 660.0 46.0
73.0 600.0 55.0
m =68.0 600.0 40.0
n=5

Suppose we want to know how far another person, v = (66, 640, 44), is from this data. 
Mahalanobis Distance

mahalanobi s ( x ,  )  ( x   )  ( x   ) 1 T

 is the covariance matrix of


the input data X

1 n
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


Mahalanobis Distance

Covariance Matrix:

 0.3 0.2
 
 0.2 0. 3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index (Real and
Vargas, 1996) is a measure used when the data is qualitative,
especially when attributes can be represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2, is
given by

Jaccard(X1, X2) =
n( X1  X 2 )
n( X1  X 2 )
where n(X1  X2) is the number of attributes that belong to both
X1 and X2 (that is, X1  X2), n(X1  X2) is the number of attributes
that belong to either X1 or X2 (that is, X1  X2).
Example: Compute Jaccard Coefficient
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the
two customers are shown in Table

Movie Title BS BoS C FG IM JB KFP SL TGF

Customer 1 1 1 0 1 1 1 1 0 0

Customer 2 0 0 1 1 1 1 1 1 1
 The JSC is given by

n(customer 1  customer 2) 4
JSC    0.44
n(customer 1  customer 2) 9

Higher the Jaccard coefficient, higher the similarity


between two observations being compared. The
value of JSC lies between 0 and 1.
Similarity Measures for Binary Data

 Similarity measures between objects that contain only binary


attributes – Similarity Coefficients; Have values in [0,1]. 1
indicates objects are completely similar.
 Common situation is that objects, x and y, have only binary
attributes

 Compute similarities using the following quantities


f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

Simple Matching Coefficient and Jaccard Coefficients


SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of matching presences / number of attributes not involved in 00


matches
= (f ) / (f + f + f )
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)


f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00)/(f01 + f10 + f11 + f00) = (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0


Cosine similarity of different values of .
Cosine Similarity (document vectors)
The cosine similarity between X1 and X2 is given by
n
 X 1i  X 2 i
X1  X 2 i 1

Similarity (X1, X2) = cos() =X 1  X 2 n
2
 X 1i 
n
2
 X 2i
i 1 i 1

where  indicates vector dot product and || X || is the


length of vector X.
In cosine similarity, X1 and X2 are two n-dimensional
vectors and it measures the angle between two vectors
(thus called vector space model).
Cosine Similarity
 Useful for document matching

Documents represented as vectors, where each attribute represents


the frequency with which a particular word occurs in the document.
 Each document is sparse since it has relatively few non-zero
attributes.
Any two documents are likely to “not contain” many of the same
words. Thus, if 0-0 matches are counted, most documents will appear
similar.
Therefore, a similarity measure for documents not only needs to
ignore 0-0 matches like the Jaccard measure but also must be able to
handle non-binary vectors.
X1 = 3 2 0 5 0 0 0 2 0 0
X2 = 1 0 0 0 0 0 0 1 0 2
Cosine Similarity

 Example:

X1 = 3 2 0 5 0 0 0 2 0 0
X2 = 1 0 0 0 0 0 0 1 0 2

X1  X2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( X1, X2 ) = .3150

If the cosine similarity is 1: the angle between x and y is 0 degree, and X & Y
are same except for magnitude

If the cosine similarity is 0, then angle between x and y is 90 degree & they do
not share any words.
Correlation

 Correlation measures the linear relationship


between objects
 To compute correlation, we standardize data
objects, p and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)

qk  (qk  mean(q)) / std (q)

correlation( p, q)  p  q
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

You might also like