0% found this document useful (0 votes)

84 views

Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge

The document discusses various proximity measures used in data mining techniques like clustering and classification. It describes measures of similarity and dissimilarity between objects, including Euclidean distance, standardized Euclidean distance, Minkowski distance, Mahalanobis distance, and Jaccard similarity coefficient. These proximity measures are numerical representations of how similar or different two objects are based on their attributes.

Uploaded by

Kushagra Singhal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views

Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge

Uploaded by

Kushagra Singhal

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Lecture on Data:

Proximity Measures

Source: Books by
Tan, Steinbach, Kumar ; Han, Kamber &
Pei; Evans; Dinesh Kumar + Experiential
Knowledge
Measures of Similarity and Dissimilarity (Proximity Measures)

Useful in data mining techniques such as Clustering, nearest neighbor

classification and anomaly detection.

Similarity between two objects: A numerical measure of the degree to which

two objects are alike. Range: [0,1]

Dissimilarity between two objects: A numerical measure of the degree to

which two objects are different. Range: [0,1] or [0, ∞].

Distance: A special case of dissimilarity.

Measures of Similarity and Dissimilarity (Proximity Measures)

One approach is to transform data to similarity (dissimilarity) space and then

perform analysis.

Proximity between two objects = f(Proximity between corresponding

attributes of the two objects)

Simplest case is to study proximity between 2 objects having one simple

attribute. This will be followed by proximity measures between objects with
multiple attributes.
Measures of Similarity and Dissimilarity (Proximity Measures)

Correlation and Euclidean Distance: useful for dense data such as time
series or two-dimensional points.

Jaccard and cosine similarity measures: useful for sparse data like
documents
Euclidean Distance
Euclidean is one of the frequently used distance
measures when the data are either in interval or ratio
scale.

The Eucledian distance between two n-dimensional

observations X1 (x11, x12, …, x1n) and X2 (x21, x22, …, x2n) is
given by
D( X1, X2 )  (x11  x21 )2  (x12  x22 )2   (x1n  x2n )2

Standardization is necessary, if scales differ.

Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
1
p2 2 0
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Standardized Euclidean Distance
Let X1k and X2k be two attributes of the data (where k stands
for the kth observation in the data set). It is possible that the
range of X1k can be much smaller compared to X2k, resulting
in skewed Euclidean distance value. An easier way of
handling the potential bias is to standardize the data using
the following equation: 
 
 X ik  X i 
 

Standardized value of the attribute =  Xi 


Where i
X and Xi are, respectively, the mean and standard

deviation of ith attribute

Example: comparison of two persons based on their age
and income. (Income differences larger hence dominate
comparison)
Limitation

Euclidean distance may not be useful while measuring

distance between two locations (say two shops)

We can use Minkowski distance for that.

Minkowski Distance
Minkowski distance is the generalized distance measure
between two cases (data objects) X1 and X2 in the
dataset and is given by

1 p
 n p

Minkowski D( X 1 , X 2 )    X 1i  X 2i 
 i 1
 

n: no. of dimensions
When p = 1, City block (Manhattan, taxicab, L1 norm)
distance.
– A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
For p = 2, Minkowski distance is same as the Euclidean
distance.
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix
Simulated bivariate normal data

Think of multivariate normal distribution. The following graph shows simulated bivariate normal data
that is overlaid with prediction ellipses. The ellipses in the graph are the 10% (innermost), 20%, ..., and
90% (outermost) prediction ellipses for the bivariate normal distribution that generated the data. The
prediction ellipses are contours of the bivariate normal density function. The probability density is high
for ellipses near the origin, such as the 10% prediction ellipse. The density is low for ellipses are further
away, such as the 90% prediction ellipse.
 In the graph, two observations are displayed by using red stars as markers. The first
observation is at the coordinates (4,0), whereas the second is at (0,2). The question is: which
marker is closer to the origin? (The origin is the multivariate center of this distribution.)
 The answer is, "It depends how you measure distance." The Euclidean distances are 4 and 2,
respectively, so you might conclude that the point at (0,2) is closer to the origin. However, for
this distribution, the variance in the Y direction is less than the variance in the X direction, so in
some sense the point (0,2) is "more standard deviations" away from the origin than (4,0) is.
 Notice the position of the two observations relative to the ellipses. The point (0,2) is located at
the 90% prediction ellipse, whereas the point at (4,0) is located at about the 75% prediction
ellipse. What does this mean? It means that the point at (4,0) is "closer" to the origin in the
sense that you are more likely to observe an observation near (4,0) than to observe one near
(0,2). The probability density is higher near (4,0) than it is near (0,2).
 In this sense, prediction ellipses are a multivariate generalization of "units of standard
deviation." You can use the bivariate probability contours to compare distances to the bivariate
mean. A point p is closer than a point q if the contour that contains p is nested within the
contour that contains q.
Mahalanobis Distance
 The MD is a measure of distance between a data vector and a set of data, or a
variation that measures the distance between two vectors from the same dataset.

 Suppose you have data for five people, and each person vector has a Height,
Score on some test, and an Age:

Height Score Age

64.0 580.0 29.0
66.0 570.0 33.0
68.0 590.0 37.0
69.0 660.0 46.0
73.0 600.0 55.0
m =68.0 600.0 40.0
n=5

Suppose we want to know how far another person, v = (66, 640, 44), is from this data.
Mahalanobis Distance

mahalanobi s ( x ,  )  ( x   )  ( x   ) 1 T

 is the covariance matrix of

the input data X

1 n
 j ,k   ( X ij  X j )( X ik  X k )
n  1 i 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

Mahalanobis Distance

Covariance Matrix:

 0.3 0.2
 
 0.2 0. 3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4
Jaccard Similarity Coefficient (Jaccard
Index)
 Jaccard similarity coefficient (JSC) or Jaccard index (Real and
Vargas, 1996) is a measure used when the data is qualitative,
especially when attributes can be represented in binary form.
 JSC for two n-dimensional data (n attributes), X1 and X2, is
given by

Jaccard(X1, X2) =
n( X1  X 2 )
n( X1  X 2 )
where n(X1  X2) is the number of attributes that belong to both
X1 and X2 (that is, X1  X2), n(X1  X2) is the number of attributes
that belong to either X1 or X2 (that is, X1  X2).
Example: Compute Jaccard Coefficient
Consider movie DVD purchases made by two customers as given by
the following sets
Customer 1 = {Jungle Book (JB), Iron Man (IM), Kung Fu Panda
(KFP), Before Sunrise (BS), Bridge of spies (BoS), Forest Gump (FG)}
Customer 2 = {Casablanca (C), Jungle Book (JB), Forrest Gump, Iron
Man (IM), Kung Fu Panda (KFP), Schindler’s List (SL), The God
Father (TGF)}
In this case, each movie is an attribute. The purchases made by the
two customers are shown in Table

Movie Title BS BoS C FG IM JB KFP SL TGF

Customer 1 1 1 0 1 1 1 1 0 0

Customer 2 0 0 1 1 1 1 1 1 1
 The JSC is given by

n(customer 1  customer 2) 4
JSC    0.44
n(customer 1  customer 2) 9

Higher the Jaccard coefficient, higher the similarity

between two observations being compared. The
value of JSC lies between 0 and 1.
Similarity Measures for Binary Data

 Similarity measures between objects that contain only binary

attributes – Similarity Coefficients; Have values in [0,1]. 1
indicates objects are completely similar.
 Common situation is that objects, x and y, have only binary
attributes

 Compute similarities using the following quantities

f01 = the number of attributes where x was 0 and y was 1
f10 = the number of attributes where x was 1 and y was 0
f00 = the number of attributes where x was 0 and y was 0
f11 = the number of attributes where x was 1 and y was 1

Simple Matching Coefficient and Jaccard Coefficients

SMC = number of matches / number of attributes
= (f11 + f00) / (f01 + f10 + f11 + f00)

J = number of matching presences / number of attributes not involved in 00

matches
= (f ) / (f + f + f )
SMC versus Jaccard: Example

x= 1000000000
y= 0000001001

f01 = 2 (the number of attributes where p was 0 and q was 1)

f10 = 1 (the number of attributes where p was 1 and q was 0)
f00 = 7 (the number of attributes where p was 0 and q was 0)
f11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (f11 + f00)/(f01 + f10 + f11 + f00) = (0+7) / (2+1+0+7) = 0.7

J = (f11) / (f01 + f10 + f11) = 0 / (2 + 1 + 0) = 0

Cosine similarity of different values of .
Cosine Similarity (document vectors)
The cosine similarity between X1 and X2 is given by
n
 X 1i  X 2 i
X1  X 2 i 1

Similarity (X1, X2) = cos() =X 1  X 2 n
2
 X 1i 
n
2
 X 2i
i 1 i 1

where  indicates vector dot product and || X || is the

length of vector X.
In cosine similarity, X1 and X2 are two n-dimensional
vectors and it measures the angle between two vectors
(thus called vector space model).
Cosine Similarity
 Useful for document matching

Documents represented as vectors, where each attribute represents

the frequency with which a particular word occurs in the document.
 Each document is sparse since it has relatively few non-zero
attributes.
Any two documents are likely to “not contain” many of the same
words. Thus, if 0-0 matches are counted, most documents will appear
similar.
Therefore, a similarity measure for documents not only needs to
ignore 0-0 matches like the Jaccard measure but also must be able to
handle non-binary vectors.
X1 = 3 2 0 5 0 0 0 2 0 0
X2 = 1 0 0 0 0 0 0 1 0 2
Cosine Similarity

 Example:

X1 = 3 2 0 5 0 0 0 2 0 0
X2 = 1 0 0 0 0 0 0 1 0 2

X1  X2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( X1, X2 ) = .3150

If the cosine similarity is 1: the angle between x and y is 0 degree, and X & Y
are same except for magnitude

If the cosine similarity is 0, then angle between x and y is 90 degree & they do
not share any words.
Correlation

 Correlation measures the linear relationship

between objects
 To compute correlation, we standardize data
objects, p and q, and then take their dot product

pk  ( pk  mean( p)) / std ( p)

qk  (qk  mean(q)) / std (q)

correlation( p, q)  p  q
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

Physics Cheat Sheet - Motion
50% (2)
Physics Cheat Sheet - Motion
1 page
Jean-Louis Cohen - A Conversation With Yves-Alain Bois, Denis Hollier, Rosalind Krauss (Summer 1999)
No ratings yet
Jean-Louis Cohen - A Conversation With Yves-Alain Bois, Denis Hollier, Rosalind Krauss (Summer 1999)
16 pages
Algebra & Trigonometry
No ratings yet
Algebra & Trigonometry
7 pages
Lec09 466 PDF
No ratings yet
Lec09 466 PDF
5 pages
Similarity
No ratings yet
Similarity
19 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
No ratings yet
18CSE397T - Computational Data Analysis Unit - 3: Session - 8: SLO - 2
4 pages
L13
No ratings yet
L13
19 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
DMi_03-Proximity
No ratings yet
DMi_03-Proximity
51 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
CS-DM MODULE- 3
No ratings yet
CS-DM MODULE- 3
27 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
No ratings yet
CSE-1-PPT-MiniTest-12feb24-Similarity (6)
11 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Similarity
No ratings yet
Similarity
20 pages
Similarity
No ratings yet
Similarity
20 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Lec2 2-Dataset2
No ratings yet
Lec2 2-Dataset2
29 pages
CS822-DataMining-Week4 (2)
No ratings yet
CS822-DataMining-Week4 (2)
45 pages
Dist
No ratings yet
Dist
14 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
Lab 2
No ratings yet
Lab 2
21 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Measure of Proximity
No ratings yet
Measure of Proximity
11 pages
STAT243 Chapter 2 - Section 2.4 (1)
No ratings yet
STAT243 Chapter 2 - Section 2.4 (1)
41 pages
Week 3 - Similarity Distance Measures
No ratings yet
Week 3 - Similarity Distance Measures
42 pages
02data Part4
No ratings yet
02data Part4
28 pages
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
No ratings yet
CSC_522_Lecture10_5f0e8c83dce359ee001691c737303b46
30 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Clustering
0% (1)
Clustering
127 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Ml unit 2
No ratings yet
Ml unit 2
11 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Cluster
No ratings yet
Cluster
13 pages
Lec 5
No ratings yet
Lec 5
24 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
No ratings yet
APznzaaN7_CY3hhfhbJRXjYJ1BR6-NtGzIkO6tA99bBiITMP7edAeijYM4WIPHTX6qmgs05QF3M-ALsy0PRS_TYvyugVy6R2kjYnK0BCBRm9Wtq_9FaGq4pVaH_pFWQ-CutgWY_nI5HsUACQNIaD3Gu0gxaanUrACiGy2qvKlVDZgXatZgVnQ_WWUQGN5GK3MgGPyk7wNYpPtuWmopw0KMKDCQDXsrCNzmu9V5rqcPBmZE4z
50 pages
Lecture 7 - Distance Measures
No ratings yet
Lecture 7 - Distance Measures
38 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Clustering
No ratings yet
Clustering
15 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
RL3.2 Data Similarity 1
No ratings yet
RL3.2 Data Similarity 1
17 pages
DistancesSimilarities
No ratings yet
DistancesSimilarities
39 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Similarity_Based_learning_(part_2_)__
No ratings yet
Similarity_Based_learning_(part_2_)__
15 pages
distance-and-similarity
No ratings yet
distance-and-similarity
33 pages
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
No ratings yet
Session-5.1-Measuring Data Similarity and Dissimilarity - Part-2
16 pages
Reachable Distance Function For KNN Classification
No ratings yet
Reachable Distance Function For KNN Classification
152 pages
Data Mining Lecture 1 - Summary
No ratings yet
Data Mining Lecture 1 - Summary
3 pages
DM&DW Individual Assignment (50%)
No ratings yet
DM&DW Individual Assignment (50%)
4 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
Algebraic Equations
From Everand
Algebraic Equations
Demetrios P. Kanoussis
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
No ratings yet
Clustering: Source: I. Business Analytics by U Dinesh Kumar Means-Example-1.htm) rial/Clustering/Numerical Example - HTM
24 pages
Educational Benchmarking: in Marekting Management
No ratings yet
Educational Benchmarking: in Marekting Management
29 pages
Six Sigma
No ratings yet
Six Sigma
15 pages
Industry: Automobile Manufacturing Industry Companies
No ratings yet
Industry: Automobile Manufacturing Industry Companies
4 pages
Assignment 1 & 2 BRM
No ratings yet
Assignment 1 & 2 BRM
6 pages
Kinematics Unit Worksheet 3: Multiple Representations of Motion
No ratings yet
Kinematics Unit Worksheet 3: Multiple Representations of Motion
4 pages
2019 Paper 2 Higher
No ratings yet
2019 Paper 2 Higher
20 pages
Module 1 2nd Quarter d1
No ratings yet
Module 1 2nd Quarter d1
5 pages
New Century Math Yr 9 - Chapter04 Investigation Geometry
No ratings yet
New Century Math Yr 9 - Chapter04 Investigation Geometry
28 pages
The Theory of Everything
100% (1)
The Theory of Everything
58 pages
Full Other Cities Other Worlds Urban Imaginaries in A Globalizing Age 1st Edition Andreas Huyssen PDF All Chapters
100% (9)
Full Other Cities Other Worlds Urban Imaginaries in A Globalizing Age 1st Edition Andreas Huyssen PDF All Chapters
70 pages
Museum of Emotions
No ratings yet
Museum of Emotions
6 pages
12-2 Prisms and Cylinders
No ratings yet
12-2 Prisms and Cylinders
2 pages
Architectural Science II Course outline
No ratings yet
Architectural Science II Course outline
6 pages
Material 1 1684906080
No ratings yet
Material 1 1684906080
13 pages
Synopsis PDF
No ratings yet
Synopsis PDF
13 pages
Volume 1 PDF
100% (1)
Volume 1 PDF
566 pages
Lecture7, Linear Transformation
No ratings yet
Lecture7, Linear Transformation
36 pages
CBSE Test Paper 01 Chapter 7 Coordinate Geometry: Material Downloaded From - 1 / 9
No ratings yet
CBSE Test Paper 01 Chapter 7 Coordinate Geometry: Material Downloaded From - 1 / 9
9 pages
Problem Solving Ability Measuring Questions (PSAMQ) : Direction
No ratings yet
Problem Solving Ability Measuring Questions (PSAMQ) : Direction
3 pages
Research On Lighting Installation Art Design in Public Space
No ratings yet
Research On Lighting Installation Art Design in Public Space
4 pages
Balancing of Rigid Rotors and Field Balancing: S K Mondal's
No ratings yet
Balancing of Rigid Rotors and Field Balancing: S K Mondal's
17 pages
E3
No ratings yet
E3
300 pages
Lecture 13 Yield Criterion
No ratings yet
Lecture 13 Yield Criterion
10 pages
Unit-1 Introduction To Surveying
No ratings yet
Unit-1 Introduction To Surveying
9 pages
Alternate Interior Angles Theorem 09112019
No ratings yet
Alternate Interior Angles Theorem 09112019
3 pages
CAPE Vectors Worksheets
No ratings yet
CAPE Vectors Worksheets
14 pages
School Learning Action Cell SLAC Plan
No ratings yet
School Learning Action Cell SLAC Plan
1 page
B.SC Maths II Year
No ratings yet
B.SC Maths II Year
10 pages
Lebron James and The Protocol of Display: Brett Ommen
No ratings yet
Lebron James and The Protocol of Display: Brett Ommen
23 pages
Vector DPP (1-10) 25.08.2018
100% (1)
Vector DPP (1-10) 25.08.2018
24 pages
Name: Period: Date:: Math Lab: Explore Transformations of Trig Functions Explore Vertical Displacement
No ratings yet
Name: Period: Date:: Math Lab: Explore Transformations of Trig Functions Explore Vertical Displacement
7 pages