0% found this document useful (0 votes)
27 views

Class-Data Preprocessing-IV

This document discusses measures of similarity and dissimilarity that are commonly used in data mining. It begins by defining similarity and dissimilarity, noting that similarity measures how alike objects are while dissimilarity measures how different they are. It then discusses several specific measures: Euclidean distance, Minkowski distance including L1 and L-infinity distances, cosine similarity, Jaccard similarity, and correlation analysis. Examples are provided to illustrate how each measure is calculated. The key aspects of these similarity and dissimilarity measures are summarized within 3 sentences.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Class-Data Preprocessing-IV

This document discusses measures of similarity and dissimilarity that are commonly used in data mining. It begins by defining similarity and dissimilarity, noting that similarity measures how alike objects are while dissimilarity measures how different they are. It then discusses several specific measures: Euclidean distance, Minkowski distance including L1 and L-infinity distances, cosine similarity, Jaccard similarity, and correlation analysis. Examples are provided to illustrate how each measure is calculated. The key aspects of these similarity and dissimilarity measures are summarized within 3 sentences.

Uploaded by

f20201207
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

CS F415: Data Mining

Yashvardhan Sharma

1/30/24 C1S F415 1


Today’s Outline
• Data Preprocessing
• Measure of Similarity and dissimilarity

1/30/24 C1S F415 2


Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

1/30/24 CS F415 3
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.

1/30/24 CS F415 4
Euclidean Distance
• Euclidean Distance

n 2
dist   ( pk  qk )
k 1

Where n is the number of dimensions (attributes) and pk


and qk are, respectively, the kth attributes (components)
of data objects p and q.

• Standardization is necessary, if scales differ.

1/30/24 CS F415 5
Euclidean Distance

3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6

p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Distance Matrix

1/30/24 CS F415 6
Minkowski Distance

• Minkowski Distance is a generalization of Euclidean


Distance
1
n r r
dist  (  | pk  qk | )
k 1

Where r is a parameter, n is the number of


dimensions (attributes) and pk and qk are,
respectively, the kth attributes (components) or
data objects p and q.

1/30/24 CS F415 7
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is
just the number of bits that are different between two binary
vectors

• r = 2. Euclidean distance

• r  . “supremum” (Lmax norm, L norm) distance (also known as


Chebyshev distance)..
• This is the maximum difference between any component of the vectors

• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
1/30/24 CS F415 8
Minkowski Distance

L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Distance Matrix
1/30/24 CS F415 9
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)

where d(p, q) is the distance (dissimilarity) between points (data objects), p


and q.

• A distance that satisfies these properties is a metric

1/30/24 CS F415 10
Common Properties of a Similarity

• Similarities, also have some well known


properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points


(data objects), p and q.

1/30/24 CS F415 11
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes

• Compute similarities using the following quantities


M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)

J = number of 11 matches / number of not-both-zero attributes


values
= (M11) / (M01 + M10 + M11)
1/30/24 CS F415 12
SMC versus Jaccard: Example

p= 1000000000
q= 0000001001

M01 = 2 (the number of attributes where p was 0 and q was 1)


M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

1/30/24 CS F415 13
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.

• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

1/30/24 CS F415 14
Extended Jaccard Coefficient (Tanimoto)

• Variation of Jaccard for continuous or count attributes


• Reduces to Jaccard for binary attributes

1/30/24 CS F415 15
Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product moment


coefficient)

 
n n
(ai  A)(bi  B ) ( ai bi )  n A B
rA, B  i 1
 i 1

(n  1) A B (n  1) A B

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.

• If rA,B > 0, A and B are positively correlated (A’s values increase as


B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated 16
Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

17
Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q, and then
take their dot product

pk  ( pk  mean ( p)) / std ( p)


qk  ( qk  mean ( q)) / std (q)

correlatio n( p, q)  p  q
1/30/24 CS F415 18
Correlation analysis
• Can detect redundancies

rA , B    
A A B  B 
 n 1 A B
 A
A
n
A  A A   2
n 1
Cont’d
• > 0 , A and B positively correlated
• values of A increase as values of B increase
• The higher the value, the more each attribute implies the other
• High value indicate that A (or B) may be removed as a redundancy
• = 0, A and B independent (no correlation)
• < 0, A and B negatively correlated
• Values of one attribute increase as the values of the other attribute decrease
(discourages each other)
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, Aand Bare the respective mean or expected values
of A and B, σA and σB are the respective standard deviation of A and B.

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal distributions)
does a covariance of 0 imply independence
21
Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
(Observed  Expected ) 2
2  
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

23
Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)


Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected


counts calculated based on the data distribution in the two categories)

(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2


 
2
    507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the
group

24
Mahalanobis Distance
1 T
mahalanobi s( p, q)  ( p  q)  ( p  q)

 is the covariance matrix of the


input data X

1 n
 j ,k  
n  1 i 1
( X ij  X j )( X ik  X k )

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.


1/30/24 CS F415 25
Mahalanobis Distance
Covariance Matrix:

 0.3 0.2
 
 0 . 2 0 .3
C

B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)

Mahal(A,B) = 5
Mahal(A,C) = 4

1/30/24 CS F415 26
General Approach for Combining Similarities
• Sometimes attributes are of many different types, but an
overall similarity is needed.

1/30/24 CS F415 27
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.

CS590D 28

You might also like