Class-Data Preprocessing-IV
Class-Data Preprocessing-IV
Yashvardhan Sharma
1/30/24 CS F415 3
Similarity/Dissimilarity for Simple Attributes
p and q are the attribute values for two data objects.
1/30/24 CS F415 4
Euclidean Distance
• Euclidean Distance
n 2
dist ( pk qk )
k 1
1/30/24 CS F415 5
Euclidean Distance
3
point x y
2 p1
p1 0 2
p3 p4
p2 2 0
1
p2 p3 3 1
0 p4 5 1
0 1 2 3 4 5 6
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Distance Matrix
1/30/24 CS F415 6
Minkowski Distance
1/30/24 CS F415 7
Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is
just the number of bits that are different between two binary
vectors
• r = 2. Euclidean distance
• Do not confuse r with n, i.e., all these distances are defined for all
numbers of dimensions.
1/30/24 CS F415 8
Minkowski Distance
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
point x y
p1 0 2 L2 p1 p2 p3 p4
p2 2 0 p1 0 2.828 3.162 5.099
p3 3 1 p2 2.828 0 1.414 3.162
p4 5 1 p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Distance Matrix
1/30/24 CS F415 9
Common Properties of a Distance
• Distances, such as the Euclidean distance, have some well known
properties.
1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality)
1/30/24 CS F415 10
Common Properties of a Similarity
1/30/24 CS F415 11
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only binary
attributes
p= 1000000000
q= 0000001001
1/30/24 CS F415 13
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| ,
where indicates vector dot product and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
1/30/24 CS F415 14
Extended Jaccard Coefficient (Tanimoto)
1/30/24 CS F415 15
Correlation Analysis (Numeric Data)
n n
(ai A)(bi B ) ( ai bi ) n A B
rA, B i 1
i 1
(n 1) A B (n 1) A B
Scatter plots
showing the
similarity from
–1 to 1.
17
Correlation
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, p and q, and then
take their dot product
correlatio n( p, q) p q
1/30/24 CS F415 18
Correlation analysis
• Can detect redundancies
rA , B
A A B B
n 1 A B
A
A
n
A A A 2
n 1
Cont’d
• > 0 , A and B positively correlated
• values of A increase as values of B increase
• The higher the value, the more each attribute implies the other
• High value indicate that A (or B) may be removed as a redundancy
• = 0, A and B independent (no correlation)
• < 0, A and B negatively correlated
• Values of one attribute increase as the values of the other attribute decrease
(discourages each other)
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
where n is the number of tuples, Aand Bare the respective mean or expected values
of A and B, σA and σB are the respective standard deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their expected
values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is likely to be
smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
• Some pairs of random variables may have a covariance of 0 but are not independent. Only
under some additional assumptions (e.g., the data follow multivariate normal distributions)
does a covariance of 0 imply independence
21
Co-Variance: An Example
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• Χ2 (chi-square) test
(Observed Expected ) 2
2
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
23
Chi-Square Calculation: An Example
24
Mahalanobis Distance
1 T
mahalanobi s( p, q) ( p q) ( p q)
1 n
j ,k
n 1 i 1
( X ij X j )( X ik X k )
0.3 0.2
0 . 2 0 .3
C
B A: (0.5, 0.5)
B: (0, 1)
A C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
1/30/24 CS F415 26
General Approach for Combining Similarities
• Sometimes attributes are of many different types, but an
overall similarity is needed.
1/30/24 CS F415 27
Using Weights to Combine Similarities
• May not want to treat all attributes the same.
• Use weights wk which are between 0 and 1 and sum to 1.
CS590D 28