Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2
M th d l
Methodology
for
f data
d t collection
ll ti
about
b
2
2.3
3 million
illi were returned
d
Source: Peverill Squire, Why the 1936 Literary Digest Poll Failed.
What is Data?
z
An attribute is a property or
characteristic of an object
Attributes
E
Examples:
l
eye color
l off a
person, temperature, etc.
Attribute is also known as
feature variable,
feature,
variable variate
z
A collection of attributes
describe a data point
data point is also known as
object, record, instance, or
example
Data
points
10
Class
Tid Home
Owner
Marital
Status
Taxable
Income Cheat
Yes
Single
g
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Similarity
Numerical measure of how alike two data points are
are.
Is higher when objects are more alike.
Often falls in the range
g [[0,1]
, ]
Dissimilarity
Numerical measure of how different are two data
points
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
P i it refers
Proximity
f
tto a similarity
i il it or di
dissimilarity
i il it
Euclidean distance in 2D
dist ( p, q ) = a 2 + b 2 = (6 3) 2 + (1 5) 2 =
= 32 + (4) 2 = 25 = 5
x2
Pythagoras' theorem:
p = (3, 5)
a2 + b2 = c2
5
c = dist(p, q)
b
q = (6, 1)
a
3
x1
Euclidean Distance
dist =
(p
k =1
qk )
Euclidean Distance
3
point
p1
p2
p3
p4
p1
p3
p4
1
p2
p
0
0
y
2
0
1
1
p1
p1
p2
p3
p4
x
0
2
3
5
0
2.828
3.162
5.099
p2
2.828
0
1.414
3.162
Distance Matrix
p3
3.162
1.414
0
2
p4
5.099
3.162
2
0
dist ( p, q) =
2
(
p
q
)
k k =
k =1
x2
p = (3, 5)
q = (0, 0)
2
p
k= p
k =1
length of vector p
x1
Minkowski Distance
z
dist = ( | pk qk
k =1
1
r r
|)
z
z
r = 2. Euclidean distance
r . supremum (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors
From Wikipedia
Minkowski Distance
point
p1
p2
p3
p4
p
x
0
2
3
5
y
2
0
1
1
L1
p1
p2
p3
p4
p1
0
4
4
6
p2
4
0
2
4
p3
4
2
0
2
p4
6
4
2
0
L2
p1
p2
p3
p4
p11
p22
2.828
0
1.414
3.162
p33
3.162
1.414
0
2
p44
5.099
3.162
2
0
L
p1
p
p2
p3
p4
p1
p2
p3
p4
0
2.828
3.162
5.099
0
2
3
5
2
0
1
3
Distance Matrix
3
1
0
2
5
3
2
0
2.
3.