100% found this document useful (1 vote)
191 views16 pages

Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2

Literary Digest incorrectly predicted the 1936 US presidential election results. They predicted 55% for Alf Landon and 41% for Franklin Roosevelt, but the actual results were 61% for Roosevelt and 37% for Landon. The problem was that Literary Digest based its survey on names from phone books and car registration lists, which overrepresented Republicans since Democrats were less likely to be listed. This led to an inaccurate sample that failed to predict Roosevelt's victory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
191 views16 pages

Data Mining: Data: Lecture Notes For Chapter 2 Lecture Notes For Chapter 2

Literary Digest incorrectly predicted the 1936 US presidential election results. They predicted 55% for Alf Landon and 41% for Franklin Roosevelt, but the actual results were 61% for Roosevelt and 37% for Landon. The problem was that Literary Digest based its survey on names from phone books and car registration lists, which overrepresented Republicans since Democrats were less likely to be listed. This led to an inaccurate sample that failed to predict Roosevelt's victory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Data Mining: Data

Lecture Notes for Chapter 2


Introduction to Data Mining
by
Tan, Steinbach, Kumar
(Modified by P
P. Radivojac for I211)

What went wrong in 1936?

Literary Digest successively conducted surveys since 1920 and


predicted an elected president every time correctly

IIn 1936 they


th predicted
di t d 55% off vote
t ffor Alf L
Landon
d and
d 41% ffor
Franklin Roosevelt

actual elections showed that Roosevelt won 61% vs. 37%

M th d l
Methodology
for
f data
d t collection
ll ti

Literary digest sent 10 million ballots to voters in the USA

about
b
2
2.3
3 million
illi were returned
d

names obtained from phone registries and automobile licensing


p
departments

So, what was the problem ?

What went wrong in 1936 (1)

Source: Peverill Squire, Why the 1936 Literary Digest Poll Failed.

What went wrong in 1936 (2)

What went wrong in 1936 (3)

What is Data?
z

Collection of data objects and


their attributes

An attribute is a property or
characteristic of an object

Attributes

E
Examples:
l
eye color
l off a
person, temperature, etc.
Attribute is also known as
feature variable,
feature,
variable variate
z

A collection of attributes
describe a data point
data point is also known as
object, record, instance, or
example

Data
points

10

Class

Tid Home
Owner

Marital
Status

Taxable
Income Cheat

Yes

Single
g

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Similarity and Dissimilarity


z

Similarity
Numerical measure of how alike two data points are
are.
Is higher when objects are more alike.
Often falls in the range
g [[0,1]
, ]

Dissimilarity
Numerical measure of how different are two data
points
Lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies

P i it refers
Proximity
f
tto a similarity
i il it or di
dissimilarity
i il it

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Euclidean distance in 2D
dist ( p, q ) = a 2 + b 2 = (6 3) 2 + (1 5) 2 =
= 32 + (4) 2 = 25 = 5

x2

Pythagoras' theorem:
p = (3, 5)

a2 + b2 = c2

5
c = dist(p, q)
b

q = (6, 1)

a
3

x1

Euclidean Distance in n dimensions


z

Euclidean Distance

dist =

(p
k =1

qk )

Where n is the number of dimensions (attributes) and pk and qk


are, respectively, the kth attributes (components) or data
objects
bj t p and
d q.
z

Standardization is necessary, if scales differ.

Euclidean Distance
3

point
p1
p2
p3
p4

p1

p3

p4

1
p2
p

0
0

y
2
0
1
1

p1
p1
p2
p3
p4

x
0
2
3
5

0
2.828
3.162
5.099

p2
2.828
0
1.414
3.162

Distance Matrix

p3
3.162
1.414
0
2

p4
5.099
3.162
2
0

More about Euclidean distance

dist ( p, q) =

2
(
p

q
)
k k =
k =1

x2

p = (3, 5)

q = (0, 0)

2
p
k= p
k =1

length of vector p

x1

Minkowski Distance
z

Minkowski Distance is a generalization of Euclidean


Distance

dist = ( | pk qk
k =1

1
r r
|)

Where r is a parameter, n is the number of dimensions


(attributes) and pk and qk are, respectively, the kth attributes
(components) or data objects p and q.

Minkowski Distance: Examples

r = 1. Cityy block ((Manhattan, taxicab, L1 norm)) distance.


A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors

z
z

r = 2. Euclidean distance
r . supremum (Lmax norm, L norm) distance.
This is the maximum difference between any component of the vectors

Do not confuse r with n, i.e., all these distances are


defined for all numbers of dimensions.

From Wikipedia

Minkowski Distance

point
p1
p2
p3
p4
p

x
0
2
3
5

y
2
0
1
1

L1
p1
p2
p3
p4

p1
0
4
4
6

p2
4
0
2
4

p3
4
2
0
2

p4
6
4
2
0

L2
p1
p2
p3
p4

p11

p22
2.828
0
1.414
3.162

p33
3.162
1.414
0
2

p44
5.099
3.162
2
0

L
p1
p
p2
p3
p4

p1

p2

p3

p4

0
2.828
3.162
5.099

0
2
3
5

2
0
1
3

Distance Matrix

3
1
0
2

5
3
2
0

Common Properties of a Distance


z

Distances, such as the Euclidean distance,


have some well known properties.
1.

d(p, q) 0 for all p and q and d(p, q) = 0 only if


p = q. (Positive definiteness)

2.

d(p, q) = d(q, p) for all p and q. (Symmetry)

3.

d(p, r) d(p, q) + d(q, r) for all points p, q, and r.


((Triangle
g Inequality)
q
y)

where d(p, q) is the distance (dissimilarity) between


points, p and q.
z

A distance that satisfies these properties is a


metric

You might also like