Lecture 3-Know Your Data - M
Lecture 3-Know Your Data - M
Data Mining
Lecture # 3
Know Your Data
(Ch # 2)
What is Data?
Collection of data objects
and their attributes Attributes
An attribute is a property or
characteristic of an object
Tid Refund Marital Taxable
Examples: eye color of a Status Income Cheat
person, temperature, etc.
1 Yes Single 125K No
Attribute is also known as
variable, field, 2 No Married 100K No
characteristic, or feature 3 No Single 70K No
A collection of attributes 4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
Object is also known as
Objects
6 No Married 60K No
record, point, case, sample, 7 Yes Divorced 220K No
entity, or instance
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
Attribute Values
Attribute values are numbers or symbols assigned
to an attribute
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye contingency
nominal attributes provide only color, gender: {male, correlation, 2 test
enough information to distinguish female}
one object from another. (=, )
6
Attribute Transformation Comments
Level
Interval new_value =a * old_value + b where a and Thus, the Fahrenheit and Celsius
b are constants temperature scales differ in
terms of where their zero value is
and the size of a unit (degree).
7
Discrete and Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
Examples: zip codes, counts, or the set of words in a
collection of documents
Often represented as integer variables.
Note: binary attributes are a special case of discrete
attributes
Continuous Attribute
Has real numbers as attribute values
Examples: temperature, height, or weight.
Practically, real values can only be measured and
represented using a finite number of digits.
Continuous attributes are typically represented as floating-
point variables. 8
Important Characteristics of Structured
Data
Dimensionality
• Number of attributes each object is described with
• Challenge: high dimensionality (curse of
dimensionality)
Sparsity
• Sparse data: values of most attributes are zero
• Challenge: sparse data call for special handling
9
Some Basic Statistical Measures
Measuring the Central Tendency
• Mean, Median, and Mode
Measuring the Dispersion of Data
• Range, Quartiles, Interquartile Range, S.D., and Variance
10
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data
objects are.
Is higher when objects are more alike.
Often falls in the range [0,1]
Dissimilarity
Numerical measure of how different are two
data objects.
Lower when objects are more alike.
Minimum dissimilarity is often 0.
Upper limit varies
Proximity refers to a similarity or dissimilarity 11
Distance Measures
Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.
dist(x,y) = (42+32) = 5
5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5) 4
15
Another Euclidean Distance
L∞ norm (Supremum distance-Uniform
distance): d(x,y) = the maximum of the
differences between x and y in any
dimension.
17
Proximity Measure for Nominal Attributes
The dissimilarity between two objects i
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j)
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j)
p
Example:
18
Dissimilarity between Nominal Attributes
19