0% found this document useful (0 votes)
16 views

Lecture 3-Know Your Data - M

This document discusses key concepts related to data attributes and data characterization in data mining. It defines data as a collection of objects and their attributes. Attributes can be nominal, ordinal, interval or ratio scaled. Nominal attributes distinguish objects but have no ordering, while ordinal attributes also have ordering. Interval and ratio attributes allow for mathematical operations like addition and multiplication. Attributes can be discrete or continuous. The document also discusses similarity and distance measures used to characterize relationships between data objects. Common distance measures include Euclidean (L1 and L2 norms) and non-Euclidean distances.

Uploaded by

Khizar Shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 3-Know Your Data - M

This document discusses key concepts related to data attributes and data characterization in data mining. It defines data as a collection of objects and their attributes. Attributes can be nominal, ordinal, interval or ratio scaled. Nominal attributes distinguish objects but have no ordering, while ordinal attributes also have ordering. Interval and ratio attributes allow for mathematical operations like addition and multiplication. Attributes can be discrete or continuous. The document also discusses similarity and distance measures used to characterize relationships between data objects. Common distance measures include Euclidean (L1 and L2 norms) and non-Euclidean distances.

Uploaded by

Khizar Shahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 19

CSC479

Data Mining
Lecture # 3
Know Your Data
(Ch # 2)
What is Data?
 Collection of data objects
and their attributes Attributes
 An attribute is a property or
characteristic of an object
Tid Refund Marital Taxable
 Examples: eye color of a Status Income Cheat
person, temperature, etc.
1 Yes Single 125K No
 Attribute is also known as
variable, field, 2 No Married 100K No
characteristic, or feature 3 No Single 70K No
 A collection of attributes 4 Yes Married 120K No
describe an object 5 No Divorced 95K Yes
 Object is also known as
Objects
6 No Married 60K No
record, point, case, sample, 7 Yes Divorced 220K No
entity, or instance
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
2
Attribute Values
 Attribute values are numbers or symbols assigned
to an attribute

 Distinction between attributes and attribute


values
 Same attribute can be mapped to different attribute
values
• Example: height can be measured in feet or meters

 Different attributes can be mapped to the same set of


values
• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
– ID has no limit but age has a maximum and minimum value
3
Types of Attributes
 There are different types of attributes
 Nominal (Categorical)
• Examples: ID numbers (Student-ID, Customer-ID), eye color,
zip codes
• Binary Attribute is a type of Nominal Attribute with only two
categories
 Ordinal (meaningful order or ranking among them)
• Examples: rankings (e.g., taste of potato chips on a scale from
1-10), grades, height in {tall, medium, short}
 Numeric
• Interval (measured on a scale of equal-size units)
– Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio-Scaled (with an inherent zero-point)
– Examples: temperature in Kelvin, length, time, counts
4
Properties of Attribute Values
 The type of an attribute depends on which of
the following properties it possesses:
 Distinctness: = 
 Order: < >
 Addition: + -
 Multiplication: */

 Nominal attribute: distinctness


 Ordinal attribute: distinctness & order
 Interval attribute: distinctness, order & addition
 Ratio attribute: all 4 properties
5
Attribute Description Examples Operations
Type

Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye contingency
nominal attributes provide only color, gender: {male, correlation, 2 test
enough information to distinguish female}
one object from another. (=, )

Ordinal The values of an ordinal attribute hardness of minerals, median,


provide enough information to {good, better, best}, percentiles, rank
order objects. (<, >) grades, street correlation, run
numbers tests, sign tests

Interval For interval attributes, the calendar dates, mean, standard


differences between values are temperature in deviation,
meaningful, i.e., a unit of Celsius or Fahrenheit Pearson's
measurement exists. correlation, t and
(+, - ) F tests

Ratio For ratio variables, both temperature in geometric mean,


differences and ratios are Kelvin, monetary harmonic mean,
meaningful. (*, /) quantities, counts, percent variation
age, mass, length,
electrical current

6
Attribute Transformation Comments
Level

Nominal Any permutation of values If all employee ID numbers were


reassigned, would it make any
difference?

Ordinal An order preserving change of values, An attribute encompassing the


i.e., notion of good, better best can be
new_value = f(old_value) represented equally well by the
where f is a monotonic function. values {1, 2, 3} or by { 0.5, 1, 10}.

Interval new_value =a * old_value + b where a and Thus, the Fahrenheit and Celsius
b are constants temperature scales differ in
terms of where their zero value is
and the size of a unit (degree).

Ratio new_value = a * old_value Length can be measured in


meters or feet.

7
Discrete and Continuous Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values
 Examples: zip codes, counts, or the set of words in a
collection of documents
 Often represented as integer variables.
 Note: binary attributes are a special case of discrete
attributes

 Continuous Attribute
 Has real numbers as attribute values
 Examples: temperature, height, or weight.
 Practically, real values can only be measured and
represented using a finite number of digits.
 Continuous attributes are typically represented as floating-
point variables. 8
Important Characteristics of Structured
Data

 Dimensionality
• Number of attributes each object is described with
• Challenge: high dimensionality (curse of
dimensionality)

 Sparsity
• Sparse data: values of most attributes are zero
• Challenge: sparse data call for special handling

9
Some Basic Statistical Measures
 Measuring the Central Tendency
• Mean, Median, and Mode
 Measuring the Dispersion of Data
• Range, Quartiles, Interquartile Range, S.D., and Variance

Kindly see the above measures by yourself, revive


the knowledge gain in the class of Statistics

10
Similarity and Dissimilarity
 Similarity
 Numerical measure of how alike two data
objects are.
 Is higher when objects are more alike.
 Often falls in the range [0,1]
 Dissimilarity
 Numerical measure of how different are two
data objects.
 Lower when objects are more alike.
 Minimum dissimilarity is often 0.
 Upper limit varies
 Proximity refers to a similarity or dissimilarity 11
Distance Measures
 Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.

 Two major classes of distance measure:


1. Euclidean : based on position of points in
some k -dimensional space.
2. Noneuclidean : not related to position or
space.
 Applying a distance measure largely
depends on the type of input data
12
Axioms of a Distance Measure
 d is a distance measure if it is a function
from pairs of points to reals such that:
1. Non-negativity: d(i,j) > 0.
2. Identity of indiscernibles: d(i,j) = 0: The
distance of an object to itself is 0
3. Symmetry: d(i,j) = d(j,i).
4. Triangle inequality: d(i,j) ≤ d(i,k)+d(k,j)

Any measure that satisfies these conditions is


known as metric.
13
Some Euclidean Distances
 L2 norm (also common or Euclidean
distance):
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 i p jp
 The most common notion of “distance”.

 L1 norm (also called Manhattan distance)


d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j2 ip jp
 distance if you had to travel along coordinates
only. 14
Examples L1 and L2 norms
L2-norm:
y = (9,8)

dist(x,y) = (42+32) = 5

5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5) 4

15
Another Euclidean Distance
 L∞ norm (Supremum distance-Uniform
distance): d(x,y) = the maximum of the
differences between x and y in any
dimension.

 Example: Let we have following two points




1 

3
   
2 8
   
x 4
  y 6
 


and






Find L1, L2 and L∞


8 



2 

    16
7
   5

Non-Euclidean Distances
 Proximity Measure for Nominal Attributes
 Jaccard measure for binary vectors
 Distance measure for Ordinal variables
 Edit distance = number of inserts and
deletes to change one string into
another.
 Cosine measure = angle between vectors
from the origin to the points in question.

17
Proximity Measure for Nominal Attributes
 The dissimilarity between two objects i
and j having nominal attributes can be
compute based on the ratio of
mismatches pm
d (i, j) 
p
where m is the number of matches and p
is the total number of attributes.
m
sim(i, j) 1 d (i, j) 
p
Example:
18
Dissimilarity between Nominal Attributes

Here, we have one nominal attribute, test-1, so p=1.


The dissimilarity matrix is as shown below:

19

You might also like