0% found this document useful (0 votes)

19 views46 pages

Module No 2 - Part 2 - Compressed - Compressed

The document discusses different types of attributes and statistical descriptions of data that are important for data exploration in data mining. It covers nominal, binary, ordinal and numeric attributes as well as statistical metrics like mean, median, mode, variance and standard deviation. Graphical representations like histograms and boxplots are also described.

Uploaded by

Abhishek Bapat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views46 pages

Module No 2 - Part 2 - Compressed - Compressed

Uploaded by

Abhishek Bapat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Module No : 02

Introduction to data mining

1
Data Exploration

2
Attributes

• An attribute is a data field, representing a characteristic or feature

of a data object.
• The nouns attribute, dimension, feature, and variable are often
used interchangeably in the literature.
• The term dimension is commonly used in data warehousing.
• Machine learning literature use the term feature, while statisticians
prefer the term variable.
• Data mining and database professionals commonly use the term
attribute
• Attributes describing a customer object can include: customer ID,
name, and address.
3
Types of Attributes UQ

1. Nominal attributes
2. Binary attributes
3. Ordinal attributes
4. Numeric attributes
a) Interval-scaled attributes
b) Ratio-scaled attributes

4
Types of Attributes 1. Nominal attributes
• Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
• Each value represents some kind of category, code, or state
• The values do not have any meaningful order.
• Example 1: hair color and marital status are two attributes
describing person objects, then possible values for hair color are
black,brown, blond, red, gray, and white.
• The attribute marital status can take on the values single, married,
divorced, and widowed.
• Both hair color and marital status are nominal attributes.
• Example 2: occupation, with the values teacher, dentist,
programmer, farmer, and so on.
5
Types of Attributes 2. Binary attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
• Example: The attribute medical test is binary, where a value of 1
means the result of the test for the patient is positive, while 0
means the result is negative.
• there is no preference on which outcome should be coded as 0 or 1
attribute gender having the states male and female.

6
Types of Attributes 3. Ordinal attributes

• An ordinal attribute is an attribute with possible values that have a

meaningful order or ranking among them, but the magnitude
between successive values is not known.
• Example: drink size corresponds to the size of drinks available at a
fast-food restaurant. This attribute has three possible values: small,
medium,and large.
• grade (e.g., A++, A+, A, B++,B+,B,C++ and so on)

7
Types of Attributes 4. Numeric attributes
• A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
a) Interval-Scaled Attributes
• Interval-scaled attributes are measured on a scale of equal-size
units. The values of interval-scaled attributes have order and can be
positive, 0, or negative
• such attributes allow us to compare and quantify the difference
between values.
• Example 1: temperature (20 degree Celsius is five degrees higher
than a temperature of 15 degree Celsius)
• Example 2: calendar dates (the years 2002 and 2010 are eight years
apart)

8
Types of Attributes 4. Numeric attributes
b)Ratio-Scaled Attributes
• A ratio-scaled attribute is a numeric attribute with an inherent
zero-point i.e. if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value
• the values are ordered, and we can also compute the difference
between values, as well as the mean,median, and mode
• Example 1: year_of_experience
• Example 2: no-of-words (in a document)
• Example 3: weight, height, latitude and longitude

9
Statistical Description of data
• For data pre-processing to be successful, it is essential to have an
overall picture of your data.
• Basic statistical descriptions can be used to identify properties of
the data and highlight which data values should be treated as noise
or outliers.
• Following are the different ways to describe data statistically
1. Mean 2. Median
3. Mode 4. Midrange
4. Range 5. Quartiles
6. Interquartile Range 7. Five-Number Summary
8. Boxplots 9. Outliers
10. Variance 11. Standard Deviation
12. Histograms 13. Scatter Plots
14. Data Correlation
10
Statistical Description of data : 1. Mean (Average Value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X. The mean of this set of values is

11
Statistical Description of data : 2. Median (middle value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X, like salary. The median of this set of values
is

12
Statistical Description of data : 3. Mode (most common
value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
Data set : 30,36,47,50,52,52,56,60,63,70,70,110
The mode of this set of values is : (the values repeating maximum times)
This set of data is bimodel i.e. there are to modes 52 and 70
(30,36,47,50,52,52,56,60,63,70,70,110)

13
Statistical Description of data : 4. Midrange
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
Data set : 30,36,47,50,52,52,56,60,63,70,70,110
The midrange is the average of largest and smallest values in the set.
The midrange of this set of values is (30+110)/2=70
(30,36,47,50,52,52,56,60,63,70,70,110)

14
Statistical Description of data : 5. Range
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
The range of the set is the difference between the largest (max) and
smallest (min) values.
e.g. (30,36,47,50,52,52,56,60,63,70,70,110)
Range of this data set is 110-30=80

15
Statistical Description of data : 5. Quartiles
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
We can pick certain data points so as to split the data distribution into
equal-size consecutive sets, as shown in figure

16
Statistical Description of data : 5. Quartiles
Q1=Lower Quartile
Q2= Median
Q3=Upper Quartile
In this e.g.
Q1=52
Q2=54
Q3=58

– The first quartile, denoted by Q1, is the 25th percentile. It cuts off the
lowest 25% of the data.
– The third quartile, denoted by Q3, is the 75th percentile—it cuts off the
lowest 75% (or highest 25%) of the data.
– The second quartile is the 50th percentile. As the median, it gives the
centre of the data distribution.
17
Statistical Description of data : 6. Interquartile Range(IQR)

• The distance between the first and third quartiles is a simple

measure of spread that gives the range covered by the middle half of
the data.

• This distance is called the interquartile range (IQR) and is defined

as IQR=Q3-Q1

18
Statistical Description of data : 6. Interquartile Range(IQR)

19
Statistical Description of data :

20
Statistical Description of data : 7. Five-Number Summary
• The five-number summary of a distribution consists of the median
(Q2), the quartiles Q1 and Q3, and the smallest and largest
individual observations, written in the order of Minimum, Q1,
Median, Q3, Maximum.
• E.g. 2,3,3,4,5,6,8,9
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9

21
Statistical Description of data : 8. Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot
incorporates the five-number summary as follows:
• E.g. 2,3,3,4,5,6,8,9
The five-number summary
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9

22
Exercise

23
Exercise (answers)

24
Statistical Description of data : 9. Outliers
• Outliers are an extremely high or extremely low values in the data
set. We can identify an outliers by following
• The values greater than Q3+ 1.5(IQR)
• The values less than Q1 – 1.5(IQR)

25
Statistical Description of data : 10. Variance and Standard deviation

26
Statistical Description of data : 12. Histograms
• ―Histos‖ means pole, and ―gram‖ means chart, so a histogram is a
chart of poles
• Plotting histograms is a graphical method for summarizing the
distribution of a given attribute, X

27
Statistical Description of data : 12. Histograms
Draw histogram for THE following data set

Transaction ID Items Brought

T1 F,A,D,B
T2 D,A,C,E,B
T3 C,A,B,E
T4 B,A,D

28
Statistical Description of data : 13. Scatter Plots
• A scatter plot is one of the most effective graphical method for determining if
there appears to be a relationship, pattern, or trend between two numeric
attributes.
• To construct a scatter plot, each pair of values is treated as a pair of coordinates in
an algebraic sense and plotted as points in the plane.

29
Statistical Description of data : 14. Data Correlations
• Two attributes, X, and Y, are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated)
• Figure shows examples of positive and negative correlations between two
attributes.
• If the plotted points pattern slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, suggesting
a positive correlation
• If the pattern of plotted points slopes from upper left to lower right, the
values of X increase as the values of Y decrease, suggesting a negative
correlation

30
Statistical Description of data : 14. Data Correlations

31
Exercise

32
Exercise: Answers

33
Exercise: Answers

34
Exercise: Answers

35
Measuring data similarity and dissimilarity
• In data mining applications(like clustering, classifications etc.) We
are interested in comparison of objects on the basis of their
similarities and dissimilarities
• Similarities and dissimilarities can be measured by using following
ways
1. Data Matrix
2. Dissimilarity Matrix
3. Minkowski Distance
a) Manhattan(City block) Distance
b) Euclidean Distance
c) Supremum Distance
4. Cosine similarity

36
Measuring data similarity and dissimilarity
Data Matrix:
• This structure stores the n data objects in the form of a relational
table, or n-by-p matrix (n objects, p attributes)

37
Measuring data similarity and dissimilarity
Data Matrix:
• Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)

38
Measuring data similarity and dissimilarity
Dissimilarity Matrix:
• This structure stores a collection of proximities that are available for
all pairs of n objects.
• It is often represented by an n-by-n table

39
Dissimilarity Matrix:
• Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)

40
Minkowski Distance Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)

a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance

41
Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance

42
Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance

43
Measuring data similarity and dissimilarity
Cosine similarity

44
Measuring data similarity and dissimilarity
Cosine similarity

X=(2,1,3,2,4,5,3) ; Y=(4,3,4,3,6,5,5) how similar are x and y?

45
Exercise : Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance

Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
2-2-Data
No ratings yet
2-2-Data
27 pages
lec2-data
No ratings yet
lec2-data
51 pages
Slide-04-Chapter2-Getting to Know Your Data
No ratings yet
Slide-04-Chapter2-Getting to Know Your Data
47 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Knowing The Data Set
No ratings yet
Knowing The Data Set
31 pages
CPSC 4830 2025Summer Lecture 2
No ratings yet
CPSC 4830 2025Summer Lecture 2
42 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
DWDM UNIT-2
No ratings yet
DWDM UNIT-2
19 pages
Ch 2 (2)
No ratings yet
Ch 2 (2)
35 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
02 Data
No ratings yet
02 Data
64 pages
Week2_UnderstandingData
No ratings yet
Week2_UnderstandingData
27 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
01 Data
No ratings yet
01 Data
100 pages
CH 2
No ratings yet
CH 2
68 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02Data
No ratings yet
02Data
66 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
02 Data
No ratings yet
02 Data
35 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Lect 3
No ratings yet
Lect 3
51 pages
Module 1
No ratings yet
Module 1
64 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
02Data
No ratings yet
02Data
24 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02 Data
No ratings yet
02 Data
41 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Lec 2
No ratings yet
Lec 2
26 pages
4
No ratings yet
4
26 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
02Data (2)
No ratings yet
02Data (2)
36 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Comparison between evaluation design of Classification model and Recommendation system
No ratings yet
Comparison between evaluation design of Classification model and Recommendation system
11 pages
AAI Extra
No ratings yet
AAI Extra
7 pages
AAI Experiment 8 Report
No ratings yet
AAI Experiment 8 Report
3 pages
Celonis Process Mining Registration
No ratings yet
Celonis Process Mining Registration
21 pages

Module No 2 - Part 2 - Compressed - Compressed

Uploaded by

Module No 2 - Part 2 - Compressed - Compressed

Uploaded by

Module No : 02

Introduction to data mining

• An attribute is a data field, representing a characteristic or feature

• An ordinal attribute is an attribute with possible values that have a

• The distance between the first and third quartiles is a simple

• This distance is called the interquartile range (IQR) and is defined

Transaction ID Items Brought

a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance

X=(2,1,3,2,4,5,3) ; Y=(4,3,4,3,6,5,5) how similar are x and y?

You might also like