Module No 2 - Part 2 - Compressed - Compressed
Module No 2 - Part 2 - Compressed - Compressed
1
Data Exploration
2
Attributes
1. Nominal attributes
2. Binary attributes
3. Ordinal attributes
4. Numeric attributes
a) Interval-scaled attributes
b) Ratio-scaled attributes
4
Types of Attributes 1. Nominal attributes
• Nominal means “relating to names.” The values of a nominal
attribute are symbols or names of things.
• Each value represents some kind of category, code, or state
• The values do not have any meaningful order.
• Example 1: hair color and marital status are two attributes
describing person objects, then possible values for hair color are
black,brown, blond, red, gray, and white.
• The attribute marital status can take on the values single, married,
divorced, and widowed.
• Both hair color and marital status are nominal attributes.
• Example 2: occupation, with the values teacher, dentist,
programmer, farmer, and so on.
5
Types of Attributes 2. Binary attributes
• A binary attribute is a nominal attribute with only two categories or
states: 0 or 1, where 0 typically means that the attribute is absent,
and 1 means that it is present.
• Binary attributes are referred to as Boolean if the two states
correspond to true and false.
• Example: The attribute medical test is binary, where a value of 1
means the result of the test for the patient is positive, while 0
means the result is negative.
• there is no preference on which outcome should be coded as 0 or 1
attribute gender having the states male and female.
6
Types of Attributes 3. Ordinal attributes
7
Types of Attributes 4. Numeric attributes
• A numeric attribute is quantitative; that is, it is a measurable
quantity, represented in integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
a) Interval-Scaled Attributes
• Interval-scaled attributes are measured on a scale of equal-size
units. The values of interval-scaled attributes have order and can be
positive, 0, or negative
• such attributes allow us to compare and quantify the difference
between values.
• Example 1: temperature (20 degree Celsius is five degrees higher
than a temperature of 15 degree Celsius)
• Example 2: calendar dates (the years 2002 and 2010 are eight years
apart)
8
Types of Attributes 4. Numeric attributes
b)Ratio-Scaled Attributes
• A ratio-scaled attribute is a numeric attribute with an inherent
zero-point i.e. if a measurement is ratio-scaled, we can speak of a
value as being a multiple (or ratio) of another value
• the values are ordered, and we can also compute the difference
between values, as well as the mean,median, and mode
• Example 1: year_of_experience
• Example 2: no-of-words (in a document)
• Example 3: weight, height, latitude and longitude
9
Statistical Description of data
• For data pre-processing to be successful, it is essential to have an
overall picture of your data.
• Basic statistical descriptions can be used to identify properties of
the data and highlight which data values should be treated as noise
or outliers.
• Following are the different ways to describe data statistically
1. Mean 2. Median
3. Mode 4. Midrange
4. Range 5. Quartiles
6. Interquartile Range 7. Five-Number Summary
8. Boxplots 9. Outliers
10. Variance 11. Standard Deviation
12. Histograms 13. Scatter Plots
14. Data Correlation
10
Statistical Description of data : 1. Mean (Average Value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X. The mean of this set of values is
11
Statistical Description of data : 2. Median (middle value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X, like salary. The median of this set of values
is
12
Statistical Description of data : 3. Mode (most common
value)
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
Data set : 30,36,47,50,52,52,56,60,63,70,70,110
The mode of this set of values is : (the values repeating maximum times)
This set of data is bimodel i.e. there are to modes 52 and 70
(30,36,47,50,52,52,56,60,63,70,70,110)
13
Statistical Description of data : 4. Midrange
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
Data set : 30,36,47,50,52,52,56,60,63,70,70,110
The midrange is the average of largest and smallest values in the set.
The midrange of this set of values is (30+110)/2=70
(30,36,47,50,52,52,56,60,63,70,70,110)
14
Statistical Description of data : 5. Range
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
The range of the set is the difference between the largest (max) and
smallest (min) values.
e.g. (30,36,47,50,52,52,56,60,63,70,70,110)
Range of this data set is 110-30=80
15
Statistical Description of data : 5. Quartiles
Let X1,X2, …..,Xn be a set of N values or observations, such as for
some numeric attribute X.
We can pick certain data points so as to split the data distribution into
equal-size consecutive sets, as shown in figure
16
Statistical Description of data : 5. Quartiles
Q1=Lower Quartile
Q2= Median
Q3=Upper Quartile
In this e.g.
Q1=52
Q2=54
Q3=58
– The first quartile, denoted by Q1, is the 25th percentile. It cuts off the
lowest 25% of the data.
– The third quartile, denoted by Q3, is the 75th percentile—it cuts off the
lowest 75% (or highest 25%) of the data.
– The second quartile is the 50th percentile. As the median, it gives the
centre of the data distribution.
17
Statistical Description of data : 6. Interquartile Range(IQR)
18
Statistical Description of data : 6. Interquartile Range(IQR)
19
Statistical Description of data :
20
Statistical Description of data : 7. Five-Number Summary
• The five-number summary of a distribution consists of the median
(Q2), the quartiles Q1 and Q3, and the smallest and largest
individual observations, written in the order of Minimum, Q1,
Median, Q3, Maximum.
• E.g. 2,3,3,4,5,6,8,9
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9
21
Statistical Description of data : 8. Boxplots
• Boxplots are a popular way of visualizing a distribution. A boxplot
incorporates the five-number summary as follows:
• E.g. 2,3,3,4,5,6,8,9
The five-number summary
• Minimum = 2
• Q1= 3
• Median = 4.5
• Q3= 7
• Maximum= 9
22
Exercise
23
Exercise (answers)
24
Statistical Description of data : 9. Outliers
• Outliers are an extremely high or extremely low values in the data
set. We can identify an outliers by following
• The values greater than Q3+ 1.5(IQR)
• The values less than Q1 – 1.5(IQR)
25
Statistical Description of data : 10. Variance and Standard deviation
26
Statistical Description of data : 12. Histograms
• ―Histos‖ means pole, and ―gram‖ means chart, so a histogram is a
chart of poles
• Plotting histograms is a graphical method for summarizing the
distribution of a given attribute, X
27
Statistical Description of data : 12. Histograms
Draw histogram for THE following data set
28
Statistical Description of data : 13. Scatter Plots
• A scatter plot is one of the most effective graphical method for determining if
there appears to be a relationship, pattern, or trend between two numeric
attributes.
• To construct a scatter plot, each pair of values is treated as a pair of coordinates in
an algebraic sense and plotted as points in the plane.
29
Statistical Description of data : 14. Data Correlations
• Two attributes, X, and Y, are correlated if one attribute implies the other.
Correlations can be positive, negative, or null (uncorrelated)
• Figure shows examples of positive and negative correlations between two
attributes.
• If the plotted points pattern slopes from lower left to upper right, this
means that the values of X increase as the values of Y increase, suggesting
a positive correlation
• If the pattern of plotted points slopes from upper left to lower right, the
values of X increase as the values of Y decrease, suggesting a negative
correlation
30
Statistical Description of data : 14. Data Correlations
31
Exercise
32
Exercise: Answers
33
Exercise: Answers
34
Exercise: Answers
35
Measuring data similarity and dissimilarity
• In data mining applications(like clustering, classifications etc.) We
are interested in comparison of objects on the basis of their
similarities and dissimilarities
• Similarities and dissimilarities can be measured by using following
ways
1. Data Matrix
2. Dissimilarity Matrix
3. Minkowski Distance
a) Manhattan(City block) Distance
b) Euclidean Distance
c) Supremum Distance
4. Cosine similarity
36
Measuring data similarity and dissimilarity
Data Matrix:
• This structure stores the n data objects in the form of a relational
table, or n-by-p matrix (n objects, p attributes)
37
Measuring data similarity and dissimilarity
Data Matrix:
• Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)
38
Measuring data similarity and dissimilarity
Dissimilarity Matrix:
• This structure stores a collection of proximities that are available for
all pairs of n objects.
• It is often represented by an n-by-n table
39
Dissimilarity Matrix:
• Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)
40
Minkowski Distance Data points X1(1,2), X2(3,5),X3(2,0),X4(4,5)
41
Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance
42
Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance
43
Measuring data similarity and dissimilarity
Cosine similarity
44
Measuring data similarity and dissimilarity
Cosine similarity
45
Exercise : Minkowski Distance
a)Manhattan(City block) Distance b)Euclidean Distance c)Supremum Distance
46