02Data
02Data
— Chapter 2 —
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
◼ Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
◼ Transaction data
◼ Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
◼ Binary
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
4
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV
positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
5
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ No true zero-point
◼ Ratio
◼ We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts,
monetary quantities
6
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
collection of documents
◼ Sometimes, represented as integer variables
8
Measuring the Central Tendency
◼ Mean (algebraic measure) (sample vs. population): 1 n
x = xi = x
Note: n is sample size and N is population size. n i =1 N
◼ Weighted arithmetic mean:
◼ Median:
◼ Middle value if odd number of values, or average of
the middle two values otherwise
◼ Estimated by interpolation (for grouped data):
◼ Mode n / 2 − ( freq)l
median = L1 + ( ) width
◼ Value that occurs most frequently
freqin median
the data
◼ Unimodal, bimodal, trimodal
9
Symmetric vs. Skewed Data
◼ Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
11
Boxplot Analysis
12
Properties of Normal Distribution Curve
13
Histogram Analysis
◼ Histogram: Graph display of
tabulated frequencies, shown as 40
bars 35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
14
Histograms Often Tell More than Boxplots
15
Scatter plot
◼ Provides a first look at bivariate data to see clusters of
points, outliers, etc
◼ Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
16
Positively and Negatively Correlated Data
17
Uncorrelated Data
18
Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data onto graphical
primitives
◼ Provide qualitative overview of large data sets
◼ Search for patterns, trends, structure, irregularities, relationships among
data
◼ Help find interesting regions and suitable parameters for further
quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations
19
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
are
◼ Lower when objects are more alike
20
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
21
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
22
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web, image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency, dispersion,
graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active area of research.
23
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
24