DWDM-LS2-Fall-24-25
DWDM-LS2-Fall-24-25
— Chapter 2 —
1
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
2
Types of Data Sets
◼ Record
◼ Relational records
◼ Data matrix, e.g., numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
◼ Document data
◼ text documents, term-frequency vector
◼ Transaction data Document 1 3 0 5 0 2 6 0 2 0 2
◼ Graph and network
Document 2 0 7 0 2 1 0 0 3 0 0
◼ World Wide Web
◼ Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0
◼ Molecular Structures
◼ Ordered
◼ Video data: sequence of images TID Items
◼ Temporal data: time-series 1 Bread, Coke, Milk
◼ Sequential Data: transaction sequences 2 Beer, Bread
◼ Genetic sequence data 3 Beer, Coke, Diaper, Milk
◼ Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
◼ Spatial data: maps
5 Coke, Diaper, Milk
◼ Image data
◼ Video data
3
Data Objects
4
Attributes
◼ Types:
◼ Nominal
◼ Binary
◼ Numeric: quantitative
◼ Interval-scaled
◼ Ratio-scaled
5
Attribute Types
◼ Nominal: categories, states, or “names of things”
◼ Hair_color = {auburn, black, blond, brown, grey, red, white}
◼ marital status, occupation, ID numbers, zip codes
◼ Binary
◼ Nominal attribute with only 2 states (0 and 1)
◼ Symmetric binary: both outcomes equally important
◼ e.g., gender
◼ Asymmetric binary: outcomes not equally important.
◼ e.g., medical test (positive vs. negative)
◼ Convention: assign 1 to most important outcome (e.g., HIV
positive)
◼ Ordinal
◼ Values have a meaningful order (ranking) but magnitude between
successive values is not known.
◼ Size = {small, medium, large}, grades, army rankings
6
Numeric Attribute Types
◼ Quantity (integer or real-valued)
◼ Interval
◼ Measured on a scale of equal-sized units
◼ Values have order
◼ E.g., temperature in C˚or F˚, calendar dates
◼ No true zero-point
◼ Ratio
◼ Inherent zero-point
◼ We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
◼ e.g., temperature in Kelvin, length, counts, monetary
quantities
7
Discrete vs. Continuous Attributes
◼ Discrete Attribute
◼ Has only a finite or countably infinite set of values
collection of documents
◼ Sometimes, represented as integer variables
◼ Continuous Attribute
◼ Has real numbers as attribute values
9
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
10
Basic Statistical Descriptions of Data
◼ Motivation
◼ To better understand the data: central tendency, variation and spread
◼ Data dispersion characteristics
◼ median, max, min, quantiles, outliers, variance, etc.
◼ Numerical dimensions correspond to sorted intervals
◼ Data dispersion: analyzed with multiple granularities of precision
◼ Boxplot or quantile analysis on sorted intervals
◼ Dispersion analysis on computed measures
◼ Folding measures into numerical dimensions
◼ Boxplot or quantile analysis on the transformed cube
11
Symmetric vs. Skewed Data
◼ Median, mean and mode of symmetric
symmetric, positively and
negatively skewed data
13
Boxplot Analysis
◼ Five-number summary of a distribution
◼ Minimum, Q1, Median, Q3, Maximum
◼ Boxplot
◼ Data is represented with a box
◼ The ends of the box are at the first and third quartiles,
i.e., the height of the box is IQR
◼ The median is marked by a line within the box
◼ Whiskers: two lines outside the box extended to
Minimum and Maximum
◼ Outliers: points beyond a specified outlier threshold,
plotted individually
16
Properties of Normal Distribution Curve
17
Graphic Displays of Basic Statistical Descriptions
18
Histogram Analysis
◼ Graph display of tabulated frequencies,
shown as bars
40
◼ It shows what proportion of cases fall
35
into each of several categories
30
◼ Differs from a bar chart in that it is the
area of the bar that denotes the value, 25
19
Histograms Often Tell More than Boxplots
20
Quantile Plot
◼ Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
◼ Plots quantile information
◼ For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
22
Scatter plot
23
Positively and Negatively Correlated Data
24
Uncorrelated Data
25
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
26
Data Visualization
◼ Why data visualization?
◼ Gain insight into an information space by mapping data onto graphical
primitives
◼ Provide qualitative overview of large data sets
◼ Search for patterns, trends, structure, irregularities, relationships among
data
◼ Help find interesting regions and suitable parameters for further
quantitative analysis
◼ Provide a visual proof of computer representations derived
◼ Categorization of visualization methods:
◼ Pixel-oriented visualization techniques
◼ Geometric projection visualization techniques
◼ Icon-based visualization techniques
◼ Hierarchical visualization techniques
◼ Visualizing complex data and relations
27
Pixel-Oriented Visualization Techniques
◼ For a data set of m dimensions, create m windows on the screen, one
for each dimension
◼ The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows
◼ The colors of the pixels reflect the corresponding values
(a) Income (b) Credit Limit (c) transaction volume (d) age
28
Laying Out Pixels in Circle Segments
◼ To save space and show the connections among multiple dimensions,
space filling is often done in a circle segment
29
Geometric Projection Visualization Techniques
31
Parallel Coordinates
◼ n equidistant axes which are parallel to one of the screen axes and
correspond to the attributes
◼ The axes are scaled to the [minimum, maximum]: range of the
corresponding attribute
◼ Every data item corresponds to a polygonal line which intersects each
of the axes at the point which corresponds to the value for the
attribute
• • •
33
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
34
Similarity and Dissimilarity
◼ Similarity
◼ Numerical measure of how alike two data objects are
are
◼ Lower when objects are more alike
35
Data Matrix and Dissimilarity Matrix
◼ Data matrix
◼ n data points with p x11 ... x1f ... x1p
dimensions ... ... ... ... ...
x xip
◼ Two modes
... xif ...
i1
... ... ... ... ...
x ... xnf ... xnp
n1
◼ Dissimilarity matrix
0
◼ n data points, but d(2,1)
0
registers only the
d(3,1) d ( 3,2) 0
distance
◼ A triangular matrix : : :
d ( n,1) d ( n,2) ... ... 0
◼ Single mode
36
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
point attribute1 attribute2
x1 1 2
x2 3 5
x3 2 0
x4 4 5
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
37
Distance on Numeric Data: Minkowski Distance
◼ Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
◼ Properties
◼ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
◼ d(i, j) = d(j, i) (Symmetry)
◼ d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
◼ A distance that satisfies these properties is a metric
38
Special Cases of Minkowski Distance
◼ h = 1: Manhattan (city block, L1 norm) distance
◼ E.g., the Hamming distance: the number of bits that are
39
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
L2 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
40
Cosine Similarity
◼ A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
41
Example: Cosine Similarity
◼ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
42
Chapter 2: Getting to Know Your Data
◼ Data Visualization
◼ Summary
43
Summary
◼ Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
◼ Many types of data sets, e.g., numerical, text, graph, Web, image.
◼ Gain insight into the data by:
◼ Basic statistical data description: central tendency, dispersion,
graphical displays
◼ Data visualization: map data onto graphical primitives
◼ Measure data similarity
◼ Above steps are the beginning of data preprocessing.
◼ Many methods have been developed but still an active area of research.
44
References
◼ W. Cleveland, Visualizing Data, Hobart Press, 1993
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
◼ L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Tech.
Committee on Data Eng., 20(4), Dec. 1997
◼ D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
◼ E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
◼ C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009
45