3-Data Fundamentals for BI- Part2
3-Data Fundamentals for BI- Part2
BI in a Business
Part 2
1
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
2
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
3
Data reduction strategies:
Principal Component Analysis (PCA)
◼ Simplifying data by finding a projection that captures the largest
amount of variation in data
◼ The original data are projected onto a much smaller space,
resulting in dimensionality reduction (ex: Instead of two features, with
PCA might now have just one).
◼ We find the eigenvectors of the covariance matrix, and these
eigenvectors define the new space
x2
5
Principal Component Analysis (Example)
◼ You want to reduce this 2-dimensional data to a single "customer profile" score
(k=1) for easier segmentation.
6
Principal Component Analysis (Example)
◼ Normalize Data: standardize the data by subtracting the mean and dividing
by the standard deviation for each feature. This makes the features
comparable.
Customer X1 X2
A -1.5 1.5
B -0.5 0.5
C 0.5 -0.5
D 1.5 -1.5
7
Principal Component Analysis (Example)
◼ Let's assume
◼ Eigenvalue 1: 2.5
◼ Eigenvalue 2: 0
▪ Reduce Data Size: Project each normalized data point onto the first principal
component (Eigenvector 1). This is done by taking the dot product of each
data point with the eigenvector. This gives us the "customer profile" score.
(e.g. Customer A (-1.5, 1.5): Score = (-1.5 * 0.71) + (1.5 * -0.71) = -1.065
+ (-1.065) = -2.12))
Customer
Customer X1 X2
Profile Score
A -1.5 1.5 -2.12
B -0.5 0.5 -0.71
C 0.5 -0.5 0.71
D 1.5 -1.5 2.12
8
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
9
Data reduction strategies:
Attribute Subset Selection
10
Heuristic Search in Attribute Selection
11
Heuristic Search in Attribute Selection
12
Data reduction strategies:
Attribute Creation (Feature Generation)
◼ Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
◼ Three general methodologies
◼ Attribute extraction
◼ Domain-specific
• E.g.Original features: Pixel values of an image
• Extracted features: Edges, corners, textures, or shapes, ….
13
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
14
Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of data
representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
• Ex.: data on house sizes (x) and prices (y). Assume a linear
relationship: Y = w X + b
• Instead of storing every house's size and price, only store the slope (w)
and intercept (b) of the line, can then use these parameters to
reconstruct (approximately) the price of a house given its size.
◼ Non-parametric methods
◼ Do not assume models
◼ use techniques like histograms, clustering, or sampling to represent the
data in a compressed form.
◼ Major families: histograms, clustering, sampling, …
15
Numerosity Reduction
Parametric Data Reduction
◼ Regression analysis: A collective name for y
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (y) (also called Y1
response variable or measurement) and of
one or more independent variables (x) (aka.
Y1’
explanatory variables or predictors) y=x+1
◼ The parameters are estimated so as to give a
"best fit" of the data
X1 x
◼ Used for prediction (including forecasting of
time-series data), and modeling of causal
relationships
16
Numerosity Reduction
Parametric Data Reduction
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Predicts a continuous value based on a single independent variable
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector (multiple independent variables)
◼ a hyperplane (in higher dimensions) instead of a line.
◼ Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Models the probabilities of different combinations of categorical variables.
• E.g. Dimension 1: Did the customer buy coffee? (Yes/No)
• Dimension 2: Did the customer buy milk? (Yes/No)
• A log-linear model can estimate the probability of a customer buying coffee and milk
17
Numerosity Reduction:
Non-Parametric Data Reduction
40
◼ Histogram Analysis 35
◼ Divide data into buckets and 30
store average (sum) for each
25
bucket
20
◼ Partitioning rules:
15
◼ Equal-width: Each bucket covers
the same range of values. 10
18
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis
19
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis
20
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered but not if data is
“smeared” (doesn't have clear groupings)
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms. The choice depends on the specific data and the
goal of the analysis.
◼ Only store cluster representations (like the centroid and
diameter) instead of all individual data points.
21
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Sampling
◼ Analyzing the entire dataset might be computationally very expensive,
obtaining a small sample s to represent the whole data set N reduces the
processing time.
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew (unevenly distributed)
◼ Develop adaptive sampling methods, e.g., stratified sampling
22
Numerosity Reduction:
Non-Parametric Data
◼ Types of Sampling:
◼ Simple random sampling
◼ There is an equal probability of selecting any particular item
◼ Sampling without replacement
◼ Once an object is selected, it is removed from the population
◼ Sampling with replacement
◼ A selected object is not removed from the population
◼ Stratified sampling
◼ Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
◼ Divide your data into groups (strata) based on some characteristic.
Take a random sample from each group
◼ Used in conjunction with skewed data
23
Sampling: With or without Replacement
Raw Data
24
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
25
Numerosity Reduction:
Data Cube Aggregation
◼ The lowest level of a data cube (base cuboid: is the most detailed
level of the data cube. e.g. [Product A, Region East, January 2023:
$1000 sales], [Product B,….])
◼ The aggregated data for an individual entity of interest
26
Data Reduction Strategies
◼ Wavelet transforms
◼ Data compression
27
Data Reduction Strategies:
Data Compression
28
Data Reduction Strategies:
Data Compression
Original Data
Approximated
29
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
30
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 31
Data Transformation:
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600 − 12,000
Then $73,000 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ 0.716 is within the new range [0,1]
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
A 73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v Divide each value by a power of 10 (e.g., 10, 100, 1000) to bring the
v'= j ◼
32
Data Transformation:
Discretization
◼ There are three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession
◼ Discretization methods:
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis
◼ unsupervised, top-down split or bottom-up merge
◼ Decision-tree analysis
◼ supervised, top-down split
34
Discretization by Binning Methods
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
35
Discretization by Classification &
Correlation Analysis
◼ Classification (e.g., decision tree analysis)
◼ Supervised: meaning it uses class labels (the target variable you're trying
to predict) to guide the discretization.
◼ focuses on finding splits that improve the classification accuracy.
36
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
37
Data Transformation:
Concept Hierarchy Generation
◼ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
and is usually associated with each dimension in a data warehouse "levels of
detail"
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity.
◼ Drilling Down: Moving from a higher level to a lower level in the hierarchy. (e.g.
see sales by country then further by city)
◼ Rolling Up: Moving from a lower level to a higher level. (e.g. see sales by city then
further by country)
◼ Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)
38
Data Transformation:
Concept Hierarchy Generation
39
Concept Hierarchy Generation
for Nominal Data
40
Automatic Concept Hierarchy Generation
41
Automatic Concept Hierarchy Generation
◼ The system automatically creates the hierarchy: street < city <
state < country because "street" has the most distinct values, and
"country" has the fewest.
42
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation
43
References
44