0% found this document useful (0 votes)
2 views

3-Data Fundamentals for BI- Part2

The document outlines major tasks in data preprocessing for business intelligence, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes data reduction strategies such as dimensionality reduction through Principal Component Analysis (PCA), numerosity reduction, and attribute selection. Additionally, it discusses heuristic search methods for attribute selection and various non-parametric techniques like clustering and sampling for effective data representation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

3-Data Fundamentals for BI- Part2

The document outlines major tasks in data preprocessing for business intelligence, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes data reduction strategies such as dimensionality reduction through Principal Component Analysis (PCA), numerosity reduction, and attribute selection. Additionally, it discusses heuristic search methods for attribute selection and various non-parametric techniques like clustering and sampling for effective data representation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Data Fundamentals for

BI in a Business
Part 2

1
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

2
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

3
Data reduction strategies:
Principal Component Analysis (PCA)
◼ Simplifying data by finding a projection that captures the largest
amount of variation in data
◼ The original data are projected onto a much smaller space,
resulting in dimensionality reduction (ex: Instead of two features, with
PCA might now have just one).
◼ We find the eigenvectors of the covariance matrix, and these
eigenvectors define the new space
x2

• The dots are data points


• 'e' represents the principal component, e
the direction of the greatest variation.
• The red dashed line represents the
approximate boundary within which
most of the data points lie.
x1
4
Principal Component Analysis (Steps)

◼ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors


(principal components) that can be best used to represent data
1) Normalize input data: Each attribute falls within the same range
2) Compute Principal Components: PCA calculates k orthogonal (perpendicular)
vectors that point in the directions of the greatest variance in data
3) Represent Data as Linear Combinations: Each original data point can be
reconstructed (at least approximately) using these principal components.
4) The principal components are sorted in order of decreasing “significance” or
strength
5) Reduce Data Size: Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
◼ Works for numeric data only

5
Principal Component Analysis (Example)

◼ A marketing manager wants to segment customers based on their purchasing


behavior. There are two features for each customer:
• X1: Amount spent on electronics per year
• X2: Number of clothing items purchased per year
◼ Four customers (N=4):
Customer X1 X2
A 2 10
B 4 8
C 6 6
D 8 4

◼ You want to reduce this 2-dimensional data to a single "customer profile" score
(k=1) for easier segmentation.

6
Principal Component Analysis (Example)
◼ Normalize Data: standardize the data by subtracting the mean and dividing
by the standard deviation for each feature. This makes the features
comparable.
Customer X1 X2
A -1.5 1.5
B -0.5 0.5
C 0.5 -0.5
D 1.5 -1.5

◼ Compute Covariance Matrix: Calculate the covariance between X1 and X2.


This measures how the two features vary together. Then, find Eigenvectors
and Eigenvalues of the covariance matrix. The eigenvectors represent the
principal components, and the eigenvalues represent their "strength". (This
requires linear algebra, skipping the detailed calculation but many software
tools do it).

7
Principal Component Analysis (Example)
◼ Let's assume

◼ Eigenvalue 1: 2.5
◼ Eigenvalue 2: 0

▪ The principal components are sorted: Eigenvalue 1 is much larger than


Eigenvalue 2, so Eigenvector 1 is our most significant principal component.
Assume Eigenvector 1 is [0.71, -0.71].

▪ Reduce Data Size: Project each normalized data point onto the first principal
component (Eigenvector 1). This is done by taking the dot product of each
data point with the eigenvector. This gives us the "customer profile" score.
(e.g. Customer A (-1.5, 1.5): Score = (-1.5 * 0.71) + (1.5 * -0.71) = -1.065
+ (-1.065) = -2.12))
Customer
Customer X1 X2
Profile Score
A -1.5 1.5 -2.12
B -0.5 0.5 -0.71
C 0.5 -0.5 0.71
D 1.5 -1.5 2.12
8
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

9
Data reduction strategies:
Attribute Subset Selection

◼ Another way to reduce dimensionality of data

◼ Two Main Types of Unnecessary Attributes:


◼ Redundant attributes: Duplicate much or all of the information
contained in one or more other attributes
◼ E.g., purchase price of a product and the amount of sales tax paid (tax is
directly calculated from the "purchase price)
◼ Irrelevant attributes: Contain no information that is useful for
the data mining task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting students' GPA

10
Heuristic Search in Attribute Selection

◼ There are 2d possible attribute combinations of d attributes


◼ e.g. Assume 10 features (d=10) to predict price. There are 210 = 1024
possible combinations of these features. Finding the absolute best
combination, by trying every single one is computationally very
expensive.

◼ Heuristic search methods provide practical ways to select relevant


attributes when the number of possible combinations is too large. They don't
guarantee the absolute best solution, but they aim to find a very good one
in a reasonable amount of time.

11
Heuristic Search in Attribute Selection

◼ Some heuristic attribute selection methods:

1) Best single attribute under the attribute independence


assumption: choose by significance tests (Pick only the single
best attribute that improves the model)

2) Best step-wise feature selection:


◼ Start with an empty set of attributes

◼ The best single-attribute is picked first

◼ Then next best attribute condition to the first, ...

3) Step-wise attribute elimination: Start with all attributes,


repeatedly eliminate the worst attribute

12
Data reduction strategies:
Attribute Creation (Feature Generation)
◼ Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
◼ Three general methodologies
◼ Attribute extraction
◼ Domain-specific
• E.g.Original features: Pixel values of an image
• Extracted features: Edges, corners, textures, or shapes, ….

◼ Mapping data to new space (see: data reduction)


◼ E.g. wavelet transformation

◼ Attribute construction (Combining features)


• E.g. Original features: User ratings for individual products.
• Constructed features: Average rating for each user,

13
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

14
Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of data
representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
• Ex.: data on house sizes (x) and prices (y). Assume a linear
relationship: Y = w X + b
• Instead of storing every house's size and price, only store the slope (w)
and intercept (b) of the line, can then use these parameters to
reconstruct (approximately) the price of a house given its size.
◼ Non-parametric methods
◼ Do not assume models
◼ use techniques like histograms, clustering, or sampling to represent the
data in a compressed form.
◼ Major families: histograms, clustering, sampling, …

15
Numerosity Reduction
Parametric Data Reduction
◼ Regression analysis: A collective name for y
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (y) (also called Y1
response variable or measurement) and of
one or more independent variables (x) (aka.
Y1’
explanatory variables or predictors) y=x+1
◼ The parameters are estimated so as to give a
"best fit" of the data
X1 x
◼ Used for prediction (including forecasting of
time-series data), and modeling of causal
relationships

16
Numerosity Reduction
Parametric Data Reduction
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Predicts a continuous value based on a single independent variable
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector (multiple independent variables)
◼ a hyperplane (in higher dimensions) instead of a line.
◼ Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Models the probabilities of different combinations of categorical variables.
• E.g. Dimension 1: Did the customer buy coffee? (Yes/No)
• Dimension 2: Did the customer buy milk? (Yes/No)
• A log-linear model can estimate the probability of a customer buying coffee and milk

17
Numerosity Reduction:
Non-Parametric Data Reduction

40
◼ Histogram Analysis 35
◼ Divide data into buckets and 30
store average (sum) for each
25
bucket
20
◼ Partitioning rules:
15
◼ Equal-width: Each bucket covers
the same range of values. 10

◼ Equal-frequency: Each bucket 5


contains (approximately) the same 0
number of data points. 10000 30000 50000 70000 90000

18
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis

◼ Equal-width (distance) partitioning


◼ Divides the range into N intervals of equal size
◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• Ex: customer ages ranging from 10 to 70 (A=10, B=70), You want 3
intervals (N=3).
• Interval width: W = (70 - 10) / 3 = 20
• Intervals: 10-30, 31-50, 51-70
◼ Simple to understand and implement, but outliers (very large or very
small values) may dominate presentation so skewed data is not
handled well

19
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis

◼ Equal-depth (frequency) partitioning


◼ Divides the range into N intervals, each containing
approximately same number of samples
• Ex: You have 100 customer ages, and you want 5 intervals.
Ideally, each interval should have 100 / 5 = 20 customers.
• Sort the ages and then create the intervals so that roughly 20
ages fall into each.
• So, the bucket widths might be different. For instance, the first
bucket might be 0-15 years, the second 16-22 years, the third
23-30 years, and so on.
◼ Good data scaling

20
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered but not if data is
“smeared” (doesn't have clear groupings)
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms. The choice depends on the specific data and the
goal of the analysis.
◼ Only store cluster representations (like the centroid and
diameter) instead of all individual data points.

21
Numerosity Reduction:
Non-Parametric Data Reduction

◼ Sampling
◼ Analyzing the entire dataset might be computationally very expensive,
obtaining a small sample s to represent the whole data set N reduces the
processing time.
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew (unevenly distributed)
◼ Develop adaptive sampling methods, e.g., stratified sampling

22
Numerosity Reduction:
Non-Parametric Data

◼ Types of Sampling:
◼ Simple random sampling
◼ There is an equal probability of selecting any particular item
◼ Sampling without replacement
◼ Once an object is selected, it is removed from the population
◼ Sampling with replacement
◼ A selected object is not removed from the population
◼ Stratified sampling
◼ Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
◼ Divide your data into groups (strata) based on some characteristic.
Take a random sample from each group
◼ Used in conjunction with skewed data

23
Sampling: With or without Replacement

Raw Data
24
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

25
Numerosity Reduction:
Data Cube Aggregation

◼ The lowest level of a data cube (base cuboid: is the most detailed
level of the data cube. e.g. [Product A, Region East, January 2023:
$1000 sales], [Product B,….])
◼ The aggregated data for an individual entity of interest

◼ Multiple levels of aggregation in data cubes (e.g. Aggregating by


Region and Time for example: sum of sales for all products in East
region in January)
◼ Further reduce the size of data to deal with
◼ Use the smallest representation which is enough to solve the task
and is sufficient to answer your query

26
Data Reduction Strategies

◼ Data reduction strategies


◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

27
Data Reduction Strategies:
Data Compression

◼ The main goal of data compression is to represent the same information


using fewer bits. This saves storage space and speeds up data transmission.
◼ String compression
◼ Typically lossless (decompressed to get the exact original text)
◼ Audio/video compression
◼ Typically lossy compression (Some information is lost during compression), with
progressive refinement
◼ Time sequence is not audio
◼ Include sensor readings, stock prices, or weather data
◼ Typically short and vary slowly with time. Can be lossy or lossless.

◼ Dimensionality and numerosity reduction may also be considered as forms of


data compression

28
Data Reduction Strategies:
Data Compression

Original Data Compressed


Data
lossless

Original Data
Approximated

29
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

30
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 31
Data Transformation:
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600 − 12,000
Then $73,000 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ 0.716 is within the new range [0,1]
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A 73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v Divide each value by a power of 10 (e.g., 10, 100, 1000) to bring the
v'= j ◼

values into a desired range.


10 ◼ Ex. v= 1500 can be divided by 1000 then v is mapped to v` =
1500/1000 = 1.5

32
Data Transformation:
Discretization
◼ There are three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession

◼ Ordinal—values from an ordered set, e.g., academic levels, ratings

◼ Numeric—real numbers, e.g., integer or real numbers, e.g., Age,

height, weight, income

◼ Discretization: Divide the range of a continuous attribute into


intervals
◼ Interval labels can then be used to replace actual data values
• E.g. {0-18: Teenager}, {19-65: Adult}, {66+: Senior}
◼ Reduce data size by discretization
◼ Supervised vs. unsupervised
◼ Split (top-down) vs. merge (bottom-up)
◼ Discretization can be performed recursively on an attribute
◼ Prepare for further analysis, e.g., classification (Discretization is often used
as a preprocessing step for classification)
33
Data Transformation:
Discretization

◼ Discretization methods:
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis
◼ unsupervised, top-down split or bottom-up merge
◼ Decision-tree analysis
◼ supervised, top-down split

34
Discretization by Binning Methods
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
35
Discretization by Classification &
Correlation Analysis
◼ Classification (e.g., decision tree analysis)
◼ Supervised: meaning it uses class labels (the target variable you're trying
to predict) to guide the discretization.
◼ focuses on finding splits that improve the classification accuracy.

◼ Clustering (e.g., k-means)


◼ Use a clustering algorithm to group similar values together. This can be
top-down (split the data into a fixed number of clusters) or bottom-up
(start with each value as its own cluster and merge them).
◼ Unsupervised

36
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

37
Data Transformation:
Concept Hierarchy Generation
◼ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
and is usually associated with each dimension in a data warehouse "levels of
detail"
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity.
◼ Drilling Down: Moving from a higher level to a lower level in the hierarchy. (e.g.
see sales by country then further by city)
◼ Rolling Up: Moving from a lower level to a higher level. (e.g. see sales by city then
further by country)
◼ Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)

38
Data Transformation:
Concept Hierarchy Generation

◼ Concept hierarchies can be explicitly specified by domain experts and/or data


warehouse designers (manually define the hierarchy)
◼ Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.
◼ Concept Hierarchies Benefits:
◼ Analyze data at different levels of detail, providing a more comprehensive
view
◼ Data Summarization: Rolling up aggregates data, making it easier to
see overall trends
◼ Speed up query processing

39
Concept Hierarchy Generation
for Nominal Data

◼ Specification of a partial/total ordering of attributes explicitly at the


schema level by users or experts
◼ street < city < state < country
◼ Specification of a hierarchy for a set of values by explicit data
grouping (manually group specific values to create higher-level
concepts)
◼ {Urbana, Champaign, Chicago} (belong to) < Illinois
◼ Specification of only a partial set of attributes (don't have to define
the entire hierarchy at once)
◼ E.g., only street < city, not others

40
Automatic Concept Hierarchy Generation

◼ Some hierarchies can be automatically generated based on the


analysis of the number of distinct values per attribute in the data set
◼ The attribute with the most distinct values is placed at the lowest
level of the hierarchy

◼ E.g., for a set of attributes: {street, city, state, country}


• street: 674,339 distinct values
• city: 3,567 distinct values
• state: 365 distinct values
• country: 15 distinct values

41
Automatic Concept Hierarchy Generation

◼ The system automatically creates the hierarchy: street < city <
state < country because "street" has the most distinct values, and
"country" has the fewest.

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

42
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

43
References

◼ Data Mining: Concepts and Techniques,


Jiawei Han, Micheline Kamber and Jian Pei
◼ "Data Science for Business" by Foster Provost and
Tom Fawcett

44

You might also like