0% found this document useful (0 votes)

2 views

3-Data Fundamentals for BI- Part2

The document outlines major tasks in data preprocessing for business intelligence, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes data reduction strategies such as dimensionality reduction through Principal Component Analysis (PCA), numerosity reduction, and attribute selection. Additionally, it discusses heuristic search methods for attribute selection and various non-parametric techniques like clustering and sampling for effective data representation.

Uploaded by

mohamed2004mowaffak

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

3-Data Fundamentals for BI- Part2

Uploaded by

mohamed2004mowaffak

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Data Fundamentals for

BI in a Business
Part 2

1
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

2
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

3
Data reduction strategies:
Principal Component Analysis (PCA)
◼ Simplifying data by finding a projection that captures the largest
amount of variation in data
◼ The original data are projected onto a much smaller space,
resulting in dimensionality reduction (ex: Instead of two features, with
PCA might now have just one).
◼ We find the eigenvectors of the covariance matrix, and these
eigenvectors define the new space
x2

• The dots are data points

• 'e' represents the principal component, e
the direction of the greatest variation.
• The red dashed line represents the
approximate boundary within which
most of the data points lie.
x1
4
Principal Component Analysis (Steps)

◼ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors

(principal components) that can be best used to represent data
1) Normalize input data: Each attribute falls within the same range
2) Compute Principal Components: PCA calculates k orthogonal (perpendicular)
vectors that point in the directions of the greatest variance in data
3) Represent Data as Linear Combinations: Each original data point can be
reconstructed (at least approximately) using these principal components.
4) The principal components are sorted in order of decreasing “significance” or
strength
5) Reduce Data Size: Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low variance (i.e.,
using the strongest principal components, it is possible to reconstruct a good
approximation of the original data)
◼ Works for numeric data only

5
Principal Component Analysis (Example)

◼ A marketing manager wants to segment customers based on their purchasing

behavior. There are two features for each customer:
• X1: Amount spent on electronics per year
• X2: Number of clothing items purchased per year
◼ Four customers (N=4):
Customer X1 X2
A 2 10
B 4 8
C 6 6
D 8 4

◼ You want to reduce this 2-dimensional data to a single "customer profile" score
(k=1) for easier segmentation.

6
Principal Component Analysis (Example)
◼ Normalize Data: standardize the data by subtracting the mean and dividing
by the standard deviation for each feature. This makes the features
comparable.
Customer X1 X2
A -1.5 1.5
B -0.5 0.5
C 0.5 -0.5
D 1.5 -1.5

◼ Compute Covariance Matrix: Calculate the covariance between X1 and X2.

This measures how the two features vary together. Then, find Eigenvectors
and Eigenvalues of the covariance matrix. The eigenvectors represent the
principal components, and the eigenvalues represent their "strength". (This
requires linear algebra, skipping the detailed calculation but many software
tools do it).

7
Principal Component Analysis (Example)
◼ Let's assume

◼ Eigenvalue 1: 2.5
◼ Eigenvalue 2: 0

▪ The principal components are sorted: Eigenvalue 1 is much larger than

Eigenvalue 2, so Eigenvector 1 is our most significant principal component.
Assume Eigenvector 1 is [0.71, -0.71].

▪ Reduce Data Size: Project each normalized data point onto the first principal
component (Eigenvector 1). This is done by taking the dot product of each
data point with the eigenvector. This gives us the "customer profile" score.
(e.g. Customer A (-1.5, 1.5): Score = (-1.5 * 0.71) + (1.5 * -0.71) = -1.065
+ (-1.065) = -2.12))
Customer
Customer X1 X2
Profile Score
A -1.5 1.5 -2.12
B -0.5 0.5 -0.71
C 0.5 -0.5 0.71
D 1.5 -1.5 2.12
8
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

9
Data reduction strategies:
Attribute Subset Selection

◼ Another way to reduce dimensionality of data

◼ Two Main Types of Unnecessary Attributes:

◼ Redundant attributes: Duplicate much or all of the information
contained in one or more other attributes
◼ E.g., purchase price of a product and the amount of sales tax paid (tax is
directly calculated from the "purchase price)
◼ Irrelevant attributes: Contain no information that is useful for
the data mining task at hand
◼ E.g., students' ID is often irrelevant to the task of predicting students' GPA

10
Heuristic Search in Attribute Selection

◼ There are 2d possible attribute combinations of d attributes

◼ e.g. Assume 10 features (d=10) to predict price. There are 210 = 1024
possible combinations of these features. Finding the absolute best
combination, by trying every single one is computationally very
expensive.

◼ Heuristic search methods provide practical ways to select relevant

attributes when the number of possible combinations is too large. They don't
guarantee the absolute best solution, but they aim to find a very good one
in a reasonable amount of time.

11
Heuristic Search in Attribute Selection

◼ Some heuristic attribute selection methods:

1) Best single attribute under the attribute independence

assumption: choose by significance tests (Pick only the single
best attribute that improves the model)

2) Best step-wise feature selection:

◼ Start with an empty set of attributes

◼ The best single-attribute is picked first

◼ Then next best attribute condition to the first, ...

3) Step-wise attribute elimination: Start with all attributes,

repeatedly eliminate the worst attribute

12
Data reduction strategies:
Attribute Creation (Feature Generation)
◼ Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
◼ Three general methodologies
◼ Attribute extraction
◼ Domain-specific
• E.g.Original features: Pixel values of an image
• Extracted features: Edges, corners, textures, or shapes, ….

◼ Mapping data to new space (see: data reduction)

◼ E.g. wavelet transformation

◼ Attribute construction (Combining features)

• E.g. Original features: User ratings for individual products.
• Constructed features: Average rating for each user,

13
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

14
Numerosity Reduction
◼ Reduce data volume by choosing alternative, smaller forms of data
representation
◼ Parametric methods (e.g., regression)
◼ Assume the data fits some model, estimate model parameters, store only
the parameters, and discard the data (except possible outliers)
• Ex.: data on house sizes (x) and prices (y). Assume a linear
relationship: Y = w X + b
• Instead of storing every house's size and price, only store the slope (w)
and intercept (b) of the line, can then use these parameters to
reconstruct (approximately) the price of a house given its size.
◼ Non-parametric methods
◼ Do not assume models
◼ use techniques like histograms, clustering, or sampling to represent the
data in a compressed form.
◼ Major families: histograms, clustering, sampling, …

15
Numerosity Reduction
Parametric Data Reduction
◼ Regression analysis: A collective name for y
techniques for the modeling and analysis of
numerical data consisting of values of a
dependent variable (y) (also called Y1
response variable or measurement) and of
one or more independent variables (x) (aka.
Y1’
explanatory variables or predictors) y=x+1
◼ The parameters are estimated so as to give a
"best fit" of the data
X1 x
◼ Used for prediction (including forecasting of
time-series data), and modeling of causal
relationships

16
Numerosity Reduction
Parametric Data Reduction
◼ Linear regression: Y = w X + b
◼ Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
◼ Predicts a continuous value based on a single independent variable
◼ Multiple regression: Y = b0 + b1 X1 + b2 X2
◼ Allows a response variable Y to be modeled as a linear function of
multidimensional feature vector (multiple independent variables)
◼ a hyperplane (in higher dimensions) instead of a line.
◼ Log-linear models:
◼ Approximate discrete multidimensional probability distributions
◼ Models the probabilities of different combinations of categorical variables.
• E.g. Dimension 1: Did the customer buy coffee? (Yes/No)
• Dimension 2: Did the customer buy milk? (Yes/No)
• A log-linear model can estimate the probability of a customer buying coffee and milk

17
Numerosity Reduction:
Non-Parametric Data Reduction

40
◼ Histogram Analysis 35
◼ Divide data into buckets and 30
store average (sum) for each
25
bucket
20
◼ Partitioning rules:
15
◼ Equal-width: Each bucket covers
the same range of values. 10

◼ Equal-frequency: Each bucket 5

contains (approximately) the same 0
number of data points. 10000 30000 50000 70000 90000

18
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis

◼ Equal-width (distance) partitioning

◼ Divides the range into N intervals of equal size
◼ if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• Ex: customer ages ranging from 10 to 70 (A=10, B=70), You want 3
intervals (N=3).
• Interval width: W = (70 - 10) / 3 = 20
• Intervals: 10-30, 31-50, 51-70
◼ Simple to understand and implement, but outliers (very large or very
small values) may dominate presentation so skewed data is not
handled well

19
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Histogram Analysis

◼ Equal-depth (frequency) partitioning

◼ Divides the range into N intervals, each containing
approximately same number of samples
• Ex: You have 100 customer ages, and you want 5 intervals.
Ideally, each interval should have 100 / 5 = 20 customers.
• Sort the ages and then create the intervals so that roughly 20
ages fall into each.
• So, the bucket widths might be different. For instance, the first
bucket might be 0-15 years, the second 16-22 years, the third
23-30 years, and so on.
◼ Good data scaling

20
Numerosity Reduction:
Non-Parametric Data Reduction
◼ Clustering
◼ Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
◼ Can be very effective if data is clustered but not if data is
“smeared” (doesn't have clear groupings)
◼ Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
◼ There are many choices of clustering definitions and clustering
algorithms. The choice depends on the specific data and the
goal of the analysis.
◼ Only store cluster representations (like the centroid and
diameter) instead of all individual data points.

21
Numerosity Reduction:
Non-Parametric Data Reduction

◼ Sampling
◼ Analyzing the entire dataset might be computationally very expensive,
obtaining a small sample s to represent the whole data set N reduces the
processing time.
◼ Key principle: Choose a representative subset of the data
◼ Simple random sampling may have very poor performance in the
presence of skew (unevenly distributed)
◼ Develop adaptive sampling methods, e.g., stratified sampling

22
Numerosity Reduction:
Non-Parametric Data

◼ Types of Sampling:
◼ Simple random sampling
◼ There is an equal probability of selecting any particular item
◼ Sampling without replacement
◼ Once an object is selected, it is removed from the population
◼ Sampling with replacement
◼ A selected object is not removed from the population
◼ Stratified sampling
◼ Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
◼ Divide your data into groups (strata) based on some characteristic.
Take a random sample from each group
◼ Used in conjunction with skewed data

23
Sampling: With or without Replacement

Raw Data
24
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

25
Numerosity Reduction:
Data Cube Aggregation

◼ The lowest level of a data cube (base cuboid: is the most detailed
level of the data cube. e.g. [Product A, Region East, January 2023:
$1000 sales], [Product B,….])
◼ The aggregated data for an individual entity of interest

◼ Multiple levels of aggregation in data cubes (e.g. Aggregating by

Region and Time for example: sum of sales for all products in East
region in January)
◼ Further reduce the size of data to deal with
◼ Use the smallest representation which is enough to solve the task
and is sufficient to answer your query

26
Data Reduction Strategies

◼ Data reduction strategies

◼ Dimensionality reduction, e.g., remove unimportant attributes

◼ Wavelet transforms

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Data compression

27
Data Reduction Strategies:
Data Compression

◼ The main goal of data compression is to represent the same information

using fewer bits. This saves storage space and speeds up data transmission.
◼ String compression
◼ Typically lossless (decompressed to get the exact original text)
◼ Audio/video compression
◼ Typically lossy compression (Some information is lost during compression), with
progressive refinement
◼ Time sequence is not audio
◼ Include sensor readings, stock prices, or weather data
◼ Typically short and vary slowly with time. Can be lossy or lossless.

◼ Dimensionality and numerosity reduction may also be considered as forms of

data compression

28
Data Reduction Strategies:
Data Compression

Original Data Compressed

Data
lossless

Original Data
Approximated

29
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

30
Data Transformation
◼ A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
◼ Methods
◼ Smoothing: Remove noise from data
◼ Attribute/feature construction
◼ New attributes constructed from the given ones
◼ Aggregation: Summarization, data cube construction
◼ Normalization: Scaled to fall within a smaller, specified range
◼ min-max normalization
◼ z-score normalization
◼ normalization by decimal scaling
◼ Discretization: Concept hierarchy climbing 31
Data Transformation:
Normalization
◼ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].
73,600 − 12,000
Then $73,000 is mapped to (1.0 − 0) + 0 = 0.716
98,000 − 12,000
◼ 0.716 is within the new range [0,1]
◼ Z-score normalization (μ: mean, σ: standard deviation):
v − A
v' =
 A 73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
◼ Normalization by decimal scaling
v Divide each value by a power of 10 (e.g., 10, 100, 1000) to bring the
v'= j ◼

values into a desired range.

10 ◼ Ex. v= 1500 can be divided by 1000 then v is mapped to v` =
1500/1000 = 1.5

32
Data Transformation:
Discretization
◼ There are three types of attributes
◼ Nominal—values from an unordered set, e.g., color, profession

◼ Ordinal—values from an ordered set, e.g., academic levels, ratings

◼ Numeric—real numbers, e.g., integer or real numbers, e.g., Age,

height, weight, income

◼ Discretization: Divide the range of a continuous attribute into

intervals
◼ Interval labels can then be used to replace actual data values
• E.g. {0-18: Teenager}, {19-65: Adult}, {66+: Senior}
◼ Reduce data size by discretization
◼ Supervised vs. unsupervised
◼ Split (top-down) vs. merge (bottom-up)
◼ Discretization can be performed recursively on an attribute
◼ Prepare for further analysis, e.g., classification (Discretization is often used
as a preprocessing step for classification)
33
Data Transformation:
Discretization

◼ Discretization methods:
◼ Binning
◼ Top-down split, unsupervised
◼ Histogram analysis
◼ Top-down split, unsupervised
◼ Clustering analysis
◼ unsupervised, top-down split or bottom-up merge
◼ Decision-tree analysis
◼ supervised, top-down split

34
Discretization by Binning Methods
❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
35
Discretization by Classification &
Correlation Analysis
◼ Classification (e.g., decision tree analysis)
◼ Supervised: meaning it uses class labels (the target variable you're trying
to predict) to guide the discretization.
◼ focuses on finding splits that improve the classification accuracy.

◼ Clustering (e.g., k-means)

◼ Use a clustering algorithm to group similar values together. This can be
top-down (split the data into a fixed number of clusters) or bottom-up
(start with each value as its own cluster and merge them).
◼ Unsupervised

36
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

37
Data Transformation:
Concept Hierarchy Generation
◼ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically
and is usually associated with each dimension in a data warehouse "levels of
detail"
◼ Concept hierarchies facilitate drilling and rolling in data warehouses to view
data in multiple granularity.
◼ Drilling Down: Moving from a higher level to a lower level in the hierarchy. (e.g.
see sales by country then further by city)
◼ Rolling Up: Moving from a lower level to a higher level. (e.g. see sales by city then
further by country)
◼ Concept hierarchy formation: Recursively reduce the data by collecting and
replacing low level concepts (such as numeric values for age) by higher level
concepts (such as youth, adult, or senior)

38
Data Transformation:
Concept Hierarchy Generation

◼ Concept hierarchies can be explicitly specified by domain experts and/or data

warehouse designers (manually define the hierarchy)
◼ Concept hierarchy can be automatically formed for both numeric and nominal
data. For numeric data, use discretization methods shown.
◼ Concept Hierarchies Benefits:
◼ Analyze data at different levels of detail, providing a more comprehensive
view
◼ Data Summarization: Rolling up aggregates data, making it easier to
see overall trends
◼ Speed up query processing

39
Concept Hierarchy Generation
for Nominal Data

◼ Specification of a partial/total ordering of attributes explicitly at the

schema level by users or experts
◼ street < city < state < country
◼ Specification of a hierarchy for a set of values by explicit data
grouping (manually group specific values to create higher-level
concepts)
◼ {Urbana, Champaign, Chicago} (belong to) < Illinois
◼ Specification of only a partial set of attributes (don't have to define
the entire hierarchy at once)
◼ E.g., only street < city, not others

40
Automatic Concept Hierarchy Generation

◼ Some hierarchies can be automatically generated based on the

analysis of the number of distinct values per attribute in the data set
◼ The attribute with the most distinct values is placed at the lowest
level of the hierarchy

◼ E.g., for a set of attributes: {street, city, state, country}

• street: 674,339 distinct values
• city: 3,567 distinct values
• state: 365 distinct values
• country: 15 distinct values

41
Automatic Concept Hierarchy Generation

◼ The system automatically creates the hierarchy: street < city <
state < country because "street" has the most distinct values, and
"country" has the fewest.

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

42
Major Tasks in Data Preprocessing
◼ Data cleaning
◼ Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
◼ Data integration
◼ Integration of multiple databases, data cubes, or files
◼ Data reduction
◼ Dimensionality reduction
◼ Numerosity reduction
◼ Data compression
◼ Data transformation and data discretization
◼ Normalization
◼ Concept hierarchy generation

43
References

◼ Data Mining: Concepts and Techniques,

Jiawei Han, Micheline Kamber and Jian Pei
◼ "Data Science for Business" by Foster Provost and
Tom Fawcett

Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Kimagure Orange Road - Madoka's Piano Files - Missing
No ratings yet
Kimagure Orange Road - Madoka's Piano Files - Missing
5 pages
DiaDENS-PCM Operations Instructions
100% (6)
DiaDENS-PCM Operations Instructions
89 pages
Data_Preprocessing-2
No ratings yet
Data_Preprocessing-2
30 pages
SE 458 - Data Mining (DM) : Spring 2019 Section W1
No ratings yet
SE 458 - Data Mining (DM) : Spring 2019 Section W1
10 pages
Lec4 Data Preprocessing
No ratings yet
Lec4 Data Preprocessing
43 pages
03preprocessing3 Part3 4
No ratings yet
03preprocessing3 Part3 4
49 pages
Data Reduction Techniques
No ratings yet
Data Reduction Techniques
41 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Module 3
No ratings yet
Module 3
41 pages
DATA REDUCTION
No ratings yet
DATA REDUCTION
23 pages
Data Science unit I(LN and QB)
No ratings yet
Data Science unit I(LN and QB)
44 pages
Week 2
No ratings yet
Week 2
96 pages
315 F19 27 Pca1
No ratings yet
315 F19 27 Pca1
28 pages
Data Mining - Data Reduction
No ratings yet
Data Mining - Data Reduction
6 pages
W4.2 DataPreProcessing-PCA (1)
No ratings yet
W4.2 DataPreProcessing-PCA (1)
22 pages
Data Mining
No ratings yet
Data Mining
21 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
Unit 3: Discriminant Analysis and Cluster Analysis
No ratings yet
Unit 3: Discriminant Analysis and Cluster Analysis
43 pages
3.2 Pca
No ratings yet
3.2 Pca
27 pages
data reduction
No ratings yet
data reduction
9 pages
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
No ratings yet
1.variable Reduction 2.principal Component Analysis: Topic UNIT-4
19 pages
Unit 3.2
No ratings yet
Unit 3.2
45 pages
Data Mining - Lecture 3
No ratings yet
Data Mining - Lecture 3
33 pages
6 - Data Pre-Processing-III
No ratings yet
6 - Data Pre-Processing-III
30 pages
Dimension Reduction
No ratings yet
Dimension Reduction
15 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
Data Preprocessing - 2: Course Leader
No ratings yet
Data Preprocessing - 2: Course Leader
31 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
ML Module 6
No ratings yet
ML Module 6
6 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
No ratings yet
1694601214-Unit 3.4 Principal Component Analysis CU 2.0
36 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
P-3.1.4 - Pca
No ratings yet
P-3.1.4 - Pca
44 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
29 pages
PCA Finds Representation Through Linear Transformation
No ratings yet
PCA Finds Representation Through Linear Transformation
28 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
No ratings yet
Ch2 Data Preprocessing Part3: Amit KR Upadhyay Sharda University
24 pages
Module 5 - BECE309L - AIML - Part2
No ratings yet
Module 5 - BECE309L - AIML - Part2
34 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
UNIT-4
No ratings yet
UNIT-4
79 pages
PCA - Feb 8
No ratings yet
PCA - Feb 8
28 pages
Introduction To Dimensionality Reduction-1
No ratings yet
Introduction To Dimensionality Reduction-1
16 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
17 pages
DMBAR Chapter 4 Dimension Reduction
No ratings yet
DMBAR Chapter 4 Dimension Reduction
25 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Script
No ratings yet
Script
5 pages
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
No ratings yet
Class8-9 DataPreprocessing DataReduction 30Sept-05Oct2020
22 pages
DR Pca
No ratings yet
DR Pca
22 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Chapter Five Principal Comonent Analysis (PCA)
No ratings yet
Chapter Five Principal Comonent Analysis (PCA)
33 pages
program-3
No ratings yet
program-3
7 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Lecture 9_PCA
No ratings yet
Lecture 9_PCA
44 pages
Pca (Data Reduction)
No ratings yet
Pca (Data Reduction)
24 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
47 pages
Love Report
No ratings yet
Love Report
7 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 3
18 pages
Statistical Inference INF312 - Is - Lecture 03 - Part 2
No ratings yet
Statistical Inference INF312 - Is - Lecture 03 - Part 2
2 pages
Networks Lecture 5
No ratings yet
Networks Lecture 5
29 pages
DM Lect 6_Recommender Systems
No ratings yet
DM Lect 6_Recommender Systems
46 pages
DM Lect 9_Classification - Decision Trees
No ratings yet
DM Lect 9_Classification - Decision Trees
39 pages
DM Lec 6 (2)
No ratings yet
DM Lec 6 (2)
4 pages
Lecture 5 Modes of Operation
No ratings yet
Lecture 5 Modes of Operation
30 pages
Lecture 1- introduction to data security
No ratings yet
Lecture 1- introduction to data security
46 pages
Lec5-Regular Simplex Method and Dual Simplex Method
No ratings yet
Lec5-Regular Simplex Method and Dual Simplex Method
48 pages
5-Data Analytics in a Business operations and BI Marketing Models
No ratings yet
5-Data Analytics in a Business operations and BI Marketing Models
29 pages
1-Introduction to Business Intelligence in a Business Environment
No ratings yet
1-Introduction to Business Intelligence in a Business Environment
40 pages
Networks Lecture 1
No ratings yet
Networks Lecture 1
28 pages
Networks Lecture 2
No ratings yet
Networks Lecture 2
21 pages
Bangladesh Railway e Ticket
No ratings yet
Bangladesh Railway e Ticket
1 page
Study Guide - Frequency Distributions and Graphs
No ratings yet
Study Guide - Frequency Distributions and Graphs
9 pages
WP Unit 3
No ratings yet
WP Unit 3
32 pages
Master of Science (M.SC.) (Mathematics) Semester-III (C.B.S.) Examination Operations Research-I Optional Paper-5
No ratings yet
Master of Science (M.SC.) (Mathematics) Semester-III (C.B.S.) Examination Operations Research-I Optional Paper-5
5 pages
S2400 User Manual 2021-07-15
No ratings yet
S2400 User Manual 2021-07-15
72 pages
Entrepreneurial Mind Reviewer
No ratings yet
Entrepreneurial Mind Reviewer
8 pages
Waterman SS-250 Series Slide Gates
No ratings yet
Waterman SS-250 Series Slide Gates
12 pages
Listening II - Sesi 1
No ratings yet
Listening II - Sesi 1
12 pages
Republic of The Philippines Department of Education Caraga Region Schools Division of Surigao Del Sur Tagbina District II Carpenito Integrated School
No ratings yet
Republic of The Philippines Department of Education Caraga Region Schools Division of Surigao Del Sur Tagbina District II Carpenito Integrated School
2 pages
Part 2: Pancreatic Lipase Activity
No ratings yet
Part 2: Pancreatic Lipase Activity
7 pages
7 Interviewing Candidates
No ratings yet
7 Interviewing Candidates
6 pages
Communiques DP DP 301 Submission of Annual System Audit Report
No ratings yet
Communiques DP DP 301 Submission of Annual System Audit Report
20 pages
Rural Institutions in The Philippine Settings
No ratings yet
Rural Institutions in The Philippine Settings
2 pages
Hot Water Boiler
No ratings yet
Hot Water Boiler
11 pages
Sumatra
No ratings yet
Sumatra
34 pages
Valia Mine Profile GLMSW 2020-21, Updated As On 15.08.20
No ratings yet
Valia Mine Profile GLMSW 2020-21, Updated As On 15.08.20
20 pages
Tekla Structures: Analysis Guide
No ratings yet
Tekla Structures: Analysis Guide
144 pages
Catalogo de Genomica ROCHE 2014
No ratings yet
Catalogo de Genomica ROCHE 2014
290 pages
TA 9 Friends Plus Unit 3 Practice Test 1 (2025 Format)
No ratings yet
TA 9 Friends Plus Unit 3 Practice Test 1 (2025 Format)
5 pages
MDR C40 Pt. Dmi
No ratings yet
MDR C40 Pt. Dmi
14 pages
Item Description Brand Reference Unit Qty Unit Price (USD) Price (USD)
No ratings yet
Item Description Brand Reference Unit Qty Unit Price (USD) Price (USD)
22 pages
AIATS For First Step JEE (ADV) - Phase-3&4 - Test-2A - P1 - Code-G - Sol - 10-03-2024
No ratings yet
AIATS For First Step JEE (ADV) - Phase-3&4 - Test-2A - P1 - Code-G - Sol - 10-03-2024
8 pages
Blender Manual
No ratings yet
Blender Manual
40 pages
Credit Risk Management of JBL
No ratings yet
Credit Risk Management of JBL
56 pages
MPSS SRS
No ratings yet
MPSS SRS
10 pages
Announcement Supplementary 2022 2023
No ratings yet
Announcement Supplementary 2022 2023
5 pages
13398 - Year - B.com. CBCS Pattern Semester-VI Subject - UCA6C05 - Income Tax
No ratings yet
13398 - Year - B.com. CBCS Pattern Semester-VI Subject - UCA6C05 - Income Tax
8 pages
Buy 10.9-Inch Ipad Air Wi-Fi 64GB - Purple - Apple
No ratings yet
Buy 10.9-Inch Ipad Air Wi-Fi 64GB - Purple - Apple
1 page

3-Data Fundamentals for BI- Part2

Uploaded by

3-Data Fundamentals for BI- Part2

Uploaded by

Data Fundamentals for

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

• The dots are data points

◼ Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors

◼ A marketing manager wants to segment customers based on their purchasing

◼ Compute Covariance Matrix: Calculate the covariance between X1 and X2.

▪ The principal components are sorted: Eigenvalue 1 is much larger than

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Another way to reduce dimensionality of data

◼ Two Main Types of Unnecessary Attributes:

◼ There are 2d possible attribute combinations of d attributes

◼ Heuristic search methods provide practical ways to select relevant

◼ Some heuristic attribute selection methods:

1) Best single attribute under the attribute independence

2) Best step-wise feature selection:

◼ The best single-attribute is picked first

◼ Then next best attribute condition to the first, ...

3) Step-wise attribute elimination: Start with all attributes,

◼ Mapping data to new space (see: data reduction)

◼ Attribute construction (Combining features)

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Equal-frequency: Each bucket 5

◼ Equal-width (distance) partitioning

◼ Equal-depth (frequency) partitioning

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ Multiple levels of aggregation in data cubes (e.g. Aggregating by

◼ Data reduction strategies

◼ Principal Components Analysis (PCA)

◼ Feature subset selection, feature creation

◼ Numerosity reduction (some simply call it: Data Reduction)

◼ Regression and Log-Linear Models

◼ Histograms, clustering, sampling

◼ Data cube aggregation

◼ The main goal of data compression is to represent the same information

◼ Dimensionality and numerosity reduction may also be considered as forms of

Original Data Compressed

values into a desired range.

◼ Ordinal—values from an ordered set, e.g., academic levels, ratings

◼ Numeric—real numbers, e.g., integer or real numbers, e.g., Age,

height, weight, income

◼ Discretization: Divide the range of a continuous attribute into

◼ Clustering (e.g., k-means)

◼ Concept hierarchies can be explicitly specified by domain experts and/or data

◼ Specification of a partial/total ordering of attributes explicitly at the

◼ Some hierarchies can be automatically generated based on the

◼ E.g., for a set of attributes: {street, city, state, country}

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

◼ Data Mining: Concepts and Techniques,

You might also like