Data Warehousing - CH3

Uploaded by

wavaf74474

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views15 pages

Data Warehousing - CH3

Uploaded by

wavaf74474

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Data Preprocessing

Mr. Fisha M.
Mekelle Institute of Technology

December 17, 2017

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 1 / 15
Data Preprocessing
Today’s real-world databases are:
Highly susceptible to noisy, missing, and inconsistent data due
to their typically huge size and their likely origin from multiple,
heterogeneous sources.
Why Preprocess Data:
To provide Data quality. Thus, help to improve Mining
processes and results.
Low-quality data will lead to low-quality mining results.
Factors comprising data quality:
Accuracy
Completeness
Consistency
Timeliness
Believability
Interpretability
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 2 / 15
Major Tasks in Data Preprocessing

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 3 / 15
Cont’d...
Data Preprocessing involves the following tasks:
1. Data Cleaning
Attempts to fill in missing values, smooth out noise while
identifying outliers, and correct inconsistencies in the data.
Missing Values
Filling missing values involves the following methods:
1. Ignore the tuple:
- This is usually done when the class label is missing.
- This method is not very effective, unless the tuplecontains
several attributes with missing values
- It is especially poor when the percentage of missing values per
attribute varies considerably
- By ignoring the tuple, we do not make use of the remaining
attributes’ values in the tuple.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 4 / 15
Cont’d...

2. Fill in the missing value manually:

- This approach is time consuming.
- May not be feasible for a given a large data set with many
missing values.
3. Use a global constant to fill in the missing value:
- Replace all missing attribute values by the same constant such
as a label like “Unknown”.
- The mining program may mistakenly think that they form an
interesting concept, since they all have a value in common—that
of “Unknown.”
- Hence, although this method is simple, it is not foolproof.
4. Use a measure of central tendency for the attribute
(e.g., the mean or median):
- Use mean or median of the data distribution.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 5 / 15
Cont’d...

5. Use the most probable value to fill in the missing value:

- This may be determined with regression, inference-based tools

using a Bayesian formalism, or decision tree induction.
- Uses the most information from the present data to predict
missing values.
- It may fill incorrect value. However, it is popular.

2. Noisy Data
Noise is a random error or variance in a measured variable.
Data smoothing techniques:

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 6 / 15
Cont’d...
1. Binning:
- Smooth a sorted data value by consulting its “ neighborhood,”
that is, the values around it.
- The sorted values are distributed into a number of “ buckets,”
or bins.
- Because binning methods consult the neighborhood of values,
they perform local smoothing
- Smoothing by bin means: each value in a bin is replaced by
the mean value of the bin
- smoothing by bin medians: each bin value is replaced by the
bin median.
- Smoothing by bin boundaries: the minimum and maximum
values in a given bin are identified as the bin boundaries. Each
bin value is then replaced by the closest boundary value.
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 7 / 15
Eg.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 8 / 15
Cont’d...

2. Regression
- A technique that conforms data values to a function.
- Linear regression involves finding the “ best ” line to fit two
attributes (or variables) so that one attribute can be used to
predict the other.
- Multiple linear regression is an extension of linear regression,
where more than two attributes are involved and the data are fit
to a multidimensional surface.
3. Outlier Analysis
- Outliers may be detected by clustering.
- For example, where similar values are organized into groups, or
“clusters.” Intuitively, values that fall outside of the set of
clusters may be considered outliers.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 9 / 15
Cont’d...
2. Data Integration
Data mining often requires data integration — the merging of
data from multiple data stores.
Careful integration can help reduce and avoid redundancies and
inconsistencies in the resulting data set. This can help improve
the accuracy and speed of the subsequent data mining process.
The semantic heterogeneity and structure of data pose great
challenges in data integration.
It involves:
- Entity identification problem.
- Redundancy and Correlation Analysis
- Tuple Duplication
- Data Value Conflict Detection and Resolution
Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 10 / 15
Cont’d...

3. Data Reduction
is a technique which can be applied to obtain a reduced
representation of the data set that is much smaller in volume,yet
closely maintains the integrity of the original data.
That is, mining on the reduced data set should be more efficient
yet produce the same(or almost the same) analytical results.
Data reduction strategies include:
1 Dimensionality reduction
2 Numerosity reduction
3 Data compression

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 11 / 15
Cont’d...

i Dimensionality reduction
is the process of reducing the number of random variables or
attributes. Includes:
Wavelet transforms
Principal components analysis: which transform or project the
original data onto a smaller space.
Attribute subset selection: is a method of dimensionality
reduction in which irrelevant, weakly relevant, or redundant
attributes or dimensions are detected and removed.
ii Numerosity reduction
A technique that replaces the original data volume by
alternative,smaller forms of data representation.
This technique may be parametric or non-parametric.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 12 / 15
Parametric methods, a model is used to estimate the data, so
that typically only the data parameters need to be stored,
instead of the actual data. (Eg. Regression and log-linear
models)
Nonparametric methods for storing reduced representations of
the data include histograms, sampling, data cube aggregation.
iii Data compression
transformations are applied so as to obtain a reduced or
compressed representation of the original data.
Lossless Compression: If the original data can be reconstructed
from the compressed data without any information loss.
Lossy Compression: the original data is reconstructed only an
approximation of the original data

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 13 / 15
4 Data Transformation
In this preprocessing step, the data are transformed or
consolidated so that the resulting mining process may be more
efficient, and the patterns found may be easier to understand.
Includes:
1 Smoothing which works to remove noise fromthe data.
Techniques include binning, regression, and clustering.
2 Attribute construction (or feature construction) where new
attributes are constructed and added from the given set of
attributes to help the mining process.
3 Aggregation where summary or aggregation operations are
applied to the data. For example, the daily sales data may be
aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube
for data analysis at multiple abstraction levels.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 14 / 15
4 Normalization where the attribute data are scaled so as to fall
within a smaller range,such as -1.0 to 1.0, or 0.0 to 1.0.
5 Discretization where the raw values of a numeric attribute (e.g.,
age) are replaced by interval labels (e.g., 0–10, 11–20, etc.) or
conceptual labels (e.g., youth, adult, senior).
6 Concept hierarchy generation for nominal data where attributes
such as street can be generalized to higher-level concepts, like
city or country. Many hierarchies for nominal attributes are
implicit within the database schema and can be automatically
defined at the schema definition level.

Mr. Fisha M. (Mekelle Institute of Technology) Data Preprocessing December 17, 2017 15 / 15

Riso RZ Series Error Codes List
63% (8)
Riso RZ Series Error Codes List
22 pages
03 Network Information Theory 2011 PDF
100% (2)
03 Network Information Theory 2011 PDF
714 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
3prep
No ratings yet
3prep
53 pages
14. Preprocessing-Cleaning & Reduction
No ratings yet
14. Preprocessing-Cleaning & Reduction
42 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Data Binning
No ratings yet
Data Binning
9 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
UNIT-2
No ratings yet
UNIT-2
37 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Swetha Unit 1 Part 2 Data Preprocessing
No ratings yet
Swetha Unit 1 Part 2 Data Preprocessing
74 pages
Lect 4
No ratings yet
Lect 4
30 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Chap 3
No ratings yet
Chap 3
55 pages
Data Mining
No ratings yet
Data Mining
22 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
ml4
No ratings yet
ml4
17 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Quick Start Guide FFmpeg 2023
No ratings yet
Quick Start Guide FFmpeg 2023
288 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
DWM
No ratings yet
DWM
14 pages
Slides C1 C4 EN
No ratings yet
Slides C1 C4 EN
142 pages
IG Computer Science SAMs Booklet Issue 2
No ratings yet
IG Computer Science SAMs Booklet Issue 2
96 pages
DIGITAL CAMERA
No ratings yet
DIGITAL CAMERA
14 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
HEIC vs HEI Image Files
No ratings yet
HEIC vs HEI Image Files
3 pages
Lecture 3 Compression in Multimedia
No ratings yet
Lecture 3 Compression in Multimedia
60 pages
WavePad Manual
No ratings yet
WavePad Manual
126 pages
Barrault et al. - 2025 - Joint speech and text machine translation for up to 100 languages
No ratings yet
Barrault et al. - 2025 - Joint speech and text machine translation for up to 100 languages
14 pages
Manual SOC2 DVR Mini 90n
0% (1)
Manual SOC2 DVR Mini 90n
83 pages
R2004 (3-8 Sem IT)
No ratings yet
R2004 (3-8 Sem IT)
81 pages
P1
No ratings yet
P1
85 pages
2026 2028 Syllabus
No ratings yet
2026 2028 Syllabus
55 pages
SVR-500 Series NVR Supported Camera List - v2.0.0SE.30059638
No ratings yet
SVR-500 Series NVR Supported Camera List - v2.0.0SE.30059638
33 pages
0478 Scheme of Work (For Examination From 2017)
No ratings yet
0478 Scheme of Work (For Examination From 2017)
36 pages
Module 1 - Introduction To Multimedia Databases
No ratings yet
Module 1 - Introduction To Multimedia Databases
39 pages
Unit 3
No ratings yet
Unit 3
18 pages
Electrical Engineering - Caltech Catalog
No ratings yet
Electrical Engineering - Caltech Catalog
9 pages
Computer Architecture Hnd1 Notes
No ratings yet
Computer Architecture Hnd1 Notes
27 pages
Robust Image Watermarking Based On Generative Adversarial Network
No ratings yet
Robust Image Watermarking Based On Generative Adversarial Network
10 pages
Video Enhancement
No ratings yet
Video Enhancement
29 pages
Tyre Pressure
No ratings yet
Tyre Pressure
28 pages
Test Data Compression Using Bitmask & Dictionary Selection
No ratings yet
Test Data Compression Using Bitmask & Dictionary Selection
13 pages
HoKo Compression in Graphics Pipeline PDF
No ratings yet
HoKo Compression in Graphics Pipeline PDF
8 pages
Lecture 15 - Image Compression JPEG
No ratings yet
Lecture 15 - Image Compression JPEG
21 pages
Implementation of Real Time Video Streamer System in Cloud: Mrs.G.Sumalatha, Mr.S.Bharathiraja
No ratings yet
Implementation of Real Time Video Streamer System in Cloud: Mrs.G.Sumalatha, Mr.S.Bharathiraja
4 pages
XOR
0% (1)
XOR
2 pages
FPGA Implementation of Discrete Wavelet Transform For Jpeg2000
No ratings yet
FPGA Implementation of Discrete Wavelet Transform For Jpeg2000
3 pages
Cdt5167d Vibe Em1000 Multi-Channel SD Encoder
No ratings yet
Cdt5167d Vibe Em1000 Multi-Channel SD Encoder
2 pages
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Warehousing - CH3

Uploaded by

Data Warehousing - CH3

Uploaded by

Data Preprocessing

December 17, 2017

2. Fill in the missing value manually:

5. Use the most probable value to fill in the missing value:

- This may be determined with regression, inference-based tools

You might also like