0% found this document useful (0 votes)

2 views

w2-Data_Preparation

The document outlines the process of data preparation, which includes data integration, selection, reduction, preprocessing, and transformation techniques. Key tasks in data preprocessing involve filling missing values, removing noisy data, identifying outliers, and correcting inconsistencies to ensure quality data for analysis. Various strategies for data reduction, such as aggregation, dimensionality reduction, and clustering, are also discussed to enhance data efficiency and accuracy.

Uploaded by

zekikurt811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

w2-Data_Preparation

Uploaded by

zekikurt811

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data Preparation

ga40eyesanim ga40eyesanim

Outline

◘ Data Integration
◘ Data Selection and Reduction
◘ Data Preprocessing and Data Cleaning
– Filling in Missing Values
– Removing Noisy Data
– Identification of Outliers
– Correcting Inconsistent Data
◘ Data Transformation Techniques
– Normalization
– Discretization
Data Preparation
Data Preparation

Data
Data Data Data
Data Selection &
Integration Preprocessing Transformation
Reduction
Data Integration

◘ Data integration
– Integration of multiple databases, data cubes, or files
– Obtain data from various sources
Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Selection & Reduction

◘ Data Reduction
– Selecting a target data set
– Removing duplicates

◘ Data Reduction
– Obtains reduced representation of the data set
(smaller in volume but yet produces the same (or almost the same) results

◘ Why data reduction?

– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run
◘ How to delete rows that are duplicates over a set of columns,
keeping only the one with the lowest ID.

◘ This query does that for all rows of tablename having the same
column1, column2, and column3.

◘ DELETE FROM tablename WHERE id IN

(SELECT id FROM
(SELECT id, ROW_NUMBER() OVER (partition BY column1,
column2, column3 ORDER BY id) AS rnum FROM tablename) t
WHERE t.rnum > 1);

◘ Sometimes a timestamp field is used instead of an ID field.

Data Reduction Strategies

1- Data Aggregation — e.g., sum, average

2- Dimensionality Reduction — e.g., remove unimportant attributes
3- Data Compression — e.g., encoding mechanisms
4- Sampling — e.g., fit data into models
5- Clustering — e.g., cluster data
6- Concept hierarchy generation — e.g., street < city < state < country
1- Data Aggregation

◘ Data aggregation is any process in which information is gathered and

expressed in a summary form.
◘ Summarization

◘ Example (Histograms):
– A popular data reduction technique
– Divide data into buckets and store average (or sum) for each bucket
40
35
30
25
20
15
10
5
0
10000 30000 50000 70000 90000
Data Aggregation Example
2- Dimensionality Reduction

◘ Attribute subset selection

◘ Remove unimportant attributes
◘ Remove redundant and/or correlating attributes
◘ Combine attributes (sum, multiply, difference)

◘ Example (Decision Tree Induction) :

Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}

3- Data Compression

◘ Data compression is the process of reproducing information in a

more compact form.
– Number comsression
– String compression
– Image/Audio/Video compression

◘ Lossless vs. Lossy Compression

– Original Data
25.888888888
Original Data Compressed
Data
– Lossless system lossless
25.[9]8
Original Data
– Lossy system Approximated
26
4- Sampling
◘ Sampling: obtaining a small sample s to represent the whole data set N.
– Representative Sampling
• Simple random sampling may have very poor performance
– Stratified Sampling
• Develop adaptive sampling methods
• Approximate the percentage of each class in the overall database

Raw Data Cluster/Stratified Sample

5- Clustering

◘ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter only)
◘ There are many choices of clustering definitions and clustering
algorithms.

Marital
C1 C2 ID Gender Age
Satatus
Score Cluster
1021 F 41 NeverM 55 C1
1022 M 27 Married 35 C1
1023 M 20 NeverM 480 C2
1024 F 34 Married 950 C3
1025 M 74 Married 500 C2
1026 M 32 Married 500 C2
1027 M 18 NeverM 890 C3
1028 M 54 Married 68 C1
C3 C4 … … … … … …
6- Concept Hierarchy Generation

◘ Replace low level concepts (such as numeric values for age) by

higher level concepts (such as young, middle-aged, or senior)
◘ Specification of a hierarchy for a set of values by explicit data grouping
– {Urbana, Champaign, Chicago} < Illinois
◘ The attribute with the most distinct values is placed at the lowest level of
the hierarchy
– street < city < state < country

country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Data Reduction Examples

Tutar <= 5 TL
SatışID Ürün Tarih ToplamTutar SatıldığıYer
1 Domates, Peynir, Kola 1.1.2008 45 İzmir Horizontal
2 Makarna, Çay 3.1.2008 55 İstanbul
Data
Reduction
3 Saç Bakımı 5.1.2008 5 İstanbul
4 Sigara, Bira 8.1.2008 25 İzmir

Concept Hierarchy
Generation Vertical Data Reduction
SatışID Ürün Tarih Tutar SatıldığıYer Açıklama
1 Domates 1.1.2008 20 İzmir Buca .....
1 Peynir 1.1.2008 10 Buca .....
1 Kola 1.1.2008 15 Buca .....
İstanbul
2 Makarna 3.1.2008 25 Mecidiyeköy .....
2 Çay 3.1.2008 30 Mecidiyeköy .....
3 Saç Bakımı 5.1.2008 5 İstanbul
Kadıköy .....
4 Sigara 8.1.2008 15
İzmir Bornova .....
4 Bira 8.1.2008 10 Bornova .....
Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Why Is Data Preprocessing Important?

◘ No quality data → No quality mining results

◘ Quality decisions must be based on quality data

– e.g., duplicate or missing data may cause incorrect or even misleading
statistics.

◘ Data warehouse needs consistent integration of quality data

Why Data Preprocessing?

Data in the real world is dirty.

◘ Incomplete: lacking or missing attribute values

– e.g., occupation=“”

◘ Noisy: containing errors or outliers

– e.g., Salary=“-10”

◘ Inconsistent: containing discrepancies in codes or names

– e.g., Age=“42” Birthday=“03/07/1976”
Outliers
– e.g., Was rating “1,2,3”, now rating “A, B, C”
– e.g., discrepancy between duplicate records
– e.g., different meanings (annual, yearly)
Example Errors

NAME M.Ulku Metin Ü.

SURNAME SANER SANRE
BIRTH DATE 10/04/1965 04.10.1965
CITY G.ANTEB GAZİANTEP
ADRESS Atatürk Cd. Kemaliye Sok. No.25 Atatrk Cad. Kemaliye Mah. 25/3

TITLE Gen. Müdr. Genel Müdür

WORKING PLACE G.Antep D.S.İ. Devlet Su İşleri A.O
......... ......... .........
Major Tasks in Data Preprocessing

1. Fill in missing values

2. Remove noisy data
3. Identify and remove outliers
4. Resolve inconsistencies
1- Filling in Missing Values

For Example: %10 of Salary is incomplete

Solutions:
1- Ignore the tuple
– Usually done when class label is missing (in classification)
– Not effective when the percentage of missing values per attribute varies
considerably

2- Fill in the missing value manually

– Tedious and infeasible

3- Fill in it automatically with

– A global constant : e.g., “unknown”, it generates a new class?
– The attribute mean
– The conditioned mean for all samples belonging to the same class
– The most probable value (use Bayesian formula or Decision Tree)
Example - Filling in Missing Values

◘ Global Constant: 900 (user defined)

◘ Most Repeated Value: 1000
◘ Mean: 1042 Dummy Attribute - NominalToBinary
◘ Conditioned mean: 1100
◘ Most probable value: 1200 (most similar to the customer 1021)
Marital
ID Gender Age Education Region Salary Cluster
Satatus
1021 F 41 Married Masters Izmir 1200 C1
1022 M 27 Married Bach. Ankara 1000 C1
1023 M 20 NeverM High School Izmir 1000 C2
1024 F 34 Married Bach. İstanbul 1000 C3
1025 M 74 Married Middle Ankara 500 C2
PhD
1026 M 32 Married İstanbul 2000 C2

1027 M 18 NeverM High School Ankara 800 C3

1028 F 43 Married Master Izmir ? C1
2- Removing Noisy Data

Solutions:

A. Binning
– First sort data and partition into (width or depth) bins
– Then one can
• (a) Equal Depth and Smooting by Bin Boundaries
• (b) Equal Depth and Smooting by Bin Means
• (c) Equal Width and Smooting by Bin Boundaries
• (d) Equal Width and Smooting by Bin Means

B. Regression
– Smooth by fitting the data into regression functions
A. Binning

◘ Equal-width partitioning
– Divides the range into N intervals of equal size
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◘ Equal-depth partitioning
– Divides the range into N intervals, each containing approximately same
number of samples

Equal width B1 B1 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3
Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3

Equal-Width Partitioning Equal-Depth Partitioning

(33-4) / 3 ~ 9
Bin1 (4-13) : 4 6 Bin1 : 4 6 14 16
Bin2 (14-23) : 14 16 18 19 21 22 23 Bin2 : 18 19 21 22
Bin3 (24-33) : 25 27 33 Bin3: 23 25 27 33
A. Binning

◘ Replace all values in a BIN by ONE value (smoothing values)

Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3
Smoothing by
10 10 10 10 20 20 20 20 27 27 27 27
bin means
Smoothing by
4 4 16 16 18 18 22 22 23 23 23 33
bin boundaries

5 5 19 19 19 19 19 19 19 28 28 28

4 6 14 14 14 23 23 23 23 25 25 33
A. Binning

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:

- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
* Smoothing by bin means: * Smoothing by bin means:
- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
* Smoothing by bin boundaries: * Smoothing by bin boundaries:
- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
Binning Example

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:

- Bin 1: 4, 8, 9, 15 - Bin 1: 4, 8, 9
- Bin 2: 21, 21, 24, 25 - Bin 2: 15, 21, 21, 24
- Bin 3: 26, 28, 29, 34 - Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means: * Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 - Bin 1: 7, 7, 7
- Bin 2: 23, 23, 23, 23 - Bin 2: 20, 20, 20, 20
- Bin 3: 29, 29, 29, 29 - Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries: * Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 1: 4, 9, 9
- Bin 2: 21, 21, 25, 25 - Bin 2: 15, 24, 24, 24
- Bin 3: 26, 26, 26, 34 - Bin 3: 25, 25, 25, 25, 34
Binning Example
[3 − 13]
35 − 3
= 10 [14 − 24]
Örneğin: 3, 8, 10, 11, 15, 19, 23, 29, 35 3 [25 − 35]

Equal-Depth Equal-Width
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:

Means Means
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:

Boundaries
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:
Binning Example
[3 − 13]
35 − 3
= 10 [14 − 24]
Örneğin: 3, 8, 10, 11, 15, 19, 23, 29, 35 3 [25 − 35]

Equal-Depth Equal-Width
Bin 1: 3, 8, 10 Bin 1: 3, 8, 10, 11
Bin 2: 11, 15, 19 Bin 2: 15, 19, 23
Bin 3: 23, 29, 35 Bin 3: 29, 35

Means Means
Bin 1: 7, 7, 7 Bin 1: 8, 8, 8, 8
Bin 2: 15, 15, 15 Bin 2: 19, 19, 19
Bin 3: 29, 29, 29 Bin 3: 32, 32

Boundaries
Bin 1: 3, 10, 10 Bin 1: 3, 11, 11, 11
Bin 2: 11, 11, 19 Bin 2: 15, 15, 23
Bin 3: 23, 23, 35 Bin 3: 29, 35
B. Regression

Y1’ y=x+1

X1 x
3- Removing Outliers

◘ Outlier: Data points inconsistent with the majority of data

◘ Removal methods
– Clustering
– Curve-fitting

Clustering

Outliers
4- Resolve inconsistencies

◘ Data discrepancy detection

– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule

◘ Data Type Conversion may be necessary

– Different representations, different scales, e.g., metric vs. British units

◘ For example: inconsistency in naming convention

Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Transformation

◘ It is the process of changing the form or structure of existing

attributes.
– Convert data into common format
– Transform data into new format

◘ It involves converting data into a single common format acceptable

to the data mining methodology.
Data Transformation Example

Data Warehouse

appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female

appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds

appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Encoding Errors

◘ Education Field
– C: college
– U: university
– H: high school
– D: doctorate
– M: master
– S : secondary school
– P: primary school
– I : illegitimate

but X,Q,Y,T values may seen in the data

Data Transformation

◘ NominalToBinary / Dummy Attributes

◘ -Numeric / Nominal -> Boolean (0-1), Hierarchical(0-1-2-3-4-…)
◘ Categorical (Dummy Attributes)

◘ Normalization: scaled to fall within a small, specified range

– Min-max normalization
– Z-score normalization
– Normalization by decimal scaling

◘ Discretization
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Normalization
◘ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

◘ Z-score normalization (μ: mean, σ: standard deviation):

v − A
v' =
 A
Normalization Example
◘ Min-max normalization: v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

– Ex. Let income range $12,000 to $98,000 normalized to [0, 1]. Then
$73,000 is mapped to
73,600 − 12,000
(1 − 0) + 0 = 0.716
98,000 − 12,000

◘ Z-score normalization
v − A
– Ex. Let μ = 54,000, σ = 16,000. Then v' =
 A

73,600 − 54,000
= 1.225
16,000
Standard Deviation

◘ Mean Average

◘ Standard Deviation
Data Transformation Example

Price in € 4 6 14 16 18 19 21 22 23 24 27 34

Min-max [0,1] 0 .06 .33 .4 .46 .5 .56 .6 .63 .66 .76 1

Z-score -1.8 -1.6 -0.6 -0.3 -0.1 0 0.2 0.4 0.5 0.6 1 1.8

Decimal Scaling .04 .06 .14 .16 .18 .19 .21 .22 .23 .24 .27 .34

v − minA v − A v
v' = (new _ maxA − new _ minA) + new _ minA v' = v' = j
maxA − minA  A 10
Discretization

◘ Discretization:
– Divide the range of a continuous attribute into intervals.
– Some classification algorithms only accept categorical attributes.
– Reduce data size by discretization, especially for numerical data
◘ Discretization Methods
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Buys Buys Buys
Age Computer Age Computer Age Computer

1 10 No 1 10 No 1 (0..17] No

2 14 No 2 14 No 2 (0..17] No

3 20 Yes 3 20 Yes 3 (17..55] Yes

4 22 Yes 4 22 Yes 4 (17..55] Yes

5 44 Yes 5 44 Yes 5 (17..55] Yes

6 48 No 6 48 No 6 (17..55] No

7 52 Yes 7 52 Yes 7 (17..55] Yes

8 70 No 8 70 No 8 (55...100] No

9 76 No 9 76 No 9 (55...100] No
Fixed k-Interval Discretization

◘ vmin is the minimum observed value

◘ vmax is the maximum observed value

◘ Intervals have width w = (vmax - vmin) / k

◘ The cut points are

vmin + w , vmin + 2w , ... , vmin + (k - 1)w

◘ Replace continuous values in Attribute with discrete ranges or labels

Fixed k-Interval Discretization

◘ Use Fixed 4-Interval Discretization method to discretize the following dataset.

Customer ID Age Customer ID Age

1 10 1 10 - 28
2 14
2 10 - 28
3 20
4 22 3 10 - 28
5 44 4 10 - 28
6 48
7 52 5 28 - 46
8 70 6 46 - 64
9 76
10 82 7 46 - 64

( 82 – 10 ) / 4 = 72 / 4 = 18 8 64 - 82
9 64 - 82
[10 – 28] 10 64 - 82
(28 – 46]
(46 – 64]
(64 – 82]

Information Systems: BCS Level 4 Certificate in IT study guide
From Everand
Information Systems: BCS Level 4 Certificate in IT study guide
Ian Huke
5/5 (1)
Computer and Network Technology: BCS Level 4 Certificate in IT study guide
From Everand
Computer and Network Technology: BCS Level 4 Certificate in IT study guide
Gary Thornton
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
03-data-preparation
No ratings yet
03-data-preparation
41 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Normalization
No ratings yet
Normalization
35 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Week2-2
No ratings yet
Week2-2
25 pages
Lecture 7 -Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 -Data Preprocessing - Cleaning-M
21 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Week 2
No ratings yet
Week 2
96 pages
Lecture5
No ratings yet
Lecture5
27 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
DM Data transformation techniques
No ratings yet
DM Data transformation techniques
25 pages
253777
No ratings yet
253777
66 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
12 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Data Mining
No ratings yet
Data Mining
31 pages
4 Binning
No ratings yet
4 Binning
19 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
3datapreprocessing ppt3
No ratings yet
3datapreprocessing ppt3
46 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Delta Lake Unveiled : Your Path to Efficient Big Data Management: 1, #1
From Everand
Delta Lake Unveiled : Your Path to Efficient Big Data Management: 1, #1
Amulya
No ratings yet
SM2 Final Draft 020412
No ratings yet
SM2 Final Draft 020412
19 pages
Business Waste
No ratings yet
Business Waste
4 pages
Comparative Study On Financial Analysis of SBI AND HDFC BANK Yui
50% (4)
Comparative Study On Financial Analysis of SBI AND HDFC BANK Yui
85 pages
CV Eka Setianingsih
No ratings yet
CV Eka Setianingsih
1 page
Export Financing BBLC IFDBC PC
100% (1)
Export Financing BBLC IFDBC PC
22 pages
Private Space Flight
No ratings yet
Private Space Flight
2 pages
Cambridge IGCSE: MATHEMATICS 0580/21
No ratings yet
Cambridge IGCSE: MATHEMATICS 0580/21
12 pages
[Ebooks PDF] download Cyclic Plasticity of Metals: Modeling Fundamentals and Applications 1st Edition - eBook PDF full chapters
100% (1)
[Ebooks PDF] download Cyclic Plasticity of Metals: Modeling Fundamentals and Applications 1st Edition - eBook PDF full chapters
62 pages
Emergency Planning
No ratings yet
Emergency Planning
43 pages
Hemodialysis Machine: Diamax
No ratings yet
Hemodialysis Machine: Diamax
19 pages
The Equation of Continuity
0% (1)
The Equation of Continuity
11 pages
Macbj20 4
No ratings yet
Macbj20 4
14 pages
Bio 14 LAB Exercise 9 Ecology
No ratings yet
Bio 14 LAB Exercise 9 Ecology
13 pages
MOVs For COIs and NCOIs T1-3
100% (8)
MOVs For COIs and NCOIs T1-3
3 pages
25 Kaavian SAP Script
No ratings yet
25 Kaavian SAP Script
38 pages
PDF The 2017 Aisi Cold Formed Steel Design Manual - Compress
0% (2)
PDF The 2017 Aisi Cold Formed Steel Design Manual - Compress
11 pages
9-Endothermic Vs Exothermic, Heat of Reaction PDF
No ratings yet
9-Endothermic Vs Exothermic, Heat of Reaction PDF
2 pages
BS en 1869 Fire Blankets Version
No ratings yet
BS en 1869 Fire Blankets Version
12 pages
Hunarobocompilerguidebook PDF
No ratings yet
Hunarobocompilerguidebook PDF
19 pages
Database Lab Exam A
No ratings yet
Database Lab Exam A
5 pages
Packet Tracer - Verifying Ipv4 and Ipv6 Addressing
No ratings yet
Packet Tracer - Verifying Ipv4 and Ipv6 Addressing
3 pages
CFM56-LEAP-X - Value For Lion Air
100% (2)
CFM56-LEAP-X - Value For Lion Air
43 pages
Exam0181 PDF
No ratings yet
Exam0181 PDF
13 pages
Assignment-13-11-2024
No ratings yet
Assignment-13-11-2024
2 pages
WBS Code Task NAME: 1.00 Planning 1.10 Communication 1.1.1
No ratings yet
WBS Code Task NAME: 1.00 Planning 1.10 Communication 1.1.1
23 pages
BMW F16 RDM Extrusion Appearance Spec
No ratings yet
BMW F16 RDM Extrusion Appearance Spec
1 page
Unit-4 Mwoc 5-12-22
No ratings yet
Unit-4 Mwoc 5-12-22
82 pages
Ways To Use: Essential Oils
100% (2)
Ways To Use: Essential Oils
17 pages
IT Project Manager PMP in Denver CO Resume Todd Buchanan
No ratings yet
IT Project Manager PMP in Denver CO Resume Todd Buchanan
2 pages
Punjab Public Service Commission: PPSC-8
No ratings yet
Punjab Public Service Commission: PPSC-8
3 pages

w2-Data_Preparation

Uploaded by

w2-Data_Preparation

Uploaded by

Data Preparation

◘ Why data reduction?

◘ DELETE FROM tablename WHERE id IN

◘ Sometimes a timestamp field is used instead of an ID field.

1- Data Aggregation — e.g., sum, average

◘ Data aggregation is any process in which information is gathered and

◘ Attribute subset selection

◘ Example (Decision Tree Induction) :

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}

◘ Data compression is the process of reproducing information in a

◘ Lossless vs. Lossy Compression

Raw Data Cluster/Stratified Sample

◘ Replace low level concepts (such as numeric values for age) by

country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

◘ No quality data → No quality mining results

◘ Quality decisions must be based on quality data

◘ Data warehouse needs consistent integration of quality data

Data in the real world is dirty.

◘ Incomplete: lacking or missing attribute values

◘ Noisy: containing errors or outliers

◘ Inconsistent: containing discrepancies in codes or names

NAME M.Ulku Metin Ü.

TITLE Gen. Müdr. Genel Müdür

1. Fill in missing values

For Example: %10 of Salary is incomplete

2- Fill in the missing value manually

3- Fill in it automatically with

◘ Global Constant: 900 (user defined)

1027 M 18 NeverM High School Ankara 800 C3

Equal-Width Partitioning Equal-Depth Partitioning

◘ Replace all values in a BIN by ONE value (smoothing values)

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:

◘ Outlier: Data points inconsistent with the majority of data

◘ Data discrepancy detection

◘ Data Type Conversion may be necessary

◘ For example: inconsistency in naming convention

◘ It is the process of changing the form or structure of existing

◘ It involves converting data into a single common format acceptable

but X,Q,Y,T values may seen in the data

◘ NominalToBinary / Dummy Attributes

◘ Normalization: scaled to fall within a small, specified range

◘ Z-score normalization (μ: mean, σ: standard deviation):

Min-max [0,1] 0 .06 .33 .4 .46 .5 .56 .6 .63 .66 .76 1

3 20 Yes 3 20 Yes 3 (17..55] Yes

4 22 Yes 4 22 Yes 4 (17..55] Yes

5 44 Yes 5 44 Yes 5 (17..55] Yes

7 52 Yes 7 52 Yes 7 (17..55] Yes

◘ vmin is the minimum observed value

◘ Intervals have width w = (vmax - vmin) / k

◘ The cut points are

◘ Replace continuous values in Attribute with discrete ranges or labels

◘ Use Fixed 4-Interval Discretization method to discretize the following dataset.

Customer ID Age Customer ID Age

You might also like