0% found this document useful (0 votes)

5 views

Week2-2

The document covers data preprocessing in data mining, focusing on data cleaning, integration, reduction, and transformation. It emphasizes the importance of handling missing and noisy data through various techniques such as binning and regression. Additionally, it discusses the challenges of data integration, including schema integration and redundancy management.

Uploaded by

sidramughal1011

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Week2-2

Uploaded by

sidramughal1011

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

DATA MINING:

LECTURE 4
Chapter 2-Data Preprocessing

Lets prepare data for mining!

Agenda

• Data Preprocessing

• Data cleaning

• Data Integration
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or
remove outliers, and resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Obtains reduced representation in volume but produces
the same or similar analytical results
• Data transformation
• Normalization and aggregation
DATA PREPROCESSING
Data Cleaning

• Importance
• “Data cleaning is one of the three biggest problems
in data warehousing”—Ralph Kimball
• “Data cleaning is the number one problem in data
warehousing”—DCI survey

• Data cleaning tasks

• Fill in missing values

• Identify outliers and smooth out noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration

Missing Data

• Data is not always available

• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus
deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the
time of entry
• did not register history or changes of the data
• Missing data may need to be inferred.
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing
values per attribute varies considerably.

• Fill in the missing value manually: tedious + infeasible?

• Fill in automatically with

• a global constant : e.g., “unknown”, a new class?!

• the attribute mean
• the attribute mean for all samples belonging to the same
class: smarter
• the most probable value: inference-based such as
Regression, Bayesian formula or decision tree
Noisy Data

• Noise: random error or variance in a

measured variable
• Incorrect attribute values may occur due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which requires data
cleaning
• duplicate records
• incomplete data
• inconsistent data
How to Handle Noisy Data?

• Binning
• first sort data and partition into (equal-frequency)
bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Outlier analysis
• Clustering may be used detect and remove outliers.
• Combined computer and human inspection
• detect suspicious values and check by human (e.g.,
deal with possible outliers)
Simple Discretization Methods:
Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate
presentation
• Skewed data is not handled well

• Equal-depth (frequency) partitioning

• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky
Binning Methods for Data

Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Binning(Equal Width)

■ Data: [5, 7, 10, 15, 18, 21, 22, 25, 27, 30]
■ We want to divide it into 3 bins (n = 3).
■ Step 1: Find min and max
a=5 (minimum value), b=30 (maximum value)
Step 2: Calculate bin width
Bin width=30−53=
25/3≈8.33
Step 3: Assign values to bins
Bin 1 (5 – 13.33): 5, 7, 10
Bin 2 (13.33 – 21.66): 15, 18, 21
Bin 3 (21.66 – 30): 22, 25, 27, 30
Now these bins can be smoothed by mean, median or boundary.
Regression
 Data is fitted to a
y
function
 Linear regression is the Y1
line that best fits 2
attributes
 One is used to predict Y1’ y=x+1
the other
 Multiple linear regression
is where more than 2 X1 x
attributes are involved
Analysis

 Data is clustered
 Values that fall
outside the
clusters are
considered noise
Data Cleaning and Data
Reduction

• Binning techniques reduce the number of distinct

values per attribute
• Binning may reduce prices for products in to
inexpensive, moderate and expensive
• Useful for decision tree induction
• May be bad in some circumstances, name one?
Data Cleaning as a Process

• Data discrepancy detection is the first task.

• Types of discrepancies
• Errors in data collection
• Deliberate errors (data providers concealing data)
• Data decay (outdated data such as change in addresses etc.)
• Inconsistent data representation and use of codes
• Data integration errors
• Use metadata (e.g., domain, range, dependency,
distribution)
• Data about data
• Outliers identified and removed
• Check field overloading
• Field overloading is adding data in to defined fields not intended for
the purpose originally (eg. 31 bits are used out of 32 bits, we add
extra information for 1 bit)
• Check uniqueness rule, consecutive rule and null rule
Data Cleaning as a Process
• Using commercial tools for Data Cleaning
• Data scrubbing:
• use simple domain knowledge (e.g., postal code, spell-check etc.) to detect
errors and make corrections
• Techniques used are: parsing and fuzzy matching techniques
• Data auditing:
• by analyzing data to discover rules and relationship to detect violators
• Techniques used are correlation, clustering and descriptive data summaries
to find outliers

• Data transformation: migration and integration

• Data migration tools: allow transformations to be
specified
• Eg. age by birthdate
• ETL (Extraction/Transformation/Loading) tools: allow
users to specify transformations through a graphical
user interface
• Only specific transforms are allowed which sometimes require custom
scripts
Data Cleaning as a Process

• Two step process involves discrepancy detection

and transformation
• Process is error-prone and time consuming
• Iterative process and problems may be removed after
various iterations
• An incorrect year entry 20004 may only be fixed after
correcting all date entries
• Latest techniques emphasize interactivity
• e.g., Potter’s Wheels – http://control.cs.berkeley.edu/abc
• Declarative languages have been developed for
specification of data transformations as extensions to
SQL
• Meta data must be updated to speed up future
cleaning
DATA INTEGRATION
Data Integration
• Data integration:
• Combines data from multiple sources (data cubes,
databases and flat files) into a coherent store such as
a data warehouse
• Schema integration: e.g., A.cust-id  B.cust-#
• Integrate data from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources,

• Detecting and resolving data value conflicts

• For the same real world entity, attribute values from
different sources are different
• Possible reasons: different representations, different
scales, e.g., metric vs. British units
• Meta-data may be used to resolve the problem
Handling Redundancy in Data Integration

• Redundant data occur often when integration

of multiple databases
• Object identification: The same attribute or object
may have different names in different databases
• Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
• Redundant attributes may be able to be
detected by correlation analysis
• Careful integration of the data from multiple
sources may help
• reduce/avoid redundancies and inconsistencies
• improve mining speed and quality
Correlation Analysis (Numerical Data)

■ Correlation coefficient (also called Pearson’s product

moment coefficient)

rA, B 
 (A  A)( B  B)

 ( AB)  n AB
(n  1)AB (n  1)AB

where n is the number of tuples,

A and
B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(AB) is the sum of the AB cross-product.

■ If rA,B > 0, A and B are positively correlated (A’s values

increase as B’s). The higher, the stronger correlation.
■ rA,B = 0: independent; rA,B < 0: negatively correlated
Correlation Analysis (Categorical Data)

■ (chi-square) test or Pearson’s chi-square statistic

2
(Observed  Expected )
 2 
Expected
■ The larger the value, the more likely the variables are
related
■ The cells that contribute the most to the value are those
whose actual count is very different from the expected
count
■ Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population
Chi-Square Calculation: An Example
Male Female Sum
(row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500

■ Χ2 (chi-square) calculation (numbers in parenthesis are

expected counts calculated based on the data distribution in
the 2two(250
categories)
 90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
     507.93
90 210 360 840

■ It shows that gender and preferred reading are correlated in the

group
■ Degree of freedom may be calculated as (r-1)X(c-1)=(2-1)X(2-
1)=1 and corresponding value is obtained from Χ2 distribution
which is 10.828

The Mongodb Cheat Sheet
No ratings yet
The Mongodb Cheat Sheet
8 pages
Grenada IT SBA 2019-2020
100% (1)
Grenada IT SBA 2019-2020
5 pages
Imo
No ratings yet
Imo
7 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
DM Lect3
No ratings yet
DM Lect3
41 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Data Preprocessingedfgh
No ratings yet
Data Preprocessingedfgh
21 pages
Preprocessing
No ratings yet
Preprocessing
62 pages
CIS664-Knowledge Discovery and Data Mining
No ratings yet
CIS664-Knowledge Discovery and Data Mining
52 pages
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
No ratings yet
Data Preparation: KIT306/606: Data Analytics A/Prof. Quan Bai University of Tasmania
49 pages
253777
No ratings yet
253777
66 pages
Chapter 2 3 Data Mining
No ratings yet
Chapter 2 3 Data Mining
4 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Preprocessing
No ratings yet
Preprocessing
52 pages
3 Ravi
No ratings yet
3 Ravi
82 pages
Lecture123
No ratings yet
Lecture123
20 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Lec2 - Data Preprocessing
No ratings yet
Lec2 - Data Preprocessing
30 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Chapter 3 - Data Pre-Processing Notes
No ratings yet
Chapter 3 - Data Pre-Processing Notes
8 pages
Romi DM 03 Persiapan Mar2016
No ratings yet
Romi DM 03 Persiapan Mar2016
82 pages
3 Persiapan Data Mining
No ratings yet
3 Persiapan Data Mining
83 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Dwina DM 03 Persiapan 2018
No ratings yet
Dwina DM 03 Persiapan 2018
82 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
GK NU CS 503 - Data Preprocessing
No ratings yet
GK NU CS 503 - Data Preprocessing
62 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
data_mining_unit_3[1]
No ratings yet
data_mining_unit_3[1]
64 pages
Preprocessing
No ratings yet
Preprocessing
50 pages
D06A Data Preprocessing
No ratings yet
D06A Data Preprocessing
25 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Data preprocessing (1)
No ratings yet
Data preprocessing (1)
77 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Unit-2Exploratory-Analysis
No ratings yet
Unit-2Exploratory-Analysis
37 pages
Data Preparation
No ratings yet
Data Preparation
21 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Preprocessing
No ratings yet
Preprocessing
13 pages
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
No ratings yet
Data Pre-Processing: Submitted By, R.Archana, 10ucs05 D.Gayathri, 10ucs11
18 pages
Data Mining Pertemuan 6
No ratings yet
Data Mining Pertemuan 6
28 pages
Unsia_Data Mining Pertemuan 9
No ratings yet
Unsia_Data Mining Pertemuan 9
39 pages
DEC_Unit II Data Pre-processing
No ratings yet
DEC_Unit II Data Pre-processing
96 pages
PPT 2
No ratings yet
PPT 2
51 pages
Data Mining
No ratings yet
Data Mining
40 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
AIML Unit 2 Understanding Data
No ratings yet
AIML Unit 2 Understanding Data
51 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Class-Data Preprocessing-III
No ratings yet
Class-Data Preprocessing-III
53 pages
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Gis in Buganda Land Board - New
No ratings yet
Gis in Buganda Land Board - New
14 pages
Nireesh Kumar Paidi - Updated Resume
No ratings yet
Nireesh Kumar Paidi - Updated Resume
5 pages
Lec1 - Intro To SDBMS
No ratings yet
Lec1 - Intro To SDBMS
14 pages
Basic AWR Report Analysis Part 1
No ratings yet
Basic AWR Report Analysis Part 1
12 pages
Rdbms - Practical Questions
No ratings yet
Rdbms - Practical Questions
5 pages
Visual-Basic-.NET-2010-and-MySQL-Database-Connection-1
No ratings yet
Visual-Basic-.NET-2010-and-MySQL-Database-Connection-1
7 pages
Powerbi 3
No ratings yet
Powerbi 3
12 pages
Types of Failure and ACID Property (Basics Transaction) by Aditi Waghela
100% (2)
Types of Failure and ACID Property (Basics Transaction) by Aditi Waghela
5 pages
Resume Deva
No ratings yet
Resume Deva
4 pages
SQL Notes
No ratings yet
SQL Notes
76 pages
Dwbi Notes
No ratings yet
Dwbi Notes
32 pages
23mca551 DBMS
No ratings yet
23mca551 DBMS
3 pages
Java Web Application - Workshop 2 - Personal Finance Management WS24
No ratings yet
Java Web Application - Workshop 2 - Personal Finance Management WS24
6 pages
12C Data Guard Switchover Best Practices Using Sqlplus
No ratings yet
12C Data Guard Switchover Best Practices Using Sqlplus
8 pages
Microsoft Azure SQL Database
No ratings yet
Microsoft Azure SQL Database
52 pages
Rules in Prolog: Student Name Student Roll # Program Section
No ratings yet
Rules in Prolog: Student Name Student Roll # Program Section
6 pages
Datawarehousing Course Contents: Different Phases of Modelling
No ratings yet
Datawarehousing Course Contents: Different Phases of Modelling
5 pages
Configuration of Data Guard Broker For Switchover
No ratings yet
Configuration of Data Guard Broker For Switchover
10 pages
Zetta Onsite-Install Checklist - 2023
No ratings yet
Zetta Onsite-Install Checklist - 2023
2 pages
Irshad Anwar: Database Administrator
No ratings yet
Irshad Anwar: Database Administrator
5 pages
Designing Data-Intensive Applications Cheat Sheet
No ratings yet
Designing Data-Intensive Applications Cheat Sheet
10 pages
Senior Oracle Dba Resume: We Provide It Yasf Augmentation Services
No ratings yet
Senior Oracle Dba Resume: We Provide It Yasf Augmentation Services
3 pages
Rai School Naxal - Kathmandu Syllabus For Open Text Book Examination 2020-2021 Class Xii - Com
No ratings yet
Rai School Naxal - Kathmandu Syllabus For Open Text Book Examination 2020-2021 Class Xii - Com
2 pages
Snowpro™ Advanced: Data Scientist: Exam Study Guide
No ratings yet
Snowpro™ Advanced: Data Scientist: Exam Study Guide
14 pages
Practice Questions1
No ratings yet
Practice Questions1
18 pages
Adding New Field
No ratings yet
Adding New Field
25 pages
Lab04 Sol
No ratings yet
Lab04 Sol
4 pages

Week2-2

Uploaded by

Week2-2

Uploaded by

DATA MINING:

Lets prepare data for mining!

• Data cleaning tasks

• Identify outliers and smooth out noisy data

• Correct inconsistent data

• Resolve redundancy caused by data integration

• Data is not always available

• Fill in the missing value manually: tedious + infeasible?

• Fill in automatically with

• a global constant : e.g., “unknown”, a new class?!

• Noise: random error or variance in a

• Equal-depth (frequency) partitioning

• Binning techniques reduce the number of distinct

• Data discrepancy detection is the first task.

• Data transformation: migration and integration

• Two step process involves discrepancy detection

• Detecting and resolving data value conflicts

• Redundant data occur often when integration

■ Correlation coefficient (also called Pearson’s product

where n is the number of tuples,

■ If rA,B > 0, A and B are positively correlated (A’s values

■ (chi-square) test or Pearson’s chi-square statistic

■ Χ2 (chi-square) calculation (numbers in parenthesis are

■ It shows that gender and preferred reading are correlated in the

You might also like