0% found this document useful (0 votes)

29 views

Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3

The document discusses data preprocessing techniques for data mining. It covers topics like data cleaning, integration, transformation, and reduction. Data cleaning involves handling missing data, noisy data, and inconsistencies. Data integration combines data from multiple sources. Data transformation includes normalization, aggregation, and attribute construction. Data reduction aims to reduce data volume while preserving analytical results, using methods like discretization, dimensionality reduction, and data compression.

Uploaded by

habibsultanbscs

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3

Uploaded by

habibsultanbscs

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 36

Data Mining: Concepts and Techniques

Slides for Textbook Chapter 2 &3

November 14, 2012

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 2

November 14, 2012

Why Data Preprocessing?

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data
Data Mining: Concepts and Techniques 3

November 14, 2012

Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility

November 14, 2012

Data Mining: Concepts and Techniques

Major Tasks in Data Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Integration of multiple databases, data cubes, or files Normalization and aggregation Obtains reduced representation in volume but produces the same or similar analytical results Part of data reduction but with particular importance, especially for numerical data
Data Mining: Concepts and Techniques 5

Data integration

Data transformation

Data reduction

Data discretization

November 14, 2012

Forms of data preprocessing

November 14, 2012

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 7

November 14, 2012

Data Cleaning

Data cleaning tasks

Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data

November 14, 2012

Data Mining: Concepts and Techniques

Missing Data

Data is not always available

E.g., many tuples have no recorded value for several attributes, such as customer income in sales data
equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data

Missing data may be due to

Missing data may need to be inferred.

Data Mining: Concepts and Techniques 9

November 14, 2012

How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (assuming

the tasks in classificationnot effective when the percentage of

missing values per attribute varies considerably.

Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., unknown, a new class?! Use the attribute mean to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inferencebased such as Bayesian formula or decision tree
Data Mining: Concepts and Techniques 10

November 14, 2012

Noisy Data

Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data
Data Mining: Concepts and Techniques 11

November 14, 2012

How to Handle Noisy Data?

Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions
Data Mining: Concepts and Techniques 12

November 14, 2012

Simple Discretization Methods: Binning

Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward But outliers may dominate presentation Skewed data is not handled well. Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky.
Data Mining: Concepts and Techniques 13

November 14, 2012

Binning Methods for Data Smoothing

* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
November 14, 2012 Data Mining: Concepts and Techniques 14

Cluster Analysis

November 14, 2012

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 16

November 14, 2012

Data Integration

Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units
Data Mining: Concepts and Techniques 17

November 14, 2012

Handling Redundant Data in Data Integration

Redundant data occur often when integration of multiple databases

The same attribute may have different names in different databases

One attribute may be a derived attribute in another table, e.g., annual revenue

Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
Data Mining: Concepts and Techniques 18

November 14, 2012

Data Transformation

Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing Normalization: scaled to fall within a small, specified range

min-max normalization z-score normalization

normalization by decimal scaling

New attributes constructed from the given ones
Data Mining: Concepts and Techniques 19

Attribute/feature construction

November 14, 2012

Data Transformation: Normalization

min-max normalization

v minA v' (new _ maxA new _ minA) new _ minA maxA minA

z-score normalization

v meanA v' stand _ devA

normalization by decimal scaling

v v' j 10

Where j is the smallest integer such that Max(| v' |)<1

November 14, 2012

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 21

November 14, 2012

Data Reduction Strategies

Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation
Data Mining: Concepts and Techniques 22

November 14, 2012

Example of Decision Tree Induction

Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A1? A6?

Class 1
>

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}

Data Mining: Concepts and Techniques 23

November 14, 2012

Data Compression

String compression There are extensive theories and well-tuned algorithms Typically lossless But only limited manipulation is possible without expansion Audio/video compression Typically lossy compression, with progressive refinement Sometimes small fragments of signal can be reconstructed without reconstructing the whole Time sequence is not audio Typically short and vary slowly with time
Data Mining: Concepts and Techniques 24

November 14, 2012

Data Compression

Original Data
lossless

Compressed Data

Original Data Approximated

November 14, 2012 Data Mining: Concepts and Techniques 25

Histograms

November 14, 2012

Data Mining: Concepts and Techniques

100000

10000

20000

30000

40000

50000

60000

70000

80000

90000

A popular data reduction technique Divide data into buckets and store average (sum) for each bucket Can be constructed optimally in one dimension using dynamic programming Related to quantization problems.

40 35 30 25 20 15 10 5 0

Clustering

Partition data set into clusters, and one can store cluster representation only Can be very effective if data is clustered but not if data

is smeared

Can have hierarchical clustering and be stored in multidimensional index tree structures

There are many choices of clustering definitions and

clustering algorithms, further detailed in Chapter 8

November 14, 2012

Data Mining: Concepts and Techniques

Sampling

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data Choose a representative subset of the data Simple random sampling may have very poor performance in the presence of skew Develop adaptive sampling methods Stratified sampling: Approximate the percentage of each class (or subpopulation of interest) in the overall database Used in conjunction with skewed data Sampling may not reduce database I/Os (page at a time).
Data Mining: Concepts and Techniques 28

November 14, 2012

Sampling

Raw Data
November 14, 2012 Data Mining: Concepts and Techniques 29

Sampling
Raw Data
Cluster/Stratified Sample

November 14, 2012

Data Mining: Concepts and Techniques

Hierarchical Reduction

Use multi-resolution structure with different degrees of reduction Hierarchical clustering is often performed but tends to define partitions of data sets rather than clusters Parametric methods are usually not amenable to hierarchical representation Hierarchical aggregation An index tree hierarchically divides a data set into partitions by value range of some attributes Each partition can be considered as a bucket Thus an index tree with aggregates stored at each node is a hierarchical histogram
Data Mining: Concepts and Techniques 31

November 14, 2012

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 32

November 14, 2012

Discretization

Three types of attributes: Nominal values from an unordered set Ordinal values from an ordered set Continuous real numbers Discretization: divide the range of a continuous attribute into intervals Some classification algorithms only accept categorical attributes. Reduce data size by discretization Prepare for further analysis
Data Mining: Concepts and Techniques 33

November 14, 2012

Discretization and Concept hierachy

Discretization

reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Data Mining: Concepts and Techniques 34

Concept hierarchies

November 14, 2012

Specification of a set of attributes

Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country province_or_ state city street
November 14, 2012

15 distinct values 65 distinct values 3567 distinct values 674,339 distinct values
35

Data Mining: Concepts and Techniques

Chapter 3: Data Preprocessing

Why preprocess the data?
Data cleaning

Data integration and transformation

Data reduction Discretization and concept hierarchy generation Summary
Data Mining: Concepts and Techniques 36

November 14, 2012

COMM 1103 Final Exam Winter 2024
No ratings yet
COMM 1103 Final Exam Winter 2024
3 pages
Clinical Pearls in Pulmonology (2018) PDF
No ratings yet
Clinical Pearls in Pulmonology (2018) PDF
183 pages
Report Flight Dynamic (Mirza) PDF
No ratings yet
Report Flight Dynamic (Mirza) PDF
51 pages
Bowen Therapy Manuals
0% (3)
Bowen Therapy Manuals
5 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Data Mining: Concepts and Techniques: January 14, 2014 1
0% (1)
Data Mining: Concepts and Techniques: January 14, 2014 1
46 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
52 pages
Lect 4
No ratings yet
Lect 4
30 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Data Preprocessing - DWM
No ratings yet
Data Preprocessing - DWM
42 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
100% (1)
Data Preprocessing
109 pages
Unit 2: Big Data Analytics
No ratings yet
Unit 2: Big Data Analytics
45 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
78 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
VIPDMTheoryChapter3
No ratings yet
VIPDMTheoryChapter3
87 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Chapter3
No ratings yet
Chapter3
50 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Normalization
No ratings yet
Normalization
35 pages
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
No ratings yet
Data Mining Requires Collecting Great Amount of Data (Available in Data Warehouses or Databases) To Achieve The Intended Objective
37 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
Data Pre Processing
No ratings yet
Data Pre Processing
35 pages
(M3S1) Data Analytics Framework
No ratings yet
(M3S1) Data Analytics Framework
12 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
59 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Que Es Datamin
No ratings yet
Que Es Datamin
52 pages
Data Pre Processing
No ratings yet
Data Pre Processing
48 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Chap 3
No ratings yet
Chap 3
55 pages
Why Data Preprocessing?: Incomplete
No ratings yet
Why Data Preprocessing?: Incomplete
17 pages
Data Preprocessing: Why Preprocess The Data?
No ratings yet
Data Preprocessing: Why Preprocess The Data?
51 pages
Chapter-1 - Introduction To Data Mining
No ratings yet
Chapter-1 - Introduction To Data Mining
10 pages
Chapter 5 Concept Description Characterization and Comparison 395
No ratings yet
Chapter 5 Concept Description Characterization and Comparison 395
64 pages
Introduction To Ds - 2024
No ratings yet
Introduction To Ds - 2024
25 pages
Preprocessing
No ratings yet
Preprocessing
90 pages
3 Prep
No ratings yet
3 Prep
50 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
BIS 541 Ch03 20-21 S
No ratings yet
BIS 541 Ch03 20-21 S
86 pages
DWM
No ratings yet
DWM
14 pages
3prep
No ratings yet
3prep
53 pages
Chapter 2: Data Preprocessing: Why Preprocess The Data?
No ratings yet
Chapter 2: Data Preprocessing: Why Preprocess The Data?
42 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Data Science Lecture No 02
No ratings yet
Data Science Lecture No 02
21 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
UNIT 1
No ratings yet
UNIT 1
27 pages
Data Mining and KDD
No ratings yet
Data Mining and KDD
15 pages
Quick Question42
No ratings yet
Quick Question42
51 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
From Everand
Data Literacy Practitioner's Guide: EDF Data Literacy Certification workbook
Michel Dekker
No ratings yet
Handwriting Book: Grade 1
No ratings yet
Handwriting Book: Grade 1
10 pages
TOUR14H-Introduction To MICE and Events Management - Docx (Syllabus)
No ratings yet
TOUR14H-Introduction To MICE and Events Management - Docx (Syllabus)
4 pages
Short Confidentiality Agreement (SCA)
No ratings yet
Short Confidentiality Agreement (SCA)
2 pages
TRM256 Welded Reinforcement Grids
No ratings yet
TRM256 Welded Reinforcement Grids
2 pages
kv bhubaneswar bio_removed
No ratings yet
kv bhubaneswar bio_removed
7 pages
Kuderu MIG Letter to Collector, ATP District 6-6-2023
No ratings yet
Kuderu MIG Letter to Collector, ATP District 6-6-2023
3 pages
20-12-17 Apple Motion To Dismiss Pistacchio Apple Arcade Case
No ratings yet
20-12-17 Apple Motion To Dismiss Pistacchio Apple Arcade Case
37 pages
SUP's 2024 Asian Studies Catalog
No ratings yet
SUP's 2024 Asian Studies Catalog
16 pages
Errors in Accident Data - Its Types - Causes and Methods of Rectification - Analysis of The Literature
No ratings yet
Errors in Accident Data - Its Types - Causes and Methods of Rectification - Analysis of The Literature
19 pages
Gastrointestinal
No ratings yet
Gastrointestinal
6 pages
Cyber Law and Policy: Lesson 6 Information Security Policies
No ratings yet
Cyber Law and Policy: Lesson 6 Information Security Policies
51 pages
Unit II Descriptive-Statistics-And-Correlation
No ratings yet
Unit II Descriptive-Statistics-And-Correlation
19 pages
Production of Bio Char and Bio Oils From Botswana Marula Shells Through Torrefaction and Pyrolysis
No ratings yet
Production of Bio Char and Bio Oils From Botswana Marula Shells Through Torrefaction and Pyrolysis
5 pages
TERM -2 - MODEL - 1
No ratings yet
TERM -2 - MODEL - 1
1 page
Seth Boriel - Lab - Enzyme Substrate Reaction
No ratings yet
Seth Boriel - Lab - Enzyme Substrate Reaction
4 pages
Tyco Inline Joint Single Core Unarmoured Xlpe Mechanical Conn PDF
No ratings yet
Tyco Inline Joint Single Core Unarmoured Xlpe Mechanical Conn PDF
8 pages
Kristian Karl Bautista Kiw-Is - Neuro Quiz 2021
No ratings yet
Kristian Karl Bautista Kiw-Is - Neuro Quiz 2021
3 pages
Vege
No ratings yet
Vege
1 page
MicroStrategy Mobile Design and Administration Guide 9.3.0
No ratings yet
MicroStrategy Mobile Design and Administration Guide 9.3.0
244 pages
Ceramic Arts Handbook - Compress
100% (3)
Ceramic Arts Handbook - Compress
47 pages
Red Oxide Scale
100% (1)
Red Oxide Scale
6 pages
CHAPTER 7 Energy and Energy Transfer
No ratings yet
CHAPTER 7 Energy and Energy Transfer
21 pages
4 Master Class - ID Creep Wave Method
No ratings yet
4 Master Class - ID Creep Wave Method
22 pages
Coriolis Effect
No ratings yet
Coriolis Effect
4 pages
List of CDS Annotations
No ratings yet
List of CDS Annotations
6 pages
Common Admission To PG Programmes of Farm Universities of Karnataka
No ratings yet
Common Admission To PG Programmes of Farm Universities of Karnataka
35 pages