SlideShare a Scribd company logo
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 1
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 The real –world database typically used in data
mining may have millions of records and thousands of
variables. It is noisy and has missing and inconsistent
values.
Data quality is a key issue with data mining so data
preparation is a necessary step for serious, effective,
real-world data mining.
Introduction
To increase the accuracy of the mining, has to
perform data preprocessing.
Otherwise, garbage in => garbage out
Data Preparation estimated to take 70-80% of the
time and effort.
Introduction
Domain Expertise
 Data quality expert: “We found these strange records
in your database after running sophisticated
algorithms!”
 Domain Experts: “Oh, those apples - we put them
in the same baskets as oranges because there are too
few apples to bother. Not a big deal. We knew that
already.”
Domain Expertise
Domain Expertise is important for understanding the
data, the problem and interpreting the results.
“The counter resets to 0 if the number of calls exceeds N”.
“The missing values are represented by 0, but the default billed
amount is 0 too.”
Insufficient Domain Expertise is a primary cause of
poor Data Quality– data are unusable.
Goal Identification
 To obtain the highest benefit from data mining, there
must be a clear statement of the business objectives.
 The first and most important step in any targeting-
model project is to establish a clear goal and develop a
process to achieve that goal.
Goal Identification
 Example of Goal for business company are:
 You want to attract new customers
 You want to avoid high -risk customers
 You want to understand the characteristics of your current customers?
 You want to make your unprofitable customers more profitable?
 You want to retain your profitable customers?
 You want to win back your lost customers?
 You want to improve customer satisfaction?
 You want to increase sales?
 You want to reduce expenses
Data Understanding
 Starts with an initial data collection and proceeds with
activities in order to get familiar with the data, to
identify data quality problems, to discover first closes
into the data.
Data Understanding
Data Understanding: Relevance:
 What data is available for the task?
 Is this data relevant?
 Is additional relevant data available?
 How much historical data is available?
 Who is the data expert ?
Data Understanding
Data Understanding: Quantity
 Number of instances (records)
 Rule of thumb: 5,000 or more desired
 if less, results are less reliable;
 Number of attributes (fields)
 Rule of thumb: for each field, 10 or more instances
 If more fields, use feature reduction and selection
 Number of targets
 Rule of thumb: >100 for each class
 if very unbalanced, use stratified sampling
Data Cleaning
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Cleaning
Tid Refund
Marital
Status
Taxable
Income
Cheat
1 Yes 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced -95k Yes
6 No Married 60K No
7 Yes 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Attributes
Objects
Data Cleaning
 Real-world data tends to be incomplete, noisy and
inconsistent.
 Data Cleaning Steps
 Missing values
 Noisy Data
 Inconsistent Data
Missing values
 A missing value (Mv) is an empty cell in the table
that represents a dataset.
?Instances
Attributes
Dealing with missing values
1. Ignore records with missing values:
 This is usually done when the class label is missing.
 This method is not effective, unless the record contains
several attributes with missing values.
Dealing with missing values
2. Fill in the missing value manually:
In general, this approach is time-consuming and may be not
feeble given a large data set with many missing values.
3. Fill in the missing value manually:
Replace all missing values by same constant such as
“unknown”. Although this method is simple but it is not
recommended because results with “unknown values are not
“interesting”.
Dealing with missing values
4. Use the attribute mean to fill missing values:
For example in attribute income if the mean income is 28000,
use this value to replace the missing values.
5. Use the attribute mean for all samples belonging to the
same class
For example, if classifying customers according to credit risk,
replace the missing value with the mean income value for
customers in the same credit risk category as that of the given
record.
Dealing with missing values
6. Use advanced method
such as K-nearest neighbors formalism or decision
tree to predict the missing value using other values.
Dealing with missing values
k nearest neighbors Approach
Compute the k nearest neighbors and assign a value
from them.
Dealing with missing values
k nearest neighbors Approach
 For nominal values, use the most common value
among all neighbors.
 For numerical values use the average value.
 Indeed, we need to define a proximity measure
between instances, such as euclidian distance.
Next:
Data Cleaning: Noisy Data
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 2
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
 Noise is a random error in measured variable.
 Noisy data is meaningless data.
 Any data that has been received, stored or changed
in such a manner that it cannot be read or used by the
program that originally created it can be described as
noisy.
Noisy Data
 Source of Noisy data:
1. Data entry problem.
2. Faulty data collection instruments.
3. Data transmission.
Noisy Data
 Binning method
 Clustering
 Combined computer and human inspections
 Regression
How to handle noisy data ?
How to handle noisy data ?
 Binning method:
1. Sort data
2. Partition into equal-frequency groups.
3. One can smooth by group means, smooth by
group median, smooth by group boundaries, etc.
How to handle noisy data ?
Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equal-frequency) groups:
-G1: 4, 8, 9, 15
-G2: 21, 21, 24, 25
-G3: 26, 28, 29, 34
Smoothing by bin means:
-G1: 9, 9, 9, 9
-G2: 23, 23, 23, 23
-G3: 29, 29, 29, 29
Smoothing by bin boundaries:
-G1: 4, 4, 4, 15
-G2: 21, 21, 25, 25
-G3: 26, 26, 26, 34
How to handle noisy data ?
Clustering: Outliers may be detected by clustering,
where similar values are organized into groups, values
that fall outside the set of clusters may be considered
outliers.
How to handle noisy data ?
 Combined computer and human inspections: Outliers
may be identified by detect suspicious values and
check by human.
How to handle noisy data ?
 Regression: Data can be smoothed by fitting the
data to a function.
Inconsistent Data
 Data which is inconsistent with our models, should
be dealt with.
 Common sense can also be used to detect such kind
of inconsistency:
The same name occurring differently in an application.
Different names can appear to be the same (Dennis Vs
Denis)
Inappropriate values (Males being pregnant, or having an
negative age) Was rating “1,2,3”, now rating “A, B, C”
Difference between duplicate records
Inconsistent Data
 We want to transform all dates to the same format internally
 Some systems accept dates in many formats
 e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc
 dates are transformed internally to a standard value
 Frequently, just the year (YYYY) is sufficient
 For more details, we may need the month, the day, the hour,
etc
 Representing date as YYYYMM or YYYYMMDD can be OK.
Data Integration
Goal identification
& Data
Understanding
Data Cleaning Data Integration
Data
Transformation
Data
Reduction
Data Integration
 Combines data from multiple sources into a coherent
store.
 Increasingly data a mining projects require data
from more than one data source.
 Such as multiple databases, data warehouse, flat
files and historical data.
Data Integration
 Data is stored in many systems across enterprise
and outside the enterprise
The source of data fall into two categories:
 Internal sources that are generated through enterprise
activities such as databases, historical data, Web sites
and warehouses.
 External sources such as credit bureaus, phone
companies and demographical information.
Data Integration
 Data Warehouse: is a structure that links information
from two or more databases.
 Data warehouse brings data from different data
sources into a central repository.
 It performs some data integration, clean-up, and
summarization, and distribute the information data
marts.
Data Integration
Next:
Data Cleaning: Noisy Data
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 3
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
Data Transformation
 Definition 1: Transform the data into a form
appropriate for given data mining method.
 Definition 2: Data transformation is the process of
converting data or information from one format to
another, usually from the format of a source system
into the required format of a new destination system.
Data Transformation
 Methods include:
 Smoothing
 Aggregation
 Generalization
 Normalization (min-max)
Data Transformation
Methods of Data Transformation
 Normalization: Where the attributes are scaled so as to
fall within a small specified ranges such as -1.0 to 1.0.
How to handle noisy data ?
Next:
Data Reduction
Data preparation and processing
Mahmoud Rafeek Alfarra
http://mfarra.cst.ps
University College of Science & Technology- Khan yonis
Development of computer systems
2016
Chapter 2 – Lecture 4
Outline
 Introduction
 Domain Expert
 Goal identification and Data Understanding
 Data Cleaning
 Missing values
 Noisy Data
 Inconsistent Data
 Data Integration
 Data Transformation
 Data Reduction
 Feature Selection
 Sampling
 Discretization
Introduction
Goal
identification and
Data
Understanding
Data Cleaning Data Integration
Data TransformationData Reduction
Data Reduction
Data Reduction (Selection)
 Warehouse may store terabytes of data: Complex
data analysis/mining may take a very long time to run
on the complete data set.
 Data reduction: Obtains a reduced representation of
the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical
results.
Data Reduction
 The choice of data representation, and selection,
reduction or transformation of features is probably the
most important issue that determines the quality of a
data-mining solution.
Data Reduction
 The three basic operations in a data-reduction
process are:
 Delete a column (feature selection).
 Delete a row (sampling).
 Reduce the number of values in a column
(Discretization).
Data Reduction
Feature Selection
 We want to choose features (attributes) that are
relevant to our data-mining application in order to
achieve maximum performance with the minimum
measurement and processing effort.
Feature Selection
1. Redundant features
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid.
Feature Selection
2. Irrelevant features
 Contain no information that is useful for the data
mining task at hand.
E.g., students' ID is often irrelevant to the task of
predicting students' GPA.
Feature Selection
3. Selecting Most Relevant Fields
 If there are too many fields, select a subset that is most
relevant.
Can select top N fields using some computations.
What is good N?
 Rule of thumb -- keep top 50 fields
Feature Selection
 Two types of feature selection
 Unsupervised: Reduce fields without knowing class label.
Supervised: Select fields with respect to class label.
Sampling
 Sampling: Obtaining a small sample s to represent
the whole data set N.
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data.
Sampling
 Key principle: Choose a representative subset of the
data.
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling.
Sampling
8000 points 2000 Points 500 Points
Sample Size
Types of Sampling
 Sampling without replacement:
 Once an object is selected, it is removed from the population.
 Sampling with replacement
 A selected object is not removed from the population.
 Stratified sampling:
 Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the data)
Types of Sampling(Sampling without replacement)
Raw Data
Types of Sampling(Sampling with replacement)
Raw Data
Types of Sampling
Raw Data Cluster/Stratified Sample
Types of Sampling
Age
Young
Young
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Middle-age
Senior
Senior
Age
Young
Young
Middle-age
Middle-age
Middle-age
Middle-age
Senior
Discretization
 Discretization is very useful for generating a
summary of data, also called “binning”.
 It does not use the class information.
 Suppose we have the following set of values for the
attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28.
Two possible ways in which Binning can be applied
are: Equi-width binning or Equi-frequency binning .
Next:
Practical Part
Ad

More Related Content

What's hot (20)

Data analytics
Data analyticsData analytics
Data analytics
BindhuBhargaviTalasi
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Statistics for data science
Statistics for data science Statistics for data science
Statistics for data science
zekeLabs Technologies
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
Data analytics
Data analyticsData analytics
Data analytics
Bhanu Pratap
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
Pratik Tambekar
 
Data analytics
Data analyticsData analytics
Data analytics
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Edureka!
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
Martin Bago
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Data cubes
Data cubesData cubes
Data cubes
Mohammed
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Data warehousing and data mart
Data warehousing and data martData warehousing and data mart
Data warehousing and data mart
Amit Sarkar
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
Neural networks.ppt
Neural networks.pptNeural networks.ppt
Neural networks.ppt
SrinivashR3
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
Amr Abd El Latief
 
DATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptxDATA WRANGLING presentation.pptx
DATA WRANGLING presentation.pptx
AbdullahAbbasi55
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
smj
 
What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)What Is DATA MINING(INTRODUCTION)
What Is DATA MINING(INTRODUCTION)
Pratik Tambekar
 
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Data Analytics For Beginners | Introduction To Data Analytics | Data Analytic...
Edureka!
 
Exploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science ClubExploratory data analysis in R - Data Science Club
Exploratory data analysis in R - Data Science Club
Martin Bago
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
FellowBuddy.com
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Introduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlowIntroduction to Deep Learning, Keras, and TensorFlow
Introduction to Deep Learning, Keras, and TensorFlow
Sri Ambati
 
Data warehousing and data mart
Data warehousing and data martData warehousing and data mart
Data warehousing and data mart
Amit Sarkar
 
Unsupervised learning clustering
Unsupervised learning clusteringUnsupervised learning clustering
Unsupervised learning clustering
Arshad Farhad
 
Neural networks.ppt
Neural networks.pptNeural networks.ppt
Neural networks.ppt
SrinivashR3
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
Sushil Kulkarni
 
Linear discriminant analysis
Linear discriminant analysisLinear discriminant analysis
Linear discriminant analysis
Bangalore
 

Similar to Data preparation and processing chapter 2 (20)

4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
Mahmoud Alfarra
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
Mahmoud Alfarra
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
preproccessing level 3 for students.ppt
preproccessing level 3 for  students.pptpreproccessing level 3 for  students.ppt
preproccessing level 3 for students.ppt
AhmedAlrashdy
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
Ahmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
LellaLinton
 
Data processing
Data processingData processing
Data processing
AnupamSingh211
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Data Science Council of America
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
YashikaSengar2
 
1 UNIT-DSP.pptx
1 UNIT-DSP.pptx1 UNIT-DSP.pptx
1 UNIT-DSP.pptx
PothyeswariPothyes
 
Data mining
Data miningData mining
Data mining
Silicon
 
Data Exploration and Transformation.pptx
Data Exploration and Transformation.pptxData Exploration and Transformation.pptx
Data Exploration and Transformation.pptx
lovepreet33653
 
Preprocessing data mining hhxdzsdsasaasa
Preprocessing data mining hhxdzsdsasaasaPreprocessing data mining hhxdzsdsasaasa
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Data preprocessing in precision agriculture
Data preprocessing in precision agricultureData preprocessing in precision agriculture
Data preprocessing in precision agriculture
mogana98
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 
4 Data preparation and processing
4  Data preparation and processing4  Data preparation and processing
4 Data preparation and processing
Mahmoud Alfarra
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
ML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.pptML-ChapterTwo-Data Preprocessing.ppt
ML-ChapterTwo-Data Preprocessing.ppt
belay41
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
Mahmoud Alfarra
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
preproccessing level 3 for students.ppt
preproccessing level 3 for  students.pptpreproccessing level 3 for  students.ppt
preproccessing level 3 for students.ppt
AhmedAlrashdy
 
Machine Learning: A Fast Review
Machine Learning: A Fast ReviewMachine Learning: A Fast Review
Machine Learning: A Fast Review
Ahmad Ali Abin
 
machinelearning-191005133446.pdf
machinelearning-191005133446.pdfmachinelearning-191005133446.pdf
machinelearning-191005133446.pdf
LellaLinton
 
Data mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updatedData mining and data warehouse lab manual updated
Data mining and data warehouse lab manual updated
Yugal Kumar
 
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdfThe Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
The Simple 5-Step Process for Creating a Winning Data Pipeline.pdf
Data Science Council of America
 
Data mining
Data miningData mining
Data mining
Silicon
 
Data Exploration and Transformation.pptx
Data Exploration and Transformation.pptxData Exploration and Transformation.pptx
Data Exploration and Transformation.pptx
lovepreet33653
 
Preprocessing data mining hhxdzsdsasaasa
Preprocessing data mining hhxdzsdsasaasaPreprocessing data mining hhxdzsdsasaasa
Preprocessing data mining hhxdzsdsasaasa
Suvedha8
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
Dr. Abdul Ahad Abro
 
Data preprocessing in precision agriculture
Data preprocessing in precision agricultureData preprocessing in precision agriculture
Data preprocessing in precision agriculture
mogana98
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
VISHALMARWADE1
 
Ad

More from Mahmoud Alfarra (20)

Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2
Mahmoud Alfarra
 
Computer Programming, Loops using Java
Computer Programming, Loops using JavaComputer Programming, Loops using Java
Computer Programming, Loops using Java
Mahmoud Alfarra
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
Mahmoud Alfarra
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structure
Mahmoud Alfarra
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structure
Mahmoud Alfarra
 
Chapter 7: Queue data structure
Chapter 7:  Queue data structureChapter 7:  Queue data structure
Chapter 7: Queue data structure
Mahmoud Alfarra
 
Chapter 6: stack data structure
Chapter 6:  stack data structureChapter 6:  stack data structure
Chapter 6: stack data structure
Mahmoud Alfarra
 
Chapter 5: linked list data structure
Chapter 5: linked list data structureChapter 5: linked list data structure
Chapter 5: linked list data structure
Mahmoud Alfarra
 
Chapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureChapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structure
Mahmoud Alfarra
 
Chapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureChapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structure
Mahmoud Alfarra
 
Chapter 2: array and array list data structure
Chapter 2: array and array list  data structureChapter 2: array and array list  data structure
Chapter 2: array and array list data structure
Mahmoud Alfarra
 
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter1  intro toprincipleofc#_datastructure_b_csChapter1  intro toprincipleofc#_datastructure_b_cs
Chapter1 intro toprincipleofc#_datastructure_b_cs
Mahmoud Alfarra
 
Chapter 0: introduction to data structure
Chapter 0: introduction to data structureChapter 0: introduction to data structure
Chapter 0: introduction to data structure
Mahmoud Alfarra
 
3 classification
3  classification3  classification
3 classification
Mahmoud Alfarra
 
8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011
Mahmoud Alfarra
 
7 programming-using-java decision-making220102011
7 programming-using-java decision-making2201020117 programming-using-java decision-making220102011
7 programming-using-java decision-making220102011
Mahmoud Alfarra
 
6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-
Mahmoud Alfarra
 
5 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop201020115 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop20102011
Mahmoud Alfarra
 
4 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava201020114 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava20102011
Mahmoud Alfarra
 
3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer
Mahmoud Alfarra
 
Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2Computer Programming, Loops using Java - part 2
Computer Programming, Loops using Java - part 2
Mahmoud Alfarra
 
Computer Programming, Loops using Java
Computer Programming, Loops using JavaComputer Programming, Loops using Java
Computer Programming, Loops using Java
Mahmoud Alfarra
 
Chapter 10: hashing data structure
Chapter 10:  hashing data structureChapter 10:  hashing data structure
Chapter 10: hashing data structure
Mahmoud Alfarra
 
Chapter9 graph data structure
Chapter9  graph data structureChapter9  graph data structure
Chapter9 graph data structure
Mahmoud Alfarra
 
Chapter 8: tree data structure
Chapter 8:  tree data structureChapter 8:  tree data structure
Chapter 8: tree data structure
Mahmoud Alfarra
 
Chapter 7: Queue data structure
Chapter 7:  Queue data structureChapter 7:  Queue data structure
Chapter 7: Queue data structure
Mahmoud Alfarra
 
Chapter 6: stack data structure
Chapter 6:  stack data structureChapter 6:  stack data structure
Chapter 6: stack data structure
Mahmoud Alfarra
 
Chapter 5: linked list data structure
Chapter 5: linked list data structureChapter 5: linked list data structure
Chapter 5: linked list data structure
Mahmoud Alfarra
 
Chapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structureChapter 4: basic search algorithms data structure
Chapter 4: basic search algorithms data structure
Mahmoud Alfarra
 
Chapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structureChapter 3: basic sorting algorithms data structure
Chapter 3: basic sorting algorithms data structure
Mahmoud Alfarra
 
Chapter 2: array and array list data structure
Chapter 2: array and array list  data structureChapter 2: array and array list  data structure
Chapter 2: array and array list data structure
Mahmoud Alfarra
 
Chapter1 intro toprincipleofc#_datastructure_b_cs
Chapter1  intro toprincipleofc#_datastructure_b_csChapter1  intro toprincipleofc#_datastructure_b_cs
Chapter1 intro toprincipleofc#_datastructure_b_cs
Mahmoud Alfarra
 
Chapter 0: introduction to data structure
Chapter 0: introduction to data structureChapter 0: introduction to data structure
Chapter 0: introduction to data structure
Mahmoud Alfarra
 
8 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 201020118 programming-using-java decision-making practices 20102011
8 programming-using-java decision-making practices 20102011
Mahmoud Alfarra
 
7 programming-using-java decision-making220102011
7 programming-using-java decision-making2201020117 programming-using-java decision-making220102011
7 programming-using-java decision-making220102011
Mahmoud Alfarra
 
6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-6 programming-using-java decision-making20102011-
6 programming-using-java decision-making20102011-
Mahmoud Alfarra
 
5 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop201020115 programming-using-java intro-tooop20102011
5 programming-using-java intro-tooop20102011
Mahmoud Alfarra
 
4 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava201020114 programming-using-java intro-tojava20102011
4 programming-using-java intro-tojava20102011
Mahmoud Alfarra
 
3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer3 programming-using-java introduction-to computer
3 programming-using-java introduction-to computer
Mahmoud Alfarra
 
Ad

Recently uploaded (20)

One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptxYSPH VMOC Special Report - Measles Outbreak  Southwest US 5-3-2025.pptx
YSPH VMOC Special Report - Measles Outbreak Southwest US 5-3-2025.pptx
Yale School of Public Health - The Virtual Medical Operations Center (VMOC)
 
One Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learningOne Hot encoding a revolution in Machine learning
One Hot encoding a revolution in Machine learning
momer9505
 
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar RabbiPresentation on Tourism Product Development By Md Shaifullar Rabbi
Presentation on Tourism Product Development By Md Shaifullar Rabbi
Md Shaifullar Rabbi
 
Understanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s GuideUnderstanding P–N Junction Semiconductors: A Beginner’s Guide
Understanding P–N Junction Semiconductors: A Beginner’s Guide
GS Virdi
 
Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025Stein, Hunt, Green letter to Congress April 2025
Stein, Hunt, Green letter to Congress April 2025
Mebane Rash
 
How to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of saleHow to manage Multiple Warehouses for multiple floors in odoo point of sale
How to manage Multiple Warehouses for multiple floors in odoo point of sale
Celine George
 
Odoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo SlidesOdoo Inventory Rules and Routes v17 - Odoo Slides
Odoo Inventory Rules and Routes v17 - Odoo Slides
Celine George
 
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
How to track Cost and Revenue using Analytic Accounts in odoo Accounting, App...
Celine George
 
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Michelle Rumley & Mairéad Mooney, Boole Library, University College Cork. Tra...
Library Association of Ireland
 
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Niamh Lucey, Mary Dunne. Health Sciences Libraries Group (LAI). Lighting the ...
Library Association of Ireland
 
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam SuccessUltimate VMware 2V0-11.25 Exam Dumps for Exam Success
Ultimate VMware 2V0-11.25 Exam Dumps for Exam Success
Mark Soia
 
LDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini UpdatesLDMMIA Reiki Master Spring 2025 Mini Updates
LDMMIA Reiki Master Spring 2025 Mini Updates
LDM Mia eStudios
 
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Marie Boran Special Collections Librarian Hardiman Library, University of Gal...
Library Association of Ireland
 
apa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdfapa-style-referencing-visual-guide-2025.pdf
apa-style-referencing-visual-guide-2025.pdf
Ishika Ghosh
 
Quality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdfQuality Contril Analysis of Containers.pdf
Quality Contril Analysis of Containers.pdf
Dr. Bindiya Chauhan
 
New Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptxNew Microsoft PowerPoint Presentation.pptx
New Microsoft PowerPoint Presentation.pptx
milanasargsyan5
 
Presentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem KayaPresentation of the MIPLM subject matter expert Erdem Kaya
Presentation of the MIPLM subject matter expert Erdem Kaya
MIPLM
 
GDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptxGDGLSPGCOER - Git and GitHub Workshop.pptx
GDGLSPGCOER - Git and GitHub Workshop.pptx
azeenhodekar
 
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...Multi-currency in odoo accounting and Update exchange rates automatically in ...
Multi-currency in odoo accounting and Update exchange rates automatically in ...
Celine George
 
2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx2541William_McCollough_DigitalDetox.docx
2541William_McCollough_DigitalDetox.docx
contactwilliamm2546
 

Data preparation and processing chapter 2

  • 1. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 1
  • 2. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 4.  The real –world database typically used in data mining may have millions of records and thousands of variables. It is noisy and has missing and inconsistent values. Data quality is a key issue with data mining so data preparation is a necessary step for serious, effective, real-world data mining. Introduction
  • 5. To increase the accuracy of the mining, has to perform data preprocessing. Otherwise, garbage in => garbage out Data Preparation estimated to take 70-80% of the time and effort. Introduction
  • 6. Domain Expertise  Data quality expert: “We found these strange records in your database after running sophisticated algorithms!”  Domain Experts: “Oh, those apples - we put them in the same baskets as oranges because there are too few apples to bother. Not a big deal. We knew that already.”
  • 7. Domain Expertise Domain Expertise is important for understanding the data, the problem and interpreting the results. “The counter resets to 0 if the number of calls exceeds N”. “The missing values are represented by 0, but the default billed amount is 0 too.” Insufficient Domain Expertise is a primary cause of poor Data Quality– data are unusable.
  • 8. Goal Identification  To obtain the highest benefit from data mining, there must be a clear statement of the business objectives.  The first and most important step in any targeting- model project is to establish a clear goal and develop a process to achieve that goal.
  • 9. Goal Identification  Example of Goal for business company are:  You want to attract new customers  You want to avoid high -risk customers  You want to understand the characteristics of your current customers?  You want to make your unprofitable customers more profitable?  You want to retain your profitable customers?  You want to win back your lost customers?  You want to improve customer satisfaction?  You want to increase sales?  You want to reduce expenses
  • 10. Data Understanding  Starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first closes into the data.
  • 11. Data Understanding Data Understanding: Relevance:  What data is available for the task?  Is this data relevant?  Is additional relevant data available?  How much historical data is available?  Who is the data expert ?
  • 12. Data Understanding Data Understanding: Quantity  Number of instances (records)  Rule of thumb: 5,000 or more desired  if less, results are less reliable;  Number of attributes (fields)  Rule of thumb: for each field, 10 or more instances  If more fields, use feature reduction and selection  Number of targets  Rule of thumb: >100 for each class  if very unbalanced, use stratified sampling
  • 13. Data Cleaning Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 14. Data Cleaning Tid Refund Marital Status Taxable Income Cheat 1 Yes 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced -95k Yes 6 No Married 60K No 7 Yes 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attributes Objects
  • 15. Data Cleaning  Real-world data tends to be incomplete, noisy and inconsistent.  Data Cleaning Steps  Missing values  Noisy Data  Inconsistent Data
  • 16. Missing values  A missing value (Mv) is an empty cell in the table that represents a dataset. ?Instances Attributes
  • 17. Dealing with missing values 1. Ignore records with missing values:  This is usually done when the class label is missing.  This method is not effective, unless the record contains several attributes with missing values.
  • 18. Dealing with missing values 2. Fill in the missing value manually: In general, this approach is time-consuming and may be not feeble given a large data set with many missing values. 3. Fill in the missing value manually: Replace all missing values by same constant such as “unknown”. Although this method is simple but it is not recommended because results with “unknown values are not “interesting”.
  • 19. Dealing with missing values 4. Use the attribute mean to fill missing values: For example in attribute income if the mean income is 28000, use this value to replace the missing values. 5. Use the attribute mean for all samples belonging to the same class For example, if classifying customers according to credit risk, replace the missing value with the mean income value for customers in the same credit risk category as that of the given record.
  • 20. Dealing with missing values 6. Use advanced method such as K-nearest neighbors formalism or decision tree to predict the missing value using other values.
  • 21. Dealing with missing values k nearest neighbors Approach Compute the k nearest neighbors and assign a value from them.
  • 22. Dealing with missing values k nearest neighbors Approach  For nominal values, use the most common value among all neighbors.  For numerical values use the average value.  Indeed, we need to define a proximity measure between instances, such as euclidian distance.
  • 24. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 2
  • 25. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 27.  Noise is a random error in measured variable.  Noisy data is meaningless data.  Any data that has been received, stored or changed in such a manner that it cannot be read or used by the program that originally created it can be described as noisy. Noisy Data
  • 28.  Source of Noisy data: 1. Data entry problem. 2. Faulty data collection instruments. 3. Data transmission. Noisy Data
  • 29.  Binning method  Clustering  Combined computer and human inspections  Regression How to handle noisy data ?
  • 30. How to handle noisy data ?  Binning method: 1. Sort data 2. Partition into equal-frequency groups. 3. One can smooth by group means, smooth by group median, smooth by group boundaries, etc.
  • 31. How to handle noisy data ? Sorted data for price: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 Partition into (equal-frequency) groups: -G1: 4, 8, 9, 15 -G2: 21, 21, 24, 25 -G3: 26, 28, 29, 34 Smoothing by bin means: -G1: 9, 9, 9, 9 -G2: 23, 23, 23, 23 -G3: 29, 29, 29, 29 Smoothing by bin boundaries: -G1: 4, 4, 4, 15 -G2: 21, 21, 25, 25 -G3: 26, 26, 26, 34
  • 32. How to handle noisy data ? Clustering: Outliers may be detected by clustering, where similar values are organized into groups, values that fall outside the set of clusters may be considered outliers.
  • 33. How to handle noisy data ?  Combined computer and human inspections: Outliers may be identified by detect suspicious values and check by human.
  • 34. How to handle noisy data ?  Regression: Data can be smoothed by fitting the data to a function.
  • 35. Inconsistent Data  Data which is inconsistent with our models, should be dealt with.  Common sense can also be used to detect such kind of inconsistency: The same name occurring differently in an application. Different names can appear to be the same (Dennis Vs Denis) Inappropriate values (Males being pregnant, or having an negative age) Was rating “1,2,3”, now rating “A, B, C” Difference between duplicate records
  • 36. Inconsistent Data  We want to transform all dates to the same format internally  Some systems accept dates in many formats  e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc  dates are transformed internally to a standard value  Frequently, just the year (YYYY) is sufficient  For more details, we may need the month, the day, the hour, etc  Representing date as YYYYMM or YYYYMMDD can be OK.
  • 37. Data Integration Goal identification & Data Understanding Data Cleaning Data Integration Data Transformation Data Reduction
  • 38. Data Integration  Combines data from multiple sources into a coherent store.  Increasingly data a mining projects require data from more than one data source.  Such as multiple databases, data warehouse, flat files and historical data.
  • 39. Data Integration  Data is stored in many systems across enterprise and outside the enterprise The source of data fall into two categories:  Internal sources that are generated through enterprise activities such as databases, historical data, Web sites and warehouses.  External sources such as credit bureaus, phone companies and demographical information.
  • 40. Data Integration  Data Warehouse: is a structure that links information from two or more databases.  Data warehouse brings data from different data sources into a central repository.  It performs some data integration, clean-up, and summarization, and distribute the information data marts.
  • 43. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 3
  • 44. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 47.  Definition 1: Transform the data into a form appropriate for given data mining method.  Definition 2: Data transformation is the process of converting data or information from one format to another, usually from the format of a source system into the required format of a new destination system. Data Transformation
  • 48.  Methods include:  Smoothing  Aggregation  Generalization  Normalization (min-max) Data Transformation
  • 49. Methods of Data Transformation  Normalization: Where the attributes are scaled so as to fall within a small specified ranges such as -1.0 to 1.0.
  • 50. How to handle noisy data ?
  • 52. Data preparation and processing Mahmoud Rafeek Alfarra http://mfarra.cst.ps University College of Science & Technology- Khan yonis Development of computer systems 2016 Chapter 2 – Lecture 4
  • 53. Outline  Introduction  Domain Expert  Goal identification and Data Understanding  Data Cleaning  Missing values  Noisy Data  Inconsistent Data  Data Integration  Data Transformation  Data Reduction  Feature Selection  Sampling  Discretization
  • 54. Introduction Goal identification and Data Understanding Data Cleaning Data Integration Data TransformationData Reduction
  • 57.  Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set.  Data reduction: Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results. Data Reduction
  • 58.  The choice of data representation, and selection, reduction or transformation of features is probably the most important issue that determines the quality of a data-mining solution. Data Reduction
  • 59.  The three basic operations in a data-reduction process are:  Delete a column (feature selection).  Delete a row (sampling).  Reduce the number of values in a column (Discretization). Data Reduction
  • 60. Feature Selection  We want to choose features (attributes) that are relevant to our data-mining application in order to achieve maximum performance with the minimum measurement and processing effort.
  • 61. Feature Selection 1. Redundant features  Duplicate much or all of the information contained in one or more other attributes  E.g., purchase price of a product and the amount of sales tax paid.
  • 62. Feature Selection 2. Irrelevant features  Contain no information that is useful for the data mining task at hand. E.g., students' ID is often irrelevant to the task of predicting students' GPA.
  • 63. Feature Selection 3. Selecting Most Relevant Fields  If there are too many fields, select a subset that is most relevant. Can select top N fields using some computations. What is good N?  Rule of thumb -- keep top 50 fields
  • 64. Feature Selection  Two types of feature selection  Unsupervised: Reduce fields without knowing class label. Supervised: Select fields with respect to class label.
  • 65. Sampling  Sampling: Obtaining a small sample s to represent the whole data set N. Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data.
  • 66. Sampling  Key principle: Choose a representative subset of the data.  Simple random sampling may have very poor performance in the presence of skew  Develop adaptive sampling methods, e.g., stratified sampling.
  • 67. Sampling 8000 points 2000 Points 500 Points Sample Size
  • 68. Types of Sampling  Sampling without replacement:  Once an object is selected, it is removed from the population.  Sampling with replacement  A selected object is not removed from the population.  Stratified sampling:  Partition the data set, and draw samples from each partition (proportionally, i.e., approximately the same percentage of the data)
  • 69. Types of Sampling(Sampling without replacement) Raw Data
  • 70. Types of Sampling(Sampling with replacement) Raw Data
  • 71. Types of Sampling Raw Data Cluster/Stratified Sample
  • 73. Discretization  Discretization is very useful for generating a summary of data, also called “binning”.  It does not use the class information.  Suppose we have the following set of values for the attribute - AGE : 0, 4, 12, 16, 16, 18, 24, 26, 28. Two possible ways in which Binning can be applied are: Equi-width binning or Equi-frequency binning .