Data Preprocessing

Data preprocessing is a technique used to transform raw data into a clean and efficient format for data mining. It involves data cleaning to handle missing or noisy data, data transformation such as normalization and discretization, and data reduction to reduce storage costs and aid analysis. Common data cleaning techniques include filling in missing values, binning noisy data, and using regression or clustering. Transformation includes scaling, attribute selection, and concept hierarchy generation. Reduction includes aggregation, attribute selection, dimensionality reduction such as PCA, and numerosity reduction using data models.

Uploaded by

Im' Possible

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

163 views

Data Preprocessing

Uploaded by

Im' Possible

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Data Preprocessing in Data Mining

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw
data in a useful and efficient format.

Steps Involved in Data Preprocessing:

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

 (a). Missing Data:

This situation arises when some data is missing in the data. It can be handled
in various ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and
multiple values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing
values manually, by attribute mean or the most probable value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can
be generated due to faulty data collection, data entry errors etc. It can be
handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are
performed to complete the task. Each segmented is handled separately.
One can replace all data in a segment by its mean or boundary values can
be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or
multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for
mining process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or
0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes
to help the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For
Example-The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order
to get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.

The various steps to data reduction are:

1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value
of the attribute.the attribute having p-value greater than significance level can
be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of dimensionality reduction are:Wavelet
transforms and PCA (Principal Componenet Analysis).

Data Science MCQ Questions and Answer PDF
75% (8)
Data Science MCQ Questions and Answer PDF
6 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
Vector Spaces: Axioms of Vector Space Vector Space
100% (1)
Vector Spaces: Axioms of Vector Space Vector Space
2 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
Data pre Processing
No ratings yet
Data pre Processing
11 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
R Programming Unit-2
No ratings yet
R Programming Unit-2
29 pages
Practical 1 ML_removed
No ratings yet
Practical 1 ML_removed
5 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Mining
No ratings yet
Data Mining
5 pages
unit 2 Preprocessing in Data Mining
No ratings yet
unit 2 Preprocessing in Data Mining
6 pages
Assignment 2
No ratings yet
Assignment 2
5 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Bi Lesson 6
No ratings yet
Bi Lesson 6
36 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
QB 10 Marker
No ratings yet
QB 10 Marker
19 pages
Module 2
No ratings yet
Module 2
42 pages
Data Integration and Data Reduction
No ratings yet
Data Integration and Data Reduction
27 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
1.data Mining Functionalities
No ratings yet
1.data Mining Functionalities
14 pages
Steps in The Data Mining Process
No ratings yet
Steps in The Data Mining Process
5 pages
Unit 3
No ratings yet
Unit 3
18 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
BI_Unit 5
No ratings yet
BI_Unit 5
9 pages
LECTURE 3-BDM 411 Data Analytics and BIG Data
No ratings yet
LECTURE 3-BDM 411 Data Analytics and BIG Data
49 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Stages in Data Mining
No ratings yet
Stages in Data Mining
11 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
9 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Down 2
No ratings yet
Down 2
61 pages
Data Warehousing - CH3
No ratings yet
Data Warehousing - CH3
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
0 pages
IBA - MODULe 4.3
No ratings yet
IBA - MODULe 4.3
10 pages
Data Cleaning and Datamining
No ratings yet
Data Cleaning and Datamining
54 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 3
53 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Chapter-3 data processing
No ratings yet
Chapter-3 data processing
54 pages
Data Mining & Data Warehousing
No ratings yet
Data Mining & Data Warehousing
62 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
No ratings yet
Data Mining: Concepts and Techniques: - Slides For Textbook - Chapter 2 &3
36 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
28 pages
Preprocessing in Data Mining: Edgar Acu Na
No ratings yet
Preprocessing in Data Mining: Edgar Acu Na
5 pages
3 Prep
No ratings yet
3 Prep
50 pages
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
No ratings yet
A Comprehensive Approach Towards Data Preprocessing Techniques & Association Rules
9 pages
Data Structures and Algorithm
From Everand
Data Structures and Algorithm
Knowledge Flow
No ratings yet
Basic Concepts in Data Structures
From Everand
Basic Concepts in Data Structures
K.Meenendranath Reddy
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
25th June Class Notes DE 4 PDF
No ratings yet
25th June Class Notes DE 4 PDF
10 pages
Mms Testing of Hypothesis
No ratings yet
Mms Testing of Hypothesis
69 pages
13-Mca-Or-Probability & Statistics
No ratings yet
13-Mca-Or-Probability & Statistics
3 pages
Learning Activity Sheet in 21 Century Literature From The Philippines and The World
100% (1)
Learning Activity Sheet in 21 Century Literature From The Philippines and The World
9 pages
715 hw3 Sol PDF
No ratings yet
715 hw3 Sol PDF
4 pages
Marketing Optimization: Predictive Analytics Use Case
No ratings yet
Marketing Optimization: Predictive Analytics Use Case
18 pages
Reliability Final Exam Solutions
No ratings yet
Reliability Final Exam Solutions
9 pages
An Energy Approach To The Solution of Partial Differential Equations in Computational Mechanics Via Machine Learning: Concepts, Implementation and Applications
No ratings yet
An Energy Approach To The Solution of Partial Differential Equations in Computational Mechanics Via Machine Learning: Concepts, Implementation and Applications
51 pages
Supplemental Material For Chapter 5 S5-1. S: XX XX X
No ratings yet
Supplemental Material For Chapter 5 S5-1. S: XX XX X
10 pages
BacktoBasics-Fundamental Resolution Equation V2
No ratings yet
BacktoBasics-Fundamental Resolution Equation V2
7 pages
Module 7: Lesson 3 Assignment Chem 20
No ratings yet
Module 7: Lesson 3 Assignment Chem 20
5 pages
Some Approximations of The Bateman's G: Function
No ratings yet
Some Approximations of The Bateman's G: Function
14 pages
Storage & Solution-1
No ratings yet
Storage & Solution-1
5 pages
XFEM Method
0% (1)
XFEM Method
24 pages
The Role of Compactness in Analysis - Edwin Hewitt
No ratings yet
The Role of Compactness in Analysis - Edwin Hewitt
19 pages
QTM-Theory Questions & Answers - New
No ratings yet
QTM-Theory Questions & Answers - New
20 pages
Reviewer Calculus
No ratings yet
Reviewer Calculus
7 pages
CHAPTER-5 - Project Management
No ratings yet
CHAPTER-5 - Project Management
20 pages
Ejercicios de Regresion Lineal
No ratings yet
Ejercicios de Regresion Lineal
10 pages
Convertir PH A MV
100% (1)
Convertir PH A MV
1 page
Quiz 9 Hypothesis Testing For Two Populations
100% (1)
Quiz 9 Hypothesis Testing For Two Populations
29 pages
LQMS 6 7 8 Quality Control
No ratings yet
LQMS 6 7 8 Quality Control
28 pages
Lesson 5: The Autocovariance Function of A Stochastic Process
No ratings yet
Lesson 5: The Autocovariance Function of A Stochastic Process
18 pages
Multiple Regression Analysis - Inference
No ratings yet
Multiple Regression Analysis - Inference
34 pages
Embracing The Generative AI Revolution: Advancing Tertiary Education in Cybersecurity With GPT
No ratings yet
Embracing The Generative AI Revolution: Advancing Tertiary Education in Cybersecurity With GPT
16 pages
SML Syllabus
No ratings yet
SML Syllabus
3 pages
Week 4 Numericals
No ratings yet
Week 4 Numericals
20 pages

Data Preprocessing

Uploaded by

Data Preprocessing

Uploaded by

Data Preprocessing in Data Mining

Preprocessing in Data Mining:

Steps Involved in Data Preprocessing:

 (a). Missing Data:

The various steps to data reduction are:

You might also like