Data Cleaning and Data Transformation

Uploaded by

Aman Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

Data Cleaning and Data Transformation

Uploaded by

Aman Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Cleaning and Data Transformation

Quality of the data is critical in analysis because any data tend to be

incomplete, noisy and inconsistent will affect your result. So data cleaning in
data mining is the process of detecting and removing corrupt or incorrect
records from the record items, table or database.
There are many
data cleaning
methods:
• It can ignore the tuple, it is done when class label is missing. This technique is not very effective, except
the tuple contains many attributes with missing values.
• This method can fill the missing value manually and this approach is effective on little data set with some
missing values.
• It can replace every missing attribute values with global constant, such as a label like “Unknown” or
minus infinity.
• It can use the attribute type to fill in the missing value. As an example customer average income is 30000
then it can use this value to replace missing value for income.
• Fill the most probable value in the missing value.
Advantages of data cleansing:
Accurate data
and quality
data

Improve Streamline
Decision Business
Making process Practise

Increases Increases
Productivity Revenue
Benefits of Data Cleaning
The several key benefits of the data cleaning process are
as follows:
1. It removes higher errors and inconsistencies that are
inevitable when different sources of data are getting
pushing into one dataset.
2. The tools to cleanup data will build everyone more
efficient since they’ll be able to quickly get what they
have from the data.
3. Smaller error means happier user and fewer frustrated
employees.
4. The ability to map the various functions and what the
Steps to clean the data
Monitor Errors Keep a record and appearance at trends of whenever most errors are coming
from, as this may build it a lot easier to identify and fix the wrong or corrupt data. This is particularly
necessary if you are integrating different solutions with your fleet management software, In order that
errors don’t obstruct the work of other departments.
Standardize Your Processes By standardizing the data process you will ensure a good
point of entry and reduce the risk of duplication.
3. Validate Accuracy It validates the accuracy of the data once it has cleaned in the
existing database. Research and invest in data tools that permit to clean the data in real-time. Some
tools now use Artificial Intelligence or machine learning to better test for accuracy.
4. Scrub for Duplicate Data It identify the duplicates, since this will help the user to
save time when analysing data. This can be avoided by researching and investing in various data
cleaning tools.
5. Analyse After the data has been standardized, validated, and scrambled for duplicates, use
third-party sources to append it. Reliable third-party sources may capture data directly from first-party
sites, then clean and compile the data to provide more complete data for business intelligence and
analytics.
6. Communicate with the Team Communicate the new standardized cleaning process to the
team. The data can be scrubbed, it’s important to keep it clean. This will help to develop and
strengthen the user segmentation and send more targeted data to user and prospects,
What kinds of issues affect the quality of
data?
• Invalid values: Some datasets have common values, e.g. gender must have “F”
(Female) and “M” (Male). Here it’s easy to detect wrong values.
• Formats: The most familiar issue that it’s possible to get values in various
formats like a name written as “Name, Surname” or “Surname, Name”.
• Attribute dependencies: Here the value of a feature completely depends on the
value of another feature. For instance, if there is some school data, the “number of
students” is related to whether the human being “is teacher?” If they are not a
teacher he/she can’t have any students.
• Uniqueness: It’s possible to find duplicated data in features that only allow single
values. For instance, an ID field can’t have two products with the same identifier.
• Missing values: The features in the dataset can have blank or null values.
• Misspellings: Values written incorrectly.
• Mis fielded values: It contains the values of another.
How can I detect and fix these issues?

There is a great deal of methods that can use to find

these issues. For example:
• Visualization: Visualizing all the values that can take as
a random sample to see whether it’s right.
• Outlier analysis: Analyze the data can be a human
error. E.g. a 200 year old person in the “age” feature.
• Validation code: It creates a code that checks if the
data is right.
For instance, in uniqueness, checking if the distance of
the data is the same as the length of the vector of single
values.
We can apply many methods to fix the
various issues:
• Misspelled data: Replace the incorrect fields by the
most similar value.
• Uniqueness: Repeated fields switching of one data with
the other value that is not in the feature.
• Missing data: Missing data handling is a key decision. It
can change null values with the mean, median or mode
of the feature.
• Formats: It have the same number of decimals, the
same format in the dates. Data Transformation in Data
Mining
Meaning of Data
Transformation:
• In data transformation process the data are transformed from
one format to another format, which is more appropriate for data
mining.
• It is the process of converting data or information from one
format to another, generally from the format of a source system
into the needed format of a new destination system.
• This process involves converting documents, but data
conversions involve the conversion of a program from one
computer language to another to enable the program to run on a
various platform. Data transformation involves the need of a
special program that is ready to browse the data original base
language, verify the language that the information has to be
translated for it to be usable by the new program or system, and
Data Transformation has two key
phases:
Data Mapping: The assignment of component from
the source base or system toward the destination to
capture all transformations that occur.
• This can create more difficult once there are complex
transformations like many to-one or one-to-many rules
for transformation.
Code Generation: The creation of the actual
transformation program.
• The resulting data map specification is used to create
an executable program to run on computer systems.
Data Transformation Strategies:
1. Smoothing: Smoothing is the process of removing noise
from the data.
2. Aggregation: Aggregation is the process where summary or
aggregation operations are applied to the data.
3. Generalization: By using concept hierarchies climbing low-
level data are replaced with the high level data
4. Normalization: Normalization scaled attribute data so as to
fall within a small specified range, such as 0.0 to 1.0. 5.
Attribute
5. Construction: In Attribute construction, new attributes are
constructed from the given set of attributes.
Is it possible to transform the features to gain more information?
There are so many methods that add information to the algorithm:
Data Binning or Bucketing: A pre-processing technique is used to decrease the
consequences of smaller observation errors. This sample is split into intervals and replaced
by categorical values.
Indicator variables: This technique converts categorical data into Boolean
values by making indicator variables. It has more than two values (n) and it have to create n-
1 columns.
Centering & Scaling: It can center the data of one data by subtracting the mean
to all values. To scale the data, it can split the centered feature by the standard deviation.

Other techniques:
It can group the outliers with the same value or replace the value with the number of times
that it appears in the feature.
Data Transformation types and dimensional attributes It is the main function of an Extract,
Transform, and Load (ETL) tool is to translate data. The transformation step is the most
important stage of building a structured data warehouse. The major transformation types
are:
Format revision: This field can contain numeric and text data types.
It needs to standardize and change the data type to text to provide values that can be
Decoding of fields: In multiple source systems, the similar data items are
described by a variety of field values.
Many legacy systems are notorious for using cryptic codes to represent business
values.
This Extract, Transform, and Load (ETL) transformation type changes codes into
values that make sense to the end-users.
Calculated and derived values: It calculates the total cost and the
profit margin before data can be stored in the data warehouse, it is an instance of
the calculated value. It can also store user’s age separately, it would be an
example of the derived value.
Splitting single fields: The first name, middle name, and last name, and
some other values are stored as a large text in a single field in the systems. It can
store individual components of names and addresses in separate fields in the data
repository to improve the operating performance by indexing and analyzing
individual components.
Merging of information: This type of data transformation in Extract,
Transform, and Load (ETL) does not literally mean the merging of various fields to
create a single field. In this case, merging of data stands for establishing the
communication between various fields such as product price, description, package
types, and viewing these fields as a single entity.

E68 E68-City E68-City-Jet Operators Manual
100% (1)
E68 E68-City E68-City-Jet Operators Manual
58 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
4. Data Cleaning and Preparation
No ratings yet
4. Data Cleaning and Preparation
20 pages
Week 3
No ratings yet
Week 3
23 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Data Quality and Preprocessing Concepts ETL
No ratings yet
Data Quality and Preprocessing Concepts ETL
64 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
L3
No ratings yet
L3
34 pages
02 Data_preprocessing -4,5,6
No ratings yet
02 Data_preprocessing -4,5,6
54 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
Data Mining
No ratings yet
Data Mining
22 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Unit 3
No ratings yet
Unit 3
18 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
chap3
No ratings yet
chap3
26 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Data Proprocesing
No ratings yet
Data Proprocesing
18 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
CS822-DataMining-Week3
No ratings yet
CS822-DataMining-Week3
91 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Chapter 3 Data Preparation
100% (1)
Chapter 3 Data Preparation
34 pages
4. Data segmentation
No ratings yet
4. Data segmentation
11 pages
L 4 and 5-Data Cleaning DS-Sa
No ratings yet
L 4 and 5-Data Cleaning DS-Sa
44 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
Part II, Meet 4 - Ch 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - Ch 6 Dan 7 UNP
19 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Programming Presentation
No ratings yet
Programming Presentation
8 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Dwdm Ppt PDF
No ratings yet
Dwdm Ppt PDF
21 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
2 DM Datapreprocessing
No ratings yet
2 DM Datapreprocessing
41 pages
Data Preprocessing
100% (1)
Data Preprocessing
33 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Updated notes of APR_084732
No ratings yet
Updated notes of APR_084732
6 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
From Everand
Data Entry Operator: Skills, Software, Career Tips, and Interview Q&A
Sumitra Kumari
No ratings yet
Untitled 13
No ratings yet
Untitled 13
3 pages
Case Study: Enhancing Cybersecurity in A Growing IT Services Company
No ratings yet
Case Study: Enhancing Cybersecurity in A Growing IT Services Company
2 pages
Deck - HRBP Main
No ratings yet
Deck - HRBP Main
3 pages
What Is Data Generalization?
No ratings yet
What Is Data Generalization?
5 pages
Unit-V (2)
No ratings yet
Unit-V (2)
17 pages
Stress Detection in IT Professional by Image Processing and Machine Learning
No ratings yet
Stress Detection in IT Professional by Image Processing and Machine Learning
2 pages
Note Positivo Stilo - Xc3550 Xc3570 71r-s14ct6-t820 Schematic
100% (1)
Note Positivo Stilo - Xc3550 Xc3570 71r-s14ct6-t820 Schematic
35 pages
필립스엑시터 수학교육과정
No ratings yet
필립스엑시터 수학교육과정
5 pages
An Introduction To Symmetric Functions And Their Combinatorics Eric S Egge instant download
100% (3)
An Introduction To Symmetric Functions And Their Combinatorics Eric S Egge instant download
90 pages
At The Seaside - en
100% (1)
At The Seaside - en
12 pages
MSN 1737 (M) : Helicopter Emergency Landing Areas
No ratings yet
MSN 1737 (M) : Helicopter Emergency Landing Areas
6 pages
GBIO-121 Week 10
No ratings yet
GBIO-121 Week 10
5 pages
2018 2023 Judy 30 Recon Service Manual English
No ratings yet
2018 2023 Judy 30 Recon Service Manual English
96 pages
Common Intermediate Language
No ratings yet
Common Intermediate Language
7 pages
HP Elitebook 8440p Datasheet
No ratings yet
HP Elitebook 8440p Datasheet
4 pages
2019-Assessment of Mothers' Measures Against Home Accidents For 0-6-Year-Old Children
No ratings yet
2019-Assessment of Mothers' Measures Against Home Accidents For 0-6-Year-Old Children
8 pages
Design Optimization of A Connecting Rod For Internal Combustion Engine (041-060)
No ratings yet
Design Optimization of A Connecting Rod For Internal Combustion Engine (041-060)
20 pages
Juki HZL-K85 Sewing Machine Service Manual
No ratings yet
Juki HZL-K85 Sewing Machine Service Manual
39 pages
Cathode Ray Oscilloscope (Cro)
No ratings yet
Cathode Ray Oscilloscope (Cro)
25 pages
IEEE References
No ratings yet
IEEE References
7 pages
Mirrors and Labs
No ratings yet
Mirrors and Labs
2 pages
Development Testing and Applications of Recycled P
No ratings yet
Development Testing and Applications of Recycled P
15 pages
Ametek MTC Manual
No ratings yet
Ametek MTC Manual
57 pages
Oops Lab
No ratings yet
Oops Lab
7 pages
MCQ Soil Mechanics Questions (Soil Shear Strength)
100% (1)
MCQ Soil Mechanics Questions (Soil Shear Strength)
32 pages
The Mathematical Theory of Tone Systems 1st Edition Jan Haluska (Author)instant download
100% (1)
The Mathematical Theory of Tone Systems 1st Edition Jan Haluska (Author)instant download
43 pages
Adobe Livecycle Designer 9.0 Installation Prerequisite
No ratings yet
Adobe Livecycle Designer 9.0 Installation Prerequisite
4 pages
SharpSploit - Quick Command
No ratings yet
SharpSploit - Quick Command
4 pages
HW 3
No ratings yet
HW 3
8 pages
Building Notes - 46356005 - 2024 - 11 - 29 - 15 - 16
No ratings yet
Building Notes - 46356005 - 2024 - 11 - 29 - 15 - 16
192 pages
C Abap Question3
100% (2)
C Abap Question3
2 pages
at Resonant
No ratings yet
at Resonant
10 pages
2nd Periodical Exam g6
No ratings yet
2nd Periodical Exam g6
7 pages

Data Cleaning and Data Transformation

Uploaded by

Data Cleaning and Data Transformation

Uploaded by

Data Cleaning and Data Transformation

Quality of the data is critical in analysis because any data tend to be

There is a great deal of methods that can use to find

You might also like