0% found this document useful (0 votes)

6 views4 pages

Data Handling and Visualization 3rd Unit

Data Preprocessing is essential in Data Science for transforming messy, raw datasets into usable formats, ensuring data quality before analysis. Key steps include Data Cleaning, Integration, Reduction, and Transformation, which enhance accuracy, completeness, and consistency of data. Effective preprocessing leads to improved ML model performance, reduced costs, and better data visualization.

Uploaded by

jasminejas110101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

Data Handling and Visualization 3rd Unit

Uploaded by

jasminejas110101

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Handling and Visualization: Data Acquisition, Data Pre-processing and Preparation,

Data Preprocessing in Data Science

Real-world datasets are generally messy, raw, incomplete, inconsistent, and unusable. It
can contain manual entry errors, missing values, inconsistent schema, etc. Data
Preprocessing is the process of converting raw data into a format that is understandable
and usable. It is a crucial step in any Data Science project to carry out an efficient and
accurate analysis. It ensures that data quality is consistent before applying any Machine
Learning or Data Mining techniques.

Data Preprocessing is an important step in the Data Preparation stage of a Data Science
development lifecycle that will ensure reliable, robust, and consistent results. The main
objective of this step is to ensure and check the quality of data before applying any Machine
Learning or Data Mining methods. Let’s review some of its benefits -

 Accuracy - Data Preprocessing will ensure that input data is accurate and reliable by ensuring
there are no manual entry errors, no duplicates, etc.
 Completeness - It ensures that missing values are handled, and data is complete for further
analysis.
 Consistent - Data Preprocessing ensures that input data is consistent, i.e., the same data kept
in different places should match.
 Timeliness - Whether data is updated regularly and on a timely basis or not.
 Trustable - Whether data is coming from trustworthy sources or not.
 Interpretability - Raw data is generally unusable, and Data Preprocessing converts raw data
into an interpretable format.

Key Steps in Data Preprocessing

Let’s explore a few of the key steps involved in the Data Preprocessing stage -
Data Cleaning

Data Cleaning uses methods to handle incorrect, incomplete, inconsistent, or missing values.
Some of the techniques for Data Cleaning include -

 Handling Missing Values

o Input data can contain missing or NULL values, which must be handled before
applying any Machine Learning or Data Mining techniques.
o Missing values can be handled by many techniques, such as removing rows/columns
containing NULL values and imputing NULL values using mean, mode, regression,
etc.
 De-noising
o De-noising is a process of removing noise from the data. Noisy data is meaningless
data that is not interpretable or understandable by machines or humans. It can occur
due to data entry errors, faulty data collection, etc.
o De-noising can be performed by applying many techniques, such as binning the
features, using regression to smoothen the features to reduce noise, clustering to
detect the outliers, etc.

Data Integration

Data Integration can be defined as combining data from multiple sources. A few of the issues
to be considered during Data Integration include the following -

 Entity Identification Problem - It can be defined as identifying objects/features from

multiple databases that correspond to the same entity. For example, in database
A _customer_id,_ and in database B _customer_number_ belong to the same entity.
 Schema Integration - It is used to merge two or more database schema/metadata into a single
schema. It essentially takes two or more schema as input and determines a mapping between
them. For example, entity type CUSTOMER in one schema may have CLIENT in another
schema.
 Detecting and Resolving Data Value Concepts - The data can be stored in various ways in
different databases, and it needs to be taken care of while integrating them into a single
dataset. For example, dates can be stored in various formats such
as DD/MM/YYYY, YYYY/MM/DD, or MM/DD/YYYY, etc.

Data Reduction

Data Reduction is used to reduce the volume or size of the input data. Its main objective is to
reduce storage and analysis costs and improve storage efficiency. A few of the popular
techniques to perform Data Reduction include -

 Dimensionality Reduction - It is the process of reducing the number of features in the input
dataset. It can be performed in various ways, such as selecting features with the highest
importance, Principal Component Analysis (PCA), etc.
 Numerosity Reduction - In this method, various techniques can be applied to reduce the
volume of data by choosing alternative smaller representations of the data. For example, a
variable can be approximated by a regression model, and instead of storing the entire variable,
we can store the regression model to approximate it.
 Data Compression - In this method, data is compressed. Data Compression can be lossless or
lossy depending on whether the information is lost or not during compression.
Data Transformation

Data Transformation is a process of converting data into a format that helps in building
efficient ML models and deriving better insights. A few of the most common methods for
Data Transformation include -

 Smoothing - Data Smoothing is used to remove noise in the dataset, and it helps identify
important features and detect patterns. Therefore, it can help in predicting trends or future
events.
 Aggregation - Data Aggregation is the process of transforming large volumes of data into
an organized and summarized format that is more understandable and comprehensive.
For example, a company may look at monthly sales data of a product instead of raw sales data
to understand its performance better and forecast future sales.
 Discretization - Data Discretization is a process of converting numerical or continuous
variables into a set of intervals/bins. This makes data easier to analyze. For example, the age
features can be converted into various intervals such as (0-10, 11-20, ..) or (child, young, …).
 Normalization - Data Normalization is a process of converting a numeric variable into a
specified range such as [-1,1], [0,1], etc. A few of the most common approaches to performing
normalization are Min-Max Normalization, Data Standardization or Data Scaling, etc.

Applications of Data Preprocessing

Data Preprocessing is important in the early stages of a Machine Learning and AI application
development lifecycle. A few of the most common usage or application include -

 Improved Accuracy of ML Models - Various techniques used to preprocess data, such as

Data Cleaning, Transformation ensure that data is complete, accurate, and understandable,
resulting in efficient and accurate ML models.
 Reduced Costs - Data Reduction techniques can help companies save storage and compute
costs by reducing the volume of the data
 Visualization - Preprocessed data is easily consumable and understandable that can be further
used to build dashboards to gain valuable insights.

Conclusion

 Data Preprocessing is a process of converting raw datasets into a format that is consumable,
understandable, and usable for further analysis. It is an important step in any Data Analysis
project that will ensure the input datasets's accuracy, consistency, and completeness.
 The key steps in this stage include - Data Cleaning, Data Integration, Data Reduction, and
Data Transformation.

It can help build accurate ML models, reduce analysis costs, and build dashboards on

raw data.

Data Acquisition


Unit - 2
No ratings yet
Unit - 2
17 pages
Unit 2
No ratings yet
Unit 2
11 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
No ratings yet
Arba Minch University Arba Minch Institute of Technology Faculty of Computing & Software Engineering
20 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
UNIT-I DA
No ratings yet
UNIT-I DA
42 pages
Snowflake Data Engineering concepts
No ratings yet
Snowflake Data Engineering concepts
93 pages
03Preprocessing
No ratings yet
03Preprocessing
59 pages
Data Preprocessing
No ratings yet
Data Preprocessing
48 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
No ratings yet
COMPAPPABCA50150rDatrAP Data Preprocessing2 (DataMining)
13 pages
DWDM-LS3-Fall-24-25
No ratings yet
DWDM-LS3-Fall-24-25
50 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
Data Mining Basics
No ratings yet
Data Mining Basics
52 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
3-Preprocessing
No ratings yet
3-Preprocessing
27 pages
L3
No ratings yet
L3
34 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Mining Basics
No ratings yet
Data Mining Basics
38 pages
How should data preparation be done for an analytics project_
No ratings yet
How should data preparation be done for an analytics project_
30 pages
Dw&bi PR2,3
No ratings yet
Dw&bi PR2,3
6 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
DM Unit2
No ratings yet
DM Unit2
9 pages
Oracle Error Messages
No ratings yet
Oracle Error Messages
952 pages
Data Warehouse and Data Mining- Definition and Concepts
No ratings yet
Data Warehouse and Data Mining- Definition and Concepts
20 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Data Mining
No ratings yet
Data Mining
22 pages
Big Data Day II
No ratings yet
Big Data Day II
38 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
Week 3
No ratings yet
Week 3
23 pages
DS_UNIT_2
No ratings yet
DS_UNIT_2
23 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Unit 3
No ratings yet
Unit 3
18 pages
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
No ratings yet
WINSEM2023-24 - BECE352E - ETH - VL2023240504409 - 2024-02-03 - Reference-Material-I 2
16 pages
10th Practical
No ratings yet
10th Practical
39 pages
Module2 DataPreprocessing
No ratings yet
Module2 DataPreprocessing
27 pages
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 Dw&DM Notes Mr. Rohit Pratap Singh
22 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
03 Preprocessing
No ratings yet
03 Preprocessing
18 pages
Database Management Systems Practical File: ROLL NUMBER - 2018UCO1615 NAME - Amogh Agarwal COE-2
83% (6)
Database Management Systems Practical File: ROLL NUMBER - 2018UCO1615 NAME - Amogh Agarwal COE-2
20 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
dm unit 3
No ratings yet
dm unit 3
15 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DBT Certificate Study Guide
100% (1)
DBT Certificate Study Guide
11 pages
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
3 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
633777800398832500ata Minig Presentation
No ratings yet
633777800398832500ata Minig Presentation
20 pages
Module 2
No ratings yet
Module 2
8 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
BI Unit 4 Final
No ratings yet
BI Unit 4 Final
2 pages
OJCST_Vol13_N2-3_p_78-81
No ratings yet
OJCST_Vol13_N2-3_p_78-81
4 pages
Chapter04 Merise Method
No ratings yet
Chapter04 Merise Method
12 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
SQL Lab Manual DB
No ratings yet
SQL Lab Manual DB
32 pages
Course+Syllabus+-+MySQL
No ratings yet
Course+Syllabus+-+MySQL
4 pages
Download full OCA Oracle Database SQL Exam Guide (Exam 1Z0-071) 1st Edition Steve O’Hearn ebook all chapters
100% (6)
Download full OCA Oracle Database SQL Exam Guide (Exam 1Z0-071) 1st Edition Steve O’Hearn ebook all chapters
55 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Week 2
No ratings yet
Week 2
3 pages
DATA HANDLING
No ratings yet
DATA HANDLING
12 pages
Chapter 12 - Using Customer Related Data
No ratings yet
Chapter 12 - Using Customer Related Data
18 pages
DBCD Lab-2
No ratings yet
DBCD Lab-2
20 pages
Practical 3A Install Oracle and ADT
No ratings yet
Practical 3A Install Oracle and ADT
12 pages
A State of The Art Review of Distributed Database Technology
No ratings yet
A State of The Art Review of Distributed Database Technology
46 pages
https:sistc.learnbook.com.au:pluginfile.php:65512:mod_resource:content:1:08_ConstructCo_MySQL.txt
No ratings yet
https:sistc.learnbook.com.au:pluginfile.php:65512:mod_resource:content:1:08_ConstructCo_MySQL.txt
2 pages
JanuaryFebruary-2023 Irs
No ratings yet
JanuaryFebruary-2023 Irs
2 pages
zoho-analytics-plan-comparison
No ratings yet
zoho-analytics-plan-comparison
3 pages
Ade Mod 1 Incremental Processing With Spark Structured Streaming
No ratings yet
Ade Mod 1 Incremental Processing With Spark Structured Streaming
73 pages
Normalization New 1
No ratings yet
Normalization New 1
60 pages
Namma Kalvi 12th Computer Science Model Question Papers em 2020 217201
No ratings yet
Namma Kalvi 12th Computer Science Model Question Papers em 2020 217201
54 pages
Intro To Presto
No ratings yet
Intro To Presto
23 pages
How To Access Delivery Objects in S4H
No ratings yet
How To Access Delivery Objects in S4H
22 pages
CH 14 FDs and Normalization PDF
No ratings yet
CH 14 FDs and Normalization PDF
55 pages
Python SQLite Tutorial - The Ultimate Guide
No ratings yet
Python SQLite Tutorial - The Ultimate Guide
12 pages
DI Guide
No ratings yet
DI Guide
70 pages
HIVE
No ratings yet
HIVE
24 pages
DWM Solution May 2019
No ratings yet
DWM Solution May 2019
9 pages
Cloud Bigdata Amand AWS
No ratings yet
Cloud Bigdata Amand AWS
6 pages
Intro Chapter 5 A
No ratings yet
Intro Chapter 5 A
18 pages
Introduction To Structured Query Language (SQL) - Part 1 PDF
No ratings yet
Introduction To Structured Query Language (SQL) - Part 1 PDF
14 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet

Data Handling and Visualization 3rd Unit

Uploaded by

Data Handling and Visualization 3rd Unit

Uploaded by

Data Handling and Visualization: Data Acquisition, Data Pre-processing and Preparation,

Data Preprocessing in Data Science

Key Steps in Data Preprocessing

 Handling Missing Values

 Entity Identification Problem - It can be defined as identifying objects/features from

Applications of Data Preprocessing

 Improved Accuracy of ML Models - Various techniques used to preprocess data, such as

You might also like