0% found this document useful (0 votes)

15 views37 pages

Data Preparation .1

The document discusses the process of data preparation which involves loading, cleaning, validating and transforming raw data from various sources into a standardized format suitable for analysis. This includes handling missing values, identifying and removing duplicate records, and separating categorical and numerical data. Data preparation is an important but time-consuming step that ensures reliable analytics results.

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views37 pages

Data Preparation .1

Uploaded by

yasmine hussein

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Data Preparation

Data Preparation
● It is rare that you get data in exactly the right form you need it. Often you’ll
need to create some new variables, rename existing ones, reorder the
observations, or just drop registers in order to make data a little easier to work
with.
● Data sets have issues in
1. Accuracy.
2. Quality.
3. Consistency.
4. Irrelevant Data.
What is Data Preparation?

● the data preparation process transforms raw data from multiple sources into a
standardized format. This ‘preparation’ makes the data ready for Exploration
and Analysis.

● Data preparation is often referred to informally as data prep. It's also known as
data wrangling, it is the process of combining , cleansing, structuring and
transforming data to be used in business intelligence ,analytics and
visualization applications.
Data Preparation
Data Preparation

● A data scientist spends 80% of the time preparing data.

● It’s important and a non negotiable step before the data is ready to be
Explored and Analyzed.
Importance of Data Preparation
● The importance of data preparation can be measured by this simple fact:
your analytics are wholly dependent on your data. If you feed Garbage to
the system, the analytics you receive will be garbage as well(GIGO). The
true power of data lies in how it is captured, processed, and turned into
true actionable insights.
● For Example: the data
in correct scale, format
and containing
meaningful features, for
the problem we want
machine to solve
Importance of Data Preparation
1. Ensure data produces reliable analytics results .
2. Identify and fix data issues that might otherwise go undetected.
3. Enable more informed business decision making.
4. Reduce Data management and analytics cost .
Projected worldwide spending on data preparation
Data Preparation Steps: How is Data Prepared?

● Here are the four major data preparation steps used by data experts everywhere.

1. Loading the data

2. Clean the data.
3. Validate the data.
4. Transform and Enrich Data.
5. Start the ETL Process.
Data Preparation
● Load the data set and storing it in data-frame
-The data could be stored in different formats ,like :
● CSV Files (.csv)
● Excel files ( .xlsx)
● Text files (.txt)
● Database in SQL
● APIs (JSON)
In data preparation stage the data will be loaded and read in a data frame.
Datasets common websites
1.https://archive.ics.uci.edu/ml/datasets.php

2.https://www.kaggle.com/
Step1: Reading data into DataFrame
Loading different data into a data frame ,the data could be in different formats
e.g(.xlsx and .csv ) by using pd.read function from Pandas library .
Reading CSV Dataset :
Step 2: Handling Missing Values

● Perhaps the data was not available or not applicable or the event did not
happen. It could be that the person who entered the data did not know the
right value, or missed filling in. Data mining methods vary in the way they treat
missing values.

● There should be a strategy to treat missing values, lets see how we can do it
Step 2: Handling Missing Values

Some default missing values :

● NA:Not Available / Not Applicable

● N/A:Not Available /Not Applicable
● NAN:Not A Number
● <Empty Cell>
● Null
Checking for NULL values (NA’s)
Handling Missing Data
1. Remove the missing data:

● By delete any NAN or Null value , is the process of removing the entire data which contains
the missing value. Although it's a simple process but its disadvantage is reduction of power of
the model as the sample size decreases.

❖ Advantage of this method is, it’s quick and dirty method of fixing the missing values issue. But
this is not always the goto method as you might sometime end up losing critical information by
deleting the features
Deleting Missing Values
2.Retain the Data through imputation

● The imputation or Filling overcomes the problem of removal of missing

records and produces a complete dataset that can be used for Analysis and
modeling .
● It could be by filling with values like Mean ,Median ,Mode ,Min ,Max,the
previous value ,the next value ,or any other value .

● It could be interpolated by using Function Interpolate()

1.Last observation carried forward (LOCF)
● Also commonly known as Forward filling .

● It is the process of replacing a missing value with last observed record. Its
widely use imputed method in time series data . This method is advantageous
as its easy to communicate , but it based on the assumption that the value of
outcome remains unchanged by the missing data, which seems very unlikely.
Last observation carried forward (LOCF)
2.Next Observation Carried Backward (NOCB)
● As the name suggest, its exact opposite of forward filling and also commonly

known as Backward Filling .

● It takes the first observation after the missing value and carrying it backward.
Next Observation Carried Backward (NOCB)
3.Mean, Mode and Median imputation
Mean ,Mode and Median imputation

● Imputation is a way to fill in the missing values with estimated ones. The
objective is to employ known relationships that can be identified in the valid
values of the data set to assist in estimating the missing values. For numeric
data type Mean / Mode / Median imputation is one of the most frequently used
methods while for categorical mode is preferred.

❖ Advantage of this method is that we don’t remove the data which prevents
data loss.
❖ The drawback is that you don’t know how accurate using the mean, median,
or mode is going to be in a given situation.
Mean Imputation
Median Imputation
Minimum Value Imputation
Maximum Value Imputation
Step 3:Check The Duplicate

● The presence of a copy of an original record is called a Duplicate record.

● The duplicated data,can be a reason of the non accurate performance of

the model or which can cause the data bias and results corrupted.
Duplicated Data:
● Checking the existence of duplicated data and the count of it in each record:
Duplicated Data
● Remove the duplicated data by using Pandas function drop_duplicates()
Step 4: Separating categorical and numerical data.
Step 4: Separating categorical and numerical data.
● Both Categorical data and Numeric data needs different kind of treatment
because of their different nature.

Unit 4 Back To Nature)
75% (65)
Unit 4 Back To Nature)
21 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Data Preparation
No ratings yet
Data Preparation
17 pages
chapter3 DS
No ratings yet
chapter3 DS
17 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
Data Cleaning Wrangling
No ratings yet
Data Cleaning Wrangling
42 pages
Unit 2 Data Preprocessing (1)
No ratings yet
Unit 2 Data Preprocessing (1)
66 pages
ET 610 - Data Preprocessing
No ratings yet
ET 610 - Data Preprocessing
41 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
No ratings yet
MFA-106-Unit III Data Preparation and Data Warehousing-16Apr2024
15 pages
Module 3 Notes
No ratings yet
Module 3 Notes
5 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Lec 3 Data Preprocessing and Transformation(1)
No ratings yet
Lec 3 Data Preprocessing and Transformation(1)
73 pages
Missing Data
No ratings yet
Missing Data
14 pages
DA_MID1
No ratings yet
DA_MID1
32 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Unit 1
No ratings yet
Unit 1
21 pages
Module 2_data preprocessing
No ratings yet
Module 2_data preprocessing
16 pages
Chapter 1. Data Preparation (2)
No ratings yet
Chapter 1. Data Preparation (2)
74 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
No ratings yet
Lab Assignment 1 Title: Data Wrangling I: Problem Statement
12 pages
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
No ratings yet
Lecture Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
40 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
data preprocessing
No ratings yet
data preprocessing
11 pages
BA UNIT-3 - Part 1
No ratings yet
BA UNIT-3 - Part 1
4 pages
Data Preprocessing and Cleaning
No ratings yet
Data Preprocessing and Cleaning
6 pages
Adsl Exp 3 2024
No ratings yet
Adsl Exp 3 2024
11 pages
M2 PPT
No ratings yet
M2 PPT
60 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
Data Quality
No ratings yet
Data Quality
14 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
DA unit 2 15m handling missing data
No ratings yet
DA unit 2 15m handling missing data
3 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Data Wrangling
No ratings yet
Data Wrangling
30 pages
DWDM unit 3
No ratings yet
DWDM unit 3
16 pages
INF30036 Lecture4
No ratings yet
INF30036 Lecture4
47 pages
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
No ratings yet
Unit2 _Data Cleaning and Multivariate Techniques_26_01_2025
42 pages
FBA Module 3
No ratings yet
FBA Module 3
41 pages
Week 3
No ratings yet
Week 3
77 pages
How To Handle Missing Data in A Dataset 1710294197
No ratings yet
How To Handle Missing Data in A Dataset 1710294197
14 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
Explorotary Data Analysis
100% (1)
Explorotary Data Analysis
30 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
data science slides
No ratings yet
data science slides
57 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Data Preprocessing Techniques Cleaning Transformation and Integration
No ratings yet
Data Preprocessing Techniques Cleaning Transformation and Integration
6 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
-16-Data Preprocessing
No ratings yet
-16-Data Preprocessing
27 pages
03_Data_Preprocessing
No ratings yet
03_Data_Preprocessing
15 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
Week3- Data Preprocessing, Extraction and Preparation
No ratings yet
Week3- Data Preprocessing, Extraction and Preparation
34 pages
2 - Data Management and Wrangling
No ratings yet
2 - Data Management and Wrangling
33 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
66 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
Intro To Statistics
No ratings yet
Intro To Statistics
37 pages
Algorithms and Flowcharts
No ratings yet
Algorithms and Flowcharts
37 pages
Computer Software
No ratings yet
Computer Software
23 pages
Data Representation
No ratings yet
Data Representation
31 pages
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
No ratings yet
CS361 Artificial Intelligence (SEP) Lecture 1 (An Introduction To Artificial Intelligence) Fall 2020
44 pages
Glossary For Sap Commerce Cloud Integration
No ratings yet
Glossary For Sap Commerce Cloud Integration
10 pages
Divine Provision
No ratings yet
Divine Provision
130 pages
Principles For Devising A Reading Comprehension Test: A Library Based Review
No ratings yet
Principles For Devising A Reading Comprehension Test: A Library Based Review
20 pages
PWC unit-1
No ratings yet
PWC unit-1
17 pages
bca 4th sem
No ratings yet
bca 4th sem
27 pages
DHSE First Year Examination Results - 2024
No ratings yet
DHSE First Year Examination Results - 2024
13 pages
Chernoz Hansen 2006 JoE
No ratings yet
Chernoz Hansen 2006 JoE
35 pages
Reading Challenge 1
75% (4)
Reading Challenge 1
129 pages
Antihero Essay Checklist in Essay? Requirement in Essay? Requirement
No ratings yet
Antihero Essay Checklist in Essay? Requirement in Essay? Requirement
1 page
Module 04 Written Assignment - Positive Teaching Strategies
No ratings yet
Module 04 Written Assignment - Positive Teaching Strategies
3 pages
An Artless Art
100% (5)
An Artless Art
134 pages
Polish Literature: Literatura Polska
100% (2)
Polish Literature: Literatura Polska
53 pages
I.S Objective Questions Mid2
No ratings yet
I.S Objective Questions Mid2
19 pages
Complier Design Lab Manual - R18 - III - Year - II - Semester - C.S.E
No ratings yet
Complier Design Lab Manual - R18 - III - Year - II - Semester - C.S.E
26 pages
Present Continuous Tense: Amity International School, Noida Practice Sheet - Verb Present Tense, September 2024
No ratings yet
Present Continuous Tense: Amity International School, Noida Practice Sheet - Verb Present Tense, September 2024
3 pages
Ai Class 9 Final Study Material
No ratings yet
Ai Class 9 Final Study Material
6 pages
Algebra IV Unit II Lecture12 Marked
No ratings yet
Algebra IV Unit II Lecture12 Marked
18 pages
Poems Handout
No ratings yet
Poems Handout
16 pages
English Notes: William Shakespeare Life of Shakespeare
No ratings yet
English Notes: William Shakespeare Life of Shakespeare
17 pages
24-25 RME PEF 2 B8 (A)
No ratings yet
24-25 RME PEF 2 B8 (A)
9 pages
ICT Lab 4 MOHSIN ALI
No ratings yet
ICT Lab 4 MOHSIN ALI
8 pages
GV65 Plus Quick Start V1.00
No ratings yet
GV65 Plus Quick Start V1.00
2 pages
Intertextual Analysis
No ratings yet
Intertextual Analysis
3 pages
%getreviews: A Sas® Macro To Retrieve User Reviews in Json Format From Review Websites and Create Sas® Datasets
No ratings yet
%getreviews: A Sas® Macro To Retrieve User Reviews in Json Format From Review Websites and Create Sas® Datasets
10 pages
Daily Lesson Plan Year 5 Wawasan: Classroom Based Assessment
No ratings yet
Daily Lesson Plan Year 5 Wawasan: Classroom Based Assessment
2 pages
WTT - Culture and Value
No ratings yet
WTT - Culture and Value
108 pages
Introduction - P.P. Raveendran
No ratings yet
Introduction - P.P. Raveendran
7 pages
PEDIA Stickers 1
No ratings yet
PEDIA Stickers 1
10 pages
Nciplot Manual
No ratings yet
Nciplot Manual
16 pages

Data Preparation .1

Uploaded by

Data Preparation .1

Uploaded by

Data Preparation

● A data scientist spends 80% of the time preparing data.

1. Loading the data

Some default missing values :

● NA:Not Available / Not Applicable

● The imputation or Filling overcomes the problem of removal of missing

● It could be interpolated by using Function Interpolate()

known as Backward Filling .

● The presence of a copy of an original record is called a Duplicate record.

● The duplicated data,can be a reason of the non accurate performance of

You might also like