0% found this document useful (0 votes)
15 views

Data Preparation .1

The document discusses the process of data preparation which involves loading, cleaning, validating and transforming raw data from various sources into a standardized format suitable for analysis. This includes handling missing values, identifying and removing duplicate records, and separating categorical and numerical data. Data preparation is an important but time-consuming step that ensures reliable analytics results.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Data Preparation .1

The document discusses the process of data preparation which involves loading, cleaning, validating and transforming raw data from various sources into a standardized format suitable for analysis. This includes handling missing values, identifying and removing duplicate records, and separating categorical and numerical data. Data preparation is an important but time-consuming step that ensures reliable analytics results.

Uploaded by

yasmine hussein
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Preparation

Data Preparation
● It is rare that you get data in exactly the right form you need it. Often you’ll
need to create some new variables, rename existing ones, reorder the
observations, or just drop registers in order to make data a little easier to work
with.
● Data sets have issues in
1. Accuracy.
2. Quality.
3. Consistency.
4. Irrelevant Data.
What is Data Preparation?

● the data preparation process transforms raw data from multiple sources into a
standardized format. This ‘preparation’ makes the data ready for Exploration
and Analysis.

● Data preparation is often referred to informally as data prep. It's also known as
data wrangling, it is the process of combining , cleansing, structuring and
transforming data to be used in business intelligence ,analytics and
visualization applications.
Data Preparation
Data Preparation

● A data scientist spends 80% of the time preparing data.

● It’s important and a non negotiable step before the data is ready to be
Explored and Analyzed.
Importance of Data Preparation
● The importance of data preparation can be measured by this simple fact:
your analytics are wholly dependent on your data. If you feed Garbage to
the system, the analytics you receive will be garbage as well(GIGO). The
true power of data lies in how it is captured, processed, and turned into
true actionable insights.
● For Example: the data
in correct scale, format
and containing
meaningful features, for
the problem we want
machine to solve
Importance of Data Preparation
1. Ensure data produces reliable analytics results .
2. Identify and fix data issues that might otherwise go undetected.
3. Enable more informed business decision making.
4. Reduce Data management and analytics cost .
Projected worldwide spending on data preparation
Data Preparation Steps: How is Data Prepared?

● Here are the four major data preparation steps used by data experts everywhere.

1. Loading the data


2. Clean the data.
3. Validate the data.
4. Transform and Enrich Data.
5. Start the ETL Process.
Data Preparation
● Load the data set and storing it in data-frame
-The data could be stored in different formats ,like :
● CSV Files (.csv)
● Excel files ( .xlsx)
● Text files (.txt)
● Database in SQL
● APIs (JSON)
In data preparation stage the data will be loaded and read in a data frame.
Datasets common websites
1.https://archive.ics.uci.edu/ml/datasets.php

2.https://www.kaggle.com/
Step1: Reading data into DataFrame
Loading different data into a data frame ,the data could be in different formats
e.g(.xlsx and .csv ) by using pd.read function from Pandas library .
Reading CSV Dataset :
Step 2: Handling Missing Values

● Perhaps the data was not available or not applicable or the event did not
happen. It could be that the person who entered the data did not know the
right value, or missed filling in. Data mining methods vary in the way they treat
missing values.

● There should be a strategy to treat missing values, lets see how we can do it
Step 2: Handling Missing Values

Some default missing values :

● NA:Not Available / Not Applicable


● N/A:Not Available /Not Applicable
● NAN:Not A Number
● <Empty Cell>
● Null
Checking for NULL values (NA’s)
Handling Missing Data
1. Remove the missing data:

● By delete any NAN or Null value , is the process of removing the entire data which contains
the missing value. Although it's a simple process but its disadvantage is reduction of power of
the model as the sample size decreases.

❖ Advantage of this method is, it’s quick and dirty method of fixing the missing values issue. But
this is not always the goto method as you might sometime end up losing critical information by
deleting the features
Deleting Missing Values
2.Retain the Data through imputation

● The imputation or Filling overcomes the problem of removal of missing


records and produces a complete dataset that can be used for Analysis and
modeling .
● It could be by filling with values like Mean ,Median ,Mode ,Min ,Max,the
previous value ,the next value ,or any other value .

● It could be interpolated by using Function Interpolate()


1.Last observation carried forward (LOCF)
● Also commonly known as Forward filling .

● It is the process of replacing a missing value with last observed record. Its
widely use imputed method in time series data . This method is advantageous
as its easy to communicate , but it based on the assumption that the value of
outcome remains unchanged by the missing data, which seems very unlikely.
Last observation carried forward (LOCF)
2.Next Observation Carried Backward (NOCB)
● As the name suggest, its exact opposite of forward filling and also commonly

known as Backward Filling .

● It takes the first observation after the missing value and carrying it backward.
Next Observation Carried Backward (NOCB)
3.Mean, Mode and Median imputation
Mean ,Mode and Median imputation

● Imputation is a way to fill in the missing values with estimated ones. The
objective is to employ known relationships that can be identified in the valid
values of the data set to assist in estimating the missing values. For numeric
data type Mean / Mode / Median imputation is one of the most frequently used
methods while for categorical mode is preferred.

❖ Advantage of this method is that we don’t remove the data which prevents
data loss.
❖ The drawback is that you don’t know how accurate using the mean, median,
or mode is going to be in a given situation.
Mean Imputation
Median Imputation
Minimum Value Imputation
Maximum Value Imputation
Step 3:Check The Duplicate

● The presence of a copy of an original record is called a Duplicate record.

● The duplicated data,can be a reason of the non accurate performance of


the model or which can cause the data bias and results corrupted.
Duplicated Data:
● Checking the existence of duplicated data and the count of it in each record:
Duplicated Data
● Remove the duplicated data by using Pandas function drop_duplicates()
Step 4: Separating categorical and numerical data.
Step 4: Separating categorical and numerical data.
● Both Categorical data and Numeric data needs different kind of treatment
because of their different nature.

You might also like