L1_Data Pre-processing & Steps of Building a Model (1)

The document provides a comprehensive overview of data preprocessing in machine learning, highlighting its importance in preparing raw data for model building. It outlines the seven essential steps of data preprocessing, including data gathering, handling missing values, encoding categorical variables, and feature scaling. Additionally, it discusses various encoding techniques and their implications for model performance, emphasizing the need for quality data to ensure accurate machine learning outcomes.

Uploaded by

Gaynika Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views30 pages

L1_Data Pre-processing & Steps of Building a Model (1)

Uploaded by

Gaynika Sharma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Data Pre-Processing in

Machine Learning &

Model Building
Trainer : Ms. Nidhi Grover Raheja
Data Pre-processing in Machine
Learning
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning model.
• It is the first and crucial step while creating a machine learning model.
• When creating a machine learning project, it is not always a case that
we come across the clean and formatted data.
• Thus, before doing any operation with data, it is mandatory to clean it
and put in a formatted way. For this, we use data preprocessing task.
Data Pre-processing in Model
Building
Data Preprocessing includes the steps we need to
follow to transform or encode data so that it may
be easily parsed by the machine.
Need of Data Pre-Processing Tasks
• Quality decisions must be based on quality data.
• Data Preprocessing is important to get this quality data, without
which it would just be a Garbage In, Garbage Out scenario.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and making
it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
7 Steps of Data Pre-Processing
1. Gathering the data

2. Import the dataset & Libraries

3. Divide the dataset into Dependent & Independent variable

4. Check for missing values

5. Encoding for Categorical values

6. Split dataset into training & test set

7. Feature Scaling
1. Data Gathering
• Here is the curated list of websites you can get the datasets from:
1.Kaggle: Kaggle is my personal favorite one to get the dataset.
https://www.kaggle.com/datasets
2.UCI Machine Learning Repository: One of the oldest sources of datasets on
the web, and a great first stop when looking for interesting datasets.
http://mlr.cs.umass.edu/ml/
3.This super awesome GitHub repo has tons of links to the high-quality
datasets.
Link: https://github.com/awesomedata/awesome-public-datasets
4.There is this Zanran Numerical Data Search Engine http://www.zanran.com/q/
5.And if you are looking for Government’s Open Data then here is few of them:
Indian Government: http://data.gov.in
US Government: https://www.data.gov/
British Government: https://data.gov.uk/
France Government: https://www.data.gouv.fr/en/
2. Importing Libraries
• NumPy
• Pandas
• Matplotlib and/or Seaborn
• Scikit-Learn
3. Import Dataset and perform 2
Steps
• Step1 : We can use read_csv function as below:
import Pandas as pd
dataset_frame =
pd.read_csv('Dataset.csv’)
• Step 2: Extracting Dependent and Independent
variables from Data Frame.
• Note : A dependent variable is a variable whose value
depends on another variable, whereas An Independent
variable is a variable whose value never depends on
another variable.
• Example - 1 : A teacher decides to analyze the effect
of revision time. Based on the test performance of that
100 students.
 Independent Variables: Revision time (which is measured in
hours) and Intelligence (which is measured using IQ score)
 Dependent Variable: Test Mark (which can be measured
from 0 to 100)
• Example - 2 : How does increment affect employees’
motivation?
 Independent variable: Increment
 Dependent variable: Employees motivation
• Example - 3: How higher education can lead to higher
Income:
 Independent variable : Higher education
 Dependent variable : Higher Income
Independent Vs Dependent Variable
4. Handling Missing data
Ways to handle missing data:
• There are mainly two ways to handle missing data, which are:
1. By deleting the particular row
2. Imputation by calculating the mean or replacing by a constant value

To handle missing values, we will use Scikit-learn library in our code,

which contains various libraries for building machine learning models.
Here we will use Imputer class of sklearn.preprocessing library.
5. Encoding Data
https://pub.towardsai.net/5-useful-encoding-techniques-in-machine-learning-f73
5567399f4

All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

• Encoding is a technique of converting categorical variables into

numerical values so that it could be easily fitted to a machine learning
model.
• Nominal variables are variables that have no inherent
order. They are simply categories that can be
distinguished from each other.
• Ordinal variables have an inherent order. They can be
ranked from highest to lowest or vice versa.

Where Order of data does not matterWhere Order of data matters

Case Study for Encoding
Suppose in a dataset there is an education column which we will use to
predict the salary of the person. The education column has categories
like ‘Bachelors’ , ’Masters’ , ’PHD’. Based on the above categories we
can rearrange this and assign ranks to each category. Based on the
education level ‘PHD’ will get the highest rank (PHD-1, masters-2,
bachelors-3).
Ordinal Encoding or Label Encoding
• The categorical data encoding technique is used when the categorical feature is
ordinal.
• In Label encoding, every label is converted into an integer value. We choose to
encode the text values by putting a running sequence for each text values
• We will create a variable that contains the categories representing the
qualification of a person. Ranks are provided based on the importance of the
category.

Disadvantage : Label
coding considers some
hierarchy in the columns
which can mislead to
nominal features present
in the data set.
Target Guided Ordinal Categories
Encoding
• In this method, we
calculate the mean of
each categorical
variable based on the
output and then rank
them.
• Below table illustrates
this.
• We can apply this technique but cant do this with nominal as we
don't know the order in case of nominal variables unlike in the case of
Ordinal where we know the order of variables.
• To overcome this limitation for Nominal variables we use another
technique called Mean Encoding.
One Hot Encoding
• This is a Categorical data encoding technique used when the features are nominal/don’t have
order).
• In one hot encoding, for every level of a categorical value, we create a new variable. Each
category is represented with a binary variable containing either 1 or 0 where 1 represents the
presence of that category and 0 represents the absence.
• One hot encoding overcomes the problem of label encoding.
• Example, suppose we have a column containing 3 categorical variables, then in one hot
encoding 3 columns will be created each for a categorical variable.
Dummy Variables Trap
• The newly created binary features in One-Hot Encoding are
known as Dummy variables. How many dummy variables are
there, depends on the levels present in the categorical variable.
• The Dummy Variable Trap occurs when two or more dummy
variables created by one-hot encoding are highly correlated
(multi-collinear). This means that one variable can be predicted
from the others, making it difficult to interpret predicted
coefficient variables in regression models.
We can skip the last column
‘Green’ as 0,0 signifies
green. This means, suppose
we have ‘n’ columns, then
the one hot encoding
should create ‘n-1’
columns.
Disadvantage of One-Hot Encoding
• Suppose we have a column which has 100 categorical variables. Now
if we try to convert the categorical variables into dummy variable
then we will get 99 columns. This will increase the dimension of the
overall dataset which will lead to curse of dimensionality.
• So basically, if there is a lot of categorical variables in a column then
we should not apply this technique.
One Hot Encoding (Multiple
Categories) — Nominal Categories
• In this method, we will consider only those categories which has the
most number of repetitions and we will consider the top 10 repeating
categories and apply one hot-encoding to only those categories.

One hot encoding-multiple categories

Binary Encoding
• Binary Encoding : Binary encoding is a technique used to transform
categorical data into numerical data by encoding categories as
integers and then converting them into binary code.
• Binary Encoding is a special case of One Hot Encoding in which binary
digits are used for encoding i.e. 0 or 1.
• This technique is preferable when there are more number categories.
Suppose you have 100 or more different categories then One hot
encoding will create 100 or more different columns, but binary
encoding only need 7 columns to represent it as binary code of 100 is
1100100.
For Binary encoding, one has to follow the following steps:
• The categories are first converted to numeric order starting from 1
(order is created as categories appear in a dataset and do not mean
any ordinal nature)
• Then those integers are converted into binary code, so for example, 3
becomes 011, 4 becomes 100
• Then the digits of the binary number form separate columns.
Mean Encoding
• Mean encoding is similar to label encoding, except here
labels are correlated directly with the target. For
example, in mean target encoding for each category in
the feature label is decided with the mean value of the
target variable on a training data.
• The advantages of the mean target encoding are that
it does not affect the volume of the data and helps
in faster learning.
• Mean encoding approach is as below:
1. Select a categorical variable you would like to
transform.
2. Group by the categorical variable and obtain
aggregated sum over the “Target” variable. (total number
of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain
aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with
the train.
Dataset Splitting

When splitting a dataset there are two competing concerns:

1. If you have less training data, your parameter estimates have greater
variance.
2. And if you have less testing data, your performance statistic will have greater
variance.
The data should be divided in such a way that neither of them is too high, which
is more dependent on the amount of data you have. If your data is too small then
no split will give you satisfactory variance so you will have to do cross-validation
but if your data is huge then it doesn’t really matter whether you choose an
Feature Scaling
• Feature scaling is the process of normalising the range of
features in a dataset.
• Real-world datasets often contain features that are varying in
degrees of magnitude, range and units. Therefore, in order for
machine learning models to interpret these features on the
same scale, we need to perform feature scaling.
• More specifically, we will be looking at 3 different scalers in
the Scikit-learn library for feature scaling and they are:

1. Normalisation or MinMaxScaler
2. Standardization or StandardScaler
3. RobustScaler
Normalisation
• Normalisation, also known as min-max scaling, is a scaling technique
whereby the values in a column are shifted so that they are bounded
between a fixed range of 0 and 1.
• MinMaxScaler is the Scikit-learn function for normalisation.
Standardisation
• On the other hand, standardization or Z-score normalisation is another scaling
technique whereby the values in a column are rescaled so that they
demonstrate the properties of a standard Gaussian distribution, that is mean =
0 and variance = 1.
• StandardScaler is the Scikit-learn function for standardization.
• It standardizes a feature by subtracting the mean and then scaling to unit
variance. Unit variance means dividing all the values by the standard deviation.
• StandardScaler makes the mean of the distribution 0. About 68% of the values
will lie between -1 and 1.

Internship Report Google AI-ML
50% (2)
Internship Report Google AI-ML
25 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
ML Concepts Papers
No ratings yet
ML Concepts Papers
3 pages
Exp 6
No ratings yet
Exp 6
9 pages
All About Encoding - by Baijayanta Roy - Towards Data Science
No ratings yet
All About Encoding - by Baijayanta Roy - Towards Data Science
25 pages
Encoding Notes
No ratings yet
Encoding Notes
4 pages
Dealing with categorical
No ratings yet
Dealing with categorical
25 pages
Feature+Encoding
No ratings yet
Feature+Encoding
5 pages
Lecture 5 Encoding
No ratings yet
Lecture 5 Encoding
35 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Machine Learning Summer Training
No ratings yet
Machine Learning Summer Training
118 pages
TP4-ML-features encoding (3)
No ratings yet
TP4-ML-features encoding (3)
4 pages
DMDW 03
No ratings yet
DMDW 03
25 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
All About Categorical Variable Encoding
No ratings yet
All About Categorical Variable Encoding
21 pages
A Deep-Learned Embedding Technique For Categorical Features Encoding
No ratings yet
A Deep-Learned Embedding Technique For Categorical Features Encoding
11 pages
Week 10
No ratings yet
Week 10
50 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
LAB MANUAL 5 SOLVED 40 (1)
No ratings yet
LAB MANUAL 5 SOLVED 40 (1)
13 pages
Mastering Categorical Encoding
No ratings yet
Mastering Categorical Encoding
8 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Feature Engineering
No ratings yet
Feature Engineering
43 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
Machine Learning Pipeline: Created by Arbaz Ali
No ratings yet
Machine Learning Pipeline: Created by Arbaz Ali
32 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
5 pages
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
No ratings yet
(Articulo) A Comparative Study of Categorical Variable Encoding PDF
4 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
A Comparative Study of Categorical Variable Encoding Techniques
No ratings yet
A Comparative Study of Categorical Variable Encoding Techniques
4 pages
Dealing with Categorical Data
No ratings yet
Dealing with Categorical Data
14 pages
4 Data Preprocessing
No ratings yet
4 Data Preprocessing
27 pages
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
No ratings yet
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
45 pages
FeatureEngineering (1)
No ratings yet
FeatureEngineering (1)
50 pages
Develop A Program To Implement Data Preprocessing Using
No ratings yet
Develop A Program To Implement Data Preprocessing Using
19 pages
DS 1
No ratings yet
DS 1
20 pages
data processing
No ratings yet
data processing
19 pages
Using Categorical Data With One Hot Encoding - Kaggle PDF
No ratings yet
Using Categorical Data With One Hot Encoding - Kaggle PDF
4 pages
Machine Learning
No ratings yet
Machine Learning
81 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Machine Learning
No ratings yet
Machine Learning
34 pages
Encoding Categorical Data: Is There Yet Anything Hotter' Than One-Hot Encoding?
No ratings yet
Encoding Categorical Data: Is There Yet Anything Hotter' Than One-Hot Encoding?
11 pages
ML_1
No ratings yet
ML_1
13 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
Encoding Categorical Data
No ratings yet
Encoding Categorical Data
4 pages
EX 3
No ratings yet
EX 3
11 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Cse3001 Ai Ml m2
No ratings yet
Cse3001 Ai Ml m2
118 pages
Machine Learning: by Team 2
No ratings yet
Machine Learning: by Team 2
41 pages
Dwdm-Lab Manual
No ratings yet
Dwdm-Lab Manual
39 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
72b85f60-8523-423f-9efc-ff56aa21f3f3
No ratings yet
72b85f60-8523-423f-9efc-ff56aa21f3f3
29 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
CSE445_T2b_Data_Preprocessing
No ratings yet
CSE445_T2b_Data_Preprocessing
42 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
ML unit 3
No ratings yet
ML unit 3
17 pages
Unit-II
No ratings yet
Unit-II
119 pages
WORKING WITH PRE[ROCESSING DATA FILES
No ratings yet
WORKING WITH PRE[ROCESSING DATA FILES
4 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Body Pose Detection Using Research
No ratings yet
Body Pose Detection Using Research
12 pages
2012_NOVATCHKOV_Machine learning methods for the automatic evaluation of exercises on sensor equipped weight training machines
No ratings yet
2012_NOVATCHKOV_Machine learning methods for the automatic evaluation of exercises on sensor equipped weight training machines
6 pages
Final Project Review 2
No ratings yet
Final Project Review 2
22 pages
18CSP83 - Project Phase 2 - body
No ratings yet
18CSP83 - Project Phase 2 - body
40 pages
Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach
No ratings yet
Detection of Hate Speech and Offensive Language CodeMix Text in Dravidian Languages Using Cost-Sensitive Learning Approach
27 pages
b 14 Sms Spam Detection Ml Ieee Report (1)
No ratings yet
b 14 Sms Spam Detection Ml Ieee Report (1)
5 pages
F2Dnet: Fast Focal Detection Network For Pedestrian Detection
No ratings yet
F2Dnet: Fast Focal Detection Network For Pedestrian Detection
7 pages
Deep Learning
No ratings yet
Deep Learning
17 pages
Rotten Tomatoes Audience Rating Prediction
No ratings yet
Rotten Tomatoes Audience Rating Prediction
36 pages
Project A
No ratings yet
Project A
24 pages
Hand Gesture Recognition Report-Updated
No ratings yet
Hand Gesture Recognition Report-Updated
62 pages
Facial Recognition Technical Report - Group 2
No ratings yet
Facial Recognition Technical Report - Group 2
48 pages
Driver Identification Using Only The CAN-Bus Vehicle Data Through An
No ratings yet
Driver Identification Using Only The CAN-Bus Vehicle Data Through An
19 pages
Malaria Parasitic Detection Using A New Deep Boosted and Ensemble Learning Framework
No ratings yet
Malaria Parasitic Detection Using A New Deep Boosted and Ensemble Learning Framework
17 pages
CS 3 - Problem Solving Agent
No ratings yet
CS 3 - Problem Solving Agent
80 pages
10.1515 - Jisys 2022 0046
No ratings yet
10.1515 - Jisys 2022 0046
19 pages
Data Science_Modern_Template_CV
No ratings yet
Data Science_Modern_Template_CV
1 page
Application of Transfer Learning For Image Classification On Dataset With Not Mutually Exclusive Classes
No ratings yet
Application of Transfer Learning For Image Classification On Dataset With Not Mutually Exclusive Classes
4 pages
Miniproj Report Calorie Burnt Prediction (1)
No ratings yet
Miniproj Report Calorie Burnt Prediction (1)
25 pages
Emonet
No ratings yet
Emonet
16 pages
Ashwin Kumar REPORT - 1BI21IS019
No ratings yet
Ashwin Kumar REPORT - 1BI21IS019
57 pages
31273-Article Text-114033-1-10-20220923
No ratings yet
31273-Article Text-114033-1-10-20220923
18 pages
AIF C01 Test Prep
No ratings yet
AIF C01 Test Prep
20 pages
Ai Notes
No ratings yet
Ai Notes
19 pages
Mini Project Thesis
No ratings yet
Mini Project Thesis
21 pages
Pratham's Mini Project
No ratings yet
Pratham's Mini Project
37 pages
ML Mid Question Solve
No ratings yet
ML Mid Question Solve
19 pages
LP Vi Lab Manual 2022-23 Final
No ratings yet
LP Vi Lab Manual 2022-23 Final
72 pages
Data Analysis from Scratch with Python Peters Morgan instant download
100% (1)
Data Analysis from Scratch with Python Peters Morgan instant download
64 pages