0% found this document useful (0 votes)
8 views30 pages

L1_Data Pre-processing & Steps of Building a Model (1)

The document provides a comprehensive overview of data preprocessing in machine learning, highlighting its importance in preparing raw data for model building. It outlines the seven essential steps of data preprocessing, including data gathering, handling missing values, encoding categorical variables, and feature scaling. Additionally, it discusses various encoding techniques and their implications for model performance, emphasizing the need for quality data to ensure accurate machine learning outcomes.

Uploaded by

Gaynika Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views30 pages

L1_Data Pre-processing & Steps of Building a Model (1)

The document provides a comprehensive overview of data preprocessing in machine learning, highlighting its importance in preparing raw data for model building. It outlines the seven essential steps of data preprocessing, including data gathering, handling missing values, encoding categorical variables, and feature scaling. Additionally, it discusses various encoding techniques and their implications for model performance, emphasizing the need for quality data to ensure accurate machine learning outcomes.

Uploaded by

Gaynika Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Data Pre-Processing in

Machine Learning &


Model Building
Trainer : Ms. Nidhi Grover Raheja
Data Pre-processing in Machine
Learning
• Data preprocessing is a process of preparing the raw data and making
it suitable for a machine learning model.
• It is the first and crucial step while creating a machine learning model.
• When creating a machine learning project, it is not always a case that
we come across the clean and formatted data.
• Thus, before doing any operation with data, it is mandatory to clean it
and put in a formatted way. For this, we use data preprocessing task.
Data Pre-processing in Model
Building
Data Preprocessing includes the steps we need to
follow to transform or encode data so that it may
be easily parsed by the machine.
Need of Data Pre-Processing Tasks
• Quality decisions must be based on quality data.
• Data Preprocessing is important to get this quality data, without
which it would just be a Garbage In, Garbage Out scenario.
• A real-world data generally contains noises, missing values, and
maybe in an unusable format which cannot be directly used for
machine learning models.
• Data preprocessing is required tasks for cleaning the data and making
it suitable for a machine learning model which also increases the
accuracy and efficiency of a machine learning model.
7 Steps of Data Pre-Processing
1. Gathering the data

2. Import the dataset & Libraries

3. Divide the dataset into Dependent & Independent variable

4. Check for missing values

5. Encoding for Categorical values

6. Split dataset into training & test set

7. Feature Scaling
1. Data Gathering
• Here is the curated list of websites you can get the datasets from:
1.Kaggle: Kaggle is my personal favorite one to get the dataset.
https://www.kaggle.com/datasets
2.UCI Machine Learning Repository: One of the oldest sources of datasets on
the web, and a great first stop when looking for interesting datasets.
http://mlr.cs.umass.edu/ml/
3.This super awesome GitHub repo has tons of links to the high-quality
datasets.
Link: https://github.com/awesomedata/awesome-public-datasets
4.There is this Zanran Numerical Data Search Engine http://www.zanran.com/q/
5.And if you are looking for Government’s Open Data then here is few of them:
Indian Government: http://data.gov.in
US Government: https://www.data.gov/
British Government: https://data.gov.uk/
France Government: https://www.data.gouv.fr/en/
2. Importing Libraries
• NumPy
• Pandas
• Matplotlib and/or Seaborn
• Scikit-Learn
3. Import Dataset and perform 2
Steps
• Step1 : We can use read_csv function as below:
import Pandas as pd
dataset_frame =
pd.read_csv('Dataset.csv’)
• Step 2: Extracting Dependent and Independent
variables from Data Frame.
• Note : A dependent variable is a variable whose value
depends on another variable, whereas An Independent
variable is a variable whose value never depends on
another variable.
• Example - 1 : A teacher decides to analyze the effect
of revision time. Based on the test performance of that
100 students.
 Independent Variables: Revision time (which is measured in
hours) and Intelligence (which is measured using IQ score)
 Dependent Variable: Test Mark (which can be measured
from 0 to 100)
• Example - 2 : How does increment affect employees’
motivation?
 Independent variable: Increment
 Dependent variable: Employees motivation
• Example - 3: How higher education can lead to higher
Income:
 Independent variable : Higher education
 Dependent variable : Higher Income
Independent Vs Dependent Variable
4. Handling Missing data
Ways to handle missing data:
• There are mainly two ways to handle missing data, which are:
1. By deleting the particular row
2. Imputation by calculating the mean or replacing by a constant value

To handle missing values, we will use Scikit-learn library in our code,


which contains various libraries for building machine learning models.
Here we will use Imputer class of sklearn.preprocessing library.
5. Encoding Data
https://pub.towardsai.net/5-useful-encoding-techniques-in-machine-learning-f73
5567399f4

All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

• Encoding is a technique of converting categorical variables into


numerical values so that it could be easily fitted to a machine learning
model.
• Nominal variables are variables that have no inherent
order. They are simply categories that can be
distinguished from each other.
• Ordinal variables have an inherent order. They can be
ranked from highest to lowest or vice versa.

Where Order of data does not matterWhere Order of data matters


Case Study for Encoding
Suppose in a dataset there is an education column which we will use to
predict the salary of the person. The education column has categories
like ‘Bachelors’ , ’Masters’ , ’PHD’. Based on the above categories we
can rearrange this and assign ranks to each category. Based on the
education level ‘PHD’ will get the highest rank (PHD-1, masters-2,
bachelors-3).
Ordinal Encoding or Label Encoding
• The categorical data encoding technique is used when the categorical feature is
ordinal.
• In Label encoding, every label is converted into an integer value. We choose to
encode the text values by putting a running sequence for each text values
• We will create a variable that contains the categories representing the
qualification of a person. Ranks are provided based on the importance of the
category.

Disadvantage : Label
coding considers some
hierarchy in the columns
which can mislead to
nominal features present
in the data set.
Target Guided Ordinal Categories
Encoding
• In this method, we
calculate the mean of
each categorical
variable based on the
output and then rank
them.
• Below table illustrates
this.
• We can apply this technique but cant do this with nominal as we
don't know the order in case of nominal variables unlike in the case of
Ordinal where we know the order of variables.
• To overcome this limitation for Nominal variables we use another
technique called Mean Encoding.
One Hot Encoding
• This is a Categorical data encoding technique used when the features are nominal/don’t have
order).
• In one hot encoding, for every level of a categorical value, we create a new variable. Each
category is represented with a binary variable containing either 1 or 0 where 1 represents the
presence of that category and 0 represents the absence.
• One hot encoding overcomes the problem of label encoding.
• Example, suppose we have a column containing 3 categorical variables, then in one hot
encoding 3 columns will be created each for a categorical variable.
Dummy Variables Trap
• The newly created binary features in One-Hot Encoding are
known as Dummy variables. How many dummy variables are
there, depends on the levels present in the categorical variable.
• The Dummy Variable Trap occurs when two or more dummy
variables created by one-hot encoding are highly correlated
(multi-collinear). This means that one variable can be predicted
from the others, making it difficult to interpret predicted
coefficient variables in regression models.
We can skip the last column
‘Green’ as 0,0 signifies
green. This means, suppose
we have ‘n’ columns, then
the one hot encoding
should create ‘n-1’
columns.
Disadvantage of One-Hot Encoding
• Suppose we have a column which has 100 categorical variables. Now
if we try to convert the categorical variables into dummy variable
then we will get 99 columns. This will increase the dimension of the
overall dataset which will lead to curse of dimensionality.
• So basically, if there is a lot of categorical variables in a column then
we should not apply this technique.
One Hot Encoding (Multiple
Categories) — Nominal Categories
• In this method, we will consider only those categories which has the
most number of repetitions and we will consider the top 10 repeating
categories and apply one hot-encoding to only those categories.

One hot encoding-multiple categories


Binary Encoding
• Binary Encoding : Binary encoding is a technique used to transform
categorical data into numerical data by encoding categories as
integers and then converting them into binary code.
• Binary Encoding is a special case of One Hot Encoding in which binary
digits are used for encoding i.e. 0 or 1.
• This technique is preferable when there are more number categories.
Suppose you have 100 or more different categories then One hot
encoding will create 100 or more different columns, but binary
encoding only need 7 columns to represent it as binary code of 100 is
1100100.
For Binary encoding, one has to follow the following steps:
• The categories are first converted to numeric order starting from 1
(order is created as categories appear in a dataset and do not mean
any ordinal nature)
• Then those integers are converted into binary code, so for example, 3
becomes 011, 4 becomes 100
• Then the digits of the binary number form separate columns.
Mean Encoding
• Mean encoding is similar to label encoding, except here
labels are correlated directly with the target. For
example, in mean target encoding for each category in
the feature label is decided with the mean value of the
target variable on a training data.
• The advantages of the mean target encoding are that
it does not affect the volume of the data and helps
in faster learning.
• Mean encoding approach is as below:
1. Select a categorical variable you would like to
transform.
2. Group by the categorical variable and obtain
aggregated sum over the “Target” variable. (total number
of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain
aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with
the train.
Dataset Splitting

When splitting a dataset there are two competing concerns:


1. If you have less training data, your parameter estimates have greater
variance.
2. And if you have less testing data, your performance statistic will have greater
variance.
The data should be divided in such a way that neither of them is too high, which
is more dependent on the amount of data you have. If your data is too small then
no split will give you satisfactory variance so you will have to do cross-validation
but if your data is huge then it doesn’t really matter whether you choose an
Feature Scaling
• Feature scaling is the process of normalising the range of
features in a dataset.
• Real-world datasets often contain features that are varying in
degrees of magnitude, range and units. Therefore, in order for
machine learning models to interpret these features on the
same scale, we need to perform feature scaling.
• More specifically, we will be looking at 3 different scalers in
the Scikit-learn library for feature scaling and they are:

1. Normalisation or MinMaxScaler
2. Standardization or StandardScaler
3. RobustScaler
Normalisation
• Normalisation, also known as min-max scaling, is a scaling technique
whereby the values in a column are shifted so that they are bounded
between a fixed range of 0 and 1.
• MinMaxScaler is the Scikit-learn function for normalisation.
Standardisation
• On the other hand, standardization or Z-score normalisation is another scaling
technique whereby the values in a column are rescaled so that they
demonstrate the properties of a standard Gaussian distribution, that is mean =
0 and variance = 1.
• StandardScaler is the Scikit-learn function for standardization.
• It standardizes a feature by subtracting the mean and then scaling to unit
variance. Unit variance means dividing all the values by the standard deviation.
• StandardScaler makes the mean of the distribution 0. About 68% of the values
will lie between -1 and 1.

You might also like