L1_Data Pre-processing & Steps of Building a Model (1)
L1_Data Pre-processing & Steps of Building a Model (1)
7. Feature Scaling
1. Data Gathering
• Here is the curated list of websites you can get the datasets from:
1.Kaggle: Kaggle is my personal favorite one to get the dataset.
https://www.kaggle.com/datasets
2.UCI Machine Learning Repository: One of the oldest sources of datasets on
the web, and a great first stop when looking for interesting datasets.
http://mlr.cs.umass.edu/ml/
3.This super awesome GitHub repo has tons of links to the high-quality
datasets.
Link: https://github.com/awesomedata/awesome-public-datasets
4.There is this Zanran Numerical Data Search Engine http://www.zanran.com/q/
5.And if you are looking for Government’s Open Data then here is few of them:
Indian Government: http://data.gov.in
US Government: https://www.data.gov/
British Government: https://data.gov.uk/
France Government: https://www.data.gouv.fr/en/
2. Importing Libraries
• NumPy
• Pandas
• Matplotlib and/or Seaborn
• Scikit-Learn
3. Import Dataset and perform 2
Steps
• Step1 : We can use read_csv function as below:
import Pandas as pd
dataset_frame =
pd.read_csv('Dataset.csv’)
• Step 2: Extracting Dependent and Independent
variables from Data Frame.
• Note : A dependent variable is a variable whose value
depends on another variable, whereas An Independent
variable is a variable whose value never depends on
another variable.
• Example - 1 : A teacher decides to analyze the effect
of revision time. Based on the test performance of that
100 students.
Independent Variables: Revision time (which is measured in
hours) and Intelligence (which is measured using IQ score)
Dependent Variable: Test Mark (which can be measured
from 0 to 100)
• Example - 2 : How does increment affect employees’
motivation?
Independent variable: Increment
Dependent variable: Employees motivation
• Example - 3: How higher education can lead to higher
Income:
Independent variable : Higher education
Dependent variable : Higher Income
Independent Vs Dependent Variable
4. Handling Missing data
Ways to handle missing data:
• There are mainly two ways to handle missing data, which are:
1. By deleting the particular row
2. Imputation by calculating the mean or replacing by a constant value
All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Disadvantage : Label
coding considers some
hierarchy in the columns
which can mislead to
nominal features present
in the data set.
Target Guided Ordinal Categories
Encoding
• In this method, we
calculate the mean of
each categorical
variable based on the
output and then rank
them.
• Below table illustrates
this.
• We can apply this technique but cant do this with nominal as we
don't know the order in case of nominal variables unlike in the case of
Ordinal where we know the order of variables.
• To overcome this limitation for Nominal variables we use another
technique called Mean Encoding.
One Hot Encoding
• This is a Categorical data encoding technique used when the features are nominal/don’t have
order).
• In one hot encoding, for every level of a categorical value, we create a new variable. Each
category is represented with a binary variable containing either 1 or 0 where 1 represents the
presence of that category and 0 represents the absence.
• One hot encoding overcomes the problem of label encoding.
• Example, suppose we have a column containing 3 categorical variables, then in one hot
encoding 3 columns will be created each for a categorical variable.
Dummy Variables Trap
• The newly created binary features in One-Hot Encoding are
known as Dummy variables. How many dummy variables are
there, depends on the levels present in the categorical variable.
• The Dummy Variable Trap occurs when two or more dummy
variables created by one-hot encoding are highly correlated
(multi-collinear). This means that one variable can be predicted
from the others, making it difficult to interpret predicted
coefficient variables in regression models.
We can skip the last column
‘Green’ as 0,0 signifies
green. This means, suppose
we have ‘n’ columns, then
the one hot encoding
should create ‘n-1’
columns.
Disadvantage of One-Hot Encoding
• Suppose we have a column which has 100 categorical variables. Now
if we try to convert the categorical variables into dummy variable
then we will get 99 columns. This will increase the dimension of the
overall dataset which will lead to curse of dimensionality.
• So basically, if there is a lot of categorical variables in a column then
we should not apply this technique.
One Hot Encoding (Multiple
Categories) — Nominal Categories
• In this method, we will consider only those categories which has the
most number of repetitions and we will consider the top 10 repeating
categories and apply one hot-encoding to only those categories.
1. Normalisation or MinMaxScaler
2. Standardization or StandardScaler
3. RobustScaler
Normalisation
• Normalisation, also known as min-max scaling, is a scaling technique
whereby the values in a column are shifted so that they are bounded
between a fixed range of 0 and 1.
• MinMaxScaler is the Scikit-learn function for normalisation.
Standardisation
• On the other hand, standardization or Z-score normalisation is another scaling
technique whereby the values in a column are rescaled so that they
demonstrate the properties of a standard Gaussian distribution, that is mean =
0 and variance = 1.
• StandardScaler is the Scikit-learn function for standardization.
• It standardizes a feature by subtracting the mean and then scaling to unit
variance. Unit variance means dividing all the values by the standard deviation.
• StandardScaler makes the mean of the distribution 0. About 68% of the values
will lie between -1 and 1.