0% found this document useful (0 votes)

190 views

Feature Scaling in Machine Learning

The document discusses feature scaling techniques in machine learning. Feature scaling transforms features to a common scale without distorting differences to avoid issues from unscaled data. Common scaling methods are z-score and min-max scaling, which center and normalize features or scale features between 0 and 1 respectively. Choosing a scaling method depends on the data distribution and algorithm.

Uploaded by

Varun Bhayana

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

190 views

Feature Scaling in Machine Learning

Uploaded by

Varun Bhayana

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Feature Scaling in Machine Learning

Feature Scaling is a technique often applied as part of data preparation for machine learning. The
goal of scaling (sometimes also referred as normalization) is to change the values of numeric
columns in the dataset to use a common scale, without distorting differences in the ranges of
values or losing information.
If one component (e.g. human height) varies less than another does (e.g. weight) because of their
respective scales (meters vs. kilos), an algorithm say PCA might determine that the direction of
maximal variance more closely corresponds with the ‘weight’ axis, if those features are not
scaled. Thus leading to incorrect results.
Scaling avoids these problems by creating new values that maintain the general distribution and
ratios in the source data, while keeping values within a scale applied across all numeric columns
used in the model. Essentially avoiding the results being heavily influenced by the High magnitude
features.

Scaling is advised in the following situations-

1. When using Distance or Variance Based Algorithms -

Scaling is essential in Distance based (for eg. k-nearest neighbours with a Euclidean
distance measure) or variance based models (eg. linear discriminant analysis, principal
component analysis) if we want all the features to contribute equally in the final output.

2. Gradient Descent optimizations-

When the feature data value is large, scaling can be used to obtain better Convergence in
lesser number of iterations and speed up the overall iterative process.

Types of Scaling –

The two most commonly used scaling techniques are Z-score and Min-Max Scaling.

1. Z-Score Scaling – A method in which all the values are converted to z-scores.

Z-score removes the mean i.e. brings the mean to zero and scales the data to unit variance.
The scaling shrinks the range of the feature values, for eg. if the original range was say 500
to 50,000 with a mean of 20,000 the new range could be shrunk to -5 to +5 with mean
approximately equal to 0.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Z-scores are calculated using the following formula -

One thing to note at this point is, the expression of z-score makes use of mean and std.
deviation of the distribution, which are both affected by the presence of outlier values. This
leads to an imbalanced feature scales in presence of significant outlier values. Thus being
aware of the outliers and performing appropriate treatment in accordance with the business
becomes very crucial.

2. Min-Max Scaling: The MinMax method linearly rescales every feature to the [0,1]
interval. The presence of this bounded range - in contrast to z-score scaling - is that we
will end up with smaller standard deviations, which can suppress the effect of outliers.

The values in the column are obtained using the following formula:

Similar to the z-score scaling, Min-Max scaling is also affected by the outliers due to
presence of Min and Max value (as Outliers would be the Min Max values in the features)
in the expression. Reinforcing the need to inspect for outliers for both the scaling methods.

Choosing between Z-score standardization and Min-Max scaling:

When it comes to choosing the appropriate scaling method for the dataset, there are no hard
coded rules instead; it depends on the case study in hand.
That being said, you can use the below pointers to arrive at the decision best suited for your
use-case:
1. Min-Max is good to use when you know that the distribution of your data does not
follow a Gaussian distribution. In addition, this can be useful in algorithms that do not
assume any distribution of the data like K-Nearest Neighbours and Neural Networks.
2. Z-Score on the other hand, can be helpful in cases where the data follows a Normal (or
near normal) distribution or the algorithm demands a Normal distribution of the
features.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3. Min-Max can be a better option when you are able to accurately estimate the minimum
and maximum observable values. For example in image processing, where pixel
intensities are known to have the range of 0 to 255 for the RGB colour range.

Inverse Scaling -
There might be certain scenarios, where interpretability of the features is very important. For
instance, let’s suppose we have trained machine learning models on the scaled data and we want
to present our results/findings to the stake holders. Now, the stakeholders will be more
comfortable with the original scales (units) instead of the Bounded Scaled values. Thus, inverse
scaling can be used to get the values to the original scales and avoid the issues of interpretability.

Python Code snippet for scaling -

#Importing the necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Splitting the dataset into Train and Test prior to scaling to avoid data leakage
#Only applicable for Supervised Learning Algorithms
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=100)

# Seperating the Continuous and Categorical variables

X_train_cont = X_train.select_dtypes(include= 'number',exclude='object')
X_train_cat = X_train.select_dtypes(include='object',exclude='number')

X_test_cont = X_test.select_dtypes(include= 'number',exclude='object')

X_test_cat = X_test.select_dtypes(include='object',exclude='number')
#Intilizing object of StandardScaler
zscore = StandardScaler()

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
#Fitting the StandardScaler object with the Train Set
zscore.fit(X_train_cont)
#Scalling the Train & Test sets using the fitted StandardScaler object
X_train_scaled= pd.DataFrame(zscore.transform(X_train_cont),columns =
X_train_cont.columns)
X_test_scaled= pd.DataFrame(zscore.transform(X_test_cont),columns = X_test_cont.columns)

#Merging the Categorical and Continuous datasets

X_train = X_train_scaled.merge(right= X_train_cat,left_index =True,right_index =True)
X_test = X_test_scaled.merge(right= X_test_cat,left_index =True,right_index =True)
#Model building Steps
…
# Inverse Scaling to obtain unscaled data
X_train_cont = pd.DataFrame(zscore.inverse_transform(X_train_scaled),columns =
X_train.columns)
X_test_cont = pd.DataFrame(zscore.inverse_transform(X_test_scaled),columns =
X_test.columns)

Substitution for Min-Max Scaling –

from sklearn.preprocessing import MinMaxScaler
#Intilizing object of Min-Max
minmax = MinMaxScaler()
#Fitting the MinMaxScaler object with the Train Set
minmax.fit(X_train_cont)

References –
1. Scikit-learn Pre-processing Article
2. Data Preprocessing in Data Mining By - Salvador García , Julián Luengo , Francisco
Herrera

SE Unit 3
No ratings yet
SE Unit 3
10 pages
Scan To BIM - Presentation
No ratings yet
Scan To BIM - Presentation
61 pages
5047x Steam Accounts
75% (4)
5047x Steam Accounts
513 pages
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Business Report: Advanced Statistics Project
100% (5)
Business Report: Advanced Statistics Project
24 pages
CNN For Deep Learning - Convolutional Neural Networks
No ratings yet
CNN For Deep Learning - Convolutional Neural Networks
10 pages
Ecc
No ratings yet
Ecc
76 pages
Expert Systems Principles and Programming Solution Manual PDF
No ratings yet
Expert Systems Principles and Programming Solution Manual PDF
2 pages
Sign Language Recognition Using Deep Learning
No ratings yet
Sign Language Recognition Using Deep Learning
6 pages
Case Study (Analysis of Algorithm
No ratings yet
Case Study (Analysis of Algorithm
14 pages
Python Syllabus
No ratings yet
Python Syllabus
4 pages
GNN-XAI 学习提纲.md
No ratings yet
GNN-XAI 学习提纲.md
4 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Vision-Face Recognition Attendance Monitoring System For Surveillance Using Deep Learning Technology and Computer Vision
No ratings yet
Vision-Face Recognition Attendance Monitoring System For Surveillance Using Deep Learning Technology and Computer Vision
5 pages
Steganography
No ratings yet
Steganography
11 pages
Unit 1-PROBLEM SOLVING AND PYTHON PROGRAMMING
No ratings yet
Unit 1-PROBLEM SOLVING AND PYTHON PROGRAMMING
85 pages
Deepseek
No ratings yet
Deepseek
11 pages
ISWA Unit1pptx 2023 08 28 19 47 11
No ratings yet
ISWA Unit1pptx 2023 08 28 19 47 11
47 pages
Ai Unit-I
No ratings yet
Ai Unit-I
41 pages
Chapter 6-Client Side Scripting Using JavaScript-hsslive
No ratings yet
Chapter 6-Client Side Scripting Using JavaScript-hsslive
29 pages
Oop PPT 1
No ratings yet
Oop PPT 1
94 pages
Hill CH 1 Ed 3
No ratings yet
Hill CH 1 Ed 3
60 pages
A Comprehensive Tutorial To Learn Convolutional Neural Networks From Scratch
No ratings yet
A Comprehensive Tutorial To Learn Convolutional Neural Networks From Scratch
11 pages
NLP Lab Manual LP 6
No ratings yet
NLP Lab Manual LP 6
43 pages
Ada Lab Manual
No ratings yet
Ada Lab Manual
64 pages
Child Digital Monitoring and Controlling System
No ratings yet
Child Digital Monitoring and Controlling System
8 pages
Arduino _ Architecture, Programming and Application
No ratings yet
Arduino _ Architecture, Programming and Application
64 pages
Computer Network Module 2
No ratings yet
Computer Network Module 2
160 pages
Research Data Analysis With Power BI: Vijay Krishnan S Bharanidharan G Krishnamoorthy
No ratings yet
Research Data Analysis With Power BI: Vijay Krishnan S Bharanidharan G Krishnamoorthy
8 pages
Unit II
No ratings yet
Unit II
13 pages
74AI
No ratings yet
74AI
62 pages
Python Loops
No ratings yet
Python Loops
12 pages
ML Lab Session 06 - VGG16-CNN
No ratings yet
ML Lab Session 06 - VGG16-CNN
15 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
2 pages
Chapter 8 Code Optimization and Code Generation
No ratings yet
Chapter 8 Code Optimization and Code Generation
58 pages
Java Oop Notes
No ratings yet
Java Oop Notes
385 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Version Control Systems
No ratings yet
Version Control Systems
14 pages
Unit-5 (Notes AI)
No ratings yet
Unit-5 (Notes AI)
28 pages
Unit II: Software Requirement Analysis and Specifications
No ratings yet
Unit II: Software Requirement Analysis and Specifications
64 pages
Program in C Notes
No ratings yet
Program in C Notes
82 pages
After Effects Expressions
No ratings yet
After Effects Expressions
9 pages
Chapter 4 - Syntax Analysis
No ratings yet
Chapter 4 - Syntax Analysis
82 pages
Chapter 23 - Product Metrics
No ratings yet
Chapter 23 - Product Metrics
23 pages
Lecture 01 - UML Case Tools
No ratings yet
Lecture 01 - UML Case Tools
44 pages
11 Implementation of Distance Vector Routing Algorithm
No ratings yet
11 Implementation of Distance Vector Routing Algorithm
7 pages
Cryptography
No ratings yet
Cryptography
31 pages
Normalization Assignment
No ratings yet
Normalization Assignment
9 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
RPA Unit 4 DC
No ratings yet
RPA Unit 4 DC
34 pages
7.what Is The MAIN Benefit of Designing Tests Early in The Life Cycle?
No ratings yet
7.what Is The MAIN Benefit of Designing Tests Early in The Life Cycle?
8 pages
Normalization in DBMS
No ratings yet
Normalization in DBMS
18 pages
C Questions
100% (3)
C Questions
256 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
SOC Lab Manual
No ratings yet
SOC Lab Manual
11 pages
Assignment Questions - IAS - TESTIII PDF
No ratings yet
Assignment Questions - IAS - TESTIII PDF
5 pages
Expert System in AI
No ratings yet
Expert System in AI
11 pages
Maximum and Minimum Using Divide and Conquer in C
No ratings yet
Maximum and Minimum Using Divide and Conquer in C
13 pages
Apollo Institute of Engineering and Technology: Question Bank Branch: IT Subject: Artificial Intelligence (3161608)
No ratings yet
Apollo Institute of Engineering and Technology: Question Bank Branch: IT Subject: Artificial Intelligence (3161608)
2 pages
Feature engineering Complete Self-Assessment Guide
From Everand
Feature engineering Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
No ratings yet
Session 7 Feature Selection & Dimensionality Reduction
20 pages
Anova: Module 3 - Advanced Statistics
No ratings yet
Anova: Module 3 - Advanced Statistics
17 pages
Analysis of Variance-1
No ratings yet
Analysis of Variance-1
42 pages
Hierarchial Problem Statement
No ratings yet
Hierarchial Problem Statement
1 page
WSS Plot
No ratings yet
WSS Plot
2 pages
Clustering - K-Means: Prerequisite
No ratings yet
Clustering - K-Means: Prerequisite
8 pages
Decision Tree
No ratings yet
Decision Tree
8 pages
DM - Week 1 With DSAW
No ratings yet
DM - Week 1 With DSAW
15 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
6 pages
Clustering Monograph DSBA
No ratings yet
Clustering Monograph DSBA
36 pages
S7 Jump Instructions
No ratings yet
S7 Jump Instructions
10 pages
ITR Report Format
No ratings yet
ITR Report Format
13 pages
TCL File I/O: Opening Files
No ratings yet
TCL File I/O: Opening Files
4 pages
New 2
No ratings yet
New 2
74 pages
BNC 2110 Manual
No ratings yet
BNC 2110 Manual
12 pages
Microlearning Lesson Plan Intel
No ratings yet
Microlearning Lesson Plan Intel
3 pages
Research 2
No ratings yet
Research 2
6 pages
Bootloader
No ratings yet
Bootloader
19 pages
INTERFACING LCD WITH 8051 MIROCONTROLLER With Code
No ratings yet
INTERFACING LCD WITH 8051 MIROCONTROLLER With Code
14 pages
80010723
No ratings yet
80010723
2 pages
PostgreSQL DBA Guide
No ratings yet
PostgreSQL DBA Guide
105 pages
Multilingual Desktop Publishing (DTP) and E-Learning Services
No ratings yet
Multilingual Desktop Publishing (DTP) and E-Learning Services
7 pages
Unit 3.-Finite Automatic: 3.1 Concepts: Definition and Classification of Finite Automata (AF) - Definition 1
No ratings yet
Unit 3.-Finite Automatic: 3.1 Concepts: Definition and Classification of Finite Automata (AF) - Definition 1
22 pages
Multi-Core Computer Architecture: Review of Basic Computer Organization
No ratings yet
Multi-Core Computer Architecture: Review of Basic Computer Organization
26 pages
TDA2008
No ratings yet
TDA2008
10 pages
Aleksandr Egorov jptv20
No ratings yet
Aleksandr Egorov jptv20
8 pages
Industrial Control System
No ratings yet
Industrial Control System
7 pages
Exam Tanzu
No ratings yet
Exam Tanzu
4 pages
Eaton 107926 Easy Usb Cab en GB
No ratings yet
Eaton 107926 Easy Usb Cab en GB
1 page
Fixes: Possible Sa Anumang Error Na Na-Encounter Niyo After Mag-Install NG Software Na Offer Ko
No ratings yet
Fixes: Possible Sa Anumang Error Na Na-Encounter Niyo After Mag-Install NG Software Na Offer Ko
2 pages
Analysis of Algorithms CS 477/677: Instructor: Monica Nicolescu
No ratings yet
Analysis of Algorithms CS 477/677: Instructor: Monica Nicolescu
37 pages
MastercamX MR1 SP2
No ratings yet
MastercamX MR1 SP2
24 pages
Linux Commands Cheat Sheet Reference
No ratings yet
Linux Commands Cheat Sheet Reference
1 page
PowerEdge MX Vs HPE Synergy 0319 v2
No ratings yet
PowerEdge MX Vs HPE Synergy 0319 v2
19 pages
Import An AutoCAD Title Block To Revit 2020
No ratings yet
Import An AutoCAD Title Block To Revit 2020
2 pages
Learn Kubernetes in Under 3 Hours: A Detailed Guide To Orchestrating Containers PDF
No ratings yet
Learn Kubernetes in Under 3 Hours: A Detailed Guide To Orchestrating Containers PDF
61 pages
Schelling SW III
No ratings yet
Schelling SW III
693 pages
Windows Abused Privileges
No ratings yet
Windows Abused Privileges
1 page
Using Your Time Wisely PDF
No ratings yet
Using Your Time Wisely PDF
13 pages

Feature Scaling in Machine Learning

Uploaded by

Feature Scaling in Machine Learning

Uploaded by

Feature Scaling in Machine Learning

Scaling is advised in the following situations-

1. When using Distance or Variance Based Algorithms -

2. Gradient Descent optimizations-

Choosing between Z-score standardization and Min-Max scaling:

Python Code snippet for scaling -

# Seperating the Continuous and Categorical variables

X_test_cont = X_test.select_dtypes(include= 'number',exclude='object')

#Merging the Categorical and Continuous datasets

Substitution for Min-Max Scaling –

You might also like