0% found this document useful (0 votes)

130 views

All About Categorical Variable Encoding

The document discusses various techniques for encoding categorical variables for machine learning models, including one-hot encoding, label encoding, ordinal encoding, Helmert encoding, binary encoding, and others. It explains each technique, provides Python code examples using Pandas and Scikit-learn to demonstrate the encoding, and discusses the appropriate uses and limitations of each approach. It uses a sample dataframe containing categorical variables like temperature and color to demonstrate how each encoding technique would transform those variables.

Uploaded by

trolllolfuckubitch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views

All About Categorical Variable Encoding

Uploaded by

trolllolfuckubitch

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

You have 2 free stories left this month. Sign up and get an extra one for free.

IN-DEPTH ANALYSIS

All about Categorical Variable Encoding

Convert a categorical variable to number for Machine Learning Model Building

Baijayanta Roy
Jul 17, 2019 · 13 min read

Last Updated : 12th February 2020

Most of the Machine learning algorithms can not handle categorical variables unless
we convert them to numerical values. Many algorithm’s performances vary based on
how Categorical variables are encoded.

Categorical variables can be divided into two categories: Nominal (No particular
order) and Ordinal (some ordered).

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 1/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Few examples as below for Nominal variable:

Red, Yellow, Pink, Blue

Singapore, Japan, USA, India, Korea

Cow, Dog, Cat, Snake

Example of Ordinal variables:

High, Medium, Low

“Strongly agree,” Agree, Neutral, Disagree, and “Strongly Disagree.”

Excellent, Okay, Bad

There are many ways we can encode these categorical variables as numbers and use
them in an algorithm. I will cover most of them from basic to more advanced ones in
this post. I will be comprising these encoding:

1) One Hot Encoding

2) Label Encoding
3) Ordinal Encoding
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 2/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

4) Helmert Encoding
5) Binary Encoding
6) Frequency Encoding
7) Mean Encoding
8) Weight of Evidence Encoding
9) Probability Ratio Encoding
10) Hashing Encoding
11) Backward Difference Encoding
12) Leave One Out Encoding
13) James-Stein Encoding
14) M-estimator Encoding

15) Thermometer Encoder (To be updated)

For explanation, I will use this data-frame, which has two independent variables or
features(Temperature and Color) and one label (Target). It also has Rec-No, which is a
sequence number of the record. There is a total of 10 records in this data-frame.
Python code would look as below.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 3/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

We will use Pandas and Scikit-learn and category_encoders (Scikit-learn contribution

library) to show different encoding methods in Python.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 4/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

One Hot Encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the
presence or absence of the feature. The number of vectors depends on the number of
categories for features. This method produces a lot of columns that slows down the
learning significantly if the number of the category is very high for the feature. Pandas
has get_dummies function, which is quite easy to use. For the sample data-frame code
would be as below:

Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional
feature column (another code is needed, as shown in the below code sample).

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 5/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of
Category) as that is sufficient to encode the one that is not included. Usually, for
Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but
for classification, the recommendation is to use all N columns without as most of the
tree-based algorithm builds a tree based on all available variables. One hot encoding
with N-1 binary variables should be used in linear Regression, to ensure the correct
number of degrees of freedom (N-1). The linear Regression has access to all of the
features as it is being trained, and therefore examines the whole set of dummy
variables altogether. This means that N-1 binary variables give complete information
about (represent completely) the original categorical variable to the linear Regression.
This approach can be adopted for any machine learning algorithm that looks at ALL
the features at the same time during training. For example, support vector machines
and neural networks as well and clustering algorithms.

In tree-based methods, we will never consider that additional label if we drop. Thus, if
we use the categorical variables in a tree-based learning algorithm, it is good practice
to encode it into N binary variables and don’t drop.

Label Encoding
In this encoding, each category is assigned a value from 1 through N (here N is the
number of categories for the feature. One major issue with this approach is there is no
relation or order between these classes, but the algorithm might consider them as
some order, or there is some relationship. In below example it may look like
(Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame
as follows:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 6/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Pandas factorize also perform the same function.

Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature
of the variable. This is reasonable only for ordinal variables, as I mentioned at the
beginning of this article. This encoding looks almost similar to Label Encoding but
slightly different as Label coding would not consider whether variable is ordinal or not
and it will assign sequence of integers

as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and
Warm (3)) or

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 7/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot”
(2) and Warm (3)).

If we consider in the temperature scale as the order, then the ordinal value should from
cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)
<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.

Refer to this code using Pandas, where first, we need to assign the original order of the
variable through a dictionary. Then we can map each row for the variable as per the
dictionary.

Though it’s very straight forward it requires coding to tell ordinal values and what is
the actual mapping from text to an integer as per the order.

Helmert Encoding

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 8/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

In this encoding, the mean of the dependent variable for a level is compared to the
mean of the dependent variable over all previous levels.

The version in category_encoders is sometimes referred to as Reverse Helmert Coding.

The mean of the dependent variable for a level is compared to the mean of the
dependent variable over all previous levels. Hence, the name ‘reverse’ is used to
differentiate from forward Helmert coding.

Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one
feature column. If there are n unique categories, then binary encoding results in the
only log(base 2)ⁿ features. In this example, we have four features; thus, the total
number of the binary encoded features will be three features. Compared to One Hot
Encoding, this will require fewer feature columns (for 100 categories One Hot
Encoding will have 100 features while for Binary encoding, we will need just seven
features).

For Binary encoding, one has to follow the following steps:

The categories are first converted to numeric order starting from 1 (order is
created as categories appear in a dataset and do not mean any ordinal nature)

Then those integers are converted into binary code, so for example 3 becomes 011,
4 becomes 100

Then the digits of the binary number form separate columns.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 9/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Refer to the below diagram for better intuition.

We will use the category_encoders package for this, and the function name is
BinaryEncoder.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 10/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the
frequency is related somewhat with the target variable, it helps the model to
understand and assign the weight in direct and inverse proportion, depending on the
nature of the data. Three-step for this :

Select a categorical variable you would like to transform

Group by the categorical variable and obtain counts of each category

Join it back with the training dataset

Pandas code can be constructed as below:

Mean Encoding
Mean Encoding or Target Encoding is one viral encoding approach followed by
Kagglers. There are many variations of this. Here I will cover the basic version and

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 11/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

smoothing version. Mean encoding is similar to label encoding, except here labels are
correlated directly with the target. For example, in mean target encoding for each
category in the feature label is decided with the mean value of the target variable on
a training data. This encoding method brings out the relation between similar
categories, but the connections are bounded within the categories and target itself.
The advantages of the mean target encoding are that it does not affect the volume of
the data and helps in faster learning. Usually, Mean encoding is notorious for over-
fitting; thus, a regularization with cross-validation or some other approach is a must
on most occasions. Mean encoding approach is as below:

1. Select a categorical variable you would like to transform

2. Group by the categorical variable and obtain aggregated sum over the “Target”
variable. (total number of 1’s for each category in ‘Temperature’)

3. Group by the categorical variable and obtain aggregated count over “Target”
variable

4. Divide the step 2 / step 3 results and join it back with the train.

Mean Encoding

Sample code for the data-frame :

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 12/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Mean encoding can embody the target in the label, whereas label encoding does not
correlate with the target. In the case of a large number of features, mean encoding
could prove to be a much simpler alternative. Mean encoding tends to group the
classes, whereas the grouping is random in case of label encoding.

There are many variations of this target encoding in practice, like smoothing.
Smoothing can implement as below:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 13/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Weight of Evidence Encoding

Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to
separate good and bad. This method was developed primarily to build a predictive
model to evaluate the risk of loan default in the credit and financial industry. Weight of
evidence (WOE) is a measure of how much the evidence supports or undermines a
hypothesis.

It is computed as below:

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is if the outcome is random for that
group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on
the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression because the Logit transformation is simply
the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded
predictors in Logistic Regression, the predictors are all prepared and coded to the same
scale. The parameters in the linear logistic regression equation can be directly
compared.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 14/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

The WoE transformation has (at least) three advantage:

1) It can transform an independent variable so that it establishes a monotonic
relationship to the dependent variable. It does more than this — to secure monotonic
relationship it would be enough to “recode” it to any ordered measure (for example
1,2,3,4…), but the WoE transformation orders the categories on a “logistic” scale
which is natural for Logistic Regression
2) For variables with too many (sparsely populated) discrete values, these can be
grouped into categories (densely populated), and the WoE can be used to express
information for the whole category
3) The (univariate) effect of each category on the dependent variable can be compared
across categories and variables because WoE is a standardized value (for example you
can compare WoE of married people to WoE of manual workers)

It also has (at least) three drawbacks:

1) Loss of information (variation) due to binning to a few categories
2) It is a “univariate” measure, so it does not take into account the correlation
between independent variables
3) It is easy to manipulate (over-fit) the effect of variables according to how categories
are created

Below code, snippets explain how one can build code to calculate WoE.

Once we calculate WoE for each group, we can map back this to Data-frame.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 15/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Probability Ratio Encoding

Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only
difference is the only ratio of good and bad probability is used. For each label, we
calculate the mean of target=1, that is the probability of being 1 ( P(1) ), and also the
probability of the target=0 ( P(0) ). And then, we calculate the ratio P(1)/P(0) and
replace the labels by that ratio. We need to add a minimal value with P(0) to avoid any
divide by zero scenarios where for any particular category, there is no target=0.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 16/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Hashing
Hashing converts categorical variables to a higher dimensional space of integers,
where the distance between two vectors of categorical variables in approximately
maintained the transformed numerical dimensional space. With Hashing, the number
of dimensions will be far less than the number of dimensions with encoding like One
Hot Encoding. This method is advantageous when the cardinality of categorical is very
high.

(Sample Code — I Will update in a future version of this article)

Backward Difference Encoding

In backward difference coding, the mean of the dependent variable for a level is
compared with the mean of the dependent variable for the prior level. This type of
coding may be useful for a nominal or an ordinal variable.

This technique falls under the contrast coding system for categorical features. A feature
of K categories, or levels, usually enters a regression as a sequence of K-1 dummy
variables.

(Sample Code — Will be updated in a future version of this article)

Leave One Out Encoding

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 17/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

This is very similar to target encoding but excludes the current row’s target when
calculating the mean target for a level to reduce the effect of outliers.

(Sample Code — Will be updated in a future version of this article)

James-Stein Encoding
For feature value, James-Stein estimator returns a weighted average of:

1. The mean target value for the observed feature value.

2. The mean target value (regardless of the feature value).

The James-Stein encoder shrinks the average toward the overall average. It is a target
based encoder. James-Stein estimator has, however, one practical limitation — it was
defined only for normal distributions.

(Sample Code — I Will update in a future version of this article)

M-estimator Encoding
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-
parameter — m, which represents the power of regularization. The higher the value of
m results, into stronger shrinking. Recommended values for m is in the range of 1 to
100.

(Sample Code — I Will update in a future version of this article)

FAQ:
I received many queries related to what to use or how one can treat the test data when
there is no target. I am adding a Faq section here, which I hope would assist.

Faq 01: Which method should I use?

Answer: There is no single method that works for every problem or dataset. You may
have to try a few to see, which gives a better result. The general guideline is to refer the
cheat-sheet shown at the end of the article.

Faq 02: How do I create categorical encoding for a situation like a target
encoding as, in test data, there won’t be any target value?

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 18/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Answer: We need to use the mapping values created at the time of training. This
process is the same concept as in scaling or normalization, where we use the train data
to scale or normalize the test data. W the map and use the same map in testing time
pre-processing. We can even create a dictionary for each category and mapped value
and then use the dictionary at testing time. Here I am using the mean encoding to
explain this.

Training Time

Testing Time

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 19/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Conclusion
It is essential to understand, for all machine learning models, all these encodings do
not work well in all situations or for every dataset. Data Scientists still need to
experiment and find out which works best for their specific case. If test data has
different classes, then some of these methods won’t work as features won’t be similar.
There are few benchmark publications by research communities, but it’s not
conclusive, which works best. My recommendation will be to try each of these with the
smaller datasets and then decide where to put more focus on tuning the encoding
process. You can use the below cheat-sheet as a guiding tool.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 20/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Source

Thanks for reading. You can connect me on LinkedIn.

Sign up for The Daily Pick

By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.

Machine Learning Data Science Arti cial Intelligence Programming Data

About Help Legal

Get the Medium app

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 21/21

Pom A2.1
No ratings yet
Pom A2.1
24 pages
H&M and Zara Comparison
No ratings yet
H&M and Zara Comparison
18 pages
BBE Assignment 1
No ratings yet
BBE Assignment 1
15 pages
SURF MASTERBRAND - Digital & Social Strategy
No ratings yet
SURF MASTERBRAND - Digital & Social Strategy
15 pages
Lecture 1
No ratings yet
Lecture 1
56 pages
NguyenDuyAnh 22075248 MDGC ComponentB
No ratings yet
NguyenDuyAnh 22075248 MDGC ComponentB
8 pages
International Learning and Talent Development Comparison Survey 2011
No ratings yet
International Learning and Talent Development Comparison Survey 2011
33 pages
Lesson 3 EBBA Sent To Students
No ratings yet
Lesson 3 EBBA Sent To Students
147 pages
Me A1.1
No ratings yet
Me A1.1
9 pages
Pom-A2.1
No ratings yet
Pom-A2.1
22 pages
SBAA7025
No ratings yet
SBAA7025
102 pages
Chap 9
No ratings yet
Chap 9
34 pages
Assignment 3 BUSM2570 SGS 06 Duong Uyen Lam
No ratings yet
Assignment 3 BUSM2570 SGS 06 Duong Uyen Lam
16 pages
BUSI1633 - Strategy For Managers - 2021-22 - Report D
No ratings yet
BUSI1633 - Strategy For Managers - 2021-22 - Report D
6 pages
Assignment 1 Front Sheet: Date Received (1 Submission)
No ratings yet
Assignment 1 Front Sheet: Date Received (1 Submission)
13 pages
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
No ratings yet
Nguyễn Văn Thành Trung-K59BF-ML15 PDF
9 pages
Source SCM Sieu To Khong Lo
No ratings yet
Source SCM Sieu To Khong Lo
284 pages
Gap Model Major
No ratings yet
Gap Model Major
9 pages
Group12 Ob A2.1
No ratings yet
Group12 Ob A2.1
46 pages
Commercial Bank Report
No ratings yet
Commercial Bank Report
29 pages
COOP MART QUẬN 7
No ratings yet
COOP MART QUẬN 7
11 pages
How Do Live Stream Channels Shape Impulse Buying Behaviour Empirical Research On Vietnamese Youth
No ratings yet
How Do Live Stream Channels Shape Impulse Buying Behaviour Empirical Research On Vietnamese Youth
28 pages
mkt201 Chap 4
No ratings yet
mkt201 Chap 4
34 pages
530 Asm2 Ho Phan Phuong Trang
No ratings yet
530 Asm2 Ho Phan Phuong Trang
22 pages
Assignment 1 - MKT20019
No ratings yet
Assignment 1 - MKT20019
8 pages
Assignment 1 Front Sheet
No ratings yet
Assignment 1 Front Sheet
14 pages
DR Dre
100% (1)
DR Dre
2 pages
07.VISSAN Case4study EN (2021)
No ratings yet
07.VISSAN Case4study EN (2021)
12 pages
(5032) - OFFICIALLY Assignment 2
No ratings yet
(5032) - OFFICIALLY Assignment 2
15 pages
Middle Chapters 2
No ratings yet
Middle Chapters 2
5 pages
5035.assignment 1 Frontsheet (2022) HRM
No ratings yet
5035.assignment 1 Frontsheet (2022) HRM
30 pages
Final Report EXE101 - Group 5
No ratings yet
Final Report EXE101 - Group 5
54 pages
E-Commerce Market Analysis in Vietnam
No ratings yet
E-Commerce Market Analysis in Vietnam
45 pages
BUSI1633 - Strategy For Managers - 2021-22 - Report C
No ratings yet
BUSI1633 - Strategy For Managers - 2021-22 - Report C
16 pages
Cocoon in China Market
No ratings yet
Cocoon in China Market
45 pages
Mintzberg Organisation
No ratings yet
Mintzberg Organisation
4 pages
SWOT ANALYSIS of An Australian Company
No ratings yet
SWOT ANALYSIS of An Australian Company
12 pages
Mar Dich Vu Cuối Kì 2
No ratings yet
Mar Dich Vu Cuối Kì 2
49 pages
SWOT Model Analysis of Acecook
No ratings yet
SWOT Model Analysis of Acecook
2 pages
Case Study A G Nico Eagle
No ratings yet
Case Study A G Nico Eagle
2 pages
570 Assignment1-NguyenHuuThang-GBD1004
No ratings yet
570 Assignment1-NguyenHuuThang-GBD1004
30 pages
Word-of-Mouth Processes Within A Services Purchase Decision Context
No ratings yet
Word-of-Mouth Processes Within A Services Purchase Decision Context
12 pages
Project Vinamilk
No ratings yet
Project Vinamilk
4 pages
Assignment 2 TIKI
No ratings yet
Assignment 2 TIKI
9 pages
2019 - Journal of Retailing and Consumer Services - What Creates Trust and Who Gets Loyalty in Social Commerce
No ratings yet
2019 - Journal of Retailing and Consumer Services - What Creates Trust and Who Gets Loyalty in Social Commerce
7 pages
Assignment 1 Front Sheet
No ratings yet
Assignment 1 Front Sheet
26 pages
NCC 2021 - Round 1: The Truth Seeker: Email
No ratings yet
NCC 2021 - Round 1: The Truth Seeker: Email
16 pages
Analyze Vinfast Data Moi
No ratings yet
Analyze Vinfast Data Moi
33 pages
Eurocham-WhiteBook 2017.en ReduceSize
100% (2)
Eurocham-WhiteBook 2017.en ReduceSize
175 pages
Assignment 2 Front Sheet Qualification BTEC Level 4 HND Diploma in Business
No ratings yet
Assignment 2 Front Sheet Qualification BTEC Level 4 HND Diploma in Business
17 pages
Differences Between HRM and Personnel Management
100% (1)
Differences Between HRM and Personnel Management
4 pages
DB Schenker Trend Report
No ratings yet
DB Schenker Trend Report
20 pages
IM Group Assignment On Case 3-OLPC
No ratings yet
IM Group Assignment On Case 3-OLPC
6 pages
BuildIT Equipment Rental Process - Part - II
No ratings yet
BuildIT Equipment Rental Process - Part - II
3 pages
ESP232 - File ôn tập tổng hợp
No ratings yet
ESP232 - File ôn tập tổng hợp
27 pages
Kido Presentation
No ratings yet
Kido Presentation
25 pages
A4 Financial Analysis GROUP 1
No ratings yet
A4 Financial Analysis GROUP 1
22 pages
Tut3 G2 Vinamilk
No ratings yet
Tut3 G2 Vinamilk
25 pages
Group 7
No ratings yet
Group 7
25 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Analysis of Shannon-Fisher Information Plane in Time Series Based On Information Entropy
No ratings yet
Analysis of Shannon-Fisher Information Plane in Time Series Based On Information Entropy
10 pages
Scription Chronodex
No ratings yet
Scription Chronodex
1 page
Subtitles
No ratings yet
Subtitles
55 pages
LE IAL Private Entry
100% (1)
LE IAL Private Entry
12 pages
Atomic Wedgie
No ratings yet
Atomic Wedgie
8 pages
Ordonez v. USA - Document No. 3
No ratings yet
Ordonez v. USA - Document No. 3
3 pages
Leadership Reviewer
No ratings yet
Leadership Reviewer
24 pages
Alaska Gold Mine
No ratings yet
Alaska Gold Mine
4 pages
Vampire The Requiem - Bloodlines - Ancient Bloodlines PDF
100% (3)
Vampire The Requiem - Bloodlines - Ancient Bloodlines PDF
178 pages
Business Analysis in The Blockchain Age PDF
No ratings yet
Business Analysis in The Blockchain Age PDF
19 pages
Stratification New
No ratings yet
Stratification New
24 pages
Tantalum Capacitor 050601 - Ver1
No ratings yet
Tantalum Capacitor 050601 - Ver1
29 pages
Darlyn Research 2
No ratings yet
Darlyn Research 2
47 pages
STCW 78 95 Table A III 1
100% (1)
STCW 78 95 Table A III 1
3 pages
Lesson 1 - Electromagnetic Waves
No ratings yet
Lesson 1 - Electromagnetic Waves
39 pages
DM375M6 60HBW en
No ratings yet
DM375M6 60HBW en
2 pages
d250 Mixing Aid Ps
No ratings yet
d250 Mixing Aid Ps
1 page
Horizontal Ducted Furred-In Unit With Plenum
No ratings yet
Horizontal Ducted Furred-In Unit With Plenum
3 pages
HDPE Pipe - Time Estimate, Methods, Cost Comparison - EDDY Pump
No ratings yet
HDPE Pipe - Time Estimate, Methods, Cost Comparison - EDDY Pump
8 pages
Botany 1st
No ratings yet
Botany 1st
17 pages
Case Presentation: Course: Endocrinology
No ratings yet
Case Presentation: Course: Endocrinology
13 pages
DARCOM Frank Holbrook - Daniel
100% (7)
DARCOM Frank Holbrook - Daniel
582 pages
6 2019 AnalisisKuantitatifHidruquinonpadaProdukKosmetikKrimPemutihyangBeredardiWilayahSurabayaPusatdanSurabayaUtaradenganMetodeSpektrofotometriUV-Vis
No ratings yet
6 2019 AnalisisKuantitatifHidruquinonpadaProdukKosmetikKrimPemutihyangBeredardiWilayahSurabayaPusatdanSurabayaUtaradenganMetodeSpektrofotometriUV-Vis
12 pages
Evidence Based Practice in Nursing (EBPN) : Introduce
No ratings yet
Evidence Based Practice in Nursing (EBPN) : Introduce
21 pages
THREE MOMENTS THEOREM NOTES
No ratings yet
THREE MOMENTS THEOREM NOTES
49 pages
Structure and Function of Neurons
No ratings yet
Structure and Function of Neurons
32 pages
Example Single Product LTD: Required: (A) Calculate The Following
0% (1)
Example Single Product LTD: Required: (A) Calculate The Following
3 pages
LESSON PLAN MAPEH 6 Quarter 2 (Arts)
No ratings yet
LESSON PLAN MAPEH 6 Quarter 2 (Arts)
6 pages
Karta Produktu dp990
No ratings yet
Karta Produktu dp990
2 pages
Biology - Living and Non Living Things
0% (1)
Biology - Living and Non Living Things
10 pages
Introduction MONA BAKER (In Other Words)
No ratings yet
Introduction MONA BAKER (In Other Words)
2 pages
Chelseas Resume
No ratings yet
Chelseas Resume
2 pages
Final Project Report
100% (1)
Final Project Report
48 pages

All About Categorical Variable Encoding

Uploaded by

All About Categorical Variable Encoding

Uploaded by

8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

All about Categorical Variable Encoding

Last Updated : 12th February 2020

Few examples as below for Nominal variable:

Red, Yellow, Pink, Blue

Singapore, Japan, USA, India, Korea

Cow, Dog, Cat, Snake

Example of Ordinal variables:

High, Medium, Low

“Strongly agree,” Agree, Neutral, Disagree, and “Strongly Disagree.”

Excellent, Okay, Bad

1) One Hot Encoding

15) Thermometer Encoder (To be updated)

We will use Pandas and Scikit-learn and category_encoders (Scikit-learn contribution

One Hot Encoding

Pandas factorize also perform the same function.

The version in category_encoders is sometimes referred to as Reverse Helmert Coding.

For Binary encoding, one has to follow the following steps:

Then the digits of the binary number form separate columns.

Refer to the below diagram for better intuition.

Select a categorical variable you would like to transform

Group by the categorical variable and obtain counts of each category

Join it back with the training dataset

Pandas code can be constructed as below:

1. Select a categorical variable you would like to transform

Sample code for the data-frame :

Weight of Evidence Encoding

The WoE transformation has (at least) three advantage:

It also has (at least) three drawbacks:

Probability Ratio Encoding

(Sample Code — I Will update in a future version of this article)

Backward Difference Encoding

(Sample Code — Will be updated in a future version of this article)

Leave One Out Encoding

(Sample Code — Will be updated in a future version of this article)

1. The mean target value for the observed feature value.

2. The mean target value (regardless of the feature value).

(Sample Code — I Will update in a future version of this article)

(Sample Code — I Will update in a future version of this article)

Faq 01: Which method should I use?

Thanks for reading. You can connect me on LinkedIn.

Sign up for The Daily Pick

Get this newsletter

Machine Learning Data Science Arti cial Intelligence Programming Data

About Help Legal

Get the Medium app

You might also like