0% found this document useful (0 votes)
130 views

All About Categorical Variable Encoding

The document discusses various techniques for encoding categorical variables for machine learning models, including one-hot encoding, label encoding, ordinal encoding, Helmert encoding, binary encoding, and others. It explains each technique, provides Python code examples using Pandas and Scikit-learn to demonstrate the encoding, and discusses the appropriate uses and limitations of each approach. It uses a sample dataframe containing categorical variables like temperature and color to demonstrate how each encoding technique would transform those variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

All About Categorical Variable Encoding

The document discusses various techniques for encoding categorical variables for machine learning models, including one-hot encoding, label encoding, ordinal encoding, Helmert encoding, binary encoding, and others. It explains each technique, provides Python code examples using Pandas and Scikit-learn to demonstrate the encoding, and discusses the appropriate uses and limitations of each approach. It uses a sample dataframe containing categorical variables like temperature and color to demonstrate how each encoding technique would transform those variables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

You have 2 free stories left this month. Sign up and get an extra one for free.

IN-DEPTH ANALYSIS

All about Categorical Variable Encoding


Convert a categorical variable to number for Machine Learning Model Building

Baijayanta Roy
Jul 17, 2019 · 13 min read

Last Updated : 12th February 2020

Most of the Machine learning algorithms can not handle categorical variables unless
we convert them to numerical values. Many algorithm’s performances vary based on
how Categorical variables are encoded.

Categorical variables can be divided into two categories: Nominal (No particular
order) and Ordinal (some ordered).

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 1/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Few examples as below for Nominal variable:

Red, Yellow, Pink, Blue

Singapore, Japan, USA, India, Korea

Cow, Dog, Cat, Snake

Example of Ordinal variables:

High, Medium, Low

“Strongly agree,” Agree, Neutral, Disagree, and “Strongly Disagree.”

Excellent, Okay, Bad

There are many ways we can encode these categorical variables as numbers and use
them in an algorithm. I will cover most of them from basic to more advanced ones in
this post. I will be comprising these encoding:

1) One Hot Encoding


2) Label Encoding
3) Ordinal Encoding
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 2/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

4) Helmert Encoding
5) Binary Encoding
6) Frequency Encoding
7) Mean Encoding
8) Weight of Evidence Encoding
9) Probability Ratio Encoding
10) Hashing Encoding
11) Backward Difference Encoding
12) Leave One Out Encoding
13) James-Stein Encoding
14) M-estimator Encoding

15) Thermometer Encoder (To be updated)


For explanation, I will use this data-frame, which has two independent variables or
features(Temperature and Color) and one label (Target). It also has Rec-No, which is a
sequence number of the record. There is a total of 10 records in this data-frame.
Python code would look as below.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 3/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

We will use Pandas and Scikit-learn and category_encoders (Scikit-learn contribution


library) to show different encoding methods in Python.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 4/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

One Hot Encoding


In this method, we map each category to a vector that contains 1 and 0 denoting the
presence or absence of the feature. The number of vectors depends on the number of
categories for features. This method produces a lot of columns that slows down the
learning significantly if the number of the category is very high for the feature. Pandas
has get_dummies function, which is quite easy to use. For the sample data-frame code
would be as below:

Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional
feature column (another code is needed, as shown in the below code sample).

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 5/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of
Category) as that is sufficient to encode the one that is not included. Usually, for
Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but
for classification, the recommendation is to use all N columns without as most of the
tree-based algorithm builds a tree based on all available variables. One hot encoding
with N-1 binary variables should be used in linear Regression, to ensure the correct
number of degrees of freedom (N-1). The linear Regression has access to all of the
features as it is being trained, and therefore examines the whole set of dummy
variables altogether. This means that N-1 binary variables give complete information
about (represent completely) the original categorical variable to the linear Regression.
This approach can be adopted for any machine learning algorithm that looks at ALL
the features at the same time during training. For example, support vector machines
and neural networks as well and clustering algorithms.

In tree-based methods, we will never consider that additional label if we drop. Thus, if
we use the categorical variables in a tree-based learning algorithm, it is good practice
to encode it into N binary variables and don’t drop.

Label Encoding
In this encoding, each category is assigned a value from 1 through N (here N is the
number of categories for the feature. One major issue with this approach is there is no
relation or order between these classes, but the algorithm might consider them as
some order, or there is some relationship. In below example it may look like
(Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame
as follows:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 6/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Pandas factorize also perform the same function.

Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature
of the variable. This is reasonable only for ordinal variables, as I mentioned at the
beginning of this article. This encoding looks almost similar to Label Encoding but
slightly different as Label coding would not consider whether variable is ordinal or not
and it will assign sequence of integers

as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and
Warm (3)) or

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 7/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot”
(2) and Warm (3)).

If we consider in the temperature scale as the order, then the ordinal value should from
cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)
<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.

Refer to this code using Pandas, where first, we need to assign the original order of the
variable through a dictionary. Then we can map each row for the variable as per the
dictionary.

Though it’s very straight forward it requires coding to tell ordinal values and what is
the actual mapping from text to an integer as per the order.

Helmert Encoding

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 8/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

In this encoding, the mean of the dependent variable for a level is compared to the
mean of the dependent variable over all previous levels.

The version in category_encoders is sometimes referred to as Reverse Helmert Coding.


The mean of the dependent variable for a level is compared to the mean of the
dependent variable over all previous levels. Hence, the name ‘reverse’ is used to
differentiate from forward Helmert coding.

Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one
feature column. If there are n unique categories, then binary encoding results in the
only log(base 2)ⁿ features. In this example, we have four features; thus, the total
number of the binary encoded features will be three features. Compared to One Hot
Encoding, this will require fewer feature columns (for 100 categories One Hot
Encoding will have 100 features while for Binary encoding, we will need just seven
features).

For Binary encoding, one has to follow the following steps:

The categories are first converted to numeric order starting from 1 (order is
created as categories appear in a dataset and do not mean any ordinal nature)

Then those integers are converted into binary code, so for example 3 becomes 011,
4 becomes 100

Then the digits of the binary number form separate columns.


https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 9/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Refer to the below diagram for better intuition.

We will use the category_encoders package for this, and the function name is
BinaryEncoder.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 10/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the
frequency is related somewhat with the target variable, it helps the model to
understand and assign the weight in direct and inverse proportion, depending on the
nature of the data. Three-step for this :

Select a categorical variable you would like to transform

Group by the categorical variable and obtain counts of each category

Join it back with the training dataset

Pandas code can be constructed as below:

Mean Encoding
Mean Encoding or Target Encoding is one viral encoding approach followed by
Kagglers. There are many variations of this. Here I will cover the basic version and

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 11/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

smoothing version. Mean encoding is similar to label encoding, except here labels are
correlated directly with the target. For example, in mean target encoding for each
category in the feature label is decided with the mean value of the target variable on
a training data. This encoding method brings out the relation between similar
categories, but the connections are bounded within the categories and target itself.
The advantages of the mean target encoding are that it does not affect the volume of
the data and helps in faster learning. Usually, Mean encoding is notorious for over-
fitting; thus, a regularization with cross-validation or some other approach is a must
on most occasions. Mean encoding approach is as below:

1. Select a categorical variable you would like to transform

2. Group by the categorical variable and obtain aggregated sum over the “Target”
variable. (total number of 1’s for each category in ‘Temperature’)

3. Group by the categorical variable and obtain aggregated count over “Target”
variable

4. Divide the step 2 / step 3 results and join it back with the train.

Mean Encoding

Sample code for the data-frame :

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 12/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Mean encoding can embody the target in the label, whereas label encoding does not
correlate with the target. In the case of a large number of features, mean encoding
could prove to be a much simpler alternative. Mean encoding tends to group the
classes, whereas the grouping is random in case of label encoding.

There are many variations of this target encoding in practice, like smoothing.
Smoothing can implement as below:

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 13/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Weight of Evidence Encoding


Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to
separate good and bad. This method was developed primarily to build a predictive
model to evaluate the risk of loan default in the credit and financial industry. Weight of
evidence (WOE) is a measure of how much the evidence supports or undermines a
hypothesis.

It is computed as below:

WoE will be 0 if the P(Goods) / P(Bads) = 1. That is if the outcome is random for that
group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on
the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.

WoE is well suited for Logistic Regression because the Logit transformation is simply
the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded
predictors in Logistic Regression, the predictors are all prepared and coded to the same
scale. The parameters in the linear logistic regression equation can be directly
compared.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 14/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

The WoE transformation has (at least) three advantage:


1) It can transform an independent variable so that it establishes a monotonic
relationship to the dependent variable. It does more than this — to secure monotonic
relationship it would be enough to “recode” it to any ordered measure (for example
1,2,3,4…), but the WoE transformation orders the categories on a “logistic” scale
which is natural for Logistic Regression
2) For variables with too many (sparsely populated) discrete values, these can be
grouped into categories (densely populated), and the WoE can be used to express
information for the whole category
3) The (univariate) effect of each category on the dependent variable can be compared
across categories and variables because WoE is a standardized value (for example you
can compare WoE of married people to WoE of manual workers)

It also has (at least) three drawbacks:


1) Loss of information (variation) due to binning to a few categories
2) It is a “univariate” measure, so it does not take into account the correlation
between independent variables
3) It is easy to manipulate (over-fit) the effect of variables according to how categories
are created

Below code, snippets explain how one can build code to calculate WoE.

Once we calculate WoE for each group, we can map back this to Data-frame.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 15/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Probability Ratio Encoding


Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only
difference is the only ratio of good and bad probability is used. For each label, we
calculate the mean of target=1, that is the probability of being 1 ( P(1) ), and also the
probability of the target=0 ( P(0) ). And then, we calculate the ratio P(1)/P(0) and
replace the labels by that ratio. We need to add a minimal value with P(0) to avoid any
divide by zero scenarios where for any particular category, there is no target=0.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 16/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Hashing
Hashing converts categorical variables to a higher dimensional space of integers,
where the distance between two vectors of categorical variables in approximately
maintained the transformed numerical dimensional space. With Hashing, the number
of dimensions will be far less than the number of dimensions with encoding like One
Hot Encoding. This method is advantageous when the cardinality of categorical is very
high.

(Sample Code — I Will update in a future version of this article)

Backward Difference Encoding


In backward difference coding, the mean of the dependent variable for a level is
compared with the mean of the dependent variable for the prior level. This type of
coding may be useful for a nominal or an ordinal variable.

This technique falls under the contrast coding system for categorical features. A feature
of K categories, or levels, usually enters a regression as a sequence of K-1 dummy
variables.

(Sample Code — Will be updated in a future version of this article)

Leave One Out Encoding


https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 17/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

This is very similar to target encoding but excludes the current row’s target when
calculating the mean target for a level to reduce the effect of outliers.

(Sample Code — Will be updated in a future version of this article)

James-Stein Encoding
For feature value, James-Stein estimator returns a weighted average of:

1. The mean target value for the observed feature value.

2. The mean target value (regardless of the feature value).

The James-Stein encoder shrinks the average toward the overall average. It is a target
based encoder. James-Stein estimator has, however, one practical limitation — it was
defined only for normal distributions.

(Sample Code — I Will update in a future version of this article)

M-estimator Encoding
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-
parameter — m, which represents the power of regularization. The higher the value of
m results, into stronger shrinking. Recommended values for m is in the range of 1 to
100.

(Sample Code — I Will update in a future version of this article)

FAQ:
I received many queries related to what to use or how one can treat the test data when
there is no target. I am adding a Faq section here, which I hope would assist.

Faq 01: Which method should I use?

Answer: There is no single method that works for every problem or dataset. You may
have to try a few to see, which gives a better result. The general guideline is to refer the
cheat-sheet shown at the end of the article.

Faq 02: How do I create categorical encoding for a situation like a target
encoding as, in test data, there won’t be any target value?

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 18/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Answer: We need to use the mapping values created at the time of training. This
process is the same concept as in scaling or normalization, where we use the train data
to scale or normalize the test data. W the map and use the same map in testing time
pre-processing. We can even create a dictionary for each category and mapped value
and then use the dictionary at testing time. Here I am using the mean encoding to
explain this.

Training Time

Testing Time

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 19/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Conclusion
It is essential to understand, for all machine learning models, all these encodings do
not work well in all situations or for every dataset. Data Scientists still need to
experiment and find out which works best for their specific case. If test data has
different classes, then some of these methods won’t work as features won’t be similar.
There are few benchmark publications by research communities, but it’s not
conclusive, which works best. My recommendation will be to try each of these with the
smaller datasets and then decide where to put more focus on tuning the encoding
process. You can use the below cheat-sheet as a guiding tool.

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 20/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science

Source

Thanks for reading. You can connect me on LinkedIn.

Sign up for The Daily Pick


By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.

Machine Learning Data Science Arti cial Intelligence Programming Data

About Help Legal

Get the Medium app

https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 21/21

You might also like