All About Categorical Variable Encoding
All About Categorical Variable Encoding
You have 2 free stories left this month. Sign up and get an extra one for free.
IN-DEPTH ANALYSIS
Baijayanta Roy
Jul 17, 2019 · 13 min read
Most of the Machine learning algorithms can not handle categorical variables unless
we convert them to numerical values. Many algorithm’s performances vary based on
how Categorical variables are encoded.
Categorical variables can be divided into two categories: Nominal (No particular
order) and Ordinal (some ordered).
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 1/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
There are many ways we can encode these categorical variables as numbers and use
them in an algorithm. I will cover most of them from basic to more advanced ones in
this post. I will be comprising these encoding:
4) Helmert Encoding
5) Binary Encoding
6) Frequency Encoding
7) Mean Encoding
8) Weight of Evidence Encoding
9) Probability Ratio Encoding
10) Hashing Encoding
11) Backward Difference Encoding
12) Leave One Out Encoding
13) James-Stein Encoding
14) M-estimator Encoding
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 3/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 4/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Scikit-learn has OneHotEncoder for this purpose, but it does not create an additional
feature column (another code is needed, as shown in the below code sample).
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 5/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
One Hot Encoding is very popular. We can represent all categories by N-1 (N= No of
Category) as that is sufficient to encode the one that is not included. Usually, for
Regression, we use N-1 (drop first or last column of One Hot Coded new feature ), but
for classification, the recommendation is to use all N columns without as most of the
tree-based algorithm builds a tree based on all available variables. One hot encoding
with N-1 binary variables should be used in linear Regression, to ensure the correct
number of degrees of freedom (N-1). The linear Regression has access to all of the
features as it is being trained, and therefore examines the whole set of dummy
variables altogether. This means that N-1 binary variables give complete information
about (represent completely) the original categorical variable to the linear Regression.
This approach can be adopted for any machine learning algorithm that looks at ALL
the features at the same time during training. For example, support vector machines
and neural networks as well and clustering algorithms.
In tree-based methods, we will never consider that additional label if we drop. Thus, if
we use the categorical variables in a tree-based learning algorithm, it is good practice
to encode it into N binary variables and don’t drop.
Label Encoding
In this encoding, each category is assigned a value from 1 through N (here N is the
number of categories for the feature. One major issue with this approach is there is no
relation or order between these classes, but the algorithm might consider them as
some order, or there is some relationship. In below example it may look like
(Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame
as follows:
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 6/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Ordinal Encoding
We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature
of the variable. This is reasonable only for ordinal variables, as I mentioned at the
beginning of this article. This encoding looks almost similar to Label Encoding but
slightly different as Label coding would not consider whether variable is ordinal or not
and it will assign sequence of integers
as per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and
Warm (3)) or
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 7/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot”
(2) and Warm (3)).
If we consider in the temperature scale as the order, then the ordinal value should from
cold to “Very Hot. “ Ordinal encoding will assign values as ( Cold(1) <Warm(2)
<Hot(3)<”Very Hot(4)). Usually, we Ordinal Encoding is done starting from 1.
Refer to this code using Pandas, where first, we need to assign the original order of the
variable through a dictionary. Then we can map each row for the variable as per the
dictionary.
Though it’s very straight forward it requires coding to tell ordinal values and what is
the actual mapping from text to an integer as per the order.
Helmert Encoding
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 8/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
In this encoding, the mean of the dependent variable for a level is compared to the
mean of the dependent variable over all previous levels.
Binary Encoding
Binary encoding converts a category into binary digits. Each binary digit creates one
feature column. If there are n unique categories, then binary encoding results in the
only log(base 2)ⁿ features. In this example, we have four features; thus, the total
number of the binary encoded features will be three features. Compared to One Hot
Encoding, this will require fewer feature columns (for 100 categories One Hot
Encoding will have 100 features while for Binary encoding, we will need just seven
features).
The categories are first converted to numeric order starting from 1 (order is
created as categories appear in a dataset and do not mean any ordinal nature)
Then those integers are converted into binary code, so for example 3 becomes 011,
4 becomes 100
We will use the category_encoders package for this, and the function name is
BinaryEncoder.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 10/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the
frequency is related somewhat with the target variable, it helps the model to
understand and assign the weight in direct and inverse proportion, depending on the
nature of the data. Three-step for this :
Mean Encoding
Mean Encoding or Target Encoding is one viral encoding approach followed by
Kagglers. There are many variations of this. Here I will cover the basic version and
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 11/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
smoothing version. Mean encoding is similar to label encoding, except here labels are
correlated directly with the target. For example, in mean target encoding for each
category in the feature label is decided with the mean value of the target variable on
a training data. This encoding method brings out the relation between similar
categories, but the connections are bounded within the categories and target itself.
The advantages of the mean target encoding are that it does not affect the volume of
the data and helps in faster learning. Usually, Mean encoding is notorious for over-
fitting; thus, a regularization with cross-validation or some other approach is a must
on most occasions. Mean encoding approach is as below:
2. Group by the categorical variable and obtain aggregated sum over the “Target”
variable. (total number of 1’s for each category in ‘Temperature’)
3. Group by the categorical variable and obtain aggregated count over “Target”
variable
4. Divide the step 2 / step 3 results and join it back with the train.
Mean Encoding
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 12/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Mean encoding can embody the target in the label, whereas label encoding does not
correlate with the target. In the case of a large number of features, mean encoding
could prove to be a much simpler alternative. Mean encoding tends to group the
classes, whereas the grouping is random in case of label encoding.
There are many variations of this target encoding in practice, like smoothing.
Smoothing can implement as below:
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 13/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
It is computed as below:
WoE will be 0 if the P(Goods) / P(Bads) = 1. That is if the outcome is random for that
group. If P(Bads) > P(Goods) the odds ratio will be < 1 and the WoE will be < 0; if, on
the other hand, P(Goods) > P(Bads) in a group, then WoE > 0.
WoE is well suited for Logistic Regression because the Logit transformation is simply
the log of the odds, i.e., ln(P(Goods)/P(Bads)). Therefore, by using WoE-coded
predictors in Logistic Regression, the predictors are all prepared and coded to the same
scale. The parameters in the linear logistic regression equation can be directly
compared.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 14/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Below code, snippets explain how one can build code to calculate WoE.
Once we calculate WoE for each group, we can map back this to Data-frame.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 15/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 16/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Hashing
Hashing converts categorical variables to a higher dimensional space of integers,
where the distance between two vectors of categorical variables in approximately
maintained the transformed numerical dimensional space. With Hashing, the number
of dimensions will be far less than the number of dimensions with encoding like One
Hot Encoding. This method is advantageous when the cardinality of categorical is very
high.
This technique falls under the contrast coding system for categorical features. A feature
of K categories, or levels, usually enters a regression as a sequence of K-1 dummy
variables.
This is very similar to target encoding but excludes the current row’s target when
calculating the mean target for a level to reduce the effect of outliers.
James-Stein Encoding
For feature value, James-Stein estimator returns a weighted average of:
The James-Stein encoder shrinks the average toward the overall average. It is a target
based encoder. James-Stein estimator has, however, one practical limitation — it was
defined only for normal distributions.
M-estimator Encoding
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-
parameter — m, which represents the power of regularization. The higher the value of
m results, into stronger shrinking. Recommended values for m is in the range of 1 to
100.
FAQ:
I received many queries related to what to use or how one can treat the test data when
there is no target. I am adding a Faq section here, which I hope would assist.
Answer: There is no single method that works for every problem or dataset. You may
have to try a few to see, which gives a better result. The general guideline is to refer the
cheat-sheet shown at the end of the article.
Faq 02: How do I create categorical encoding for a situation like a target
encoding as, in test data, there won’t be any target value?
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 18/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Answer: We need to use the mapping values created at the time of training. This
process is the same concept as in scaling or normalization, where we use the train data
to scale or normalize the test data. W the map and use the same map in testing time
pre-processing. We can even create a dictionary for each category and mapped value
and then use the dictionary at testing time. Here I am using the mean encoding to
explain this.
Training Time
Testing Time
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 19/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Conclusion
It is essential to understand, for all machine learning models, all these encodings do
not work well in all situations or for every dataset. Data Scientists still need to
experiment and find out which works best for their specific case. If test data has
different classes, then some of these methods won’t work as features won’t be similar.
There are few benchmark publications by research communities, but it’s not
conclusive, which works best. My recommendation will be to try each of these with the
smaller datasets and then decide where to put more focus on tuning the encoding
process. You can use the below cheat-sheet as a guiding tool.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 20/21
8/23/2020 All about Categorical Variable Encoding | by Baijayanta Roy | Towards Data Science
Source
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more
information about our privacy practices.
https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02 21/21