Artigo Pre Processing
Artigo Pre Processing
An investigation of categorical
variable encoding techniques in
machine learning: binary versus
one-hot and feature hashing
CEDRIC SEGER
CEDRIC SEGER
Abstract
Machine learning methods can be used for solving important binary clas-
sification tasks in domains such as display advertising and recommender
systems. In many of these domains categorical features are common and
often of high cardinality. Using one-hot encoding in such circumstances
lead to very high dimensional vector representations, causing memory
and computability concerns for machine learning models. This thesis in-
vestigated the viability of a binary encoding scheme in which categorical
values were mapped to integers that were then encoded in a binary for-
mat. This binary scheme allowed for representing categorical features us-
ing log2(d)-dimensional vectors, where d is the dimension associated with
a one-hot encoding. To evaluate the performance of the binary encoding,
it was compared against one-hot and feature hashed representations with
the use of linear logistic regression and neural networks based models.
These models were trained and evaluated using data from two publicly
available datasets: Criteo and Census. The results showed that a one-hot
encoding with a linear logistic regression model gave the best performance
according to the PR-AUC metric. This was, however, at the expense of
using 118 and 65,953 dimensional vector representations for Census and
Criteo respectively. A binary encoding led to a lower performance but
used only 35 and 316 dimensions respectively. For Criteo, binary encoding
suffered significantly in performance and feature hashing was perceived
as a more viable alternative. It was also found that employing a neural
network helped mitigate any loss in performance associated with using
binary and feature hashed representations.
Sammanfattning
Maskininlärningsmetoder kan användas för att lösa viktiga binära klassi-
ficeringsuppgifter i domäner som displayannonsering och rekommenda-
tionssystem. I många av dessa domäner är kategoriska variabler vanliga
och ofta av hög kardinalitet. Användning av one-hot-kodning under så-
dana omständigheter leder till väldigt högdimensionella vektorrepresen-
tationer. Detta orsakar minnes- och beräkningsproblem för maskininlär-
ningsmodeller. Denna uppsats undersökte användbarheten för ett binärt
kodningsschema där kategoriska värden var avbildade på heltalvärden
som sedan kodades i ett binärt format. Detta binära system tillät att re-
presentera kategoriska värden med hjälp av log2(d) -dimensionella vekto-
rer, där d är dimensionen förknippad med en one-hot kodning. För att ut-
värdera prestandan för den binära kodningen jämfördes den mot one-hot
och en hashbaserad kodning. En linjär logistikregression och ett neuralt
nätverk tränades med hjälp av data från två offentligt tillgängliga data-
set: Criteo och Census, och den slutliga prestandan jämfördes. Resultaten
visade att en one-hot kodning med en linjär logistisk regressionsmodell
gav den bästa prestandan enligt PR-AUC måttet. Denna metod använde
dock 118 och 65,953 dimensionella vektorrepresentationer för Census re-
spektive Criteo. En binär kodning ledde till en lägre prestanda generellt,
men använde endast 35 respektive 316 dimensioner. Den binära kodning-
en presterade väsentligt sämre specifikt för Criteo datan, istället var hash-
baserade kodningen en mer attraktiv lösning. Försämringen i prestationen
associerad med binär och hashbaserad kodning kunde mildras av att an-
vända ett neuralt nätverk.
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Background 6
2.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Linear Logistic Regression . . . . . . . . . . . . . . . . 6
2.1.2 Artificial Neural Networks . . . . . . . . . . . . . . . 7
2.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Feature Representation . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 One-hot . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3 Feature hashing . . . . . . . . . . . . . . . . . . . . . . 11
3 Method 14
3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Census . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Criteo . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Input Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
iv
CONTENTS v
4 Results 22
4.1 Results for Census Data . . . . . . . . . . . . . . . . . . . . . 22
4.2 Results for Criteo Data . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . 26
5 Conclusion 28
5.1 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . 29
Chapter 1
Introduction
1.1 Background
Many important problems that machine learning try to solve are of a bi-
nary nature. Recommendation systems in which the goal is to recommend
a product can be phrased as a binary classification problem by predicting
whether a person may like an item or not. The quality of such recommen-
dation engines can have far reaching business and customer impact as ev-
1
CHAPTER 1. INTRODUCTION 2
1.2 Problem
A one-hot representation, although commonly used, has several disad-
vantages. For example, one-hot requires storing a dictionary that maps
categorical features to vector indices. When the cardinality of the categor-
ical features are large these dictionaries can pose a significant strain on a
computer’s memory resources [1]. In addition, in sparse and high dimen-
sional feature domains, storing the parameter vectors for one-hot encoded
data becomes troublesome [15], even for simple models. Thus the problem
that we seek to solve is to find different ways to represent categorical data
CHAPTER 1. INTRODUCTION 3
1.3 Purpose
The purpose of this report is to investigate the relative performance dif-
ferences that results from when using different encoding techniques of
CHAPTER 1. INTRODUCTION 4
1.4 Objectives
In order to answer the research question and achieve the aim of this study
several goals need to be accomplished: data with a large number of cate-
gorical features needs to be collected. Using this data, binary classification
models - using one-hot, binary and feature hashed representations of cat-
egorical input - need to be trained. Lastly, suitable performance measure-
ments need to be defined and used to compare the trained models.
1.5 Methodology
To accomplish the goals, this study employs a quantitative, empirical re-
search approach. For the data collection part, two publicly available datasets
- Census income data [17] and Criteo ad-click prediction data [18] - have
been chosen due to their different characteristics. This should help in pro-
viding a more general answer to the research question. Further, any con-
tinuous data features are either discarded or converted to categorical fea-
tures. This allows the research to focus solely on the impact of encoding
categorical features and limits influence of external factors.
In terms of models, a linear logistic regression model is used since this
is a widely used model in practice [2]. However due to the popularity of
neural networks, a neural network based logistic regression model is also
trained. This follows our hypothesis that the type of feature representation
and amount of compression will matter less for a neural network than for
a simple linear model. For each model, dataset, and input encoding, a
model is trained, resulting in a total of 12 trained models.
To evaluate the trained models - and thereby gauge the performance
implications of the various input encoding strategies - precision, recall and
area-under-curve for the precision-recall curve are used as performance
metrics. These metrics are commonly employed in binary classification
tasks [19].
CHAPTER 1. INTRODUCTION 5
1.6 Outline
In order to familiarize readers with the essential concepts discussed in this
report a comprehensive background is given in chapter two. The method
is described in chapter three and details the data, models, metrics and ex-
periments. Chapter four describes the results, including a discussion and
limitations section. Chapter five concludes and suggests potential future
works.
Chapter 2
Background
2.1 Classification
Classification problems are concerned with classifying samples into dis-
tinct categories. In terms of binary classification, the two classes are often
referred to as the positive class and the negative class and the goal is to
determine whether a sample belongs to the positive or negative class.
6
CHAPTER 2. BACKGROUND 7
a = wT x (Affine transformation)
ea
P (Y = 1|x) = (Logistic function) (2.1)
1 + ea
P (Y = 1|x) = sigmoid(a) = ŷ (Short version)
where w represents a parameter vector and x the input to the model. While
the relationship between input and ouput is modeled with the help of the
model parameters, the sigmoid function in 2.1 restricts the output of the
model to the range (0, 1) and allows for the output to be interpreted as a
probability.
2.1.3 Learning
In order for a model to be useful it is necessary to learn the parameters
of the model. Logistic regression models, and many other machine learn-
ing models, makes use of the principle of maximum likelihood to fit the
parameters of the model [21]. Particularly it is common to minimize the
negative log-likelihood of the data rather than maximizing the likelihood.
For logistic regression models that differentiate between only two classes,
it is possible to write the negative log-likelihood as
where yi is the label of sample i and ŷi is the probability of the sample
belonging to the positive class as defined in equation 2.1. The expression in
equation 2.2 is also commonly referred to as a cost function in the machine
learning literature.
CHAPTER 2. BACKGROUND 9
where m represents the batch size and is a constant called the learning
rate. Learning is thus made possible by iterating through the training data
and performing the updates as shown in equation 2.3. Note that neural
network based logistic regression and linear logistic regression models can
both be trained using the maximum likelihood approach and the concept
of following gradient information.
2.2.1 One-hot
The most common approach to converting categorical features to a suit-
able format for use as input to a machine learning model is one-hot en-
coding. Continuing with the example of predicting a person’s salary it is
possible that the person’s type of employment is an important factor to
consider. For example, a lawyer tends to make more money than a stu-
dent. Assuming that we wish to differentiate between four types of em-
ployment - student, teacher, doctor and banker - it is possible to represent
this information using one-hot encoding as shown in figure 2.2.
2.2.2 Binary
Categorical data can be represented in a binary format by first assigning
a numerical value to each category and then converting it to its binary
representation. For a feature with d-unique values, this results in a log2(d)-
number of on or off discrete values. The process is shown graphically in
figure 2.3.
To the best of our knowledge, few attempts [16] have been made to
study binary encoding in a formal setting.
If we assume x is an integer key, then the hash function maps any inte-
ger, x, to another integer in the set {0, 1, 2, 3, 4}. An important point is
that it is possible to design hash functions for a variety of keys - a hash
function that maps string keys to integers is an example. This property
has classically allowed hash functions to be used for creating efficient data
structures such as hash-tables.
The application of hashing in a machine learning context becomes clear
by noting that raw categorical data is usually stored in string format. Thus
it is possible to treat the raw string as a key for input to a hash function.
For example when dealing with categorical features, in practice it is com-
mon to concatenate the name of the category and its actual value at a point
[15, 1]. If the category is ’employment’ and a particular sample has the
value ’student’, then the input to the hash function would be ’employ-
ment=student’. By design, the output of the hash function is an integer
number that can be used to index into a feature vector, similar to how a
hash-table look-up is performed. The process of converting categorical
values to a suitable feature vector using hashing is illustrated in figure 2.4.
The hash function used in figure 2.4 hashes keys to integers in the range
[0, 3] and hence results in 4-dimensional vectors. In practice we are free to
CHAPTER 2. BACKGROUND 13
choose the range of the ouptut and thereby allow for dimensionality re-
duction. It is possible to choose to hash the values of the employment
category to the set {0, 1} and hence represent the employment category
using a 2-dimensional vector. The cost of reducing the dimension of a vec-
tor, however, is the potential loss of information: two keys can hash to the
same index. Hash collisions become more likely with a smaller number
of dimensions. Empirically it seems that hash collisions does not signifi-
cantly impact prediction performance of machine learning models [26, 12].
To reduce the impact of hashing collisions, using a second hashing func-
tion has been proposed [12].
Other than reducing the dimensionality, other interesting properties of
feature hashing include the ability to use it in an online fashion and its abil-
ity to handle variable length vocabularies. For example feature hashing
can be used as a fast method for text-feature vector extraction [28]. Also
Weinberger et al. [12] argue that feature hashing preserves information
as well as random projections and show that hashed feature vectors ap-
proximately preserve similarity measures such as inner products between
sample data points.
Chapter 3
Method
14
CHAPTER 3. METHOD 15
for simple models [15], a linear logistic regression model is used for pre-
diction. Also a neural networks based logistic regression model is tested.
This follows from our hypothesis that the type of feature representation
will matter less for a neural network than for a linear model. Generalized
linear logistic regression models and neural networks based approaches
are widely used in industry [29, 2, 10], making them interesting models to
study. Other interesting, alternative models to study could be tree-based
models but were chosen not to be included due to time limitations.
This chapter continues by describing these choices in greater detail,
including the specific experimental setup.
3.1 Data
Two datasets - Census and Criteo - are used for conducting experiments.
Both datasets contain a large number of categorical features, making them
ideal for testing performance implications for categorical feature compres-
sion. The two datasets also have different characteristics, these character-
istics are outlined in the next sections.
3.1.1 Census
The Census Income data[17] consists of 45,222 samples of income data
for adults in the United States taken from the census bureau database.
The goal is to predict whether a person makes more than 50,000 USD in
salary. Each sample consists of 14 mixed continuous and categorical fea-
tures. Some features contained little or highly sparse information and as
such were discarded from the data. Any remaining continuous features
were converted to categorical by discretizing into 10 equal-sized bins. The
resulting categorical features and their cardinality is shown in table 3.1a.
The class distributions for the complete data is shown in table 3.1b.
CHAPTER 3. METHOD 16
Feature Cardinality
age 10
workclass 7
(b) Class distributions
education 16
marital status 7 Class %
occupation 14
1 24.4%
relationship 6
0 75.6%
race 5
sex 2
hours per week 10
native country 41
TOTAL 118
3.1.2 Criteo
The Criteo dataset[18] is a real world dataset compromised of seven days
of display ad logs from Criteo. Each ad is described by 13 integer and
26 categorical features and the goal is to predict click or no click for each
ad. The original data contains some categorical features with very high
cardinality and rare occurrences. In order to reduce the cardinality of such
feature, all infrequently occurring values (feature values occuring less than
500 times) were mapped to a new, common category. Further, all contin-
uous features were discretized by mapping them into bins derived from
the relevant feature’s 95th percentile. If the 95th percentile of a continuous
feature turned out to be larger than 100, that feature was simply mapped
into single-sized bins from zero to 150 (resulting in 150 bins of size one).
The resulting features used for prediction had cardinalities in the range of
three to 6,899 with a total sum of cardinalities of 65,953. The class distri-
bution for the complete data can be seen in table 3.2.
CHAPTER 3. METHOD 17
Class %
1 24.4%
0 75.6%
3.2 Models
To test our hypothesis we run experiments on the two datasets using both
a linear and non-linear model. An overview of these models is illustrated
in figure 3.1.
P (Y |x) = sigmoid(wT x + b)
neural networks [21]. The network consists of 256 and 128 units in the first
and second layer respectively. Specifically, the neural network computes
the following:
l1 = h(w1T x + b1 )
l2 = h(w2T l1 + b2 )
P (Y |x) = sigmoid(w3T l2 + b)
where l1 , l2 represent the two layers, h is the ReLU activation function,
w1 , w2 , w3 the model parameters and b1 , b2 , b3 are bias terms.
3.2.1 Learning
All the learning problems considered are binary classification tasks. Cor-
respondingly the final output of the models is P (Y |x) = sigmoid(...) and
represents the probability of a sample, x, belonging to the positive class.
Learning is done by minimizing the binary cross entropy between the true
labels and predicted conditional labels. The cross entropy or negative log
likelihood is widely used as it provides well behaved gradient updates
[21] - required for learning to be efficient.
In addition to the log loss, an l2 regularization term is also included
as part of the cost function for the neural network models. L2 regulariza-
tion was not included for the linear model as we found this to worsen the
performance. The gradient of the loss is propagated through the model
to update the parameters using stochastic gradient descent. Learning is
done on minibatches of size 32 with the Adam optimizer [30]. Due to the
large size of the Criteo data, a larger batch size of 512 was used in order
to speed up the training procedure. In general, a small mini-batch size is
motivated by recent research by Masters and Luschi [31] that suggest that
smaller batch-sizes improves stability and reliability of learning by provid-
ing more up-to-date gradient calculations. Further, some of the learning
problems become very high dimensional when encoding input as one-hot
vectors; a smaller batch-size therefore also helps to reduce the memory
footprint.
In order to train and evaluate the models, the datasets were split into
training and test sets. For the Census data, training and test sets were
constructed by randomly partitioning the data into 80% training and 20%
CHAPTER 3. METHOD 19
for testing. The models were then trained for 40 epochs1 on the training
data and finally evaluated once on the test data. For neural networks, due
to their non-convex optimization, this process was repeated ten times and
the results averaged for the Census dataset.
For Criteo, due to its large size, training for 40 epochs is infeasible. In-
stead, the original data with 45,840,617 samples was split into train and
test sets by taking the last 6,548,660 samples to form the test set. Similar
procedures have been done by others [32] and the reason is that the Criteo
data is chronologically ordered: the last 6,548,660 samples roughly corre-
spond to the 7th day of the collected data. Training was carried out for a
total of one epoch on the training set and the model was evaluated once
on the test set.
The raw input is read-in from a csv file and each feature is separately
transformed into the chosen representation such as one-hot or binary en-
1
One epoch corresponds to iterating over the full training data once
CHAPTER 3. METHOD 20
3.4 Metrics
To compare the performance implications of the different encoding tech-
niques precision, recall and area-under-curve for the precision-recall curve
are used as evaluation metrics. The metrics are defined as follows:[19]
TP
P recision =
TP + FP
TP
Recall =
TP + FN
where TP is true positives, FN is false negatives and FP stands for false
positives. The recall metric measures the fraction of positive examples
that are labeled correctly. Precision measures the fraction of times that the
classifier is correct when predicting a positive class. For a logistic regres-
sion model that has a probabilistic output, the precision and recall values
are associated with a chosen threshold. A precision-recall curve can be
constructed by plotting points in a (recall, precision)-space by calculating
precision and recall at various thresholds. The area under the precision-
recall curve can be used a simple metric for comparing the capacity of the
models.
These are common performance measures for binary classification tasks
in other research [2] and are particularly well suited for when the data ex-
hibits class imbalances [19]. Since both Criteo and Census are imbalanced
CHAPTER 3. METHOD 21
with respect to class distribution and have the goal of binary classification,
these metrics are suitable to use. Specifically, the precision-recall curve is a
better measure than accuracy for imbalanced datasets since it takes into ac-
count the trade-off in enhancing accuracy by biasing the classifier towards
positive examples [33, 34].
To give a concrete example of why accuracy is not a sufficient metric
for imbalanced data: in the case of a dataset exhibiting 99% positive sam-
ples, any naive classifier can achieve 99% accuracy by simply predicting
positive classification for all samples.
Chapter 4
Results
22
CHAPTER 4. RESULTS 23
The precision and recall metrics for each model and input representa-
tion tell a similar story to that of PR-AUC. While precision is relatively
similar for all input encoding techniques and models, recall shows a big-
ger difference. In particular, feature hashing gained a significant increase
in recall when using a neural network model than when using a linear
model.
a
The standard deviation is based on 10 re-runs of training and eval-
uating each model. The PR-AUC reported is the mean of these 10 runs.
4.3 Discussion
All the models, with their respective encoding techniques, performed bet-
ter than a random model would perform. A random model would be ex-
pected to achieve a PR-AUC of approximately 0.2401 , considerably worse
than the worst performing models on Census and Criteo. This suggests
that the models learned are somewhat useful for making predictions. Over-
all a linear model combined with a one-hot encoding of categorical vari-
ables consistently gave the best results. This can be expected since a one-
hot encoding implies all categories to be independent and learns a sep-
arate parameter for each concept, there is no sharing of the parameters
between categories. In contrast a binary encoding, by using fewer param-
eters to represent input, imposes a different assumption about the cate-
gories. For example using a binary coding for a category with four unique
values, it is possible to use the following encoding:
category 1: [0, 0]
category 2: [0, 1]
category 3: [1, 0]
category 4: [1, 1]
which implies that category four is made up of category two and cate-
gory three - category four shares the parameters of these other categories.
Feature hashing achieves the same compression rate as binary and while
feature hashing tries to preserve the structure of a one-hot encoded vector
there inevitably occurs hashing collisions. This effect should be especially
pronounced for the Census data as the compressed vector only has 35 di-
mensions. Feature hashing performs worse than binary on Census. For
the Criteo data, using a larger compressed representation of 316 dimen-
sions, however, feature hashing outperforms binary. This seems to indi-
cate that the binary representation, by imposing explicit parameter shar-
ing between categories, makes it more difficult for a model to perform
well. It seems that an independent encoding achieves better performance
in general.
1
Calculation takes into account the class distributions of the data sets.
CHAPTER 4. RESULTS 26
4.3.1 Limitations
The results presented are limited in that only two datasets have been con-
sidered: Census and Criteo. In addition, the evaluation data was chosen
in a simple way; for census a random 20% of the data was chosen for eval-
uation while for Criteo the last day of data was used. For more robust
2
This indicates that it is harder to detect the positive samples using a compressed
representation, but having labeled a sample as positive, the probability for the models to
be correct is about the same.
CHAPTER 4. RESULTS 27
Conclusion
The aim of his report was to investigate how the performance of binary
classification models are affected when using a binary encoding of cat-
egorical features rather than a one-hot encoding. Performance was also
measured against a feature hashed representation of input. The input en-
coding schemes were tested on two binary classification tasks and experi-
ments with both a linear and non-linear model were carried out.
The results provide no evidence in favor of a binary encoding with
respect to predictive performance. For high cardinality data, in particular,
encodings more similar to one-hot seem to be easier to optimize and yield
better results. Considering that compression is of greatest practical interest
when the one-hot dimensionality is large, feature hashing may provide a
better alternative to a binary encoding. This is indicated by the results
from the Criteo data.
Further, applying a neural network on top of the compressed repre-
sentations led to better performance but at the cost of introducing more
parameters and thereby more time consuming computations.
Due to the limitations in the number of experiments carried out, the
results should not be interpreted as definitive. Instead the results show
an indication of the potential performance aspects of the various models
and encoding techniques. More testing and experimentation is encour-
aged and required, suggestions are given in the next section.
28
CHAPTER 5. CONCLUSION 29
30
BIBLIOGRAPHY 31
[20] T. Hastie, The Elements of Statistical Learning Data Mining, Inference, and
Prediction, 2nd ed., ser. Springer Series in Statistics. Springer, 2009.
BIBLIOGRAPHY 33
[31] D. Masters and C. Luschi, “Revisiting Small Batch Training for Deep
Neural Networks,” ArXiv e-prints, Apr. 2018.
[32] R. Wang, B. Fu, G. Fu, and M. Wang, “Deep & cross network
for ad click predictions,” CoRR, vol. abs/1708.05123, 2017. [Online].
Available: http://arxiv.org/abs/1708.05123
www.kth.se