Behavior Analysis With Machine Learning Using R (Ceja, Enrique Garci)
Behavior Analysis With Machine Learning Using R (Ceja, Enrique Garci)
Machine Learning
Using R
Chapman & Hall/CRC
The R Series
Series Editors
John M. Chambers, Department of Statistics, Stanford University, California, USA
Torsten Hothorn, Division of Biostatistics, University of Zurich, Switzerland
Duncan Temple Lang, Department of Statistics, University of California, Davis, USA
Hadley Wickham, RStudio, Boston, Massachusetts, USA
Statistical Inference via Data Science: A ModernDive into R and the Tidyverse
Chester Ismay and Albert Y. Kim
Learn R: As a Language
Pedro J. Aphalo
R Markdown Cookbook
Yihui Xie, Christophe Dervieux, and Emily Riederer
Javascript for R
John Coene
Advanced R Solutions
Malte Grosser, Henning Bumann, and Hadley Wickham
List of Figures xv
Welcome xxvii
Preface xxxi
vii
viii Contents
B Datasets 379
B.1 COMPLEX ACTIVITIES . . . . . . . . . . . . . . . . . 379
B.2 DEPRESJON . . . . . . . . . . . . . . . . . . . . . . . . 380
B.3 ELECTROMYOGRAPHY . . . . . . . . . . . . . . . . . 380
B.4 FISH TRAJECTORIES . . . . . . . . . . . . . . . . . . 381
B.5 HAND GESTURES . . . . . . . . . . . . . . . . . . . . 381
B.6 HOME TASKS . . . . . . . . . . . . . . . . . . . . . . . 381
B.7 HOMICIDE REPORTS . . . . . . . . . . . . . . . . . . 382
B.8 INDOOR LOCATION . . . . . . . . . . . . . . . . . . . 382
B.9 SHEEP GOATS . . . . . . . . . . . . . . . . . . . . . . . 383
B.10 SKELETON ACTIONS . . . . . . . . . . . . . . . . . . 383
B.11 SMARTPHONE ACTIVITIES . . . . . . . . . . . . . . 384
B.12 SMILES . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Contents xiii
Bibliography 387
Index 395
List of Figures
xv
xvi List of Figures
7.3 Two different feature vectors for classifying tired and not
tired. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
7.4 Example database with 10 transactions. . . . . . . . . . 222
7.5 Flattening a matrix into a 1D array. . . . . . . . . . . . 224
7.6 Encoding 3 accelerometer timeseries as an image. . . . . 225
7.7 Three activities captured with an accelerometer repre-
sented as images. . . . . . . . . . . . . . . . . . . . . . 225
7.8 Four timeseries (top) with their respective RPs (bot-
tom). (Author: Norbert Marwan/Pucicu at German
Wikipedia. Source: Wikipedia (CC BY-SA 3.0) [https:
//creativecommons.org/licenses/by-sa/3.0/legalcode]). . . 227
xxvii
xxviii Welcome
R with the bookdown package1 developed by Yihui Xie2 . The front cover
and comics were illustrated by Vance Capley3 .
website: http://www.enriquegc.com
Preface
Supplemental Material
The supplemental material consists of examples’ code, shiny apps, and
datasets. The source code for the examples and the shiny apps can be
downloaded from https://github.com/enriquegit/behavior- crc- code. In-
structions on how to set up the code, run shiny apps, and get the datasets
are in Appendix A. A reference for all the utilized datasets is in Appendix
B.
Conventions
DATASET names are written in uppercase italics. Functions are referred
to by their name followed by parenthesis (omitting their arguments), for
example: myFunction(). Class labels are written in italics and between
xxxi
xxxii Preface
single quotes: ‘label1’. The following icons are used to provide additional
contextual information:
The folder icon will appear at the beginning of a section (if applicable)
to indicate which scripts were used for the corresponding examples.
Preface xxxiii
Acknowledgments
I want to thank Ketil Stølen and Robert Kabacoff who reviewed the
book and gave me valuable suggestions.
I want to thank Michael Riegler, Darlene E., Jaime Mondragon y Ariana,
Viviana M., Linda Sicilia, Ania Aguirre, Anton Aguilar, Gagan Chhabra,
Aleksander Karlsen, 刘爽, Ragnhild Halvorsrud, Tine Nordgreen, Petter
Jakobsen, Jim Tørresen, my former master’s and PhD. advisor Ramon
F. Brena, and my former colleagues at University of Oslo and SINTEF.
I want to thank Vance Capley who brought to life the front cover and
comic illustrations, Francescoozzimo who drew the comic for chapter 10,
and Katia Liuntova who animated the online front cover. The examples
in this book rely heavily on datasets. I want to thank all the people that
made all their datasets used here publicly available. Thanks to Yihui Xie
who developed the bookdown R package with which this book was written.
Thanks to Rob Calver, Vaishali Singh, and the CRC Press team who
helped me during the publishing process.
I want to thank all the music bands I listened to during my writing-
breaks: Lionheart, Neaera, Hatebreed, Sworn Enemy, Killswitch Engage,
As I Lay Dying, Lamb of God, Himsa, Slipknot, Madball, Fleshgod Apoc-
alypse, Bleeding Through, Caliban, Chimaira, Heaven Shall Burn, Dark-
est Hour, Demon Hunter, Frente de Ira, Desarmador, Después del Odio,
Gatomadre, Rey Chocolate, ill niño, Soulfly, Walls of Jericho, Arrecife,
Corcholata, Amon Amarth, Abinchova, Fit for a King, Annisokay, Sylo-
sis, Meshuggah.
1
Introduction to Behavior and Machine
Learning
In the last years, machine learning has surged as one of the key technolo-
gies that enables and supports many of the services and products that
we use in our everyday lives and is expanding quickly. Machine learning
has also helped to accelerate research and development in almost every
field including natural sciences, engineering, social sciences, medicine,
art and culture. Even though all those fields (and their respective sub-
fields) are very diverse, most of them have something in common: They
involve living organisms (cells, microbes, plants, humans, animals, etc.)
and living organisms express behaviors. This book teaches you machine
learning and data-driven methods to analyze different types of behav-
iors. Some of those methods include supervised, unsupervised, and deep
learning. You will also learn how to explore, encode, preprocess, and
visualize behavioral data. While the examples in this book focus on be-
havior analysis, the methods and techniques can be applied in any other
context.
This chapter starts by introducing the concepts of behavior and machine
learning. Next, basic machine learning terminology is presented and you
will build your first classification and regression models. Then, you will
learn how to evaluate the performance of your models and important
concepts such as underfitting, overfitting, bias, and variance.
DOI: 10.1201/9781003203469-1 1
2 1 Introduction to Behavior and Machine Learning
The definitions are similar and both include humans and animals. Fol-
lowing those definitions, this book will focus on the automatic analysis of
human and animal behaviors however, the methods can also be applied
to robots and to a wide variety of problems in different domains. There
are three main reasons why one may want to analyze behaviors in an
automatic manner:
This simple rule should work well and will do the job. Imagine that now
your boss tells you that the system needs to recognize green apples as
well. Our previous rule will no longer work, and we will need to include
additional rules and thresholds. On the other hand, a machine learning
algorithm will automatically learn such rules based on the updated data.
So, you only need to update your data with examples of green apples
and “click” the re-train button!
The result of learning is knowledge that the system can use to solve new
instances of a problem. In this case, when you show a new image to the
6 1 Introduction to Behavior and Machine Learning
system, it should be able to recognize the type of fruit. Figure 1.3 shows
this general idea.
FIGURE 1.3 Overall Machine Learning phases. The ‘?’ represents the
new unknown object for which we want to obtain a prediction using the
learned model.
Not every machine learning method needs the expected output or la-
bels (more on this in the Taxonomy section 1.3).
From the figure, four main types of machine learning methods can be
observed:
𝑓 (𝑥) = 𝑦 (1.1)
where 𝑓 is a function that maps some input data 𝑥 (for example im-
ages) to an output 𝑦 (types of fruits). Usually, an algorithm will try
to learn the best model 𝑓 given some data consisting of 𝑛 pairs (𝑥, 𝑦)
of examples. During learning, the algorithm has access to the expected
output/label 𝑦 for each input 𝑥. At inference time, that is, when we want
to make predictions for new examples, we can use the learned model 𝑓
and feed it with a new input 𝑥 to obtain the corresponding predicted
value 𝑦.
1.4 Terminology
This section introduces some basic terminology that will be helpful for
the rest of the book.
1.4.1 Tables
Since data is the most important ingredient in machine learning, let’s
start with some related terms. First, data needs to be stored/structured
so it can be easily manipulated and processed. Most of the time, datasets
will be stored as tables or in R terminology, data frames. Figure 1.5 shows
the classic mtcars dataset2 stored in a data frame.
The columns represent variables and the rows represent examples also
known as instances or data points. In this table, there are 5 variables
mpg, cyl, disp, hp and the model (the first column). In this example, the
first column does not have a name, but it is still a variable. Each row
represents a specific car model with its values per variable. In machine
learning terminology, rows are more commonly called instances whereas
in statistics they are often called data points or observations. Here, those
terms will be used interchangeably.
Figure 1.6 shows a data frame for the iris dataset which consists of
different kinds of plants [Fisher, 1936]. Suppose that we are interested in
predicting the Species based on the other variables. In machine learning
terminology, the variable of interest (the one that depends on the others)
is called the class or label for classification problems. For regression, it
is often referred to as y. In statistics, it is more commonly known as the
response, dependent, or y variable, for both classification and regression.
In machine learning terminology, the rest of the variables are called fea-
tures or attributes. In statistics, they are called predictors, independent
2
mtcars dataset https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/mtcars
.html extracted from the 1974 Motor Trend US magazine.
1.4 Terminology 11
variables, or just X. From the context, most of the time it should be easy
to identify dependent from independent variables regardless of the used
terminology. The word feature vector is also very common in machine
learning. A feature vector is just a structure containing the features of
a given instance. For example, the features of the first instance in Fig-
ure 1.6 can be stored as a feature vector [5.4, 3.9, 1.3, 0.4] of size 4. In a
programming language, this can be implemented with an array.
common steps. It all starts with the data collection. Then the data ex-
ploration and so on, until the results are presented. These steps can be
followed in sequence, but you can always jump from one step to another
one. In fact, most of the time you will end up using an iterative approach
by going from one step to the other (forward or backward) as needed.
The big gray box at the bottom means that machine learning methods
can be used in all those steps and not just during training or evaluation.
For example, one may use dimensionality reduction methods in the data
exploration phase to plot the data or classification/regression methods
in the cleaning phase to impute missing values. Now, let’s give a brief
description of each of those phases:
to the normal class but just a small proportion will be of type ‘illegal
transaction’. In this case, we may want to do some preprocessing to
try to balance the dataset. Some models are also sensitive to feature-
scale differences. For example, a variable weight could be in kilograms
but another variable height in centimeters. Before training a predictive
model, the data needs to be prepared in such a way that the models
can get the most out of it. Chapter 5 will present some common pre-
processing steps.
• Training and evaluation. Once the data is preprocessed, we can pro-
ceed to train the models. Furthermore, we also need ways to evaluate
their generalization performance on new unseen instances. The purpose
of this phase is to try, and fine-tune different models to find the one
that performs the best. Later in this chapter, some model evaluation
techniques will be introduced.
• Interpretation and presentation of results. The purpose of this
phase is to analyze and interpret the models’ results. We can use per-
formance metrics derived from the evaluation phase to make informed
decisions. We may also want to understand how the models work in-
ternally and how the predictions are derived.
Then, the train set is used to train (fit) a model, and the test set to
evaluate how well that model performs on new data. The performance
can be measured using performance metrics such as the accuracy for
classification problems. The accuracy is the percent of correctly classified
instances.
16 1 Introduction to Behavior and Machine Learning
simple_model.R
This table has 2 variables: speed and class. The first one is a numeric
variable. The second one is a categorical variable. In this case, it can
take two possible values: ‘tiger’ or ‘leopard’.
1.7 Simple Classification Example 19
The output tells us that the data frame has 100 rows and 2 columns.
Now we may be interested to know how many of those correspond to
tigers. We can use the table() function to get that information.
Here we see that 50 instances are of type ‘leopard’ and also that 50 in-
stances are of type ‘tiger’. In fact, this is how the dataset was intention-
ally generated. The next thing we can do is to compute some summary
statistics for each column. R already provides a very convenient function
for that purpose. Yes, it is the summary() function.
FIGURE 1.11 Feline speeds with vertical dashed lines at the means.
𝑀 = {𝜇1 , … , 𝜇𝑛 } (1.2)
Those centrality measures (the class means in this particular case) are
called the parameters of the model. Training a model consists of finding
those optimal parameters that will allow us to achieve the best perfor-
mance on new instances that were not part of the training data. In most
cases, we will need an algorithm to find those parameters. In our ex-
ample, the algorithm consists of simply computing the mean speed for
each class. That is, for each class, sum all the corresponding speeds and
divide them by the number of data points that belong to that class.
Once those parameters are found, we can start making predictions on
new data points. This is called inference or prediction. In this case, when
a new data point arrives, we can predict its class by computing its dis-
tance to each of the 𝑛 centrality measures in 𝑀 and return the class of
the closest one.
22 1 Introduction to Behavior and Machine Learning
return(params)
}
The first argument is the training data and the second argument is the
centrality function we want to use (the mean, by default). This func-
tion iterates each class, computes the centrality measure based on the
speed, and stores the results in a named array called params which is then
returned at the end.
Most of the time, training a model involves feeding it with the training
data and any additional hyperparameters specific to each model. In
1.7 Simple Classification Example 23
Now that we have a function that performs the training, we need an-
other one that performs the actual inference or prediction on new data
points. Let’s call this one simple.classifier.predict(). Its first argument
is a data frame with the instances we want to get predictions for. The
second argument is the named vector of parameters learned during train-
ing. This function will return an array with the predicted class for each
instance in newdata.
return(predictions)
}
This function iterates through each row and computes the distance to
each centrality measure and returns the name of the class that was the
closest one. The distance computation is done with the following line of
code:
First, it computes the absolute difference between the speed and each
centrality measure stored in params and then, it returns the class name of
the minimum one. Now that we have defined the training and prediction
procedures, we are ready to test our classifier!
In section 1.6, two evaluation methods were presented. Hold-out and
k-fold cross-validation. These methods allow you to estimate how your
model will perform on new data. Let’s start with hold-out validation.
First, we need to split the data into two independent sets. We will use
70% of the data to train our classifier and the remaining 30% to test it.
The following code splits dataset into a trainset and testset.
The last argument replace is set to FALSE because we do not want re-
peated numbers. This ensures that any instance only belongs to either
the train or the test set. We don’t want an instance to be copied
into both sets.
Now it’s time to test our functions. We can train our model using the
trainsetby calling our previously defined function simple.model.train().
Our predict function returns predictions for each instance in the test set.
26 1 Introduction to Behavior and Machine Learning
We can use the head() function to print the first predictions. The first
two instances were classified as tigers, the third one as leopard, and so
on.
But how good are those predictions? Since we know what the true classes
are (also known as ground truth) in our test set, we can compute
the performance. In this case, we will compute the accuracy, which is
the percentage of correct classifications. Note that we did not use the
class information when making predictions, we only used the speed. We
pretended that we didn’t have the true class. We will use the true class
only to evaluate the model’s performance.
The train accuracy was 85.7%. As expected, this was higher than the
test accuracy. Typically, what you report is the performance on the test
set, but we can use the performance on the train set to look for signs of
over/under-fitting which will be covered in the following sections.
1.7 Simple Classification Example 27
# Number of folds.
k <- 5
set.seed(123)
Again, we can use the sample() function. This time we want to select
random integers between 1 and 𝑘. The total number of integers will be
equal to the total number of instances 𝑛 in the entire dataset. Note that
this time we set replace = TRUE since 𝑘 < 𝑛, so this implies that we need
to pick repeated numbers. Each number will represent the fold to which
each instance belongs to. As before, we need to make sure that each
instance belongs only to one of the sets. Here, we are guaranteeing that
by assigning each instance a single fold number. We can use the table()
function to print how many instances ended up in each fold. Here, we
see that the folds will contain between 17 and 23 instances.
𝑘-fold cross-validation consists of iterating 𝑘 times. In each iteration, one
of the folds is selected as the test set and the remaining folds are used
to build the train set. Within each iteration, the model is trained with
the train set and evaluated with the test set. At the end, the average
accuracy across folds is reported.
28 1 Introduction to Behavior and Machine Learning
for(i in 1:k){
testset <- dataset[which(folds == i),]
trainset <- dataset[which(folds != i),]
The test mean accuracy across the 5 folds was ≈ 83% which is very
similar to the accuracy estimated by hold-out validation.
1.7 Simple Classification Example 29
Note that in section 1.6 a validation set was also mentioned. This
one is useful when you want to fine-tune a model and/or try dif-
ferent preprocessing methods on your data. In case you are using
hold-out validation, you may want to split your data into three sets:
train/validation/test sets. So, you train your model using the train
set and estimate its performance using the validation set. Then you
can fine-tune your model. For example, here, instead of the mean as
centrality measure, you can try to use the median and measure the
performance again with the validation set. When you are pleased with
your settings, you estimate the final performance of the model with
the test set only once.
In the case of 𝑘-fold cross-validation, you can set aside a test set at
the beginning. Then you use the remaining data to perform cross-
validation and fine-tune your model. Within each iteration, you test
the performance with the validation data. Once you are sure you are
not going to do any parameter tuning, you can train a model with the
train and validation sets and test the generalization performance using
the test set.
simple_model.R
return(predictions)
}
print(params)
#> tiger leopard
#> 48.88246 54.5836
The MAE on the test set was 2.56. That is, on average, our simple model
had a deviation of 2.56 km/hr with respect to the true values, which is
not bad. We can also compute the MAE on the train set.
1.9 Underfitting and Overfitting 33
The MAE on the train set was 2.16, which is better than the test set MAE
(small MAE values are preferred). Now, you have built, trained, and
evaluated a regression model!
This was a simple example, but it illustrates the basic idea of regression
and how it differs from classification. It also shows how the performance
of regression models is typically evaluated with the MAE as opposed to
the accuracy used in classification. In chapter 8, more advanced methods
such as neural networks will be introduced, which can be used to solve
regression problems.
In this section, we have gone through several of the data analysis pipeline
phases. We did a simple exploratory analysis of the data and then we
built, trained, and validated the models to perform both classification
and regression. Finally, we estimated the overall performance of the mod-
els and presented the results. Here, we coded our models from scratch,
but in practice, you typically use models that have already been imple-
mented and tested. All in all, I hope these examples have given you the
feeling of how it is to work with machine learning.
the position of the line that reduces the classification error is searched
for. We implicitly estimated the position of the line by finding the mean
values for each of the classes.
Now, imagine that we do not only have access to the speed but also to the
felines’ age. This extra information could help us reduce the prediction
error since age plays an important role in how fast a feline is. Figure 1.14
(left) shows how it will look like if we plot age in the x-axis and speed in
the y-axis. Here, we can see that for both, tigers and leopards, the speed
seems to increase as age increases. Then, at some point, as age increases
the speed begins to decrease.
Constructing a classifier with a single vertical line as we did before will
not work in this 2-dimensional case where we have 2 predictors. Now we
will need a more complex decision boundary (function) to separate the
two classes. One approach would be to use a line as before but this time
we allow the line to have a slope (angle). Everything below the line is
classified as ‘tiger’ and everything else as ‘leopard’. Thus, the learning
phase involves finding the line’s position and its slope that achieves the
smallest error.
Figure 1.14 (left) shows a possible decision line. Even though this func-
tion is more complex than a vertical line, it will still produce a lot of
misclassifications (it does not clearly separate both classes). This is called
underfitting, that is, the model is so simple that it is not able to capture
the underlying data patterns.
Let’s try a more complex function, for example, a curve. Figure 1.14
(middle) shows that a curve does a better job at separating the two
1.9 Underfitting and Overfitting 35
In this example, we saw how underfitting and overfitting can affect the
generalization performance of a model in a classification setting but the
same can occur in regression problems.
There are several methods that aim to reduce overfitting, but many of
them are specific to the type of model. For example, with decision trees
(covered in chapter 2), one way to reduce overfitting is to limit their
depth or build ensembles of trees (chapter 3). Neural networks are also
highly prone to overfitting since they can be very complex and have
millions of parameters. In chapter 8, several techniques to reduce the
effect of overfitting will be presented.
infinite (or very large) number of train sets can be generated and for
each, a predictive model is trained. Then we average the predictions of
all those models and see how much that average differs from the true
value.
Variance. How much the predictions change for a given data point when
training a model using a different train set each time.
Bias and variance are closely related to underfitting and overfitting. High
variance is a sign of overfitting. That is, a model is so complex that it
will fit a particular train set very well. Every time it is trained with a
different train set, the train error will be low, but it will likely generate
very different predictions for the same test points and a much higher test
error.
Figure 1.16 illustrates the relation between overfitting and high variance
with a regression problem.
set. The simpler model does not fit the train data so well but has a
smaller Δ and a lower error on the test point as well. Visually, the
function (red curve) of the complex model also varies a lot across train
sets whereas the shapes of the simpler model functions look very similar.
On the other hand, if a model is too simple, it will underfit causing highly
biased results without being able to capture the input-output relation-
ships. This results in a high train error and in consequence, a high test
error as well.
1.11 Summary
In this chapter, several introductory machine learning concepts and
terms were introduced and they are the basis for the methods that will
be covered in the following chapters.
• Behavior can be defined as “an observable activity in a human or
animal”.
• Three main reasons of why we may want to analyze behavior automat-
ically were discussed: react, understand, and document/archive.
• One way to observe behavior automatically is through the use of sen-
sors and/or data.
• Machine Learning consists of a set of computational algorithms that
automatically find useful patterns and relationships from data.
• The three main building blocks of machine learning are: data, algo-
rithms, and models.
• The main types of machine learning are supervised learning, semi-
supervised learning, partially-supervised learning, and unsu-
pervised learning.
• In R, data is usually stored in data frames. Data frames have variables
(columns) and instances (rows). Depending on the task, variables can
be independent or dependent.
• A predictive model is a model that takes some input and produces
an output. Classifiers and regressors are predictive models.
• A data analysis pipeline consists of several tasks including data collec-
tion, cleaning, preprocessing, training/evaluation, and presentation of
results.
• Model evaluation can be performed with hold-out validation or k-
fold cross-validation.
• Overfitting occurs when a model ‘memorizes’ the training data in-
stead of finding useful underlying patterns.
• The test error can be decomposed into noise, bias, and variance.
2
Predicting Behavior with Classification Models
1. Compute the distance between the query instance and all train-
ing instances.
2. Return the most common class label among the k nearest train-
ing instances (neighbors).
DOI: 10.1201/9781003203469-2 41
42 2 Predicting Behavior with Classification Models
training time! The training phase consists only of storing the training in-
stances so they can be compared to the query instance at prediction time.
The hyper-parameter k is usually specified by the user and depends on
each application. We also need to specify a distance function that returns
small distances for similar instances and big distances for very dissimilar
instances. For numeric features, the Euclidean distance is one of the
most commonly used distance function. The Euclidean distance between
two points can be computed as follows:
𝑛
2
𝑑 (𝑝, 𝑞) = √∑ (𝑝𝑖 − 𝑞𝑖 ) (2.1)
𝑖=1
where 𝑝 and 𝑞 are 𝑛-dimensional feature vectors and 𝑖 is the index to the
vectors’ elements. Figure 2.1 shows the idea graphically (adapted from
the 𝑘-NN article1 in Wikipedia). The query instance is depicted with the
‘?’ symbol. If we choose 𝑘 = 3 (represented by the inner dashed circle)
the predicted class is ‘square’ because there are two squares but only
one circle. If 𝑘 = 5 (outer dotted circle), the predicted class is ‘circle’.
Typical values for 𝑘 are small odd numbers like 1, 2, 3, 5. The 𝑘-NN
algorithm can also be used for regression with a small modification:
1
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
2.1 k-Nearest Neighbors 43
indoor_classification.R indoor_auxiliary.R
You might already have experienced some troubles with geolocation ser-
vices when you are inside a building. Part of this is because GPS tech-
nologies do not provide good indoors-accuracy due to several sources
of interference. For some applications, it would be beneficial to have
accurate location estimations inside buildings even at room-level. For
example, in domotics and localization services in big public places like
airports or shopping malls. Having good indoor location estimates can
also be used in behavior analysis such as extracting trajectory patterns.
In this section, we will implement 𝑘-NN to perform indoor location in a
building based on Wi-Fi signals. For instance, we can use a smartphone
to scan the nearby Wi-Fi access points and based on this information,
determine our location at room-level. This can be formulated as a clas-
sification problem: Given a set of Wi-Fi signals as input, predict the
location where the device is located.
For this classification problem, we will use the INDOOR LOCATION
dataset (see Appendix B) which was collected with an Android smart-
phone. The smartphone application scans the nearby access points and
stores their information and label. The label is provided by the user
and represents the room where the device is located. Several instances
for every location were recorded. To generate each instance, the device
scans and records the MAC address and signal strength of the nearby
access points. A delay of 500 ms is set between scans. For each location,
approximately 3 minutes of data were collected while the user walked
in the specific room. Figure 2.2 depicts the layout of the building where
the data was collected. The data has four different locations: ‘bedroomA’,
‘bedroomB’, ‘tvroom’, and the ‘lobby’. The lobby (not shown in the lay-
out) is at the same level as bedroom A but on the first floor.
44 2 Predicting Behavior with Classification Models
Table 2.1 shows the first rows of the dataset. The first column is the
class. The scanid column is a unique identifier for the given Wi-Fi scan
(instance). To preserve privacy, MAC addresses were converted into in-
teger values. Every instance is composed of several rows. For example,
the first instance with scanid=1 has two rows (one row per mac address).
Intuitively, the same location should have similar MAC addresses across
scans. From the table, we can see that at bedroomA access points with
MAC address 1 and 2 are usually found by the device.
Since each instance is composed of several rows, we will convert our
data frame into a list of lists where each inner list represents a single
instance with the class (locationId), a unique id, and a data frame with
the corresponding access points. The example code can be found in the
script indoor_classification.R.
First, we read the dataset from the csv file and store it in the data frame
df. To make things easier, the data frame is converted into a list of lists
using the auxiliary function wifiScansToList() which is defined in the
script indoor_auxiliary.R. Next, we print the number of instances in the
dataset, that is, the number of lists. The dataset contains 365 instances.
46 2 Predicting Behavior with Classification Models
The 365 was just a coincidence, the data was not collected every day
during one year but in the same day. Next, we extract the first instance
with dataset[[1]]. Here, we see that each instance has three pieces of
information. The class (locationId), a unique id (scanId), and a set of
access points stored in a data frame. The first instance has two access
points with MAC addresses 1 and 2. There is also information about the
signal strength, though, this one will not be used.
Since we would expect that similar locations have similar MAC addresses
and locations that are far away from each other have different MAC ad-
dresses, we need a distance measure that captures this notion of similar-
ity. In this case, we cannot use the Euclidean distance on MAC addresses.
Even though they were encoded as integer values, they do not represent
magnitudes but unique identifiers. Each instance is composed of a set
of 𝑛 MAC addresses stored in the accessPoints data frame. To compute
the distance between two instances (two sets) we can use the Jaccard
distance. This distance is based on element sets:
|𝐴 ∪ 𝐵| − |𝐴 ∩ 𝐵|
𝑗 (𝐴, 𝐵) = (2.2)
|𝐴 ∪ 𝐵|
𝑆1 = {𝑎, 𝑏, 𝑐, 𝑑, 𝑒}
𝑆2 = {𝑒, 𝑓, 𝑔, 𝑎}
Now let’s try to compute the distance between instances with different
classes.
The distance between instances of the same class was 0.33 whereas the
distance between instances of the different classes was 0.66. So, our func-
tion is working as expected.
In the extreme case when the sets 𝐴 and 𝐵 are identical, the distance
will be 0. When there are no common elements in the sets, the distance
will be 1. Armed with this distance metric, we can now implement the
𝑘-NN function in R. The knn_classifier() implementation is in the script
indoor_auxiliary.R. Its first argument is the dataset (the list of instances).
The second argument k, is the number of nearest neighbors to use, and
48 2 Predicting Behavior with Classification Models
the last two arguments are the indices of the train and test instances,
respectively. This indices are pointers to the elements in the dataset
variable.
for(queryInstance in testSetIndices){
distancesToQuery <- NULL
for(trainInstance in trainSetIndices){
jd <- jaccardDistance(dataset[[queryInstance]]$accessPoints$mac,
dataset[[trainInstance]]$accessPoints$mac)
distancesToQuery <- c(distancesToQuery, jd)
}
return(list(predictions = predictions,
groundTruth = groundTruth))
}
For each instance queryInstance in the test set, the knn_classifier() com-
putes its jaccard distance to every other instance in the train set and
stores those distances in distancesToQuery. Then, those distances are
2.1 k-Nearest Neighbors 49
sorted in ascending order and the most common class among the first 𝑘
elements is returned as the predicted class. The function Mode() returns
the most common element. Finally, knn_classifier() returns a list with
the predictions for every instance in the test set and their respective
ground truth class for evaluation.
Now, we can try our classifier. We will use 70% of the dataset as train
set and the remaining as the test set.
The function knn_classifier() predicts the class for each test set instance
and returns a list with their predictions and their ground truth classes.
With this information, we can compute the accuracy on the test set
which is the percentage of correctly classified instances. In this example,
we set 𝑘 = 3.
library(caret)
cm <- confusionMatrix(factor(result$predictions),
factor(result$groundTruth))
cm$table # Access the confusion matrix.
#> Reference
#> Prediction bedroomA bedroomB lobby tvroom
#> bedroomA 26 0 3 1
#> bedroomB 0 17 0 1
#> lobby 0 1 28 0
#> tvroom 0 0 0 33
The columns of the confusion matrix represent the true classes and the
rows the predictions. For example, from the total 31 instances of type
‘lobby’, 28 were correctly classified as ‘lobby’ while 3 were misclassified
as ‘bedroomA’. Something I find useful is to plot the confusion matrix as
proportions instead of counts (Figure 2.3). From this confusion matrix
we see that for the class ‘bedroomB’, 94% of the instances were correctly
classified while 6% were mislabeled as ‘lobby’. On the other hand, in-
stances of type ‘bedroomA’ were always classified correctly.
A confusion matrix is a good way to analyze the classification results
per class and it helps to spot weaknesses which can be used to improve
the model, for example, by extracting additional features.
2.2 Performance Metrics 51
TP
𝑟𝑒𝑐𝑎𝑙𝑙 = (2.4)
P
Specificity: The proportion of negatives classified as such. It is also
called the true negative rate.
TN
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (2.5)
N
Precision: The fraction of true positives among those classified as pos-
itives. Also known as the positive predictive value.
TP
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = (2.6)
TP + FP
F1-score: This is the harmonic mean of precision and recall.
precision ⋅ recall
F1-score = 2 ⋅ (2.7)
precision + recall
The mean of the metrics across all classes can be computed by taking
the mean for each column of the returned object:
values (ground truth). For example, the first element in the list is a P
and it was correctly classified as a P. The eight element is a P but it
was misclassified as N. The associated confusion matrix for these ground
truth and predicted classes is shown at the bottom.
There are 7 true positives and 3 true negatives. In total, 10 instances
were correctly classified (TP and TN) and 5 were misclassified (FP and
FN). From this matrix we can calculate what is the total number of
true positives by taking the sum of the first column, 10 in this case.
The total number of true negatives is obtained by summing the second
column, 5 in this case. Having this information we can compute any of
the previous performance metrics: accuracy, recall, specificity, precision,
and F1-score.
Be aware that there is no standard that defines whether the true classes
or the predicted classes go in the rows or columns, thus, you need to
check for this everytime you encounter a new confusion matrix.
As shown in the example, decision trees are easy to interpret and the
final result can be explained by just following the path. Now let’s see
how these decision trees are learned from data. Consider the following
artificial concert dataset (Figure 2.7).
The first four variables are features and the last column is the class.
The class is the decision whether or not we should go to a music concert
56 2 Predicting Behavior with Classification Models
based on the other variables. In this case, all variables are binary except
Price which has three possible values: low, medium, and high.
• Tired: Indicates whether the person is tired or not.
• Rain: Whether it is raining or not.
• Metal: Indicates whether this is a heavy metal concert or not.
• Price: Ticket price.
• Go: The decision of whether to go to the music concert or not.
The main question when building a tree is which feature should be at
the root (top). Once you answer this question, you may need to grow
the tree by adding another feature (node) as one of the root’s children.
To decide which new feature to add you need to answer the same first
question: “What feature should be at the root of this subtree?”. This
is a recursive definition! The tree keeps growing until you reach a leaf
node, there are no more features to select from, or you have reached a
predefined maximum depth.
For the concert dataset we need to find which is the best variable to be
placed at the root. Let’s suppose we need to choose between Price and
Metal. Figure 2.8 shows these two possibilities.
If we select Price, there are three possible subnodes, one for each value:
low, medium, and high. If Price is low then four instances fall into this
subtree (the first four from the table). For all of them, the value of Go
is 1. If Price is high, two instances fall into this category and their Go
2.3 Decision Trees 57
FIGURE 2.8 Two example trees with one variable split by Price (left)
and Metal (right).
value is 0, thus if the price is high then you should not go to the concert
according to this data. There are six instances for which the Price value
is medium. From those, two of them have Go=1 and the remaining four
have Go=0. For cases when the price is low or high we can arrive at a
solution. If the price is low then go to the concert, if the price is high
then do not go. However, if the price is medium it is still not clear what
to do since this subnode is not pure. That is, the labels of the instances
are mixed: two with an output of 1 and four with an output of 0. In this
case we can try to use another feature to decide and grow the tree but
first, let’s look at what happens if we decide to use Metal as the first
feature at the root. In this case, we end up with two subsets with six
instances each. And for each subnode, what decision should we take is
still not clear because the output is ‘mixed’ (Go: 3, NotGo: 3). At this
point we would need to continue growing the tree below each subnode.
Intuitively, it seems like Price is a better feature since its subnodes are
more pure. Then we can use another feature to split the instances whose
Price is medium. For example, using the Metal variable. Figure 2.9 shows
how this would look like. Since one of the subnodes of Metal is still not
pure we can further split it using the Rain variable, for example. At this
point, we can not split any further. Note that the Tired variable was
never used.
So far, we have chosen the root variable based on which one looks more
pure but to automate the process, we need a way to measure this purity
in a quantitative manner. One way to do that is by using the entropy.
Entropy is a measure of uncertainty from information theory. It is 0 when
there is no uncertainty and 1 when there is complete uncertainty. The
entropy of a discrete variable 𝑋 with values 𝑥1 … 𝑥𝑛 and probability
mass function 𝑃 (𝑋) is:
58 2 Predicting Behavior with Classification Models
FIGURE 2.9 Tree splitting example. Left: tree splits. Right: High-
lighted instances when splitting by Price and Metal.
𝑛
𝐻(𝑋) = − ∑ 𝑃 (𝑥𝑖 )𝑙𝑜𝑔𝑃 (𝑥𝑖 ) (2.8)
𝑖=1
Take for example a fair coin with probability of heads and tails = 0.5
each. The entropy for that coin is:
Since we do not know what will be the result when we drop the coin,
the entropy is maximum. Now consider the extreme case when the coin
is biased such that the probability of heads is 1 and the probability of
tails is 0. The entropy in this case is zero:
Thus, we can use this to compute the entropy for the three possible
values of Price with respect to the class. The positives are the instances
where Go=1 and the negatives are the instances where Go=0:
4 4 0 0
𝐻𝑝𝑟𝑖𝑐𝑒=𝑙𝑜𝑤 (4, 0) = −( )𝑙𝑜𝑔( )+( )𝑙𝑜𝑔( )=0
4+0 4+0 4+0 4+0
2 2 4 4
𝐻𝑝𝑟𝑖𝑐𝑒=𝑚𝑒𝑑𝑖𝑢𝑚 (2, 4) = −( )𝑙𝑜𝑔( )+( )𝑙𝑜𝑔( ) = 0.918
2+4 2+4 2+4 2+4
0 0 2 2
𝐻𝑝𝑟𝑖𝑐𝑒=ℎ𝑖𝑔ℎ (0, 2) = −( )𝑙𝑜𝑔( )+( )𝑙𝑜𝑔( )=0
0+2 0+2 0+2 0+2
The average of those three can be calculated by taking into account the
number of corresponding instances for each value and the total number
of instances (12):
𝐻(6, 6) = 1
Now we can compute the information gain for Price. Intuitively, the
information gain tells you how powerful this variable is at dividing the
instances based on their class, that is, how much you are learning:
Since you want to learn fast, you want your root node to be the one with
the highest information gain. For the rest of the variables the information
gain is:
𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛(𝑇 𝑖𝑟𝑒𝑑) = 0
60 2 Predicting Behavior with Classification Models
𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛(𝑅𝑎𝑖𝑛) = 0.020
𝑖𝑛𝑓𝑜𝐺𝑎𝑖𝑛(𝑀 𝑒𝑡𝑎𝑙) = 0
The highest information gain is produced by Price, thus, it is selected as
the root node. Then, the process continues recursively for each branch
but excluding Price. Since branches with values low and high are already
done, we only need to further split medium. Sometimes it is not possible
to have completely pure nodes like with low and high. This can happen
for example, when there are no more attributes left or when two or
more instances have the same feature values but different labels. In those
situations the final prediction is the most common label (majority vote).
There exist many implementations of decision trees. Some implementa-
tions compute variable importance using the entropy (as shown here)
but others use the Gini index, for example. Each implementation also
treats numeric variables in different ways. Pruning the tree using differ-
ent techniques is also common in order to reduce its size.
Some of the most common implementations are C4.5 trees [Quinlan,
2014] and CART [Steinberg and Colla, 2009]. The later is implemented
in the rpart R package [Therneau and Atkinson, 2019] which will be used
in the following section to build a model that predicts physical activities
from smartphones sensor data.
smartphone_activities.R
Usually, classification models are not trained with the raw data but with
feature vectors extracted from the raw data. Feature vectors have the
advantage of being more compact, thus, making the learning phase more
efficient. For activity recognition, the feature extraction process consists
of defining a moving window of size 𝑤 that starts at position 𝑖. At the
beginning, 𝑖 is the index pointing to the first accelerometer readings.
Then, 𝑛 statistical features are computed on the elements covered by
the window such as mean, standard deviation, 0-crossings, etc. This will
produce a 𝑛-dimensional feature vector and the process is repeated by
moving the window 𝑠 steps forward. Typical values of 𝑠 are such that
the overlap between the previous window position and the next one is
2
http://www.cis.fordham.edu/wisdm/dataset.php
62 2 Predicting Behavior with Classification Models
Once we have the set of feature vectors and their associated class labels,
we can use them to train a classifier and make predictions on new data
(Figure 2.12).
FIGURE 2.12 The extracted feature vectors are used to train a classi-
fier.
For this example, we will use the file with features already extracted. The
authors used windows of 10 seconds which is equivalent to 200 observa-
tions given the 20 Hz sampling rate and they used 0% overlap. From
each window, they extracted 43 features such as the mean, standard
deviation, absolute deviations, etc.
Let’s read and print the first rows of the dataset. The script for this
section is smartphone_activities.R. The data frame has several columns,
but we only print the first five features and the class which is stored in
the last column.
2.3 Decision Trees 63
# Read data.
df <- read.csv(datapath,stringsAsFactors = F)
#> X0 X1 X2 X3 X4 class
#> 1 0.04 0.09 0.14 0.12 0.11 Jogging
#> 2 0.12 0.12 0.06 0.07 0.11 Jogging
#> 3 0.14 0.09 0.11 0.09 0.09 Jogging
#> 4 0.06 0.10 0.09 0.09 0.11 Walking
#> 5 0.12 0.11 0.10 0.08 0.10 Walking
#> 6 0.09 0.09 0.10 0.12 0.08 Walking
#> 7 0.12 0.12 0.12 0.13 0.15 Upstairs
#> 8 0.10 0.10 0.10 0.10 0.11 Upstairs
#> 9 0.08 0.07 0.08 0.08 0.05 Upstairs
Our aim is to predict the class based on all the numeric features. We will
use the rpart package [Therneau and Atkinson, 2019] which implements
classification and regression trees. We will assess the performance of
the decision tree with 10-fold cross-validation. We can use the sample()
function to generate the folds. This function will sample 𝑛 integers from
1 to 𝑘 where 𝑛 is the number of rows in the data frame.
# Generate folds.
64 2 Predicting Behavior with Classification Models
The folds variable stores the fold each instance belongs to. For example,
the first instance belongs to fold 10, the second instance belongs to fold
6, and so on. We can now generate our test and train sets. We will
iterate 𝑘 = 10 times. For each iteration 𝑖, the test set is built using the
instances that belong to fold 𝑖 and the train set will be composed of the
remaining instances (those that do not belong to fold 𝑖). Next, the rpart()
function is used to train the decision tree with the train set. By default,
rpart() performs 10-fold cross-validation internally. To avoid this, we set
the parameter xval = 0. Then, we can use the trained model to obtain
the predictions on the test set with the generic predict() function. The
ground truth classes and the predictions are stored so the performance
metrics can be computed.
for(i in 1:k){
cm <- confusionMatrix(as.factor(predictions),
as.factor(groundTruth))
# Print accuracy
cm$overall["Accuracy"]
#> Accuracy
#> 0.7895903
The overall accuracy was 78% and by looking at the individual perfor-
mance metrics, some classes had low scores like ‘walking downstairs’
and ‘walking upstairs’. From the confusion matrix (Figure 2.13), it can
be seen that those two activities were often confused with each other
but also with the ‘walking’ activity. The package rpart.plot [Milborrow,
2019] can be used to plot the resulting tree (Figure 2.14).
library(rpart.plot)
# Plot the tree from the last fold.
rpart.plot(treeClassifier, fallen.leaves = F,
shadow.col = "gray", legend.y = 1)
(≈ 0.39), that is, it’s the most common activity present in the dataset.
So, if we didn’t have any other information, our best bet would be to
predict the most frequent activity.
# Prior probabilities.
table(trainSet$class) / nrow(trainSet)
#> Downstairs Jogging Sitting Standing Upstairs Walking
#> 0.09882885 0.29607561 0.05506472 0.04705157 0.11793713 0.38504212
These results look promising, but they can still be improved. In the next
chapter, I will show you how to improve these results with Ensemble
Learning which is a method that is used to aggregate many models.
68 2 Predicting Behavior with Classification Models
𝑃 (𝐶 = Walking|𝑓1 , … , 𝑓𝑛 ).
This reads as the conditional probability that the class is ‘Walking’ given
the observed evidence. For each instance, the evidence that we can ob-
serve are its features 𝑓1 , … , 𝑓𝑛 . In this dataset, each instance has 39
features. If we want to estimate the most likely class, all we need to do
is to compute the conditional probability for each class and return the
highest one:
where 𝐾 is the total number of possible classes. The arg max notation
means: Evaluate the right hand expression for every class 𝑘 and return
the 𝑘 that resulted with the maximum probability. If instead of arg max
we had max (without the arg) that would mean to return the actual
maximum probability instead of the class 𝑘.
Now let’s see how we can compute 𝑃 (𝐶𝑘 |𝑓1 , … , 𝑓𝑛 ). To compute a con-
ditional probability we can use Bayes’ rule:
2.4 Naive Bayes 69
𝑃 (𝐻)𝑃 (𝐸|𝐻)
𝑃 (𝐻|𝐸) = (2.11)
𝑃 (𝐸)
1 2 2
𝑓(𝑥) = √ 𝑒−(𝑥−𝜇) /2𝜎 (2.12)
𝜎 2𝜋
Suppose that for some feature 𝑓1 when the class is ‘Walking’, its mean
is 5 and its standard deviation is 3. That is, we filter the train set and
only select those instances with class ‘Walking’ and compute the mean
and standard deviation for feature 𝑓1. Figure 2.15 shows how its pdf
looks like.
dnorm(x=1.7, mean = 5, sd = 3)
#> [1] 0.07261739
In Figure 2.16 the solid circle shows the likelihood when 𝑥 = 1.7.
If we have more than one feature we need to compute the likelihood
for each and take their product: 𝑃 (𝑓1 |𝐶 = 𝑊 𝑎𝑙𝑘𝑖𝑛𝑔) ∗ 𝑃 (𝑓2 |𝐶 =
𝑊 𝑎𝑙𝑘𝑖𝑛𝑔) ∗ ⋯ ∗ 𝑃 (𝑓𝑛 |𝐶 = 𝑊 𝑎𝑙𝑘𝑖𝑛𝑔). Each feature and class pair has its
own 𝜇 and 𝜎 parameters. Thus, Naive Bayes requires to learn 𝐾 ∗ 𝐹 ∗ 2
2.4 Naive Bayes 71
parameters for the 𝑃 (𝐸|𝐻) part plus 𝐾 parameters for the priors 𝑃 (𝐻).
𝐾 is the number of classes, 𝐹 is the number of features, and the 2 stands
for the mean and standard deviation.
We have seen how we can compute 𝑃 (𝐶𝑘 |𝑓1 , … , 𝑓𝑛 ) using Baye’s rule
by calculating the prior 𝑃 (𝐻) and 𝑃 (𝐸|𝐻) which is the product of the
likelihoods for each feature. If we substitute Bayes’s rule (omitting the
denominator) in equation (2.10) we get our Naive Bayes classifier:
𝐹
𝑦 = argmax 𝑃 (𝐶𝑘 ) ∏ 𝑃 (𝑓𝑖 |𝐶𝑘 ) (2.13)
𝑘∈{1,…,𝐾} 𝑖=1
In the following section we will implement our own Naive Bayes algo-
rithm in R and test it on the SMARTPHONE ACTIVITIES dataset.
Then, we will compare our implementation with that of the well known
e1071 package [Meyer et al., 2019].
Naive Bayes works well with missing values since the features are in-
dependent. At prediction time, if an instance has one or more missing
72 2 Predicting Behavior with Classification Models
values then, those features are just ignored and the posterior probabil-
ity is computed based only on the available variabels. Another advan-
tage of the feature independence assumption is that feature selection
algorithms run very fast with Naive Bayes. When building a predictive
model, not all features may provide useful information and some fea-
tures may even degrade the performance. Feature selection algorithms
aim to find the best set of features and some of them need to try a
huge number of feature combinations. With Naive Bayes, the param-
eters only need to be learned once and then different combinations of
features can be evaluated by omitting the ones that are not used. With
decision trees, for example, we would need to build entire new trees
every time we want to try different input features.
Here, we have shown how we can use a Gaussian pdf to compute the
likelihood 𝑃 (𝐸|𝐻) when the features are numeric. This assumes that
the features have a normal distribution. However, this is not always
the case. In practice, Naive Bayes can work really well even if that
assumption is not met. Furthermore, nothing prevents us from using
another distribution to estimate the likelihood or even defining a spe-
cific distribution for each feature. For categorical variables, 𝑃 (𝐸|𝐻) is
estimated using the frequencies of the feature values.
naive_bayes.R
three axes of the accelerometer sensor. The following code snippet prints
the first rows of the train set. The RESULTANT feature is in column
39 and the class is the last column (40).
head(trainset[,c(39:40)])
#> RESULTANT class
#> 1004 11.14 Walking
#> 623 1.24 Upstairs
#> 2693 9.90 Standing
#> 934 10.44 Upstairs
#> 4496 10.43 Walking
#> 2948 15.28 Jogging
First, we compute the prior probabilities for each class in the train set
and store them in the variable priors. This corresponds to the 𝑃 (𝐶𝑘 )
part in equation (2.13).
This means that 30% of the instances in the train set are of type ‘Jog-
ging’. Now we need to compute the 𝑃 (𝑓𝑖 |𝐶𝑘 ) part from equation (2.13).
74 2 Predicting Behavior with Classification Models
It’s first argument x is the input value. The second argument m is the
mean, and the last argument s is the standard deviation. For illustration
purposes we are defining this function manually but remember that this
pdf is already implemented with the base dnorm() function.
According to equation (2.13) we need to compute 𝑃 (𝑓𝑖 |𝐶𝑘 ) for each fea-
ture 𝑖 and class 𝑘. Let’s assume there are only two classes, ‘Walking’ and
‘Jogging’. Thus, we need to compute the mean and standard deviation
for each, and for the feature RESULTANT (column 39).
mean.standing
#> [1] 9.405795
mean.jogging
#> [1] 13.70145
2.4 Naive Bayes 75
Note that the mean value for ‘Jogging’ is higher for this feature. This
was expected since this feature captures the overall movement across all
axes. Now we have everything we need to start making predictions on
new instances. We have the priors and we have the means and standard
deviations for each feature-class pair.
Let’s select the first instance from the test set and try to predict its class.
Now we compute the posterior probability for each class using the
learned means and standard deviations:
# Compute P(Standing)P(RESULTANT|Standing)
priors["Standing"] * f(query$RESULTANT, mean.standing, sd.standing)
#> 0.003169748
# Compute P(Jogging)P(RESULTANT|Jogging)
priors["Jogging"] * f(query$RESULTANT, mean.jogging, sd.jogging)
#> 0.03884481
The posterior for ‘Jogging’ was higher (0.038) so we classify the query in-
stance as ‘Jogging’. If we check the true class we see that it was correctly
classified!
In this example we assumed that there was only one feature and we
computed each step manually. However, this can easily be extended to
deal with more features. So let’s just do that. We can write two functions,
one for training the classifier and the other for making predictions.
The following function will be used to train the classifier. It takes as
input a data frame with 𝑛 features. This function assumes that the class
76 2 Predicting Behavior with Classification Models
is the last column. The function returns a list with the learned priors,
means, and standard deviations.
# Unique classes.
classes <- unique(data$class)
# Number of features.
nfeatures <- ncol(data) - 1
for(c in classes){
# Populate matrix.
for(i in 1:nfeatures){
feature.values <- data[which(data$class == c),i]
M[i,1] <- mean(feature.values)
M[i,2] <- sd(feature.values)
}
return(list(list.means.sds=list.means.sds,
2.4 Naive Bayes 77
priors=priors))
}
The function iterates through each class and for each, it creates a matrix
M with 𝐹 rows and 2 columns where 𝐹 is the number of features. The first
column stores the means and the second the standard deviations. Those
matrices are saved in a list indexed by the class name so at prediction
time we can retrieve each matrix individually. At the end, the prior
probabilities are computed. Finally, a list is returned. The first element
of the list is the list of matrices and the second element are the priors.
The next function will make predictions based on the learned parameters.
Its first argument is the learned parameters and the second a data frame
with the instances we want to make predictions for.
n <- nrow(data)
# Iterate instances.
for(i in 1:n){
query <- data[i,]
max.probability <- -Inf
predicted.class <- ""
# Iterate features.
for(j in 1:nfeatures){
# Compute P(feature|class)
tmp <- f(query[,j],
params$list.means.sds[[c]][j,1],
params$list.means.sds[[c]][j,2])
# Accumulate result.
acum.prob <- acum.prob * tmp
}
return(predictions)
}
This function iterates through each instance and computes the posterior
for each class and stores the one that achieved the highest value as the
prediction. Finally, it returns the list with all predictions.
Now we are ready to train our Naive Bayes classifier. All we need to do
is call the function naive.bayes.train() and pass the train set.
The learned parameters are stored in nb.model and we can make predic-
tions with the naive.bayes.predict() function by passing the nb.model and
a test set.
# Make predictions.
predictions <- naive.bayes.predict(nb.model, testset)
cm <- confusionMatrix(as.factor(predictions),
as.factor(groundTruth))
# Print accuracy
cm$overall["Accuracy"]
#> Accuracy
#> 0.7501538
# Print accuracy
cm2$overall["Accuracy"]
#> Accuracy
#> 0.7501538
As you can see, the result was the same as the one obtained with our
implementation! We implemented our own for illustrative purposes but
it is advisable to use already tested and proven packages. Furthermore,
this one also supports categorical variables.
dtw_example.R
2.5 Dynamic Time Warping 81
To compare two sequences we could use the well known Euclidean dis-
tance. However since the two sequences may not be aligned in time,
the result could be misleading. Furthermore, the two sequences differ
in length. To account for this “time-shift” effect in timeseries data, Dy-
namic Time Warping (DTW) [Sakoe et al., 1990] can be used instead.
DTW is a method that:
• Finds an optimal match between two time-dependent sequences.
• Computes their dissimilarity.
• Finds the optimal deformation (mapping) of one of the sequences onto
the other.
Another advantage of DTW is that the timeseries do not need to be
of the same length. Suppose we have two timeseries, a query, and a
reference we want to compare with:
𝑞𝑢𝑒𝑟𝑦 = (2, 2, 2, 4, 4, 3)
𝑟𝑒𝑓 = (2, 2, 3, 3, 2)
The first thing to note is that the sequences differ in length. Figure 2.19
shows their plot. The query is the solid line and seems to be shifted to
the right one position with respect to the reference. The plot also shows
2.5 Dynamic Time Warping 83
the resulting alignment after applying the DTW algorithm (dashed lines
between the sequences). The resulting distance (after aligning) between
the sequences is 3. In the following, we will see how the problem can
be formalized and how it can be computed. Don’t worry if you find the
math notation a bit difficult to grasp at this pint. A step by step example
will follow which should help to explain how the method works.
FIGURE 2.19 DTW alignment between the query and reference se-
quences (solid line is the query).
𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑇𝑥 )
𝑌 = (𝑦1 , 𝑦2 , … , 𝑦𝑇𝑦 )
where 𝑥𝑖 and 𝑦𝑖 are vectors. In the previous example, the vectors only
have one element since the sequences are 1-dimensional, but DTW also
works with multidimensional sequences. 𝑇𝑥 and 𝑇𝑦 are the sequences’
lengths. Let
𝑑(𝑖𝑥 , 𝑖𝑦 )
84 2 Predicting Behavior with Classification Models
𝑖𝑥 = 𝜙𝑥 (𝑘), 𝑘 = 1, 2, … , 𝑇
𝑖𝑦 = 𝜙𝑦 (𝑘), 𝑘 = 1, 2, … , 𝑇 .
𝑇
𝑑𝜙 (𝑋, 𝑌 ) = ∑ 𝑑 (𝜙𝑥 (𝑘), 𝜙𝑦 (𝑘)) (2.14)
𝑘=1
The aim is to find the warping function 𝜙 that minimizes the total dis-
similarity:
𝜙𝑥 (1) = 1, 𝜙𝑦 (1) = 1
𝜙𝑥 (𝑇 ) = 𝑇𝑥 , 𝜙𝑦 (𝑇 ) = 𝑇𝑦
𝜙𝑥 (𝑘 + 1) ≥ 𝜙𝑥 (𝑘)
𝜙𝑦 (𝑘 + 1) ≥ 𝜙𝑦 (𝑘)
𝜙𝑥 (𝑘 + 1) − 𝜙𝑥 (𝑘) ≤ 1
𝜙𝑦 (𝑘 + 1) − 𝜙𝑦 (𝑘) ≤ 1
𝑄 = (2, 2, 2, 4, 4, 3)
𝑅 = (2, 2, 3, 3, 2)
The first step is to compute a local cost matrix. This is just a matrix
that contains the distance between every pair of points between the two
sequences. For this example, we will use the Manhattan distance. Since
our sequences are 1-dimensional this distance can be computed as the
absolute difference |𝑥𝑖 − 𝑦𝑖 |. Figure 2.20 shows the resulting local cost
matrix.
For example, position (1, 1) = 0 (row,column) because the first element
of 𝑄 is 2 and the first element of 𝑅 is also 2, thus, |2 − 2| = 0. The
rest of the matrix is filled in the same way. In dynamic programming,
partial results are computed and stored in a table. Figure 2.21 shows
the final dynamic programming table computed from the local cost ma-
trix. Initially, this table is empty. We start to fill it from bottom left
at position (1, 1). From the local cost matrix, the cost at position (1, 1)
is 0 so the cost at that position in the dynamic programming table is
0. Then we can start filling in the contiguous cells. The only direction
from which we can arrive at position (1, 2) is from the west (W). The
cost at position (1, 2) from the local cost matrix is 0 and the cost of the
minimum of the cell from the west (1, 1) is also 0. So 𝑊 ∶ 0 + 0 = 0. For
each cell we add the current cost plus the minimum cost when coming
86 2 Predicting Behavior with Classification Models
from the contiguous cell. The minimum costs are marked with red. For
some cells it is possible to arrive from three different directions: S, W,
and SW, thus we need to compute the cost when coming from each of
those. The final minimum cost at position (5, 6) is 3. Thus, that is the
global DTW distance. In the example, it is possible to get the minimum
at (5, 6) when arriving from the south or southwest.
Once the table is filled in, we can backtrack starting at (5, 6) to find
the warping functions. Figure 2.22 shows the final warping functions.
Because of the endpoint constraints, we know that 𝜙𝑄 (1) = 1, 𝜙𝑅 (1) =
1, 𝜙𝑄 (6) = 6, and 𝜙𝑅 (6) = 5. Then, from (5, 6) the minimum contiguous
value is 2 coming from SW, thus 𝜙𝑄 (5) = 5, 𝜙𝑅 (5) = 4, and so on. Note
that we could also have chosen to arrive from the south with the same
minimum value of 2 but still this would have resulted in the same overall
distance. The dashed line in figure 2.21 shows the full backtracking.
library("dtw")
alignment$localCostMatrix
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0 0 1 1 0
#> [2,] 0 0 1 1 0
#> [3,] 0 0 1 1 0
#> [4,] 2 2 1 1 2
#> [5,] 2 2 1 1 2
#> [6,] 1 1 0 0 1
alignment$distance
#> [1] 3
alignment$index1
#> [1] 1 2 3 4 5 6
alignment$index2
#> [1] 1 1 2 3 4 5
The local cost matrix is the same one as in Figure 2.20 but in rotated
form. The resulting object also has the dynamic programming table
which can be plotted along with the resulting backtracking (see Figure
2.23).
And finally, the aligned sequences can be plotted. The previous Figure
2.19 shows the result of the following command.
hand_gestures.R, hand_gestures_auxiliary.R
2 2 2
𝑀 𝑎𝑔𝑛𝑖𝑡𝑢𝑑𝑒(𝑡) = √𝑎𝑥 (𝑡) + 𝑎𝑦 (𝑡) + 𝑎𝑧 (𝑡) (2.16)
to preprocess the data. Since the sequences of each gesture are of vary-
ing length, storing them as a data frame could be problematic because
data frames have fixed sizes. Instead, the gen.instances() function pro-
cesses the files and returns all hand gestures as a list. This function also
computes the magnitude (equation (2.16)). The following code (from
hand_gestures.R) calls the gen.instances() function and stores the results
in the instances variable which is a list. Then, we select the first and
second instances to be the query and the reference.
Each element in instances is also a list that stores the type and values
(magnitude) of each gesture.
print(ref$type)
#> [1] "1"
Here, the first two instances are of type ‘1’. We can also print the mag-
nitude values.
# Print values.
print(query$values)
#> [1] 9.167477 9.291464 9.729926 9.901090 ....
In this case, both classes are “1”. We can use the dtw() function to com-
pute the similarity between the query and the reference instance and
plot the resulting alignment (Figure 2.26).
# Plot result.
plot(alignment, type="two", off=1, match.lty=2, match.indices=40,
main="DTW resulting alignment",
xlab="time", ylab="magnitude")
D <- matrix.distances(instances)
# Save results.
save(D, file="D.RData")
for(query in testSet){
}
} # end of for
cm <- confusionMatrix(factor(predictions),
factor(groundTruth))
The overall recall was 0.87 which is not bad. From the confusion ma-
trix (Figure 2.27), we can see that the class ‘a’ was often confused with
‘circleLeft’ and vice versa. This makes sense since both have similar mo-
tions (see Figure 2.24). Also, ‘b’ was often confused with ‘circleLeft’. The
‘square’ class was always correctly classified. This example demonstrated
how DTW can be used with 𝑘-NN to recognize hand gestures.
96 2 Predicting Behavior with Classification Models
dummy_classifiers.R
When faced with a new problem, you may be tempted to start trying
to solve it by using a complex model. Then, you proceed to train your
complex model and evaluate it. The results look reasonably good so you
think you are done. However, this good performance could only be an
illusion. Sometimes there are underlying problems with the data that
can give the false impression that a model is performing well. Examples
of such problems are imbalanced datasets, no correlation between the
features and the classes, features not containing enough information, etc.
Dummy models can be used to spot some of those problems. Dummy
models use little or no information at all when making predictions (we’ll
see how in a moment).
Furthermore, for some problems (specially in regression) it is not clear
what is considered to be a good performance. There are problems in
which doing slightly better than random is considered a great achieve-
ment (e.g., in forecasting) but for other problems that would be unac-
ceptable. Thus, we need some type of baseline to assess whether or not a
particular model is bringing some benefit. Dummy models are not only
used to spot problems but can be used as baselines as well.
Dummy models are also called baseline models or dumb models. One
student I was supervising used to call them stupid models. When I am
angry, I also call them like that, but today I’m in a good mood so I’ll
refer to them as dummy.
Now, I will present three types of dummy classifiers and how they can
be implemented in R.
2.6 Dummy Models 97
# In percentages.
table(dataset$class) / nrow(dataset)
#> Upstairs Walking
#> 0.08768084 0.91231916
We can see that more than 90% of the instances belong to class ‘Walking’.
It’s time to define the dummy classifier!
return(most.frequent)
}
# Return the same label for as many rows as there are in data.
return(rep(params, nrow(data)))
The only thing the predict function does is to return the params argument
that contains the class name repeated 𝑛 times. Where 𝑛 is the number
of rows in the test data frame.
Let’s try our functions. The dataset has already been split into 50% for
training and 50% for testing. First we train the dummy model using the
train set. Then, the learned parameter is printed.
Now we can make predictions on the test set and compute the accuracy.
2.6 Dummy Models 99
# Make predictions.
predictions <- most.frequent.class.predict(dummy.model1, testset)
# Print accuracy
cm$overall["Accuracy"]
#> Accuracy
#> 0.9087719
The accuracy was 90.8%. It seems that the dummy classifier was not
that dummy after all! Let’s print the confusion matrix to inspect the
predictions.
#> Reference
#> Prediction Walking Upstairs
#> Walking 1036 104
#> Upstairs 0 0
From the confusion matrix we can see that all ‘Walking’ activities were
correctly classified but none of the ‘Upstairs’ classes were identified. This
is because the dummy model only predicts ‘Walking’. Here we can see
that even though it seemed like the dummy model was doing pretty
good, it was not that good after all.
We can now try with a decision tree from the rpart package.
# Print accuracy
cm.tree$overall["Accuracy"]
#> Accuracy
#> 0.9263158
Decision trees are more powerful than dummy classifiers but the accuracy
was very similar!
return(unique.classes)
}
At prediction time, it just picks a random label for each instance in the
test set. This model achieved an accuracy of only 49.0% using the same
dataset, but it correctly identified more classes of type ‘Upstairs’.
#> Reference
#> Prediction Walking Upstairs
#> Walking 506 54
#> Upstairs 530 50
In fact, the previous rule can be thought of as a very simple decision tree
with only one root node. Surprisingly, sometimes simple rules can be dif-
ficult to beat by more complex models. In this section I’ve been focusing
on classification problems, but dummy models can also be constructed
for regression. The simplest one would be to predict the mean value of
𝑦 regardless of the feature values. Another dummy model could predict
a random value between the min and max of 𝑦. If there is a categorical
feature, one could predict the mean value based on the category. In fact,
that is what we did in chapter 1 in the simple regression example.
In summary, one can construct any type of dummy model depending
on the application. The takeaway is that dummy models allow us to
assess how more complex models perform with respect to some baselines
and help us to detect possible problems in the data and features. What I
typically do when solving a problem is to start with simple models and/or
rules and then, try more complex models. Of course, manual thresholds
and simple rules can work remarkably well in some situations but they
are not scalable. Depending on the use case, one can just implement
the simple solution or go for something more complex if the system is
expected to grow or be used in more general ways.
2.7 Summary 103
2.7 Summary
This chapter focused on classification models. Classifiers predict a cat-
egory based on the input features. Here, it was demonstrated how classi-
fiers can be used to detect indoor locations, classify activities, and hand
gestures.
• 𝑘-Nearest Neighbors (𝑘-NN) predicts the class of a test point as
the majority class of the 𝑘 nearest neighbors.
• Some classification performance metrics are recall, specificity, pre-
cision, accuracy, F1-score, etc.
• Decision trees are easy-to-interpret classifiers trained recursively
based on feature importance (for example, purity).
• Naive Bayes is a type of classifier where features are assumed to be
independent.
• Dynamic Time Warping (DTW) computes the similarity between
two timeseries after aligning them in time. This can be used for classi-
fication for example, in combination with 𝑘-NN.
• Dummy models can help to spot possible errors in the data and can
also be used as baselines.
3
Predicting Behavior with Ensemble Learning
3.1 Bagging
Bagging stands for “bootstrap aggregating” and is an ensemble learning
method proposed by Breiman [1996]. Ummm…, Bootstrap, aggregating?
Let’s start with the aggregating part. As the name implies, this method
is based on training several base learners (e.g., decision trees) and com-
bining their outputs to produce a single final prediction. One way to
combine the results is by taking the majority vote for classification tasks
or the average for regression. In an ideal case, we would have enough
data to train each base learner with an independent train set. However,
in practice we may only have a single train set of limited size. Training
several base learners with the same train set is equivalent to having a
single learner, provided that the training procedure of the base learners
is deterministic. Even if the training procedure is not deterministic, the
resulting models might be very similar. What we would like to have is ac-
curate base learners but at the same time they should be diverse. Then,
how can those base learners be trained? Well, this is where the bootstrap
part comes into play.
Bootstrapping means generating new train sets by sampling instances
with replacement from the original train set. If the original train set has
𝑁 instances, the method selects 𝑁 instances at random to produce a new
train set. With replacement means that repeated instances are allowed.
This has the effect of generating a new train set of size 𝑁 by removing
some instances and duplicating other instances. By using this method,
𝑛 different train sets can be generated and used to train 𝑛 different
learners.
It has been shown that having more diverse base learners increases per-
formance. One way to generate diverse learners is by using different train
sets as just described. In his original work, Breiman [1996] used decision
trees as base learners. Decision trees are considered to be very unstable.
This means that small changes in the train set produce very different
trees – but this is a good thing for bagging! Most of the time, the ag-
gregated predictions will produce better results than the best individual
learner from the ensemble.
Figure 3.1 shows bootstrapping in action. The train set is sampled with
replacement 3 times. The numbers represent indices to arbitrary train
instances. Here, we can see that in the first sample, the instance number
5 is missing but instead, instance 2 is duplicated. All samples have five
elements. Then, each sample is used to train individual decision trees.
One of the disadvantages of ensemble methods is their higher compu-
tational cost both during training and inference. Another disadvantage
of ensemble methods is that they are more difficult to interpret. Still,
there exist model agnostic interpretability methods [Molnar, 2019] that
can help to analyze the results. In the next section, I will show you how
to implement your own Bagging model with decision trees in R.
bagging_activities.R iterated_bagging_activities.R
N <- nrow(data)
return(res)
}
First, a list that will store each individual learner is defined models <-
list(). Then, the function iterates ntrees times. In each iteration, a boot-
strapped train set is generated and used to train a rpart model. The xval
= 0 parameter tells rpart not to perform cross-validation internally. The
cp parameter is also set to 0. This value controls the amount of pruning.
The default is 0.01 leading to smaller trees. This makes the trees to be
more similar but since we want diversity we are setting this to 0 so bigger
trees are generated and as a consequence, more diverse.
Finally, an object of class "my_bagging" is returned. This is just a list con-
taining the trained base learners. The class = "my_bagging" argument is
important. It tells R that this object is of type my_bagging. Setting the
class will allow us to use the generic predict() function, and R will auto-
matically call the corresponding predict.my_bagging() function which we
will shortly define. The class name and the function name after predict.
need to be the same.
# Populate matrix.
# Each column of M contains all predictions for a given tree.
# Each row contains the predictions for a given instance.
for(i in 1:ntrees){
m <- object$models[[i]]
tmp <- as.character(predict(m, newdata, type = "class"))
M[,i] <- tmp
}
# Final predictions
predictions <- character()
Now let’s dissect the predict.my_bagging() function. First, note that the
function name starts with predict. followed by the type of object. Follow-
ing this convention will allow us to call predict() and R will call the cor-
responding method based on the class of the object. The first argument
object is an object of type “my_bagging” as returned by my_bagging().
The second argument newdata is the test set we want to generate pre-
dictions for. A matrix M that will store the predictions for each tree is
defined. This matrix has 𝑁 rows and 𝑛𝑡𝑟𝑒𝑒𝑠 columns where 𝑁 is the
110 3 Predicting Behavior with Ensemble Learning
The following will perform 5-fold cross-validation and print the results.
set.seed(1234)
k <- 5
folds <- sample(k, size = nrow(df), replace = TRUE)
for(i in 1:k){
trainSet <- df[which(folds != i), ]
testSet <- df[which(folds == i), ]
treeClassifier <- my_bagging(class ~ ., trainSet, ntree = 10)
foldPredictions <- predict(treeClassifier, testSet)
predictions <- c(predictions, as.character(foldPredictions))
groundTruth <- c(groundTruth, as.character(testSet$class))
}
cm <- confusionMatrix(as.factor(predictions), as.factor(groundTruth))
# Print accuracy
3.1 Bagging 111
cm$overall["Accuracy"]
#> Accuracy
#> 0.861388
The accuracy was much better now compared to 0.789 from the previous
chapter without using Bagging!
The effect of adding more trees to the ensemble can also be analyzed.
The script iterated_bagging_activities.R does 5-fold cross-validation as
we just did but starts with 1 tree in the ensemble and repeats the process
by adding more trees until 50.
Figure 3.2 shows the effect on the train and test accuracy with different
number of trees. Here, we can see that 3 trees already produce a signif-
icant performance increase compared to 1 or 2 trees. This makes sense
since having only 2 trees does not add additional information. If the two
trees produce different predictions then, it becomes a random choice be-
tween the two labels. In fact, 2 trees produced worse results than 1 tree.
But we cannot make strong conclusions since the experiment was run
only once. One possibility to break ties when there are only two trees
is to use the averaged probabilities of each label. rpart can return those
probabilities by setting type = "prob" in the predict() function which is
the default behavior. This is left as an exercise for the reader. In the
following section, Random Forest will be described which is a way of
introducing more diversity to the base learners.
112 3 Predicting Behavior with Ensemble Learning
library(randomForest)
rf <- randomForest(class ~ ., trainSet, ntree = 10)
By default, ntree = 500. Among other things, you can control how many
random features are sampled at each split with the mtry argument. By
default, for classification mtry = floor(sqrt(ncol(x))) and for regression
mtry = max(floor(ncol(x)/3), 1).
set.seed(1234)
k <- 5
for(i in 1:k){
# Print accuracy
cm$overall["Accuracy"]
#>Accuracy
#> 0.870801
Those results are better than the previous ones with Bagging. Figure 3.3
shows the results when doing 5-fold cross-validation for different num-
ber of trees (the complete script is in iterated_randomForest_activities.R).
From these results, we can see a similar behavior as Bagging. That is,
the accuracy increases very quickly and then it stabilizes.
If we directly compare Bagging vs. Random Forest, Random Forest out-
performs Bagging (Figure 3.4). The complete code to generate the plot
is in the script iterated_bagging_rf.R.
3.3 Stacked Generalization 115
predictions. The predictions of the base learners are known as the meta-
features. The meta-features along with their true labels 𝑦 are used to
build a new train set that is used to train a meta-learner. The rationale
behind this is that the predictions themselves contain information that
can be used by the meta-learner.
The procedure to train a Stacking model is as follows:
Note that steps 2 and 3 can lead to overfitting because the predictions
are made with the same data used to train the models. To avoid this,
steps 2 and 3 are usually performed using 𝑘-fold cross-validation. After
′
D has been generated, the learners in L can be retrained using all data
in D.
Ting and Witten [1999] showed that the performance can increase by
adding confidence information about the predictions. For example, the
probabilities produced by the first-level learners. Most classifiers can
output probabilities.
At prediction time, each first-level learner predicts the class, and option-
ally, the class probabilities of a given instance. These predictions are used
to form a feature vector (meta-features) that is fed to the meta-learner to
obtain the final prediction. Usually, first-level learners are high perform-
ing classifiers such as Random Forests, Support Vector Machines, Neural
Networks, etc. The meta-learner should also be a powerful classifier.
In the next section, I will introduce Multi-view Stacking which is similar
to Generalized Stacking except that each first-level learner is trained
with features from a different view.
stacking_algorithms.R stacking_activities.R
properties, thus, different types of models may be needed for each view.
When aggregating features from all views, new variable correlations may
be introduced which could impact the performance. Another limitation
is that features need to be in the same format (feature vectors, images,
etc.), so they can be aggregated.
For video classification, we could have two views. One represented by
sequences of images, and the other by the corresponding audio. For the
video part, we could encode the features as the images themselves, i.e.,
matrices. Then, a Convolutional Neural Network (covered in chapter 8)
could be trained directly from those images. For the audio part, statis-
tical features can be extracted and stored as normal feature vectors. In
this case, the two representations (views) are different. One is a matrix
and the other a one-dimensional feature vector. Combining them to train
a single classifier could be problematic given the nature of the views and
their different encoding formats. Instead, we can train two models, one
for each view and then combine the results. This is precisely the idea of
Multi-view Stacking [Garcia-Ceja et al., 2018a]. Train a different model
for each view and combine the outputs like in Stacking.
Here, Multi-view Stacking will be demonstrated using the HOME TASKS
dataset. This dataset was collected from two sources. Acceleration and
audio. The acceleration was recorded with a wrist-band watch and the
audio using a cellphone. This dataset consists of 7 common home tasks:
‘mop floor’, ‘sweep floor’, ‘type on computer keyboard’, ‘brush teeth’, ‘wash
hands’, ‘eat chips’, and ‘watch t.v.’. Three volunteers performed each
activity for approximately 3 minutes.
The acceleration and audio signals were segmented into 3-second win-
dows. From each window, different features were extracted. From the
acceleration, 16 features were extracted from the 3 axes (𝑥,𝑦,𝑧) such as
mean, standard deviation, maximum values, mean magnitude, area un-
der the curve, etc. From the audio signals, 12 features were extracted,
namely, Mel Frequency Cepstral Coefficients (MFCCs). To preserve vol-
unteers’ privacy, the original audio was not released. The dataset already
contains the extracted features from acceleration and audio. The first
column is the label.
In order to implement Multi-view Stacking, two Random Forests will
be trained, one for each view (acceleration and audio). The predicted
outputs will be stacked to form the new training set 𝐷′ and a Random
Forest trained with 𝐷′ will act as the meta-learner.
3.4 Multi-view Stacking for Home Tasks Recognition 119
The next code snippet taken from stacking_algorithms.R shows the multi-
view stacking function implemented in R.
# Construct meta-features.
metaFeatures <- data.frame(label = trueLabels,
((probs.v1 + probs.v2) / 2),
pred1 = predicted.v1,
pred2 = predicted.v2)
#train meta-learner
metalearner <- randomForest(label ~.,
metaFeatures, nt = 100)
return(res)
}
The first argument D is a data frame containing the training data. v1cols
and v2cols are the column names of the two views. Finally, argument
k specifies the number of folds for the internal cross-validation to avoid
overfitting (Steps 2 and 3 as described in the generalized stacking proce-
dure).
The function iterates through each fold and trains a Random Forest
with the train data for each of the two views. Within each iteration, the
trained models are used to predict the labels and probabilities on the
3.4 Multi-view Stacking for Home Tasks Recognition 121
internal test set. Predicted labels and probabilities on the internal test
sets are concatenated across all folds (predicted.v1, predicted.v2).
After cross-validation, the meta-features are generated by creating a data
frame with the predictions of each view. Additionally, the average of class
probabilities is added as a meta-feature. The true labels are also added.
The purpose of cross-validation is to avoid overfitting but at the end, we
do not want to waste data so both learners are re-trained with all data
D.
# Build meta-features
metaFeatures <- data.frame(((raw.v1 + raw.v2) / 2),
pred1 = pred.v1,
pred2 = pred.v2)
return(predictions)
}
The object parameter is the trained model and newdata is a data frame
from which we want to make the predictions. First, labels and prob-
abilities are predicted using the two views. Then, a data frame with
the meta-features is assembled with the predicted label and the aver-
aged probabilities. Finally, the meta-learner is used to predict the final
classes using the meta-features.
The script stacking_activities.R shows how to use our mvstacking() func-
tion. With the following two lines we can train and make predictions.
The script performs 10-fold cross-validation and for the sake of compari-
son, it builds three models. One with only audio features, one with only
acceleration features, and the Multi-view Stacking one combining both
types of features.
Table 3.1 shows the results for each view and with Multi-view Stacking.
Clearly, combining both views with Multi-view Stacking achieved the
best results compared to using a single view.
3.4 Multi-view Stacking for Home Tasks Recognition 123
Figure 3.6 shows the resulting confusion matrices for the three cases. By
looking at the recall (anti-diagonal) of the individual classes, it seems
that audio features are better at recognizing some activities like ‘sweep’
and ‘mop floor’ whereas the accelerometer features are better for classi-
fying ‘eat chips’, ‘wash hands’, ‘type on keyboard’, etc. thus, those two
views are somehow complementary. All recall values when using Multi-
view Stacking are higher than for any of the other views.
124 3 Predicting Behavior with Ensemble Learning
3.5 Summary
In this chapter, several ensemble learning methods were introduced. In
general, ensemble models perform better than single models.
• The main idea of ensemble learning is to train several models and
combine their results.
• Bagging is an ensemble method consisting of 𝑛 base-learners, each,
trained with bootstrapped training samples.
• Random Forest is an ensemble of trees. It introduces randomness to
the trees by selecting random features in each split.
• Another ensemble method is called stacked generalization. It con-
sists of a set of base-learners and a meta-learner. The later is trained
using the outputs of the base-learners.
• Multi-view learning can be used when an instance can be repre-
sented by two or more views (for example, different sensors).
4
Exploring and Visualizing Behavioral Data
EDA.R
One of the reasons may be that you found the dataset online and maybe
the project is already over. In those cases, you can try to contact the
authors. I have done that several times and they were very responsive. It
is also a good idea to try to find experts in the field even if they were not
involved in the project. This will allow you to understand things from
their perspective and possibly to explain patterns/values that you may
find later in the process.
It is good practice to check the min and max values of all variables to
see if they have different ranges since some algorithms are sensitive to
different scales.
The output of the summary() function also shows some strange values.
The statistics of the variable XAVG are all 0𝑠. Some other variables like
ZAVG were encoded as characters and it seems that the ‘?’ symbol is
appended to the numbers. In summary, the summary() function (I know,
too many summaries in this sentence), allowed us to spot some errors
in the dataset. What we do with that information will depend on the
domain and application.
130 4 Exploring and Visualizing Behavioral Data
t <- table(dataset$class)
t <- as.data.frame(t)
colnames(t) <- c("class","count")
print(p)
The most common activity turned out to be ‘Walking’ with 2081 in-
stances. It seems that the volunteers were a bit sporty since ‘Jogging’
is the second most frequent activity. One thing to note is that there
are some big differences here. For example, ‘Walking’ vs. ‘Standing’.
Those differences in class counts can have an impact when training clas-
sification models. This is because classifiers try to minimize the overall
error regardless of the performance of individual classes, thus, they tend
to prioritize the majority classes. This is called the class imbalance
problem. This occurs when there are many instances of some classes
but fewer of some other classes. For some applications this can be a
problem. For example, in fraud detection, datasets have many legiti-
mate transactions but just a few of illegal ones. This will bias a classifier
to be good at detecting legitimate transactions but what we are really
interested in is in detecting the illegal transactions. This is something
4.4 User-class Sparsity Matrix 131
For the activity recognition example, some persons may go jogging fre-
quently while others may never go jogging at all. Some behaviors will
be present or absent depending on each individual. We can plot this in-
formation with what I call a user-class sparsity matrix. Figure 4.2
shows this matrix for the activities dataset. The code to generate this
plot is included in the script EDA.R.
The x-axis shows the user ids and the y-axis the classes. A colored en-
try (gray in this case) means that the corresponding user has at least
one associated instance of the corresponding class. For example, user 3
performed all activities and thus, the dataset contains at least one in-
stance for each of the six activities. On the other hand, user 25 only has
instances for two activities. Users are sorted in descending order (users
that have more classes are at the left). At the bottom of the plot, the
sparsity is shown (0.18). This is just the percentage of empty cells in the
matrix. When all users have at least one instance of every class the spar-
sity is 0. When the sparsity is different from 0, one needs to decide what
to do depending on the application. The following cases are possible:
• Some users did not perform all activities. If the classifier was trained
with, for example, 6 classes and a user never goes ‘jogging’, the clas-
sifier may still sometimes predict ‘jogging’ even if a particular user
never does that. This can degrade the predictions’ performance for
that particular user and can be worse if that user never performs other
activities. A possible solution is to train different classifiers with dif-
ferent class subsets. If you know that some users never go ‘jogging’
4.5 Boxplots 133
then you train a classifier that excludes ‘jogging’ and use that one
for that set of users. The disadvantage of this is that there are many
possible combinations so you need to train many models. Since several
classifiers can generate prediction scores and/or probabilities per class,
another solution would be to train a single model with all classes and
predict the most probable class excluding those that are not part of a
particular user.
• Some users can have unique classes. For example, suppose there is a
new user that has an activity labeled as ‘Eating’ which no one else
has, and thus, it was not included during training. In this situation,
the classifier will never predict ‘Eating’ since it was not trained for that
activity. One solution could be to add the new user’s data with the new
labels and retrain the model. But if not too many users have the activ-
ity ‘Eating’ then, in the worst case, they will die from starvation. In a
less severe case, the overall system performance can degrade because
as the number of classes increases, it becomes more difficult to find
separation boundaries between categories, thus, the models become
less accurate. Another possible solution is to build user-dependent
models for each user. These, and other types of models in multi-user
settings will be covered in chapter 9.
4.5 Boxplots
Boxplots are a good way to visualize the relationship between variables
and classes. R already has the boxplot() function. In the SMARTPHONE
ACTIVITIES dataset, the RESULTANT variable represents the ‘total
amount of movement’ considering the three axes [Kwapisz et al., 2010].
The following code displays a set of boxplots (one for each class) with
respect to the RESULTANT variable (Figure 4.3).
The solid black line in the middle of each box marks the median1 . Overall,
we can see that this variable can be good at separating high-intensity
activities like jogging, walking, etc. from low-intensity ones like sitting or
standing. With boxplots we can inspect one feature at a time. If you want
to visualize the relationship between predictors, correlation plots can be
used instead. Correlation plots will be presented in the next subsection.
The following code snippet uses the corrplot library to generate a corre-
lation plot (Figure 4.5) for the HOME TASKS dataset. Remember that
this dataset contains two sets of features. One set extracted from au-
dio and the other one extracted from the accelerometer sensor. First,
the Pearson correlation between each pair of variables is computed
with the cor() function and then the corrplot() function is used to gen-
erate the actual plot. Here, we specify that we only want to display the
upper diagonal with type = "upper". The tl.pos argument controls where
to print the labels. In this example, at the top and in the diagonal.
Setting diag = FALSE instructs the function not to print the principal di-
agonal which is all ones since it is the correlation between each variable
and itself.
136 4 Exploring and Visualizing Behavioral Data
library(corrplot)
It looks like the correlations between sound features (v1_) and acceler-
ation features (v2_) are not too high. In this case, this is good since
we want both sources of information to be as independent as possible so
that they capture different characteristics and complement each other
as explained in section 3.4. On the other hand, there are high correla-
tions between some acceleration features. For example v2_maxY with
v2_sdMagnitude.
Please, be aware that the Pearson correlation only captures linear re-
lationships.
iplotCorr(dataset[,-1], reorder=F,
chartOpts=list(cortitle="Correlation matrix",
scattitle="Scatterplot"))
138 4 Exploring and Visualizing Behavioral Data
Please note that at the time this book was written, printed paper does
not support interactive plots. Check the online html version instead to
see the actual result or run the code on a computer.
4.7 Timeseries
Behavior is something that usually depends on time. Thus, being able
to visualize timeseries data is essential. To illustrate how timeseries data
can be plotted, I will use the ggplot package and the HAND GESTURES
dataset. Recall that the data was collected with a tri-axial accelerome-
ter, thus, for each hand gesture we have 3-dimensional timeseries. Each
dimension represents one of the x, y, and z axes. First, we read one of
the text files that stores a hand gesture from user 1. Each column rep-
resents an axis. Then, we need to do some formatting. We will create a
data frame with three columns. The first one is a timestep represented
as integers from 1 to the number of points per axis. The second column
is a factor that represents the axis x, y, or z. The last column contains
the actual values.
4.7 Timeseries 139
Note that the last column (values) contains the values of all axes instead
of having one column per axis. Now we can use the ggplot() function.
The lines are colored by type of axis and this is specified with colour =
type. The type column should be a factor. The line type is also dependent
on the type of axis and is specified with linetype = type. The resulting
plot is shown in Figure 4.6.
library(dygraphs)
2
For a comprehensive list of available features of the dygraph package, the reader
is advised to check its demos website: https://rstudio.github.io/dygraphs/index.html
4.7 Timeseries 141
Then we can generate a minimal plot with one line of code with:
dygraph(dataset)
If you run the code, you will be able to zoom in by clicking and dragging
over a region. A double click will restore the zoom. It is possible to add
a lot of customization to the plots. For example, the following code adds
a text title, fills the area under the lines, adds a point of interest line,
and shades the region between 30 and 40.
142 4 Exploring and Visualizing Behavioral Data
iterative_mds.R
same class are closer compared to points from different classes. This can
give you an idea of the difficulty of the problem at hand. If points of
the same class are very close and grouped together then, it is likely that
a classification model will not have trouble separating the data points.
But how do we plot such relationships with high dimensional data? One
method is by using multidimensional scaling (MDS) which consists of a
set of techniques aimed at reducing the dimensionality of data so it can
be visualized in 2D or 3D. The objective is to plot the data such that
the original distances between pairs of points are preserved in a given
lower dimension 𝑑.
There exist several MDS methods but most of them take a distance
matrix as input (for example, Euclidean distance). In R, generating a
distance matrix from a set of points is easy. As an example, let’s generate
some sample data points.
dist(df)
#> 1 2
#> 2 0.09998603
#> 3 0.48102824 0.42486143
The output is the Euclidean distance between the pairs of rows (1, 2),
(1, 3) and (2, 3).
144 4 Exploring and Visualizing Behavioral Data
We can also reduce the data into 3 dimensions and use the scatterplot3d
package to generate a 3D scatter plot:
library(scatterplot3d)
fit <- cmdscale(d,k = 3)
x <- fit[,1]; y <- fit[,2]; z <- fit[,3]
scatterplot3d(x, y, z,
xlab = "",
ylab = "",
zlab = "",
main="Accelerometer features in 3D",
pch=19,
color=cols,
tick.marks = F,
cex.symbols = 0.5,
cex.lab = 0.7,
mar = c(1,0,1,0))
legend("topleft",legend = labels,
148 4 Exploring and Visualizing Behavioral Data
pch=19,
col=unique(cols),
cex=0.7,
horiz = F)
From those plots, it can be seen that the different points are more or less
grouped together based on the type of activity. Still, there are several
points with no clear grouping which would make them difficult to classify.
In section 3.4 of chapter 3, we achieved a classification accuracy of 85%
when using only the accelerometer data.
4.9 Heatmaps
Heatmaps are a good way to visualize the ‘intensity’ of events. For ex-
ample, a heatmap can be used to depict website interactions by overlap-
ping colored pixels relative to the number of clicks. This visualization
eases the process of identifying the most relevant sections of the given
4.9 Heatmaps 149
source("auxiliary_eda.R")
# Normalize matrices.
res <- normalizeMatrices(map.control, map.condition)
Then, the pheatmap package [Kolde, 2019] can be used to create the actual
heatmap from the matrices.
150 4 Exploring and Visualizing Behavioral Data
library(pheatmap)
library(gridExtra)
Figure 4.10 shows the two heatmaps. Here, we can see that overall, the
condition group has lower activity levels. It can also be observed that
people in the control group wakes up at around 6 ∶ 00 but in the con-
dition group, the activity starts to increase until 7 ∶ 00 in the morning.
Activity levels around midnight look higher during weekends compared
to weekdays.
All in all, heatmaps provide a good way to look at the overall patterns of
a dataset and can provide some insights to further explore some aspects
of the data.
4.10 Automated EDA 151
FIGURE 4.10 Activity level heatmaps for the control and condition
group.
TASKS dataset. The complete code is available in script EDA.R. The out-
put is shown in Figure 4.11. This plot shows the number of observations,
the number of variables, the variable names, and their types.
library(DataExplorer)
dataset <- read.csv(file.path(datasets_path, "home_tasks/sound_acc.csv"))
plot_str(dataset)
introduce(dataset)
rows 1386
columns 29
discrete_columns 1
continuous_columns 28
all_missing_columns 0
total_missing_values 0
complete_rows 1386
total_observations 40194
memory_usage 328680
library(inspectdf)
show_plot(inspect_cat(dataset))
Here, we can see that the most frequent class is ‘eat_chips’ and the
less frequent one is ‘sweep’. We can confirm this by printing the actual
counts:
table(dataset$label)
#> brush_teeth eat_chips mop_floor sweep type_on_keyboard
#> 180 282 181 178 179
#> wash_hands watch_tv
#> 180 206
154 4 Exploring and Visualizing Behavioral Data
4.11 Summary
One of the first tasks in a data analysis pipeline is to familiarize yourself
with the data. There are several techniques and tools that can provide
support during this process.
• Talking with field experts can help you to better understand the data.
• Generating summary statistics is a good way to gain general insights
of a dataset. In R, the summary() function will compute such statistics.
4.11 Summary 155
• For classification problems, one of the first steps is to check the distri-
bution of classes.
• In multi-user settings, generating a user-class sparsity matrix can
be useful to detect missing classes per user.
• Boxplots and correlation plots are used to understand the behavior
of the variables.
• R, has several packages for creating interactive plots such as dygraphs
for timeseries and qtlcharts for correlation plots.
• Multidimensional scaling (MDS) can be used to project high-
dimensional data into 2 or 3 dimensions so they can be plotted.
• R has some packages like DataExplorer that provide some degree of
automation for exploring a dataset.
5
Preprocessing Behavioral Data
preprocessing.R
Behavioral data comes in many flavors and forms, but when training
predictive models, the data needs to be in a particular format. Some
sources of variation when collecting data are:
• Sensors’ format. Each type of sensor and manufacturer stores data
in a different format. For example, .csv files, binary files, images, pro-
prietary formats, etc.
• Sampling rate. The sampling rate is how many measurements are
taken per unit of time. For example, a heart rate sensor may return
a single value every second, thus, the sampling rate is 1 Hz. An ac-
celerometer that captures 50 values per second has a sampling rate of
50 Hz.
• Scales and ranges. Some sensors may return values in degrees (e.g.,
a temperature sensor) while others may return values in some other
scale, for example, in centimeters for a proximity sensor. Furthermore,
ranges can also vary. That is, a sensor may capture values in the range
of 0–1000, for example.
During the data exploration step (chapter 4) we may also find that values
are missing, inconsistent, noisy, and so on, thus, we also need to take care
of that.
This chapter provides an overview of some common methods used to
clean and preprocess the data before one can start training reliable mod-
els.
if not implemented correctly, and this can cause overfitting. That is, in-
advertently transferring information from the train set to the test set.
This is something undesirable because both sets need to be indepen-
dent so the generalization performance can be estimated accurately.
You can find more details about information injection and how to avoid
it in section 5.5 of this chapter.
FIGURE 5.1 Device placed on the neck of the sheep. (Author: Lady-
ofHats. Source: Wikipedia (CC0 1.0)).
library(naniar)
Figure 5.2 shows the resulting output. The plot shows that there are
missing values in four variables: pressure, cz, cy, and cx. The last three
correspond to the compass (magnetometer). For pressure, the number of
missing values is more than 2 million! For the rest, it is a bit less (more
than 1 million).
To further explore this issue, we can plot each observation in a row with
the function vis_miss().
Figure 5.3 shows every observation per row and missing values are black
colored (if any). From this image, it seems that missing values are sys-
tematic. It looks like there is a clear stripes pattern, especially for the
compass variables. Based on these observations, it doesn’t look like ran-
dom sensor failures or random noise.
If we explore the data frame’s values, for example with the RStudio
viewer (Figure 5.4), two things can be noted. First, for the compass
values, there is a missing value for each present value. Thus, it looks like
50% of compass values are missing. For pressure, it seems that there are
7 missing values for each available value.
FIGURE 5.4 Displaying the data frame in RStudio. Source: Data from
Kamminga, MSc J.W. (University of Twente) (2017): Generic online
animal activity recognition on collar tags. DANS. https://doi.org/10.170
26/dans-zp6-fmna
So, what could be the root cause of those missing values? Remember
that at the beginning of this chapter it was mentioned that one of the
sources of variation is sampling rate. If we look at the data set
documentation, all sensors have a sampling rate of 200 Hz except for
the compass and the pressure sensor. The compass has a sampling rate
of 100 Hz. That is half compared to the other sensors! This explains
why 50% of the rows are missing. Similarly, the pressure sensor has a
sampling rate of 25 Hz. By visualizing and then inspecting the missing
data, we have just found out that the missing values are not caused by
162 5 Preprocessing Behavioral Data
random noise or sensor failures but because some sensors are not as fast
as others!
Now that we know there are missing values we need to decide what to do
with them. The following subsection lists some ways to deal with missing
values.
5.1.1 Imputation
Imputation is the process of filling in missing values. One of the reasons
for imputing missing values is that some predictive models cannot deal
with missing data. Another reason is that it may help in increasing the
predictions’ performance, for example, if we are trying to predict the
sheep behavior from a discrete set of categories based on the inertial
data. There are different ways to handle missing values:
• Discard rows. If the rows with missing values are not too many, they
can simply be discarded.
• Mean value. Fill the missing values with the mean value of the cor-
responding variable. This method is simple and can be effective. One
of the problems with this method is that it is sensitive to outliers (as
it is the arithmetic mean).
• Median value. The median is robust against outliers, thus, it can be
used instead of the arithmetic mean to fill the gaps.
• Replace with the closest value. For timeseries data, as is the case
of the sheep readings, one could also replace missing values with the
closest known value.
• Predict the missing values. Use the other variables to predict the
missing one. This can be done by training a predictive model. A regres-
sor if the variable is numeric or a classifier if the variable is categorical.
Another problem with the mean and median values is that they can be
correlated with other variables, for example, with the class that we want
to predict. One way to avoid this, is to compute the mean (or median)
for each class, but still, some hidden correlations may bias the estimates.
In R, the simputation package [van der Loo, 2019] has implemented var-
ious imputation techniques including: group-wise median imputation,
5.1 Missing Values 163
library(simputation)
Originally, the missing values are encoded as NaN but in order to use
the simputation package functions, we need them as NA. First, NaNs are
replaced with NA. The first argument of impute_lm() is a data frame and
the second argument is a formula. We discard the first 4 variables of the
data frame since we do not want to use them as predictors. The left-
hand side of the formula (everything before the ~ symbol) specifies the
variables we want to impute. The right-hand side specifies the variables
used to build the linear models. The ‘.’ indicates that we want to use all
variables while the ‘-’ is used to specify variables that we do not want to
include. The vignettes1 of the package contain more detailed examples.
1
https://cran.r-project.org/web/packages/simputation/vignettes/intro.html
164 5 Preprocessing Behavioral Data
The mean, median, etc. and the predictive models to infer missing
values should be trained using data only from the train set to avoid
information injection.
5.2 Smoothing
Smoothing comprises a set of algorithms with the aim of highlighting
patterns in the data or as a preprocessing step to clean the data and
remove noise. These methods are widely used on timeseries data but also
with spatio-temporal data such as images. With timeseries data, they
are often used to emphasize long-term patterns and reduce short-term
signal artifacts. For example, in Figure 5.52 a stock chart was smoothed
using two methods: moving average and exponential moving average.
The smoothed versions make it easier to spot the overall trend rather
than focusing on short-term variations.
The most common smoothing method for timeseries is the simple mov-
ing average. With this method, the first element of the resulting
smoothed series is computed by taking the average of the elements within
a window of predefined size. The window’s position starts at the first ele-
ment of the original series. The second element is computed in the same
way but after moving the window one position to the right. Figure 5.6
shows this procedure on a series with 5 elements and a window size of
size 3. After the third iteration, it is not possible to move the window
one more step to the right while covering 3 elements since the end of
the timeseries has been reached. Because of this, the smoothed series will
have some missing values at the end. Specifically, it will have 𝑤 − 1 fewer
elements where 𝑤 is the window size. A simple solution is to compute
the average of the elements covered by the window even if they are less
than the window size.
In the previous example the average is taken from the elements to the
right of the pointer. There is a variation called centered moving average
in which the center point of the window has the same elements to the
2
https://commons.wikimedia.org/wiki/File:Moving_Average_Types_comparison_-_Simple_and
_Exponential.png
5.2 Smoothing 165
FIGURE 5.5 Stock chart with two smoothed versions. One with mov-
ing average and the other one with an exponential moving average.
(Author: Alex Kofman. Source: Wikipedia (CC BY-SA 3.0) [h t t p s :
//creativecommons.org/licenses/by-sa/3.0/legalcode]).
FIGURE 5.6 Simple moving average step by step with window size =
3. Top: original array; bottom: smoothed array.
166 5 Preprocessing Behavioral Data
left and right (Figure 5.7). Note that with this version of moving average
some values at the beginning and at the end will be empty. Also note
that the window size should be odd. In practice, both versions produce
very similar results.
FIGURE 5.7 Centered moving average step by step with window size
= 3.
for(i in 1:(n-w+1)){
smoothedX[i] <- mean(x[i:(i-1+w)])
}
return(smoothedX)
}
Figure 5.8 shows the result after plotting both the original vector and the
smoothed one. It can be observed that many of the small peaks are no
longer present in the smoothed version. The window size is a parameter
that needs to be defined by the user. If it is set too large some important
information may be lost from the signal.
FIGURE 5.8 Original time series and smoothed version using a moving
average window of size 21.
5.3 Normalization
Having variables on different scales can have an impact during learning
and at inference time. Consider a study where the data was collected
using a wristband that has a light sensor and an accelerometer. The
measurement unit of the light sensor is lux whereas the accelerometer’s
is 𝑚/𝑠2 . After inspecting the dataset, you realize that the min and max
values of the light sensor are 0 and 155, respectively. The min and max
values for the accelerometer are −0.4 and 7.45, respectively. Why is this a
problem? Well, several learning methods are based on distances such as 𝑘-
NN and Nearest centroid thus, distances will be more heavily affected by
bigger scales. Furthermore, other methods like neural networks (covered
in chapter 8) are also affected by different scales. They have a harder
time learning their parameters (weights) when data is not normalized.
On the other hand, some methods are not affected, for example, tree-
based learners such as decision trees and random forests. Since most of
the time you may want to try different methods, it is a good idea to
normalize your predictor variables.
A common normalization technique is to scale all the variables between
0 and 1. Suppose there is a numeric vector 𝑥 that you want to normalize
between 0 and 1. Let 𝑚𝑎𝑥(𝑥) and 𝑚𝑖𝑛(𝑥) be the maximum and mini-
mum values of 𝑥. The following can be used to normalize the 𝑖𝑡ℎ value
of 𝑥:
𝑥𝑖 − 𝑚𝑖𝑛(𝑥)
𝑧𝑖 = (5.1)
𝑚𝑎𝑥(𝑥) − 𝑚𝑖𝑛(𝑥)
where 𝑧𝑖 is the new normalized 𝑖𝑡ℎ value. Thus, the formula is applied to
every value in 𝑥. The 𝑚𝑎𝑥(𝑥) and 𝑚𝑖𝑛(𝑥) values are parameters learned
from the data. Notice that if you will split your data into training and
test sets the max and min values (the parameters) are learned only from
the train set and then used to normalize both the train and test set. This
is to avoid information injection (section 5.5). Be also aware that after
the parameters are learned from the train set, and once the model is
deployed in production, it is likely that some input values will be ‘out of
range’. If the train set is not very representative of what you will find in
real life, some values will probably be smaller than the learned 𝑚𝑖𝑛(𝑥)
and some will be greater than the learned 𝑚𝑎𝑥(𝑥). Even if the train set
5.3 Normalization 169
Since label is a categorical variable, the class counts are printed. For the
three remaining variables, we get some statistics including their min and
max values. As we can see, the min value of v1_mfcc1 is very different
from the min value of v1_mfcc2 and the same is true for the maximum
values. Thus, we want all variables to be between 0 and 1 in order to use
classification methods sensitive to different scales. Let’s assume we want
to train a classifier with this data so we divide it into train and test sets:
170 5 Preprocessing Behavioral Data
# Iterate columns
for(i in 1:ncol(trainset)){
return(list(train=trainset, test=testset))
}
Now we can use the previous function to normalize the train and test
sets. The function returns a list of two elements: a normalized train and
test sets.
Now, the variables on the train set are exactly between 0 and 1 for all
numeric variables. For the test set, not all min values will be exactly
0 but a bit higher. Conversely, some max values will be lower than 1.
This is because the test set may have a min value that is greater than
the min value of the train set and a max value that is smaller than the
max value of the train set. However, after normalization, all values are
guaranteed to be within 0 and 1.
of images from healthy tissue but just a dozen with signs of cancer. Of
course, having just a few cases with diseases is a good thing for the
world! but not for machine learning methods. This is because predictive
models will try to learn their parameters such that the error is reduced,
and most of the time this error is based on accuracy. Thus, the models
will be biased towards making correct predictions for the majority classes
(the ones with higher counts) while paying little attention to minority
classes. This is a problem because for some applications we are more
interested in detecting the minority classes (illegal transactions, cancer
cases, etc.).
Suppose a given database has 998 instances with class ‘no cancer’ and
only 2 instances with class ‘cancer’. A trivial classifier that always pre-
dicts ‘no cancer’ will have an accuracy of 98.8% but will not be able to
detect any of the ‘cancer’ cases! So, what can we do?
• Collect more data from the minority class. In practice, this can
be difficult, expensive, etc. or just impossible because the study was
conducted a long time ago and it is no longer possible to replicate the
context.
• Delete data from the majority class. Randomly discard instances
from the majority class. In the previous example, we could discard 996
instances of type ‘no cancer’. The problem with this is that we end up
with insufficient data to learn good predictive models. If you have a
huge dataset this can be an option, but in practice, this is rarely the
case and you have the risk of having underrepresented samples.
• Create synthetic data. One of the most common solutions is to cre-
ate synthetic data from the minority classes. In the following sections
two methods that do that will be discussed: random oversampling and
Synthetic Minority Oversampling Technique (SMOTE).
• Adapt your learning algorithm. Another option is to use an al-
gorithm that takes into account class counts and weights them ac-
cordingly. This is called cost-sensitive classification. For example, the
rpart() method to train decision trees has a weight parameter which
can be used to assign more weight to minority classes. When train-
ing neural networks it is also possible to assign different weights to
different classes.
174 5 Preprocessing Behavioral Data
shiny_random-oversampling.Rmd
This method consists of duplicating data points from the minority class.
The following code will create an imbalanced dataset with 200 instances
of class ‘class1’ and only 15 instances of class ‘class2’.
set.seed(1234)
If we want to exactly balance the class counts, we will need 185 additional
instances of type ‘class2’. We can use our well known sample() function
to pick 185 points from data frame df2 (which contains only instances
of class ‘class2’) and store them in new.points. Notice the replace = T
parameter. This allows the function to pick repeated elements. Then,
the new data points are appended to the imbalanced data set which now
becomes balanced.
4
https://shiny.rstudio.com/
176 5 Preprocessing Behavioral Data
5.4.2 SMOTE
shiny_smote-oversampling.Rmd
SMOTE is another method that can be used to augment the data points
from the minority class [Chawla et al., 2002]. One of the limitations of
random oversampling is that it creates duplicates. This has the effect of
having fixed boundaries and the classifiers can overspecialize. To avoid
this, SMOTE creates entirely new data points.
SMOTE operates on the feature space (on the predictor variables). To
generate a new point, take the difference between a given point 𝑎 (taken
from the minority class) and one of its randomly selected nearest neigh-
bors 𝑏. The difference is multiplied by a random number between 0 and
1 and added to 𝑎. This has the effect of selecting a point along the line
5.4 Imbalanced Classes 177
# Percent to oversample.
N <- 1200
The parameter N is set to 1200. This will create 12 new data points
for every minority class instance (15). Thus, the method will return 180
instances. In this case, 𝑘 is set to 5. Finally, the new points are appended
to the imbalanced dataset having a total of 195 samples of class ‘class2’.
Again, a shiny app is included with this chapter’s code. Figure 5.11 shows
the distribution of the original points and after applying SMOTE. Note
how the boundary of ‘class2’ changes after applying SMOTE. It slightly
spans in all directions. This is particularly visible in the lower right cor-
ner. This boundary expansion is what allows the classifiers to generalize
better as compared to training them using random oversampled data.
5.5 Information Injection 179
Suppose that as one of the preprocessing steps, you need to subtract the
mean value of a feature for each instance. For now, suppose a dataset
has a single feature 𝑥 of numeric type and a categorical response variable
𝑦. The dataset has 𝑛 rows. As a preprocessing step, you decide that you
need to subtract the mean of 𝑥 from each data point. Since you want to
predict 𝑦 given 𝑥, you train a classifier by splitting your data into train
180 5 Preprocessing Behavioral Data
and test sets as usual. So you proceed with the steps depicted in Figure
5.12.
First, (a) you compute the 𝑚𝑒𝑎𝑛 value of the of variable 𝑥 from the
entire dataset. This 𝑚𝑒𝑎𝑛 is known as the parameter. In this case, there
is only one parameter but there could be several. For example, we could
additionally need to compute the standard deviation. Once we know the
mean value, the dataset is divided into train and test sets (b). Finally,
the 𝑚𝑒𝑎𝑛 is subtracted from each element in both train and test sets (c).
Without realizing, we have transferred information from the train set to
the test set! But, how did this happen? Well, the mean parameter was
computed using information from the entire dataset. Then, that 𝑚𝑒𝑎𝑛
parameter was used on the test set, but it was calculated using data
points that also belong to that same test set!
Figure 5.13 shows how to correctly do the preprocessing to avoid informa-
tion injection. The dataset is first split (a). Then, the 𝑚𝑒𝑎𝑛 parameter
is calculated only with data points from the train set. Finally, the mean
parameter is subtracted from both sets. Here, the mean contains infor-
mation only from the train set.
In the previous example, we assumed that the dataset was split into train
and test sets only once. The same idea applies when performing 𝑘-fold
cross-validation. In each of the 𝑘 iterations, the preprocessing parameters
need to be learned only from the train split.
You should be aware of the dummy variable trap which means that one
variable can be predicted from the others. For example, if the possible
values are just male and female, then if the dummy variable for male
is 1, we know that the dummy variable for female must be 0. The
182 5 Preprocessing Behavioral Data
The caret package has a function dummyVars() that can be used to one-hot
encode the categorical variables of a data frame. Since the STUDENTS’
MENTAL HEALTH dataset [Nguyen et al., 2019] has several categorical
variables, it can be used to demonstrate how to apply dummyVars(). This
dataset collected at a University in Japan contains survey responses from
students about their mental health and help-seeking behaviors. We begin
by loading the data.
Figure 5.16 shows the output plot. We can see that the last rows contain
many missing values so we will discard them and only keep the first rows
(1–268).
If we inspect the resulting data frame (Figure 5.17) we see that it has
3 variables, one for each possible value: Long, Medium, and Short. If
this variable is used as a predictor variable, we should delete one of its
columns to avoid the dummy variable trap. We can do this by setting
the parameter fullRank = TRUE.
In this situation, the column with ‘Long’ was discarded (Figure 5.18).
If you want to one-hot encode all variables at once you can use ~ . as
the formula. But be aware that the dataset may have some categories
encoded as numeric and thus will not be transformed. For example,
the Age_cate encodes age categories but the categories are represented
as integers from 1 to 5. In this case, it may be ok not to encode this
variable since lower integer numbers also imply smaller ages and bigger
integer numbers represent older ages. If you still want to encode this
variable you could first convert it to character by appending a letter at
the beginning. Sometimes you should encode a variable, for example, if
it represents colors. In that situation, it does not make sense to leave it
as numeric since there is not semantic order between colors.
186 5 Preprocessing Behavioral Data
5.7 Summary
Programming functions that train predictive models expect the data to
be in a particular format. Furthermore, some methods make assumptions
about the data like having no missing values, having all variables in the
same scale, and so on. This chapter presented several commonly used
methods to preprocess datasets before using them to train models.
• When collecting data from different sensors, we can face several sources
of variation like sensors’ format, different sampling rates, differ-
ent scales, and so on.
• Some preprocessing methods can lead to information injection. This
happens when information from the train set is leaked to the test set.
• Missing values is a common problem in many data analysis tasks. In
R, the naniar package can be used to spot missing values.
5.7 Summary 187
So far, we have been working with supervised learning methods, that is,
models for which the training instances have two elements: (1) a set of in-
put values (features) and (2) the expected output (label). As mentioned
in chapter 1, there are other types of machine learning methods and one
of those is unsupervised learning which is the topic of this chapter.
In unsupervised learning, the training instances do not have a response
variable (e.g., a label). Thus, the objective is to extract knowledge from
the available data without any type of guidance (supervision). For exam-
ple, given a set of variables that characterize a person, we would like to
find groups of people with similar behaviors. For physical activity behav-
iors, this could be done by finding groups of very active people versus
finding groups of people with low physical activity. Those groups can
be useful for delivering targeted suggestions or services thus, enhancing
and personalizing the user experience.
This chapter starts with one of the most popular unsupervised learning
algorithms: 𝑘-means clustering. Next, an example of how this tech-
nique can be applied to find groups of students with similar characteris-
tics is presented. Then, association rules mining is presented, which
is another type of unsupervised learning method. Finally, association
rules are used to find criminal patterns from a homicide database.
kmeans_steps.R
This is one of the most commonly used unsupervised methods due to its
simplicity and efficacy. Its objective is to find groups of points such that
points in the same group are similar and points from different groups are
as dissimilar as possible. The number of groups 𝑘 needs to be defined a
priori. The method is based on computing distances to centroids. The
centroid of a set of points is computed by taking the mean of each of
their features. The 𝑘-means algorithm is as follows:
Generate k centroids at random.
Repeat until no change or max iterations:
Assign each data point to the closest centroid.
Update centroids.
To measure the distance between a data point and a centroid, the Eu-
clidean distance is typically used, but other distances can be used as well
depending on the application. As an example, let’s cluster user responses
from the STUDENTS’ MENTAL HEALTH dataset. This database con-
tains questionnaire responses about depression, acculturative stress, so-
cial connectedness, and help-seeking behaviors from students at a Univer-
sity in Japan. To demonstrate how 𝑘-means work, we will only choose
two variables so we can plot the results. The variables are ToAS (To-
tal Acculturative Stress) and ToSC (Total Social Connectedness). The
ToAS measures the emotional challenges when adapting to a new culture
while ToSC measures emotional distance with oneself and other people.
For the clustering, the parameter 𝑘 will be set to 3, that is, we want
to group the points into 3 disjoint groups. The code that implements
the 𝑘-means algorithm can be found in the script kmeans_steps.R. The
algorithm begins by selecting 3 centroids at random. Figure 6.1 shows
a scatterplot of the variables ToAS and ToSC along with the random
centroids.
Next, at the first iteration, each point is assigned to the closest centroid.
This is depicted in Figure 6.2 (top left). Then, the centroids are updated
(moved) based on the new assignments. In the next iteration, the points
are reassigned to the closest centroids and so on. Figure 6.2 shows the
first 4 iterations of the algorithm.
From iteration 1 to 2 the centroids moved considerably. After that, they
began to stabilize. Formally, the algorithm tries to minimize the total
within cluster variation of all clusters. The cluster variation of a single
cluster 𝐶𝑘 is defined as:
6.1 𝑘-means Clustering 191
𝑘
𝑇 𝑊 𝐶𝑉 = ∑ 𝑊 (𝐶𝑖 ) (6.2)
𝑖=1
that is, the sum of all within-cluster variations across all clusters. The
objective is to find the 𝜇𝑘 centroids that make 𝑇 𝑊 𝐶𝑉 minimal. Find-
ing the global optimum is a difficult problem. However, the iterative
algorithm described above often produces good approximations.
192 6 Discovering Behaviors with Unsupervised Learning
group_students.R
,"APD","AHome","APH","Afear",
"ACS","AGuilt","ToAS")
The first argument of kmeans() is a data frame or a matrix and the sec-
ond argument the number of clusters. Figure 6.4 shows the resulting
clustering. The kmeans() method returns an object that contains several
194 6 Discovering Behaviors with Unsupervised Learning
components including cluster that stores the assigned cluster for each
data point and centers that stores the centroids.
always… this depends on the task at hand but there is a method called
Silhouette index that can be used to select the optimal 𝑘 based on an
optimality criterion. This index is presented in the next section.
𝑏(𝑖) − 𝑎(𝑖)
𝑠(𝑖) = (6.3)
max{𝑎(𝑖), 𝑏(𝑖)}
6.2 The Silhouette Index 197
One nice thing about this index is that it can be presented visually. To
generate a silhouette plot, use the generic plot() function and pass the
object returned by silhouette().
Figure 6.8 shows the silhouette plot when 𝑘 = 4. The horizontal lines
represent the individual silhouette indices. In this plot, all of them are
positive. The height of each cluster gives a visual idea of the number of
data points contained in it with respect to other clusters. We can see
for example that cluster 2 is the smallest one. On the right side, is the
number of points in each cluster and their average silhouette index. At
the bottom, the total silhouette index is printed (0.35). We can try to
cluster the points into 7 groups instead of 4 and see what happens.
6.2 The Silhouette Index 199
set.seed(1234)
clusters <- kmeans(normdf, 7)
si <- silhouette(clusters$cluster, dist(normdf))
plot(si, cex.names=0.6, col = 1:7,
main = "Silhouette plot, k=7",
border=NA)
Here, cluster 2 and 4 have data points with negative indices and the
overall score is 0.26. This suggests that 𝑘 = 4 produces more coherent
clusters as compared to 𝑘 = 7.
Let’s compute all those metrics using an example. The following table
shows a synthetic example database of transactions from shoppers with
unhealthy behaviors.
supp({𝑠𝑜𝑑𝑎, 𝑖𝑐𝑒𝑐𝑟𝑒𝑎𝑚})
lift({𝑠𝑜𝑑𝑎} ⟹ {𝑖𝑐𝑒𝑐𝑟𝑒𝑎𝑚}) =
supp({𝑠𝑜𝑑𝑎})supp({𝑖𝑐𝑒𝑐𝑟𝑒𝑎𝑚})
= (2/10)/((7/10)(3/10))
= 0.95.
However, this is not the only application of association rules. There are
other problems that can be structured as transactions of items. For ex-
ample in medicine, diseases can be seen as transactions and symptoms
as items. Thus, one can apply association rules algorithms to find symp-
toms and disease relationships. Another application is in recommender
systems. Take, for example, movies. Transactions can be the set of movies
watched by every user. If you watched a movie 𝑚 then, the recommender
system can suggest another movie that co-occurred frequently with 𝑚
and that you have not watched yet. Furthermore, other types of rela-
tional data can be transformed into transaction-like structures to find
patterns and this is precisely what we are going to do in the next section
to mine criminal patterns.
crimes_process.R crimes_rules.R
years). After these cleaning and preprocessing steps, the dataset has 3
columns and 328238 rows (see Figure 6.11). The script used to perform
the preprocessing is crimes_process.R.
Now, we have a data frame that contains only the relevant information.
Each row will be used to generate one transaction. An example transac-
tion may be {R.Wife, Knife, Adult}. This one represents the case where
the perpetrator is an adult who used a knife to kill his wife. Note the
‘R.’ at the beginning of ‘Wife’. This ‘R.’ was added for clarity in order
to identify that this item is a relationship. One thing to note is that
every transaction will consist of exactly 3 items. This is a bit different
than the market basket case in which every transaction can include a
varying number of products. Although this item-size constraint was a
design decision based on the structure of the original data, this will not
prevent us from performing the analysis to find interesting rules.
To find the association rules, the arules package [Hahsler et al., 2019] will
be used. This package has an interface to an efficient implementation in
C of the Apriori algorithm. This package needs the transactions to be
6.3 Mining Association Rules 205
as.character(colnames(M))
#> [1] "R.Acquaintance" "R.Wife" "R.Stranger"
#> [4] "R.Girlfriend" "R.Ex-Husband" "R.Brother"
#> [7] "R.Stepdaughter" "R.Husband" "R.Friend"
#> [10] "R.Family" "R.Neighbor" "R.Father"
#> [13] "R.In-Law" "R.Son" "R.Ex-Wife"
#> [16] "R.Boyfriend" "R.Mother" "R.Sister"
#> [19] "R.Common-Law Husband" "R.Common-Law Wife" "R.Stepfather"
#> [22] "R.Stepson" "R.Stepmother" "R.Daughter"
#> [25] "R.Boyfriend/Girlfriend" "R.Employer" "R.Employee"
#> [28] "Blunt Object" "Strangulation" "Rifle"
#> [31] "Knife" "Shotgun" "Handgun"
#> [34] "Drowning" "Firearm" "Suffocation"
#> [37] "Fire" "Drugs" "Explosives"
#> [40] "Fall" "Gun" "Poison"
#> [43] "teen" "adult" "lateAdulthood"
#> [46] "child"
The following snippet shows how to convert the matrix into an arules
transactions object. Before the conversion, the package arules needs
to be loaded. For convenience, the transactions are saved in a file
transactions.RData.
library(arules)
Now that the database is in the required format we can start the analysis.
The crimes_rules.R script has the code to perform the analysis. First, the
transactions file that we generated before is loaded:
library(arules)
library(arulesViz)
Note that additionally to the arules package, we also loaded the arulesViz
package [Hahsler, 2019]. This package has several functions to generate
cool plots of the learned rules! A summary of the transactions can be
printed with the summary() function:
# Print summary.
summary(transactions)
The summary shows the total number of rows (transactions) and the
number of columns. It also prints the most frequent items, in this case,
adult with 257026 occurrences, Handgun with 160586, and so on. The
itemset sizes are also displayed. Here, all itemsets have a size of 3 (by
design). Some other summary statistics are also printed.
We can use the itemFrequencyPlot() function from the arulesViz package
to plot the frequency of items.
itemFrequencyPlot(transactions,
type = "relative",
topN = 15,
main = 'Item frequecies')
The type argument specifies that we want to plot the relative frequencies.
Use "absolute" instead to plot the total counts. topN is used to select how
many items are plotted. Figure 6.12 shows the output.
Now it is time to find some interesting rules! This can be done with the
apriori() function as follows:
By looking at the summary, we see that the algorithm found 141 rules
that satisfy the support and confidence thresholds. The rule length dis-
tribution is also printed. Here, 45 rules are of size 2 and 96 rules are of
size 3. Then, some standard statistics are shown for support, confidence,
and lift. The inspect() function can be used to print the actual rules.
Rules can be sorted by one of the importance measures. The following
code sorts by lift and prints the first 20 rules. Figure 6.13 shows the
output.
# Print the first n (20) rules with highest lift in decreasing order.
inspect(sort(resrules, by='lift', decreasing = T)[1:20])
The first rule with a lift of 4.27 says that if a homicide was committed
by an adult and the victim was the stepson, then is it likely that a blunt
object was used for the crime. By looking at the rules, one can also note
that whenever blunt object appears either in the lhs or rhs, the victim
was most likely an infant. Another thing to note is that when the victim
was boyfriend, the crime was likely committed with a knife. This is also
mentioned in the reports ‘Homicide trends in the United States’ [Cooper
et al., 2012]:
210 6 Discovering Behaviors with Unsupervised Learning
The resulting rules can be plotted with the plot() function (see Figure
6.14). By default, it generates a scatterplot with the support in the 𝑥
axis and confidence in the 𝑦 axis colored by lift.
The plot shows that rules with a high lift also have a low support and
confidence. Hahsler [2017] mentioned that rules with high lift typically
6.3 Mining Association Rules 211
have low support. The plot can be customized for example to show the
support and lift in the axes and color them by confidence. The axes can
be set with the measure parameter and the coloring with the shading pa-
rameter. The function also supports different plotting engines including
static and interactive. The following code generates a customized inter-
active plot by setting engine = "htmlwidget". This is very handy if you
want to know which points correspond to which rules. By hovering the
mouse on the desired point the corresponding rule is shown as a tooltip
box (Figure 6.15). The interactive plots also allow to zoom in regions by
clicking and dragging.
The arulesViz package has a nice option to plot rules as a graph. This is
done by setting method = "graph". We can also make the graph interactive
for easier exploration by setting engine="htmlwidget". For clarity, the font
size is reduced with cex=0.9. Here we plot the first 25 rules.
212 6 Discovering Behaviors with Unsupervised Learning
Figure 6.16 shows a zoomed-in portion of the entire graph. Circles rep-
resent rules and rounded squares items. The size of the circle is relative
to the support and color relative to the lift. Incoming arrows represent
the items in the antecedent and the outgoing arrow of a circle points
to the item in the consequent part of the rule. From this graph, some
interesting patterns can be seen. First, when the age category of the per-
petrator is lateAdulthood, the victims were the husband or ex-wife. When
the perpetrator is a teen, the victim was likely a friend or stranger.
The arulesViz package has a cool function ruleExplorer() that generates
a shiny app with interactive controls and several plot types. When run-
ning the following code (output not shown) you may be asked to install
additional shiny related packages.
6.3 Mining Association Rules 213
# Subset transactions.
rulesGirlfriend <- subset(resrules, subset = lhs %in% "R.Girlfriend")
6.4 Summary
One of the types of machine learning is unsupervised learning in
which there are no labels. This chapter introduced some unsupervised
methods such as clustering and association rules.
• The objective of 𝑘-means clustering is to find groups of points such
that points in the same group are similar and points from different
groups are as dissimilar as possible.
• The centroid of a group is calculated by taking the mean value of
each feature.
• In 𝑘-means, one needs to specify the number of groups 𝑘 before run-
ning the algorithm.
• The Silhouette Index is a measure that tells us how well a set of
points were clustered. This measure can be used to find the optimal
number of groups 𝑘.
• Association rules can find patterns in an unsupervised manner.
• The Apriori algorithm is the most well-known method for finding
association rules.
• Before using the Apriori algorithm, one needs to format the data as
transactions.
• A transaction is an event that involves a set of items.
7
Encoding Behavioral Data
Behavioral data comes in many different flavors and shapes. Data stored
in databases also have different structures (relational, graph, plain text,
etc.). As mentioned in chapter 1, before training a predictive model, data
goes through a series of steps, from data collection to preprocessing (Fig-
ure 1.7). During those steps, data is transformed and shaped with the
aim of easing the operations in the subsequent tasks. Finally, the data
needs to be encoded in a very specific format as expected by the predic-
tive model. For example, decision trees and many other classifier methods
expect their input data to be formatted as feature vectors while Dy-
namic Time Warping expects the data to be represented as timeseries.
Images are usually encoded as 𝑛-dimensional matrices. When it comes
to social network analysis, a graph is the preferred representation.
So far, I have been mentioning two key terms: encode and represen-
tation. The Cambridge Dictionary1 defines the verb encode as:
1
https://dictionary.cambridge.org/dictionary/english/encode
2
https://techterms.com/definition/encoding
DOI: 10.1201/9781003203469-7 217
218 7 Encoding Behavioral Data
Both definitions are similar, but in this chapter’s context, the second one
makes more sense. The Cambridge Dictionary3 defines representation as:
TechTerms.com returned no results for that word. From now on, I will
use the term encode to refer to the process of transforming the data
and representation as the way data is ‘conceptually’ described. Note the
‘conceptually’ part which means the way we humans think about it. This
means that data can have a conceptual representation but that does not
necessarily mean it is digitally stored in that way. For example, a physical
activity like walking captured with a motion sensor can be conceptually
represented by humans as a feature vector but its actual digital format
inside a computer is binary (see Figure 7.1).
FIGURE 7.2 Example of some raw data encoded into different repre-
sentations.
The process of designing and extracting feature vectors from raw data
is known as feature engineering. This also involves the process of
deciding which features to extract. This requires domain knowledge as
the features should capture the information needed to solve the problem.
Suppose we want to classify if a person is ‘tired’ or ‘not tired’. We have
access to some details about the person like age, height, the activities
performed during the last 30 minutes, and so on. For simplicity, let’s
assume we can generate feature vectors of size 2 and we have two options:
• Option 1. Feature vectors where the first element is age and the second
element is height.
• Option 2. Feature vectors where the first element is the number of
squats done by the user during the last 30 minutes and the second
element is heart rate.
Clearly, for this specific classification problem the second option is more
likely to produce better results. The first option may not even contain
enough information and will lead the predictive model to produce ran-
dom predictions. With the second option, the boundaries between classes
are more clear (see Figure 7.3) and classifiers will have an easier time
finding them.
FIGURE 7.3 Two different feature vectors for classifying tired and not
tired.
In R, feature vectors are stored as data frames where rows are individual
instances and columns are features. Some of the advantages and limita-
tions of feature vectors are listed below.
Advantages:
• Efficient in terms of memory.
7.2 Timeseries 221
7.2 Timeseries
A timeseries is a sequence of data points ordered in time. We have al-
ready worked with timeseries data in previous chapters when classify-
ing physical activities and hand gestures (chapter 2). Timeseries can be
multi-dimensional. For example, typical inertial sensors capture motion
forces in three axes. Timeseries analysis methods can be used to find un-
derlying time-dependent patterns while timeseries forecasting methods
aim to predict future data points based on historical data. Timeseries
analysis is a very extensive topic and there are a number of books on the
topic. For example, the book “Forecasting: Principles and Practice” by
Hyndman and Athanasopoulos [2018] focuses on timeseries forecasting
with R.
In this book we mainly use timeseries data collected from sensors in the
context of behavior predictions using machine learning. We have already
seen how classification models (like decision trees) can be trained with
timeseries converted into feature vectors (section 2.3.1) or by using the
raw timeseries data with Dynamic Time Warping (section 2.5.1).
Advantages:
• Many problems have this form and can be naturally modeled as time-
series.
222 7 Encoding Behavioral Data
7.3 Transactions
Sometimes we may want to represent data as transactions, as we did
in section 6.3. Data represented as transactions are usually intended
to be used by association rule mining algorithms (see section 6.3). As
a minimum, a transaction has a unique identifier and a set of items.
Items can be types of products, symptoms, ingredients, etc. A set of
transactions is called a database. Figure 7.4 taken from chapter 6 shows
an example database with 10 transactions. In this example, items are
sets of products from a supermarket.
7.4 Images
timeseries_to_images.R plot_activity_images.R
vision-based tasks and are very flexible models in the sense that they
can be adapted for a variety of applications with little effort.
Before CNNs were introduced by Lecun [LeCun et al., 1998], image clas-
sification used to be feature-based. One first needed to extract hand-
crafted features from images and then use a classifier to make predic-
tions. Also, images can be flattened into one-dimensional arrays where
each element represents a pixel (Figure 7.5). Then, those 1D arrays can
be used as feature vectors to perform training and inference.
The script then moves to the next window with no overlap and repeats
the process. Actually, the script saves each image as one line of text.
The first 100 elements correspond to the 𝑥 axis, the next 100 to 𝑦, and
the remaining to 𝑧. Thus each line has 300 values. Finally, the user
id and the corresponding activity label are added at the end. This for-
mat will make it easy to read the file and reconstruct the images later
on. The resulting file is called images.txt and is already included in the
smartphone_activities dataset folder.
We can see that the patterns for ‘jogging’ look more “chaotic” compared
to the others while the ‘sitting’ activity looks like a plain solid square.
Then, we can use those images to train a CNN and perform inference.
CNNs will be covered in chapter 8 and used to build adaptive models
using these activity images.
226 7 Encoding Behavioral Data
Advantages:
• Spatial relationships can be captured.
• Can be multi-dimensional. For example 3D RGB images.
• Can be efficiently processed with CNNs.
Limitations:
• Computational time can be higher than when processing feature vec-
tors. Still, modern hardware and methods allow us to perform opera-
tions very efficiently.
• It can take some extra processing to convert non-image data into im-
ages.
FIGURE 7.8 Four timeseries (top) with their respective RPs (bot-
tom). (Author: Norbert Marwan/Pucicu at German Wikipedia. Source:
Wikipedia (CC BY-SA 3.0) [https://creativecommons.org/licenses/by-
sa/3.0/legalcode]).
The first RP (leftmost) does not seem to have a clear pattern (white
noise) whereas the other three show some patterns like diagonals of
different sizes, some square and circular shapes, and so on. RPs can
be characterized by small-scale and large-scale patterns. Examples of
small-scale patterns are diagonals, horizontal/vertical lines, dots, etc.
Large-scale patterns are called typology and they depict the global char-
acteristics of the dynamic system 5 .
The visual interpretation of RPs requires some experience and is out of
the scope of this book. However, they can be used as a visual pattern
extraction tool to represent the data and then, in conjunction with ma-
chine learning methods like CNNs, used to solve classification problems.
5
http://www.recurrence-plot.tk/glance.php
228 7 Encoding Behavioral Data
But how are RPs computed? Well, that is the topic of the next section.
1 if ||𝑥𝑖⃗ − 𝑥𝑗⃗ || ≤ 𝜖
𝑅𝑖,𝑗 (𝑥) = { (7.1)
0 otherwise,
where 𝑥⃗ are the states and ||⋅|| is a norm (for example Euclidean dis-
tance). 𝑅𝑖,𝑗 is the square matrix and will be 1 if 𝑥𝑖⃗ ≈ 𝑥𝑗⃗ up to an
error 𝜖. The 𝜖 is important since systems often do not recur exactly to a
previously visited state.
The threshold 𝜖 needs to be set manually which can be difficult in some
situations. If not set properly, the RP can end up having excessive ones
or zeros. If you plan to use RPs as part of an automated process and
fed them to a classifier, you can use the distance matrix instead. The
advantage is that you don’t need to specify any parameter except for
the distance function. The distance matrix can be defined as:
which is similar to equation (7.1) but without the extra step of applying
a threshold.
Advantages:
• RPs capture dynamic patterns of a system.
• They can be used to extract small and large scale patterns.
• Timeseries can be easily encoded as RPs.
• Can be used as input to CNNs for supervised learning tasks.
7.5 Recurrence Plots 229
Limitations:
• Computationally intensive since all pairs of distances need to be cal-
culated.
• Their visual interpretation requires experience.
• A threshold needs to be defined and it is not always easy to find the
correct value. However, the distance matrix can be used instead.
recurrence_plots.R
for(i in 1:N){
for(j in 1:N){
# Store result in D.
# Start filling values from bottom left.
D[N - (i-1), j] <- d
This function first defines two square matrices M and D to store the re-
currence plot and the distance matrix, respectively. Then, it iterates the
matrices from bottom left to top right and fills the corresponding val-
ues for M and D. The distance between elements i and j from the vector
is computed. That distance is directly stored in D. To generate the RP
we check if the distance is less or equal to the threshold. If that is the
case the corresponding entry in M is set to 1. Finally, both matrices are
returned by the function.
Now, we can try our rp() function on the HAND GESTURES dataset to
convert one of the timeseries into a RP. First, we read one of the gesture
files. For example, the first gesture ‘1’ from user 1. We only extract the
acceleration from the 𝑥 axis and store it in variable x.
7.5 Recurrence Plots 231
df <- read.csv(file.path(datasets_path,
"hand_gestures/1/1_20130703-120056.txt"),
header = F)
x <- df$V1
# Plot vector x.
plot(x, type="l", main="Hand gesture 1", xlab = "time", ylab = "")
Now the rp() function that we just defined is used to calculate the RP
and distance matrix of vector x. We set a threshold of 0.5 and store the
result in res.
Let’s first plot the distance matrix stored in res$D. The pheatmap() func-
tion can be used to generate the plot.
library(pheatmap)
pheatmap(res$D, main="Distance matrix of gesture 1", cluster_row = FALSE,
cluster_col = FALSE,
legend = F,
color = colorRampPalette(c("white", "black"))(50))
From figure 7.10 we can see that the diagonal cells are all white. Those
represent values of 0, the distance between a point and itself. Apart from
that, there are no other human intuitive patterns to look for. Now, let’s
see how the recurrence plot stored in res$RP looks like (Figure 7.11).
legend = F,
color = colorRampPalette(c("white", "black"))(50))
shiny_rp.R This shiny app allows you to select hand gestures, plot their
corresponding distance matrix and recurrence plot, and see how the
threshold affects the final result.
7.6 Bag-of-Words
The main idea of the Bag-of-Words (BoW) encoding is to represent a
complex entity as a set of its constituent parts. It is called Bag-of-Words
because one of the first applications was in natural language processing.
Say there is a set of documents about different topics such as medicine,
arts, engineering, etc., and you would like to classify them automati-
cally based on their words. In BoW, each document is represented as
a table that contains the unique words across all documents and their
respective counts for each document. With this representation, one may
see that documents about medicine will contain higher counts of words
like treatment, diagnosis, health, etc., compared to documents about art
or engineering. Figures 7.13 and 7.14 show the conceptual view and the
table view, respectively.
From these representations, it is now easy to build a document classifier.
The word-counts table can be used as an input feature vector. That is,
each position in the feature vector represents a word and its value is an
integer representing the total count for that word.
7.6 Bag-of-Words 235
Once the feature vectors are labeled, we can build the word-count table
but instead of having ‘meaningful’ words, the entries will be ids with
their corresponding counts. As you might have guessed, one limitation
is that we do not know how many clusters (labels) there should be for a
given problem. One approach is to try out for different values of 𝑘 and
use the one that optimizes your performance metric of interest.
But, what this BoW thing has to do with behavior? Well, we can use this
method to decompose complex behaviors into simpler ones and encode
them as BoW as we will see in the next subsection for complex activities
analysis.
Advantages
Limitations
bagwords/bow_functions.R bagwords/bow_run.R
So far, I have been talking about BoW applications for text and images.
In this section, I will show you how to decompose complex activities
238 7 Encoding Behavioral Data
from accelerometer data into simpler activities and encode them as BoW.
In chapters 2 and 3, we trained supervised models for simple activity
recognition. Those activities were like: walking, jogging, standing, etc.
For those, it is sufficient to divide them into windows of size equivalent
to a couple of seconds in order to infer their labels. On the other hand,
the duration of complex activities are longer and they are composed of
many simple activities. One example is the activity shopping. When
we are shopping we perform many different activities including walking,
taking groceries, paying, standing while looking at the stands, and so on.
Another example is commuting. When we commute, we need to walk
but also take the train, or drive, or cycle.
Using the same approach for simple activity classification on complex
ones may not work. Representing a complex activity using fixed-size
windows can cause some conflicts. For example, a window may be cover-
ing the time span when the user was walking, but walking can be present
in different types of complex activities. If a window happens to be part of
a segment when the person was walking, there is not enough information
to know which was the complex activity at that time. This is where BoW
comes into play. If we represent a complex activity as a bag of simple ac-
tivities then, a classifier will have an easier time differentiating between
classes. For instance, when exercising, the frequencies (counts) of high-
intensity activities (like running or jogging) will be higher compared to
when someone is shopping.
In practice, it would be very tedious to manually label all possible sim-
ple activities to form the BoW. Instead, we will use the unsupervised
approach discussed in the previous section to automatically label the
simple activities so we only need to manually label the complex ones.
Here, I will use the COMPLEX ACTIVITIES dataset which consists of
five complex activities: ‘commuting’, ‘working’, ‘being at home’, ‘shop-
ping’ and ‘exercising’. The duration of the activities varies from some
minutes to a couple of hours. Accelerometer data at 50 Hz. was collected
with a cellphone placed in the user’s belt. The dataset has 80 accelerom-
eter files, each representing a complex activity.
The task is to go from the raw accelerometer data of the complex activity
to a BoW representation where each word will represent a simple activity.
The overall steps are as follows:
1. Divide the raw data into small fixed-length windows and gener-
ate feature vectors from them. Intuitively, these are the simple
activities.
7.6 Bag-of-Words 239
Figure 7.15 shows the overall steps graphically. All the functions to per-
form the above steps are implemented in bow_functions.R. The functions
are called in the appropriate order in bow_run.R.
First of all, and to avoid overfitting, we need to hold out an independent
set of instances. These instances will be used to generate the clusters and
their respective centroids. The dataset is already divided into a train and
test set. The train set contains 13 instances out of the 80. The remaining
67 are assigned to the test set.
In the first step, we need to extract the feature vectors from the raw
data. This is implemented in the function extractSimpleActivities(). This
function divides the raw data of each file into fixed-length windows of
size 150 which corresponds to 3 seconds. Each window can be thought of
as a simple activity. For each window, it extracts 14 features like mean,
standard deviation, correlation between axes, etc. The output is stored in
the folder simple_activities/. Each file corresponds to one of the complex
240 7 Encoding Behavioral Data
This is because we divided the data into train and test sets. So we need
to extract the features from both sets by setting the train parameter
accordingly.
The second step consists of clustering the extracted feature vectors.
To avoid overfitting, this step is only performed on the train set.
The function clusterSimpleActivities() implements this step. The fea-
ture vectors are grouped into 15 groups. This can be changed by set-
ting constants$wordsize <- 15 to some other value. The function stores
all feature vectors from all files in a single data frame and runs
𝑘-means. Finally, the resulting centroids are saved in the text file
clustering/centroids.txt inside the train set directory.
The next step is to label each feature vector (simple activity) by assigning
it to its closest centroid. The function assignSimpleActivitiesToCluster()
reads the centroids from the text file, and for each simple activity in
the test set it finds the closest centroid using the Euclidean distance.
The label (an integer from 1 to 15) of the closest centroid is assigned
and the resulting files are saved in the labeled_activities/ directory.
Each file contains the assigned labels (integers) for the corresponding
feature vectors file in the simple_activities/ directory. Thus, if a file in-
side simple_activities/ has 100 feature vectors then, its corresponding
file in labeled_activities/ should have 100 labels.
In the last step, the function convertToHistogram() will generate the bag
of words from the labeled activities. The BoW are stored as histograms
(encoded as vectors) with each element representing a label and its cor-
responding counts. In this case, the labels are 𝑤1..𝑤15. The 𝑤 stands
for word and was only appended for clarity to show that this is a label.
This function will convert the counts into percentages (normalization)
in case we want to perform classification, that is, the percentage of time
7.6 Bag-of-Words 241
that each word (simple activity) occurred during the entire complex ac-
tivity. The resulting histograms/histograms.csv file contains the BoW as
one histogram per row. One per each complex activity. The first column
is the complex activity’s label in text format.
Figures 7.16 and 7.17 show the histogram for one instance of ‘working’
and ‘exercising’. The x-axis shows the labels of the simple activities and
the y-axis their relative frequencies.
7.7 Graphs
Graphs are one of the most general data structures (and my favorite
one). The two basic components of a graph are its vertices and edges.
Vertices are also called nodes and edges are also called arcs. Vertices are
connected by edges. Figure 7.18 shows three different types of graphs.
Graph (a) is an undirected graph that consists of 3 vertices and 3 edges.
Graph (b) is a directed graph, that is, its edges have a direction. Graph
(c) is a weighted directed graph because its edges have a direction and
they also have an associated weight.
Advantages:
• Many real-world situations can be naturally represented as graphs.
• Some partial order is preserved.
244 7 Encoding Behavioral Data
plot_graphs.R
In the previous section, it was shown how complex activities can be rep-
resented as Bag-of-Words. This was done by decomposing the complex
activities into simpler ones. The BoW is composed of the simple activities
counts (frequencies). In the process of building the BoW in the previous
section, some intermediate text files stored in labeled_activities/ were
generated. These files contain the sequence of simple activities (their ids
as integers) that constitute the complex activity. From these sequences,
histograms were generated and in doing so, the order was lost.
One thing we can do is build a graph where vertices represent simple ac-
tivities and edges represent the interactions between them. For instance,
if we have a sequence of simple activities ids like: 3, 2, 2, 4 we can repre-
sent this as a graph with 3 vertices and 3 edges. One vertex per activity.
The first edge would go from vertex 3 to vertex 2, the next one from ver-
tex 2 to vertex 2, and so on. In this way we can use a graph to capture
the interactions between simple activities.
The script plot_graphs.R implements a function named ids.to.graph()
that reads the sequence files from labeled_activities/ and converts them
into weighted directed graphs. The weight of the edge (𝑎, 𝑏) is equal to
the total number of transitions from vertex 𝑎 to vertex 𝑏. The script
uses the igraph package [Csardi and Nepusz, 2006] to store and plot the
7.7 Graphs 245
Figure 7.20 shows the resulting plot. The plot can be customized to
change the vertex and edge color, size, curvature, etc. For more details
please read the igraph package documentation.
The width of the edges is proportional to its weight. For instance, tran-
sitions from simple activity 3 to itself are very frequent (53.2% of the
time) for the ‘work’ complex activity, but transitions from 8 to 4 are
246 7 Encoding Behavioral Data
very infrequent. Note that with this graph representation, some tempo-
ral dependencies are preserved but the complete sequence order is lost.
Still this captures more information compared to BoW. The relationships
between consecutive simple activities are preserved.
It is also possible to get the adjacency matrix with the method
as_adjacency_matrix().
as_adjacency_matrix(g)
7.8 Summary
Depending on the problem at hand, the data can be encoded in different
forms. Representing data in a particular way, can simplify the problem
solving process and the application of specialized algorithms. This chap-
ter presented different ways in which data can be encoded along with
some of their advantages/disadvantages.
• Feature vectors are fixed-size arrays that capture the properties of
an instance. This is the most common form of data representation in
machine learning.
248 7 Encoding Behavioral Data
For the rest of the chapter I will mostly use the term units to refer to
neurons/nodes. I will also use the term network to refer to artificial
neural networks.
Before going into details of how multi-layer ANNs work, let’s start with
a very simple neural network consisting of a single unit. See Figure
8.1. Even though this network only has one node, it is already composed
of several interesting elements which are the basis of more complex net-
works. First, it has 𝑛 input variables 𝑥1 … 𝑥𝑛 which are real numbers.
Second, the unit has a set of 𝑛 weights 𝑤1 … 𝑤𝑛 associated with each
input. These weights can take real numbers as values. Finally, there is
an output 𝑦′ which is binary (it can take two values: 1 or 0).
the inputs are multiplied by their corresponding weights and the results
are summed. If the sum is greater than a given threshold, then the output
is 1 and 0 otherwise. Formally:
1 if ∑𝑖 𝑤𝑖 𝑥𝑖 > 𝑡,
𝑦′ = { (8.1)
0 if ∑𝑖 𝑤𝑖 𝑥𝑖 ≤ 𝑡
Suppose that today was payday and the theater is projecting an action
movie. Then, we can set the input variables 𝑚𝑜𝑛𝑒𝑦 = 1 and ℎ𝑜𝑟𝑟𝑜𝑟 = 0.
Now we want to decide if we should go to the movie theater or not. To
get the final answer we can use Equation (8.1). This formula tells us
that we need to multiply each input variable with their corresponding
weights and add them:
(𝑚𝑜𝑛𝑒𝑦)(5) + (ℎ𝑜𝑟𝑟𝑜𝑟)(−3)
(1)(5) + (0)(−3) = 5
Since 5 > 𝑡 (remember the threshold 𝑡 = 3), the final output will be
1, thus, the advice is to go to the movies. Let’s try the scenario when
you have money but they are projecting a horror movie: 𝑚𝑜𝑛𝑒𝑦 = 1,
ℎ𝑜𝑟𝑟𝑜𝑟 = 1.
(1)(5) + (1)(−3) = 2
In this case, 2 < 𝑡 and the final output is 0. Even if you have money,
you should not waste it on a movie that you know you most likely will
not like. This process of applying operations to the inputs and obtaining
the final result is called forward propagation because the inputs are
‘pushed’ all the way through the network (a single perceptron in this
case). For bigger networks, the outputs of the current layer become the
inputs of the next layer, and so on.
For convenience, a simplified version of Equation (8.1) can be used.
This alternative representation is useful because it provides flexibility
to change the internals of the units (neurons) as we will see. The first
simplification consists of representing the inputs and weights as vectors:
8.1 Introduction to Artificial Neural Networks 253
∑ 𝑤𝑖 𝑥𝑖 = 𝑤 ⋅ 𝑥 (8.2)
𝑖
1 if 𝑤 ⋅ 𝑥 + 𝑏 > 0,
𝑦′ = 𝑓(𝑥) = { (8.3)
0 otherwise
1 if 𝑥 > 0
𝑔(𝑥) = 𝑠𝑡𝑒𝑝(𝑥) = { (8.5)
0 if 𝑥 ≤ 0
The first limitation imposes some restrictions on its applicability. For ex-
ample, a perceptron cannot be used to predict real-valued outputs which
254 8 Predicting Behavior with Deep Learning
1
𝑠(𝑥) = (8.6)
1 + 𝑒−𝑥
This function has an ‘S’ shape (Figure 8.5) and as opposed to a step
function, this one is smooth. The range of this function is from 0 to 1.
If we substitute the activation function in Equation (8.4) with the sig-
moid function we get our sigmoid unit:
1
𝑓(𝑥) = (8.7)
1+ 𝑒−(𝑤⋅𝑥+𝑏)
256 8 Predicting Behavior with Deep Learning
Sigmoid units have been one of the most commonly used types of units
when building bigger neural networks. Another advantage is that the
outputs are real values that can be interpreted as probabilities. For in-
stance, if we want to make binary decisions we can set a threshold. For
example, if the output of the sigmoid unit is > 0.5 then return a 1. Of
course, that threshold would depend on the application. If we need more
confidence about the result we can set a higher threshold.
In the last years, another type of unit has been successfully applied to
train neural networks, the rectified linear unit or ReLU for short
(Figure 8.6).
The activation function of this unit is the rectifier function:
0 if 𝑥 < 0,
𝑟𝑒𝑐𝑡𝑖𝑓𝑖𝑒𝑟(𝑥) = { (8.8)
𝑥 if 𝑥 ≥ 0
This one is also called the ramp function and is one of the simplest non-
linear functions and probably the most common one used in modern big
8.1 Introduction to Artificial Neural Networks 257
In practice, many other activation functions are used but the most
common ones are sigmoid and ReLU units. In the following link, you
can find an extensive list of activation functions: https://en.wikipedia
.org/wiki/Activation_function
So far, we have been talking about single units. In the next section, we
will see how these single units can be assembled to build bigger artificial
neural networks.
258 8 Predicting Behavior with Deep Learning
This network only has one hidden layer. Hidden layers are called like that
because they do not have direct contact with the external world. Finally,
there is an output layer with a single unit. We could also have an output
layer with more than one unit. Most of the time, we will have fully
connected neural networks. That is, all units have incoming connections
from all nodes in the previous layer (as in the previous example).
8.1 Introduction to Artificial Neural Networks 259
We already saw how a unit can produce a result based on the inputs
by using forward propagation. For more complex networks the process is
the same! Consider the network shown in Figure 8.8. It consists of two
inputs and one output. It also has one hidden layer with 2 units.
Each node is labeled as 𝑛𝑙,𝑛 where 𝑙 is the layer and 𝑛 is the unit num-
ber. The two input values are 1 and 0.5. They could be temperature
measurements, for example. Each edge has an associated weight. For
simplicity, let’s assume that the activation function of the units is the
identity function 𝑔(𝑥) = 𝑥. The bold underlined number inside the nodes
of the hidden and output layers are the biases. Here we assume that the
network is already trained (later we will see how those weights and bi-
ases are learned). To get the final result, for each node, its inputs are
multiplied by their corresponding weights and added. Then, the bias is
added. Next, the activation function is applied. In this case, it is just the
identify function (returns the same value). The outputs of the nodes in
the hidden layer become the inputs of the next layer and so on.
260 8 Predicting Behavior with Deep Learning
In this example, first we need to compute the outputs of nodes 𝑛2,1 and
𝑛2,2 :
output of 𝑛2,1 = (1)(2) + (0.5)(1) + 1 = 3.5
output of 𝑛2,2 = (1)(−3) + (0.5)(5) + 0 = −0.5
Finally, we can compute the output of the last node using the outputs
of the previous nodes:
output of 𝑛3,1 = (3.5)(1) + (−0.5)(−1) + 3 = 7.
1 𝑁
𝐿(𝜃) = ∑ (𝑦′ − 𝑦𝑛 )2 (8.9)
𝑁 𝑛=1 𝑛
The mean squared error (MSE) loss function is commonly used for
regression problems. For classification problems, the average cross-
entropy loss function is usually preferred (covered later in this chap-
ter).
262 8 Predicting Behavior with Deep Learning
This notation means: find and return the weights and biases that make
the loss function be as small as possible.
The most common method to train neural networks is called gradient
descent. The algorithm updates the parameters in an iterative fashion
based on the loss. This algorithm is suitable for complex functions with
millions of parameters.
Suppose there is a network with only 1 weight and no bias with MSE
as loss function (Equation (8.9)). Figure 8.10 shows a plot of the loss
function. This is a quadratic function that only depends on the value of
𝑤. The task is to find the 𝑤 where the function is at its minimum.
FIGURE 8.11 Function with 1 global minimum and several local min-
ima.
But in what direction and how much is 𝑤 moved in each iteration? The
direction and magnitude are estimated by computing the derivative of
𝜕𝐿
the loss function with respect to the weight 𝜕𝑤 . The derivative is also
called the gradient and denoted by ∇𝐿. The iterative gradient descent
procedure is listed below:
loop until convergence or max iterations (epochs)
for each 𝑤𝑖 in 𝑊 do:
𝑤𝑖 = 𝑤𝑖 − 𝛼 𝜕𝐿(𝑊
𝜕𝑤
)
𝑖
The outer loop is run until the algorithm converges or until a predefined
number of iterations is reached. Each iteration is also called an epoch.
Each weight is updated with the rule: 𝑤𝑖 = 𝑤𝑖 − 𝛼 𝜕𝐿(𝑊 )
𝜕𝑤𝑖 . The deriva-
tive part will give us the direction and magnitude. The 𝛼 is called the
learning rate and it controls how ‘fast’ we move. The learning rate is
a constant defined by the user, thus, it is a hyperparameter. A high
learning rate can cause the algorithm to miss the local minima and the
loss can start to increase. A small learning rate will cause the algorithm
to take more time to converge. Figure 8.12 illustrates both scenarios.
Selecting an appropriate learning rate will depend on the application
but common values are between 0.0001 and 0.05.
Let’s see how gradient descent works with a step by step example. Con-
sider a very simple neural network consisting of an input layer with only
one input feature and an output layer with one unit and no bias. To make
it even simpler, the activation function of the output unit is the identity
function 𝑓(𝑥) = 𝑥. Assume that as training data we have a single data
point. Figure 8.13 shows the simple network and the training data. The
264 8 Predicting Behavior with Deep Learning
training data point only has one input variable (𝑥) and an output (𝑦).
We want to train this network such that it can make predictions on new
data points. The training point has an input feature of 𝑥 = 3 and the
expected output is 𝑦 = 1.5. For this particular training point, it seems
that the output is equal to the input divided by 2. Thus, based on this
single training data point the network should learn how to divide any
other input by 2.
product between the input value and the single weight, and the
activation function has no effect (it returns the same value as its
input). We can rewrite the loss function as 𝐿(𝑤) = (𝑥𝑤 − 𝑦)2 .
2. We need to define a learning rate. For now, we can set it to
𝛼 = 0.05.
3. The weights need to be initialized at random. Let’s assume the
single weight is ‘randomly’ initialized with 𝑤 = 2.
Now we can use gradient descent to iteratively update the weight. Re-
member that the updating rule is:
𝜕𝐿(𝑤)
𝑤=𝑤−𝛼 (8.11)
𝜕𝑤
𝜕𝐿(𝑤)
= 2𝑥(𝑥𝑤 − 𝑦) (8.12)
𝜕𝑤
𝑤 = 𝑤 − 𝛼2𝑥(𝑥𝑤 − 𝑦) (8.13)
Now, we can start doing predictions with our very simple neural network!
To do so, we use forward propagation on the new input data using the
266 8 Predicting Behavior with Deep Learning
Even though the predictions are not perfect, they are very close to the
expected value (division by 2) considering that the network is very simple
and was only trained with a single data point and for only 3 epochs!
If the training set has more than one data point, then we need to compute
the derivative of each point and accumulate them (the derivative of a
sum is equal to the sum of the derivatives). In the previous example, the
update rule becomes:
𝑁
𝑤 = 𝑤 − 𝛼 ∑ 2𝑥𝑖 (𝑥𝑖 𝑤 − 𝑦) (8.14)
𝑖=1
gradient_descent.R
start by creating a sample training set with 3 points. Again, the output
is the input divided by 2.
# Gradient descent.
gradient.descent <- function(train_set, lr = 0.01, epochs = 5){
for(i in 1:epochs){
derivative.sum <- 0.0
loss.sum <- 0.0
# Update weight.
w <- w - lr * derivative.sum
return(w)
}
Now, let’s train the network with a learning rate of 0.01 and for 10
epochs. This function will print for each epoch, the loss and the current
weight.
set.seed(123)
From the output, we can see that the loss decreases as the weight is
updated. The final value of the weight at iteration 10 is 0.49805. We can
now make predictions on new data.
fp(learned_w, -88)
#> [1] -43.8286
Now, you can try to change the training set to make the network learn
a different arithmetic operation!
In the previous example, we considered a very simple neural network
consisting of a single unit. In this case, the partial derivative with respect
to the single weight was calculated directly. For bigger networks with
more layers and activations, the final output becomes a composition
of functions. That is, the activation values of a layer 𝑙 depend on its
weights which are also affected by the previous layer’s 𝑙 − 1 weights and
270 8 Predicting Behavior with Deep Learning
so on. So, the derivatives (gradients) can be computed using the chain
rule 𝑓(𝑔(𝑥))′ = 𝑓 ′ (𝑔(𝑥)) ⋅ 𝑔′ (𝑥). This can be performed efficiently by an
algorithm known as backpropagation.
Here, 𝐶 refers to the loss function which is also called the cost func-
tion. In modern deep learning libraries like TensorFlow, this procedure
is efficiently implemented with a computational graph. If you want to
learn the details about backpropagation I recommend you to check this
post by DEEPLIZARD (https://deeplizard.com/learn/video/XE3krf3CQls)
which consists of 5 parts including videos.
Then, at each epoch all batches are iterated and the parameters are
updated based on each batch and not the entire training set, for example:
1
http://neuralnetworksanddeeplearning.com/chap2.html
8.2 Keras and TensorFlow with R 271
𝑚
𝑤 = 𝑤 − 𝛼 ∑ 2𝑥𝑖 (𝑥𝑖 𝑤 − 𝑦) (8.15)
𝑖=1
Again, an epoch is one pass through all parameters and all batches. Now
you may be wondering why this method is more efficient if an epoch still
involves the same number of operations but they are split into chunks.
Part of this is because since the parameter updates are more frequent, the
loss also improves quicker. Another reason is that the operations within
each batch can be optimized and performed in parallel, for example, by
using a GPU. One thing to note is that each update is based on less
information by only using 𝑚 points instead of the entire data set. This
can introduce some noise in the learning but at the same time this can
help to get out of local minima. In practice, SGD needs more epochs to
converge compared to gradient descent but overall, it will take less time.
From now on, this is the method we will use to train our networks.
Be aware that when using GPUs, a big batch size can cause out of
memory errors since the GPU may not have enough memory to allocate
the batch.
3
https://keras.io/
4
https://en.wikipedia.org/wiki/Theano_(software)
5
https://en.wikipedia.org/wiki/Microsoft_Cognitive_Toolkit
8.2 Keras and TensorFlow with R 273
In the next section, we will start with a simple model built with Keras
and the following examples will introduce more functions. By the end
of this chapter you will be able to build and train efficient deep neural
networks including Convolutional Neural Networks.
keras_simple_network.R
library(keras)
We can now start adding layers (only one in this example). To do so, the
layer_dense() method can be used. The dense name means that this will
274 8 Predicting Behavior with Deep Learning
be a densely (fully) connected layer. This layer will be the output layer
with a single unit.
model %>%
layer_dense(units = 1,
use_bias = FALSE,
activation = 'linear',
input_shape = 1)
The first argument units = 1 specifies the number of units in this layer.
By default, a bias is added in each layer. To make it the same as in
the previous example, we will not use a bias so use_bias is set to FALSE.
The activation specifies the activation function. Here it is set to 'linear'
which means that no activation function is applied 𝑓(𝑥) = 𝑥. Finally,
we need to specify the number of inputs with input_shape. In this case,
there is only one feature.
Before training the network we need to compile the model and specify the
learning algorithm. In this case, stochastic gradient descent with a learn-
ing rate of 𝛼 = 0.01. We also need to specify which loss function to use
(we’ll use mean squared error). At every epoch, some performance met-
rics can be computed. Here, we specify that we want the mean squared
error and mean absolute error. These metrics are computed on the train
data. After compiling the model, the summary() method can be used to
print a textual description of it. Figure 8.16 shows the output of the
summary() function.
From this output, we see that the network consists of a single dense layer
with 1 unit. To start the actual training procedure we need to call the
fit() function. Its first argument is the input training data (features) as
a matrix. The second argument specifies the corresponding true outputs.
8.2 Keras and TensorFlow with R 275
We let the algorithm run for 30 epochs. The batch size is set to 3 which
is also the total number of data points in our data. In this example the
dataset is very small so we set the batch size equal to the total number
of instances. In practice, datasets can contain thousands of instances but
the batch size will be relatively small (e.g., 8, 16, 32, etc.).
Additionally, there is a validation_split parameter that specifies the frac-
tion of the train data to be used for validation. This is set to 0 (the
default) since the dataset is very small. If the validation split is greater
than 0, its performance metrics will also be computed. The verbose pa-
rameter sets the amount of information to be printed during training.
A 0 will not print anything. A 2 will print one line of information per
epoch. The last parameter view_metrics specifies if you want the progress
of the loss and performance metrics to be plotted. The fit() function
returns an object with summary statistics collected during training and
is saved in the variable history.
Figure 8.17 presents the output of the fit() function in RStudio. In the
console, the training loss, mean squared error, and mean absolute error
are printed during each epoch. In the viewer pane, plots of the same
276 8 Predicting Behavior with Deep Learning
metrics are shown. Here, we can see that the loss is nicely decreasing
over time. The loss at epoch 30 should be close to 0.
The results can slightly differ every time the training is run due to
random weight initializations performed by the back end.
Once the model is trained, we can perform inference on new data points
with the predict_on_batch() function. Here we are passing three data
points.
Now, try setting a higher learning rate, for example, 0.05. With this
learning rate, the algorithm will converge much faster. In my computer,
at epoch 11 the loss was already 0.
8.3 Classification with Neural Networks 277
One practical thing to note is that if you make any changes in the
compile() or fit() functions, you will have to rerun the code that in-
stantiates and defines the network. This is because the model object
saves the current state including the learned weights. If you rerun the
fit() function on a previously trained model, it will start with the
previously learned weights.
Let’s start with point number 1 (add more units to the output layer).
This means that if the number of classes is 𝑘, then the last layer needs to
have 𝑘 units, one for each class. That’s it! Figure 8.18 shows an example
of a neural network with an output layer having 3 units. Each unit
predicts a score for each of the 3 classes. Let’s call the vector of predicted
scores 𝑦′ .
Point number 2 says that a softmax activation function should be used
in the output layer. When training the network, just as with regression,
we need a way to compute the error between the predicted values 𝑦′ and
the true values 𝑦. In this case, 𝑦 is a one-hot encoded vector with a 1 at
the position of the true class and 0𝑠 elsewhere. If you are not familiar
with one-hot encoding, you can check the topic in chapter 5. As opposed
to other classifiers like decision trees, 𝑘-NN, etc., neural networks need
the classes to be one-hot encoded.
278 8 Predicting Behavior with Deep Learning
With regression problems, one way to compare the prediction with the
true value is by using the squared difference: (𝑦′ −𝑦)2 . With classification,
𝑦 and 𝑦′ are vectors so we need another way to compare them. The true
values 𝑦 are represented as a vector of probabilities with a 1 at the
position of the true class. The output scores 𝑦′ do not necessarily sum
up to 1 thus, they are not proper probabilities. Before comparing 𝑦 and
𝑦′ we need both to be probabilities. The softmax activation function is
used to convert 𝑦′ into a vector of probabilities. The softmax function is
applied individually to each element of a vector:
𝑒𝑥 𝑖
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝑥, 𝑖) = (8.16)
∑ 𝑗 𝑒𝑥 𝑗
# Softmax function.
softmax <- function(scores){
exp(scores) / sum(exp(scores))
}
probabilities <- softmax(scores)
print(probabilities)
#> [1] 0.82196 0.04217 0.13587
print(sum(probabilities)) # Should sum up to 1.
#> [1] 1
# Cross-entropy
CE <- function(A,B){
- sum(B * log(A))
}
y <- c(1, 0, 0)
280 8 Predicting Behavior with Deep Learning
print(CE(softmax(scores), y))
#> [1] 0.1961
Now we know how to compute the cross-entropy for each training in-
stance. The total loss function is then, the average cross-entropy
across the training points. The next section shows how to build a
neural network for classification using Keras.
keras_electromyography.R
# Format data
trainset <- format.to.array(trainset, numclasses = 4)
valset <- format.to.array(valset, numclasses = 4)
testset <- format.to.array(testset, numclasses = 4)
282 8 Predicting Behavior with Deep Learning
Let’s print the first one-hot encoded classes from the train set:
head(trainset$y)
The first three instances belong to the class ‘paper’ because the 1𝑠 are in
the third position. The corresponding integers are 0-rock, 1-scissors, 2-
paper, 3-OK. So ‘paper’ comes in the third position. The fourth instance
belongs to the class ‘OK’, the fifth to ‘rock’, and so on.
Now it’s time to define the neural network architecture! We will do so
inside a function:
model %>%
layer_dense(units = 32, activation = 'relu',
input_shape = ninputs) %>%
layer_dense(units = 16, activation = 'relu') %>%
layer_dense(units = nclasses, activation = 'softmax')
return(model)
}
8.3 Classification with Neural Networks 283
The first argument takes the number of inputs (features), the second
argument specifies the number of classes and the last argument is the
learning rate 𝛼. The first line instantiates an empty keras sequential
model. Then we add three layers. The first two are hidden layers and
the last one will be the output layer. The input layer is implicitly defined
when setting the input_shape parameter in the first layer. The first hidden
layer has 32 units with a ReLU activation function. Since this is the first
hidden layer, we also need to specify what is the expected input by
setting the input_shape. In this case, the number of input features is 64.
The next hidden layer has 16 ReLU units. For the output layer, the
number of units needs to be equal to the number of classes (4, in this
case). Since this is a classification problem we also set the activation
function to softmax.
Then, the model is compiled and the loss function is set to
categorical_crossentropy because this is a classification problem. Stochas-
tic gradient descent is used with a learning rate passed as a parameter.
During training, we want to monitor the accuracy. Finally, the function
returns the compiled model.
Now we can call our function to create the model. This one will have 64
inputs and 4 outputs and the learning rate is set to 0.01. It is always
useful to print a summary of the model with the summary() function.
From the summary, we can see that the network has 3 layers. The sec-
ond column shows the output shape which in this case corresponds to
284 8 Predicting Behavior with Deep Learning
the number of units in each layer. The last column shows the number of
parameters of each layer. For example, the first layer has 2080 parame-
ters! Those come from the weights and biases. There are 64 (inputs) *
32 (units) = 2048 weights plus the 32 biases (one for each unit). The
biases are included by default on each layer unless otherwise specified.
The second layer receives 32 inputs on each of its 16 units. Thus 32 *
16 + 16 (biases) = 528. The last layer has 16 inputs from the previous
layer on each of its 4 units plus 4 biases giving a total of 68 parameters.
In total, the network has 2676 parameters. Here, we see how fast the
number of parameters grows when adding more layers and units. Now,
we use the fit() function to train the model.
The model is trained for 300 epochs with a batch size of 8. We used
the validation_data parameter to specify the validation set to compute
the performance on unseen data. The training will take some minutes
to complete. Bigger models can take hours or even several days. Thus,
it is a good idea to save a model once it is trained. You can do so with
the save_model_hdf5() or save_model_tf() methods. The former saves the
model in hdf5 format while the later saves it in TensorFlow’s SavedModel
format. The SavedModel is stored as a directory containing the necessary
serialized files to restore the model’s state.
# Load model.
model <- load_model_hdf5("electromyography.hdf5")
The source code files include the trained models used in this book in
case you want to reproduce the results. Both, the hdf5 and SavedModel
versions are included.
Figure 8.20 shows the train and validation loss and accuracy as produced
by plot(history). We see that both the training and validation loss are
decreasing over time. The accuracy increases over time.
Now, we evaluate the performance of the trained model with the test set
using the evaluate() function.
# Evaluate model.
model %>% evaluate(testset$x, testset$y)
The accuracy was pretty decent (≈ 84%). To get the actual class predic-
tions you can use the predict_classes() function.
286 8 Predicting Behavior with Deep Learning
# Predict classes.
classes <- model %>% predict_classes(testset$x)
head(classes)
#> [1] 2 2 1 3 0 1
Note that this function returns the classes with numbers starting with
0 just as in the original dataset.
Sometimes it is useful to access the actual predicted scores for each class.
This can be done with the predict_on_batch() function.
To obtain the actual classes from the scores, we can compute the index
of the maximum column. Then we subtract −1 so the classes start at 0.
Since the true classes are also one-hot encoded we need to do the same
to get the ground truth.
# Compute accuracy.
sum(classes == groundTruth) / length(classes)
#> [1] 0.8474576
library(caret)
cm <- confusionMatrix(as.factor(str.predictions),
as.factor(str.groundTruth))
cm$table
#> Reference
#> Prediction ok paper rock scissors
#> ok 681 118 24 27
#> paper 54 681 47 12
#> rock 29 18 771 1
#> scissors 134 68 8 867
288 8 Predicting Behavior with Deep Learning
Now, try to modify the network by making it deeper (adding more layers)
and fine-tune the hyperparameters like the learning rate, batch size, etc.,
to increase the performance.
8.4 Overfitting
One important thing to look at when training a network is overfitting.
That is, when the model memorizes instead of learning (see chapter 1).
Overfitting means that the model becomes very specialized at mapping
inputs to outputs from the train set but fails to do so with new test
samples. One reason is that a model can become too complex and with
so many parameters that it will perfectly adapt to its training data but
will miss more general patterns preventing it to perform well on unseen
instances. To diagnose this, one can plot loss/accuracy curves during
training epochs.
In Figure 8.21 we can see that after some epochs the validation loss starts
to increase even though the train loss is still decreasing. This is because
the model is getting better on reducing the error on the train set but
its performance starts to decrease when presented with new instances.
Conversely, one can observe a similar effect with the accuracy. The model
keeps improving its performance on the train set but at some point,
the accuracy on the validation set starts to decrease. Usually, one stops
8.4 Overfitting 289
keras_electromyography_earlystopping.R
Neural networks are trained for several epochs using gradient descent.
But the question is: For how many epochs?. As can be seen in Figure
8.21, too many epochs can lead to overfitting and too few can cause
underfitting. Early stopping is a simple but effective method to reduce
the risk of overfitting. The method consists of setting a large number of
epochs and stop updating the network’s parameters when a condition
is met. For example, one condition can be to stop when there is no
performance improvement on the validation set after 𝑛 epochs or when
there is a decrease of some percent in accuracy.
Keras provides some mechanisms to implement early stopping and this
is accomplished via callbacks. A callback is a function that is run at
different stages during training such as at the beginning or end of an
epoch or at the beginning or end of a batch operation. Callbacks are
passed as a list to the fit() function. You can define custom callbacks
or use some of the built-in ones including callback_early_stopping(). This
callback will cause the training to stop when a metric stops improving.
The metric can be accuracy, loss, etc. The following callback will stop
the training if after 10 epochs (patience) there is no improvement of at
least 1% (min_delta) in accuracy on the validation set.
callback_early_stopping(monitor = "val_acc",
min_delta = 0.01,
patience = 10,
verbose = 1,
mode = "max")
If it is set to "max", training will stop when the monitored metric has
stopped increasing.
It may be the case that the best validation performance was achieved
not in the last epoch but at some previous point. By setting the
restore_best_weights parameter to TRUE the model weights from the epoch
with the best value of the monitored metric will be restored.
The script keras_electromyography_earlystopping.R shows how to use the
early stopping callback in Keras with the electromyography dataset. The
following code is an extract that shows how to define the callback and
pass it to the fit() function.
This code will cause the training to stop if after 50 epochs there is no
improvement in accuracy of at least 1% and will restore the model’s
weights to the ones during the epoch with the highest accuracy. Figure
8.22 shows how the training stopped at epoch 241.
If we evaluate the final model on the test set, we see that the accuracy
is 86.4%, a noticeable increase compared to the 84.7% that we got when
training for 300 epochs without early stopping.
8.4 Overfitting 291
# Evaluate model.
model %>% evaluate(testset$x, testset$y)
#> $loss
#> [1] 0.3777530
#> $acc
#> [1] 0.8641243
8.4.2 Dropout
Dropout is another technique to reduce overfitting proposed by Srivas-
tava et al. [2014]. It consists of ‘dropping’ some of the units from a hidden
layer for each sample during training. In theory, it can also be applied
to input and output layers but that is not very common. The incoming
and outgoing connections of a dropped unit are discarded. Figure 8.23
shows an example of applying dropout to a network. In Figure 8.23 (b),
the middle unit was removed from the network whereas in Figure 8.23
(c), the top and bottom units were removed.
Each unit has an associated probability 𝑝 (independent of other units) of
being dropped. This probability is another hyperparameter but typically
292 8 Predicting Behavior with Deep Learning
it is set to 0.5. Thus, during each iteration and for each sample, half of
the units are discarded. The effect of this, is having more simple networks
(see Figure 8.23) and thus, less prone to overfitting. Intuitively, you can
also think of dropout as training an ensemble of neural networks,
each having a slightly different structure.
From the perspective of one unit that receives inputs from the previous
hidden layer with dropout, approximately half of its incoming connec-
tions will be gone (if 𝑝 = 0.5). See Figure 8.24.
Dropout has the effect of making units not to rely on any single incoming
connection. This makes the whole network able to compensate for the
lack of connections by learning alternative paths. In practice and for
many applications, this results in a more robust model. A side effect of
applying dropout is that the expected value of the activation function of
a unit will be diminished because some of the previous activations will
be 0. Recall that the output of a neuron is computed as:
where 𝑥 contains the input values from the previous layer, 𝑤 the cor-
responding weights and 𝑔() is the activation function. With dropout,
approximately half of the values of 𝑥 will be 0 (if 𝑝 = 0.5). To
8.5 Fine-tuning a Neural Network 293
compensate for that, the input values need to be scaled, in this case,
by a factor of 2.
model %>%
layer_dense(units = 256, activation = 'relu', input_shape = 1000) %>%
layer_dropout(0.5) %>%
layer_dense(units = 128, activation = 'relu') %>%
layer_dropout(0.5) %>%
layer_dense(units = 2, activation = 'softmax')
There is also no formula for determining the batch size, the learning
rate, type of activation function, for how many epochs should we train
the network, and so on. All those are called the hyperparameters of the
network. Hyperparameter tuning is a complex optimization problem and
there is a lot of research going on that tackles the issue from different
angles. My suggestion is to start with a simple architecture that has
been used before to solve a similar problem and then fine-tune it for
your specific task. If you are not aware of such a network, there are
some guidelines (described below) to get you started. Always keep in
mind that those are only recommendations, so you do not need to abide
by them and you should feel free to try configurations that deviate from
those guidelines depending on your problem at hand.
Training neural networks is a time-consuming process, especially in deep
networks. Training a network can take from several minutes to weeks.
In many cases, performing cross-validation is not feasible. A common
practice is to divide the data into train/validation/test sets. The train-
ing data is used to train a network with a given architecture and a set
of hyperparameters. The validation set is used to evaluate the general-
ization performance of the network. Then, you can try different archi-
tectures and hyperparameters and evaluate the performance again and
again with the validation set. Typically, the network’s performance is
monitored during training epochs by plotting the loss and accuracy of
the train and validation sets. Once you are happy with your model, you
test its performance on the test set only once and that is the result
that is reported.
Here are some starting point guidelines, however, also take into consid-
eration that those hyperparameters can be dependent on each other. So,
if you modify a hyperparameter it may impact other(s).
Number of hidden layers. Most of the time one or two hidden layers
are enough to solve not too complex problems. One advice is to start with
one hidden layer and if that one is not enough to capture the complexity
of the problem, add another layer and so on.
Number of units. If a network has too few units it can underfit, that
is, the model will be too simple to capture the underlying data patterns.
If the network has too many units this can result in overfitting. Also, it
will take more time to learn the parameters. Some guidelines mention
that the number of units should be somewhere between the number of
8.5 Fine-tuning a Neural Network 295
input features and the number of units in the output layer6 . Huang [2003]
has even proposed a formula for the two-hidden layer case to calculate
the number of units that are enough to learn 𝑁 samples: 2√(𝑚 + 2)𝑁
where 𝑚 is the number of output units.
My suggestion is to first gain some practice and intuition with simple
problems. A good way to do so is with the TensorFlow playground (ht
tps://playground.tensorflow.org/) created by Daniel Smilkov and Shan
Carter. This is a web-based implementation of a neural network that
you can fine-tune to solve a predefined set of classification and regression
problems. For example, Figure 8.25 shows how I tried to solve the XOR
problem with a neural network with 1 hidden layer and 1 unit with a
sigmoid activation function. After more than 1, 000 epochs the loss is
still quite high (0.38). Try to add more neurons and/or hidden layers
and see if you can solve the XOR problem with fewer epochs.
Batch size. Batch sizes typically range between 4 and 512. Big batch
sizes provide a better estimate of the gradient but are more computa-
tionally expensive. On the other hand, small batch sizes are faster to
compute but will incur in more noise in the gradient estimation requir-
ing more epochs to converge. When using a GPU or other specialized
hardware, the computations can be performed in parallel thus, allowing
bigger batch sizes to be computed in a reasonable time. Some people
argue that the noise introduced with small batch sizes is good to escape
6
https://www.heatonresearch.com/2017/06/01/hidden-layers.html
296 8 Predicting Behavior with Deep Learning
from local minima. Keskar et al. [2016] showed that in practice, big batch
sizes can result in degraded models. A good starting point is 32 which
is the default in Keras.
Learning rate. This is one of the most important hyperparameters.
The learning rate specifies how fast gradient descent ‘moves’ when try-
ing to find an optimal minimum. However, this doesn’t mean that the
algorithm will learn faster if the learning rate is set to a high value. If
it is too high, the loss can start oscillating. If it is too low, the learn-
ing will take a lot of time. One way to fine-tune it, is to start with the
default one. In Keras, the default learning rate for stochastic gradient
descent is 0.01. Then, based on the loss plot across epochs, you can de-
crease/increase it. If learning is taking long, try to increase it. If the loss
seems to be oscillating or stuck, try reducing it. Typical values are 0.1,
0.01, 0.001, 0.0001, 0.00001. Additionally to stochastic gradient descent,
Keras provides implementations of other optimizers7 like Adam8 which
have adaptive learning rates, but still, one needs to specify an initial one.
Most of the time they are used for image classification tasks but can also
be used for regression and for time series data. If we wanted to perform
image classification with a traditional neural network, first we would
need to either build a feature vector by:
The first solution requires a lot of image processing expertise and do-
main knowledge. Extracting features from images is not a trivial task
and requires a lot of preprocessing to reduce noise, artifacts, segment
the objects of interest, remove background, etc. Additionally, consider-
able effort is spent on feature engineering. The drawback of the second
solution is that spatial information is lost, that is, the relationship be-
tween neighboring pixels. CNNs solve the two previous problems by au-
tomatically extracting features while preserving spatial information. As
opposed to traditional networks, CNNs can take as input 𝑛-dimensional
images and process them efficiently. The main building blocks of a CNN
are:
1. Convolution layers
2. Pooling operations
3. Traditional fully connected layers
Figure 8.26 shows a simple CNN and its basic components. First, the
input image goes through a convolution layer with 4 kernels (details
about the convolution operation are described in the next subsection).
This layer is in charge of extracting features by applying the kernels
on top of the image. The result of this operation is a convolved image,
also known as feature maps. The number of feature maps is equal to
the number of kernels, in this case, 4. Then, a pooling operation is
applied on top of the feature maps. This operation reduces the size of
the feature maps by downsampling them (details on this in a following
subsection). The output of the pooling operation is a set of feature maps
with reduced size. Here, the outputs are 4 reduced feature maps since
the pooling operation is applied to each feature map independently of
the others. Then, the feature maps are flattened into a one-dimensional
array. Conceptually, this array represents all the features extracted from
the previous steps. These features are then used as inputs to a neural
network with its respective input, hidden, and output layers. An ‘*’ and
298 8 Predicting Behavior with Deep Learning
underlined text means that parameter learning occurs in that layer. For
example, in the convolution layer, the parameters of the kernels need to
be learned. On the other hand, the pooling operation does not require
parameter learning since it is a fixed operation. Finally, the parameters
of the neural network are learned too, including the hidden layers and
the output layer.
One can build more complex CNNs by stacking more convolution layers
and pooling operations. By doing so, the level of abstraction increases.
For example, the first convolution extracts simple features like horizon-
tal, vertical, diagonal lines, etc. The next convolution could extract more
complex features like squares, triangles, and so on. The parameter learn-
ing of all layers (including the convolution layers) occurs during the
same forward and backpropagation step just as with a normal neural
network. Both, the features and the classification task are learned at the
same time! During learning, batches of images are forward propagated
and the parameters are adjusted accordingly to minimize the error (for
example, the average cross-entropy for classification). The same meth-
ods for training normal neural networks are used for CNNs, for example,
stochastic gradient descent.
8.6.1 Convolutions
Convolutions are used to automatically extract feature maps from im-
ages. A convolution operation consists of a kernel also known as a fil-
ter which is a matrix with real values. Kernels are usually much smaller
than the original image. For example, for a grayscale image of height
and width of 100x100 a typical kernel size would be 3x3. The size of
the kernel is a hyperparameter. The convolution operation consists of
applying the kernel over the image starting at the upper left corner and
moving forward row by row until reaching the bottom right corner. The
stride controls how many elements the kernel is moved at a time and
this is also a hyperparameter. A typical value for the stride is 1.
The convolution operation computes the sum of the element-wise prod-
uct between the kernel and the image region it is covering. The output
of this operation is used to generate the convolved image (feature map).
Figure 8.27 shows the first two iterations and the final iteration of the
convolution operation on an image. In this case, the kernel is a 3x3 ma-
trix with 1s in its first row and 0s elsewhere. The original image has a
size of 5x5x1 (height, width, depth) and it seems to be a number 7.
In the first iteration, the kernel is aligned with the upper left corner of
the original image. An element-wise multiplication is performed and the
results are summed. The operation is shown at the top of the figure.
In the first iteration, the result was 3 and it is set at the corresponding
position of the final convolved image (feature map). In the next iteration,
the kernel is moved one position to the right and again, the final result
is 3 which is set in the next position of the convolved image. The process
continues until the kernel reaches the bottom right corner. At the last
iteration (9), the result is 1.
Now, the convolved image (feature map) represents the features ex-
tracted by this particular kernel. Also, note that the feature map is
a 3x3 matrix which is smaller than the original image. It is also possible
300 8 Predicting Behavior with Deep Learning
to force the feature map to have the same size as the original image by
padding it with zeros.
Before learning starts, the kernel values are initialized at random. In this
example, the kernel has 1s in the first row and it has 3x3 = 9 parameters.
This is what makes CNNs so efficient since the same kernel is applied to
the entire image. This is known as ‘parameter sharing’. Our kernel has 1s
at the top and 0s elsewhere so it seems that this kernel learned to detect
horizontal lines. If we look at the final convolved image, we see that the
horizontal lines were emphasized by this kernel. This would be a good
candidate kernel to differentiate between 7s and 0s, for example. Since
0s does not have long horizontal lines. But maybe it will have difficulties
discriminating between 7s and 5s since both have horizontal lines at the
top.
In this example, only 1 kernel was used but in practice, you may want
more kernels, each in charge of identifying the best features for the given
problem. For example, another kernel could learn to identify diagonal
lines which would be useful to differentiate between 7s and 5s. The num-
ber of kernels per convolution layer is a hyperparameter. In the previous
example, we could have defined to have 4 kernels instead of one. In that
case, the output of that layer would have been 4 feature maps of size
3x3 each (Figure 8.28).
What would be the output of a convolution layer with 4 kernels of size
8.6 Convolutional Neural Networks 301
3x3 if it is applied to an RGB color image of size 5x5x3)? In that case, the
output will be the same (4 feature maps of size 3x3) as if the image were
in grayscale (5x5x1). Remember that the number of output feature maps
is equal to the number of kernels regardless of the depth of the image.
However, in this case, each kernel will have a depth of 3. Each depth is
applied independently to the corresponding R, G, and B image channels.
Thus, each kernel has 3x3x3 = 27 parameters that need to be learned.
After applying each kernel to each image channel (in this example, 3
channels), the results of each channel are added and this is why we
end up with one feature map per kernel. The following course website has
a nice interactive animation of how convolutions are applied to an image
with 3 channels: https://cs231n.github.io/convolutional-networks/. In the
next section (‘CNNs with Keras’), a couple of examples that demonstrate
how to calculate the number of parameters and the outputs’ shape will
be presented as well.
FIGURE 8.29 Max pooling with a window of size 2x2 and stride = 2.
The result of this operation is an image of size 2x2 which is half of the
original one. Aside from max pooling, average pooling can be applied
instead. In that case, it computes the mean value across all values covered
by the window.
keras_cnns.R
# Convolution layer.
layer_conv_2d(filters = 4, # Number of kernels.
kernel_size = c(3,3), # Kernel size.
strides = c(1,1), # Stride.
padding = "same", # Type of padding.
activation = 'relu', # Activation function.
input_shape = c(5,5,1)) # Input image dimensions.
# Only specified in first layer.
8.7 CNNs with Keras 303
The pool_size specifies the window size (height, width). By default, the
strides will be equal to pool_size but if desired, this can be changed with
the strides parameter. This function also accepts a padding parameter
similar to the one for layer_max_pooling_2d().
To illustrate this convolution and pooling operations I will use two simple
examples. The complete code for the two examples can be found in the
script keras_cnns.R.
8.7.1 Example 1
Let’s create our first CNN in Keras. For now, this CNN will not be
trained but only its architecture will be defined. The objective is to
understand the building blocks of the network. In the next section, we
will build and train a CNN that detects smiles from image faces.
Our network will consist of 1 convolution layer, 1 max pooling layer,
304 8 Predicting Behavior with Deep Learning
library(keras)
model %>%
layer_conv_2d(filters = 4,
kernel_size = c(3,3),
padding = "valid",
activation = 'relu',
input_shape = c(10,10,1)) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(units = 32, activation = 'relu') %>%
layer_dense(units = 2, activation = 'softmax')
summary(model)
The first convolution layer has 4 kernels of size 3x3 and a ReLU as
the activation function. The padding is set to "valid" so no padding
will be performed. The input image is of size 10x10x1 (height, width,
depth). Then, we apply max pooling with a window size of 2x2. Later,
8.7 CNNs with Keras 305
the output is flattened and fed into a fully connected layer with 32 units.
Finally, the output layer has 2 units with a softmax activation function
for classification.
From the summary, the output of the first Conv2D layer is (None, 8,
8, 4). The ‘None’ means that the number of input images is not fixed
and depends on the batch size. The next two numbers correspond to
the height and width which are both 8. This is because the image was
not padded and after applying the convolution operation on the original
10x10 height and width image, its dimensions are reduced to 8. The last
number (4) is the number of feature maps which is equal to the number
of kernels (filters=4). The number of parameters is 40 (last column).
This is because there are 4 kernels with 3x3 = 9 parameters each, and
there is one bias per kernel included by default: 4 × 3 × 3 × +4 = 40.
The output of MaxPooling2D is (None, 4, 4, 4). The height and width
are 4 because the pool size was 2 and the stride was 2. This had the effect
of reducing to half the height and width of the output of the previous
layer. Max pooling preserves the number of feature maps, thus, the last
number is 4 (the number of feature maps from the previous layer). Max
pooling does not have any learnable parameters since it applies a fixed
operation every time.
Before passing the downsampled feature maps to the next fully connected
layer they need to be flattened into a 1-dimensional array. This is done
with the layer_flatten() function. Its output has a shape of (None, 64)
which corresponds to the 4 × 4 × 4 = 64 features of the previous layer.
The next fully connected layer has 32 units each with a connection with
every one of the 64 input features. Each unit has a bias. Thus the number
of parameters is 64 × 32 + 32 = 2080.
Finally the output layer has 32 × 2 + 2 = 66 parameters. And the entire
network has 2, 186 parameters! Now, you can try to modify, the kernel
size, the strides, the padding, and input shape and see how the output
dimensions and the number of parameters vary.
8.7.2 Example 2
Now let’s try another example, but this time the input image will have
a depth of 3 simulating an RGB image.
306 8 Predicting Behavior with Deep Learning
model2 %>%
layer_conv_2d(filters = 16,
kernel_size = c(3,3),
padding = "same",
activation = 'relu',
input_shape = c(28,28,3)) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(units = 64, activation = 'relu') %>%
layer_dense(units = 5, activation = 'softmax')
summary(model2)
Figure 8.31 shows that the output height and width of the first Conv2D
layer is 28 which is the same as the input image size. This is because this
time we set padding = "same" and the image dimensions were preserved.
The 16 corresponds to the number of feature maps which was set with
filters = 16.
The total parameter count for this layer is 448. Each kernel has 3×3 = 9
parameters. There are 16 kernels but each kernel has a 𝑑𝑒𝑝𝑡ℎ = 3 because
the input image is RGB. 9 × 16[𝑘𝑒𝑟𝑛𝑒𝑙𝑠] × 3[𝑑𝑒𝑝𝑡ℎ] + 16[𝑏𝑖𝑎𝑠𝑒𝑠] = 448.
Notice that even though each kernel has a depth of 3 the output number
of feature maps of this layer is 16 and not 16×3 = 48. This is because as
mentioned before, each kernel produces a single feature map regardless
of the depth because the values are summed depth-wise. The rest of the
layers are similar to the previous example.
8.8 Smiles Detection with a CNN 307
keras_smile_detection.R
In this section, we will build a CNN that detects smiling and non-smiling
faces from pictures from the SMILES dataset. This information could
be used, for example, to analyze smiling patterns during job interviews,
exams, etc. For this task, we will use a cropped [Sanderson and Lovell,
2009] version of the Labeled Faces in the Wild (LFW) database [Huang
et al., 2008]. A subset of the database was labeled by Arigbabu et al.
[2016], Arigbabu [2017]. The labels are provided as two text files, each,
containing the list of files that correspond to smiling and non-smiling
faces. The dataset can be downloaded from: http://conradsanderson.id
.au/lfwcrop/ and the labels list from: https://data.mendeley.com/datase
ts/yz4v8tb3tp/5. See Appendix B for instructions on how to setup the
dataset.
The smiling set has 600 pictures and the non-smiling has 603 pictures.
Figure 8.32 shows an example of one image from each of the sets.
The script keras_smile_detection.R has the full code of the analysis. First,
we load the list of smiling pictures.
308 8 Predicting Behavior with Deep Learning
library(pixmap)
# Print dimensions.
dim(smiling.images)
#> [1] 600 64 64 3
If we print the minimum and maximum values we see that they are 0
and 1 so there is no need for normalization.
max(smiling.images)
#> [1] 1
min(smiling.images)
#> [1] 0
The next step is to randomly split the dataset into train and test sets.
We will use 85% for the train set and 15% for the test set. We set
the validation_split parameter of the fit() function to choose a small
percent (10%) of the train set as the validation set during training.
After creating the train and test sets, the train set images and labels are
stored in trainX and trainY, respectively and the test set data is stored
in testX and testY. The labels in trainY and testY were one-hot encoded.
Now that the data is in place, let’s build the CNN.
model %>%
layer_conv_2d(filters = 8,
kernel_size = c(3,3),
activation = 'relu',
input_shape = c(64,64,3)) %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_dropout(0.25) %>%
layer_conv_2d(filters = 16,
kernel_size = c(3,3),
activation = 'relu') %>%
310 8 Predicting Behavior with Deep Learning
# Compile model.
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_sgd(lr = 0.01),
metrics = c("accuracy")
)
# Fit model.
history <- model %>% fit(
trainX, trainY,
epochs = 50,
batch_size = 8,
validation_split = 0.10,
verbose = 1,
view_metrics = TRUE
)
plot(history)
After epoch 25 (see Figure 8.33) it looks like the training loss is de-
creasing faster than the validation loss. After epoch 40 it seems that the
model starts to overfit (the validation loss is increasing a bit). If we look
at the validation accuracy, it seems that it starts to get flat after epoch
30. Now we evaluate the model on the test set:
#> $acc
#> [1] 0.9222222
From those 16, all but one were correctly classified. The correct ones
are shown in green and the incorrect one in red. Some faces seem to
be smiling (last row, third image) but the mouth is closed, though. It
seems that this CNN classifies images as ‘smiling’ only when the mouth
is open which may be the way the train labels were defined.
8.9 Summary 313
8.9 Summary
Deep learning (DL) consists of a set of different architectures and
algorithms. As of now, it mainly focuses on artificial neural networks
(ANNs). This chapter introduced two main types of DL models (ANNs
and CNNs) and their application to behavior analysis.
• Artificial neural networks (ANNs) are mathematical models inspired
by the brain. But that does not mean they work the same as the brain.
• The perceptron is one of the simplest ANNs.
• ANNs consist of an input layer, hidden layer(s) and an output layer.
• Deep networks have many hidden layers.
• Gradient descent can be used to learn the parameters of a network.
• Overfitting is a recurring problem in ANNs. Some methods like
dropout and early stopping can be used to reduce the effect of
overfitting.
• A Convolutional Neural Network (CNN) is a type of ANN that can
process 𝑁 -dimensional arrays very efficiently. They are used mainly
for computer vision tasks.
• CNNs consist of convolution and pooling layers.
9
Multi-user Validation
Every person is different. We all have different physical and mental char-
acteristics. Every person reacts differently to the same stimulus and con-
ducts physical and motor activities in particular ways. As we have seen,
predictive models rely on the training data; and for user-oriented ap-
plications, this data encodes their behaviors. When building predictive
models, we want them to be general and to perform accurately on new
unseen instances. Sometimes this generalization capability comes at a
price, especially in multi-user settings. A multi-user setting is one in
which the results depend heavily on the target user, that is, the user
on which the predictions are made. Take, for example, a hand gesture
recognition system. At inference time, a specific person (the target user)
performs a gesture and the system should recognize it. The input data
comes directly from the user. On the other hand, a non multi-user
system does not depend on a particular person. A classifier that labels
fruits on images or a regression model that predicts house prices does
not depend directly on a particular person.
Some time ago I had to build an activity recognition system based on
inertial data from a wrist band. So I collected the data, trained the mod-
els, and evaluated them. The performance results were good. However,
it turned out that when the system was tested on a new sample group it
failed. The reason? The training data was collected from people within
a particular age group (young) but the target market of the product
was for much older people. Older people tend to walk more slowly, thus,
the system was predicting ‘no movement’ when in fact, the person was
walking at a very slow pace. This is an extreme example, but even within
the same age groups, there can be differences between users (inter-user
variance). Even the same user can evolve over time and change her/his
behaviors (intra-user variance).
So, how do we evaluate multi-user systems to reduce the unexpected
effects once the system is deployed? Most of the time, there’s going to
With a mixed model, we would just remove the userid column and per-
form 𝑘-fold cross-validation or hold-out validation as usual. In fact, this
is what we have been doing so far. By doing so, some random data points
will end up in the train set and others in the test set regardless of which
data point was generated by which user. The user rows are just mixed,
thus the mixed model name. This model assumes that the data was gen-
erated by a single user. One disadvantage of validating a system using
a mixed model is that the performance results could be overestimated.
When randomly splitting into train and test sets, some data points for
a given user could end up in each of the splits. At inference time, when
9.1 Mixed Models 317
1. When you know you will have available train data belonging
to the intended target users.
2. In many cases, a dataset already has missing information about
the mapping between rows and users. That is, a userid column
is not present. In those cases, the best performance estimation
would be through the use of a mixed model.
preprocess_skeleton_actions.R classify_skeleton_actions.R
# Print dimensions.
dim(df)
#> [1] 20 3 66
From the file name, we see that this corresponds to action 7 (basketball
shoot), from subject 1 and trial 1. The readMat() function reads the file
contents and stores them as a 3D array in df. If we print the dimensions
we see that the first one corresponds to the number of joints, the second
one are the positions (x, y, z), and the last dimension is the number of
frames, in this case 66 frames.
We extract the first time-frame as follows:
# Print dimensions.
dim(frame)
#> [1] 20 3
9.1 Mixed Models 319
Each frame can then be plotted. The plotting code is included in the
script. Figure 9.2 shows how the skeleton looks like for six of the time
frames. The script also has code to animate the actions.
We will represent each action (file) as a feature vector. The same script
also shows the code to extract the feature vectors from each action.
To extract the features, a reference point in the skeleton is selected,
in this case the spine (joint 3). Then, for each time frame, the distance
between all joints (excluding the reference point) and the reference point
is calculated. Finally, for each distance, the mean, min, and max are
computed across all time frames. Since there are 19 joints (excluding
the spine), we end up with 19 ∗ 3 = 57 features. Figure 9.3 shows how
the final dataset looks like. It only shows the first four features out
of the 57, the user id and the labels.
The following examples assume that the file dataset.csv with the
extracted features already exsits in the skeleton_actions/ directory.
To generate this file, run the feature extraction code in the script
preprocess_skeleton_actions.R.
FIGURE 9.3 First rows of the skeleton dataset after feature extraction
showing the first four features. Source: Original data from C. Chen, R.
Jafari, and N. Kehtarnavaz, “UTD-MHAD: A Multimodal Dataset for
Human Action Recognition Utilizing a Depth Camera and a Wearable
Inertial Sensor”, Proceedings of IEEE International Conference on Image
Processing, Canada, September 2015.
source(file.path("..","auxiliary_functions","globals.R"))
source(file.path("..","auxiliary_functions","functions.R"))
library(randomForest)
library(caret)
The unique.actions variable stores the name of all actions. We will need it
later to define the levels of the factor object. Next, we generate 10 folds
and define some variables to store the performance metrics including the
accuracy, recall, and precision. In each iteration during cross-validation,
we will compute and store those performance metrics.
#Normalize.
res <- normalize(trainset, testset)
322 9 Multi-user Validation
Finally, the average performance across folds for each of the metrics is
printed.
Now, imagine that you want to estimate the performance of the model
in a situation where a completely new user is shown to the model, that
is, the model does not know anything about this user. We can model
those situations using a user-independent model which is the topic
of the next section.
unique.users
#> [1] "s1" "s2" "s3" "s4" "s5" "s6" "s7" "s8"
Then, we iterate through each user, build the corresponding train and
test sets, and train the classifiers. Here, we make sure that the test set
only includes data points belonging to a single user.
set.seed(1234)
for(user in unique.users){
mean(accuracies)
#> [1] 0.5807805
mean(recalls)
#> [1] 0.5798611
mean(precisions)
#> [1] 0.6539715
326 9 Multi-user Validation
Those numbers are surprising! In the previous section, our mixed model
had an accuracy of 92.7% and now the user-independent model has
an accuracy of only 58.0%! This is because the latter didn’t know any-
thing about the target user. Since each person is different, the user-
independent model was not able to capture the patterns of new users
and this had a big impact on the performance.
When should a user-independent model be used to validate a
system?
for(i in 1:nrow(user.data)){
"F1")],
na.rm = TRUE)
mean(accuracies)
#> [1] 0.943114
mean(recalls)
#> [1] 0.9425154
mean(precisions)
#> [1] 0.9500772
This time, the average accuracy was 94.3% which is higher than the ac-
curacy achieved with the mixed model and the user-independent model.
The average recall and precision were also higher compared to the other
types of models. The reason is because each model was targeted to a
particular user.
When should a user-dependent model be used to validate a
system?
1. When the model will be trained only using data from the target
user.
a model from scratch is very time consuming and requires a lot of effort,
especially during the data collection and labeling phase.
The idea of transfer learning dates back to 1991 [Pratt et al., 1991] but
with the advent of deep learning and in particular, with Convolutional
Neural Networks (see chapter 8), it has gained popularity because it
has proven to be a valuable tool when solving challenging problems. In
2014 a CNN architecture called VGG16 was proposed by Simonyan and
Zisserman [2014] and won the ILSVR image recognition competition.
This CNN was trained with more than 1 million images to recognize
1000 categories. It consists of several convolution layers, max pooling
operations, and fully connected layers. In total, the network has ≈ 138
million parameters and it took some weeks to train.
What if you wanted to add a new category to the 1000 labels? Or maybe,
you only want to focus on a subset of the categories? With transfer learn-
ing you can take advantage of a network that has already been trained
and adapt it to your particular problem. In the case of deep learning,
the approach consists of ‘freezing’ the first layers of a network and only
retraining (updating) the last layers for the particular problem. During
training, the frozen layers’ parameters will not change and the unfrozen
ones are updated as usual during the gradient descent procedure. As
discussed in chapter 8, the first layers can act as feature extractors and
be reused. With this approach, you can easily retrain a VGG16 network
in an average computer and within a reasonable time. In fact, Keras
already provides interfaces to common pre-trained models that you can
reuse.
In the following section we will use this idea to build a user-adaptive
model for activity recognition using transfer learning.
keras/adaptive_cnn.R
# Read data.
df <- read.csv(filepath, stringsAsFactors = F)
# Shuffle rows.
set.seed(1234)
df <- df[sample(nrow(df)),]
print(mapping)
#> Walking Downstairs Jogging Standing Upstairs Sitting
#> 0 1 2 3 4 5
Now we store the unique users’ ids in the users variable. After print-
ing the variable’s values, notice that there are 19 distinct users in this
database. The original database has more users but we only kept those
that performed all the activities. Then, we select one of the users to act
as the target user. I will just select one of them at random (turned out
to be user 24). Feel free to select another user if you want.
Next, we split the data into two sets. The first set trainset contains the
data from all users but excluding the target user. We create two
variables: train.y and train.x. The first one has the labels as integers
and the second one has the actual image pixels (features). The second
set target.data contains data only from the target user.
Then, we split the target’s user data into 50% test data and 50% adaptive
data (code omitted here) so that we end up with the following 4 variables:
We also need to normalize the data and reshape it into the actual
image format since in their current form, the pixels are stored into
1-dimensional arrays. We learn the normalization parameters only from
the train set and then, use the normalize.reshape() function (defined in
the same script file) to perform the actual normalization and formatting.
# Learn min and max values from train set for normalization.
maxv <- max(train.x)
minv <- min(train.x)
334 9 Multi-user Validation
Let’s inspect how the structure of the final datasets looks like.
dim(train.x)
#> [1] 6399 10 10 3
dim(target.adaptive.x)
#> [1] 124 10 10 3
dim(target.test.x)
#> [1] 124 10 10 3
Here, we see that the train set has 6399 instances (images). The adaptive
and test sets both have 124 instances.
Now that we are done with the preprocessing, it is time to build the
CNN model! This one will be the initial user-independent model and is
trained with all the train data train.x, train.y.
model %>%
layer_conv_2d(name = "conv1",
filters = 8,
kernel_size = c(2,2),
activation = 'relu',
input_shape = c(10,10,3)) %>%
layer_conv_2d(name = "conv2",
filters = 16,
kernel_size = c(2,2),
activation = 'relu') %>%
layer_max_pooling_2d(pool_size = c(2, 2)) %>%
layer_flatten() %>%
layer_dense(name = "hidden1", units = 32,
9.4 User-adaptive Models 335
This CNN has two convolutional layers followed by a max pooling oper-
ation, a fully connected layer, and an output layer. One important thing
to note is that we have specified a name for each layer with the
name parameter. For example, the first convolution’s name is conv1, the
second one is conv2, and the fully connected layer was named hidden1.
Those names must be unique because they will be used to select specific
layers to freeze and unfreeze.
If we print the model’s summary (Figure 9.5) we see that in total it has
9, 054 trainable parameters and 0 non-trainable parameters. This
means that all the parameters of the network will be updated during the
gradient descent procedure, as usual.
# Print summary.
summary(model)
The next code will compile the model and initiate the training phase.
336 9 Multi-user Validation
# Compile model.
model %>% compile(
loss = 'sparse_categorical_crossentropy',
optimizer = optimizer_sgd(lr = 0.01),
metrics = c("accuracy")
)
plot(history)
Figure 9.6 shows a plot of the loss and accuracy during training. Then,
we save the model so we can load it later. Let’s also estimate the model’s
performance on the target user test set.
# Save model.
save_model_hdf5(model, "user-independent.hdf5")
summary(adaptive.model)
After printing the summary (Figure 9.7), note that the number of train-
able and non-trainable parameters has changed. Now, the non-
trainable parameters are 104 (before they were 0). These 104 parameters
9.4 User-adaptive Models 339
correspond to the first convolutional layer but this time they will not be
updated during the gradient descent training phase.
The following code will retrain the model using the adaptive data but
keeping the first convolutional layer fixed.
Note that this time the validation_split was set to 0. This is because
the target user data set is very small so there is not enough data to
use as validation set. One possible approach to overcome this is to
leave a percentage of users out when building the train set for the user-
independent model. Then, use those left-out users to find which are
the most appropriate layers to keep frozen. Once you are happy with
the results, evaluate the model on the target user.
9.5 Summary
Many real-life scenarios involve multi-user settings. That is, the system
heavily depends on the specific behavior of a given target user. This
chapter covered different types of models that can be used to evaluate
the performance of a system in such a scenario.
9.5 Summary 341
Abnormal data points are instances that are rare or do not occur very
often. They are also called outliers. Some examples include illegal bank
transactions, defective products, natural disasters, etc. Detecting abnor-
mal behaviors is an important topic in the fields of health care, ecology,
economy, psychology, and so on. For example, abnormal behaviors in
wildlife creatures can be an indication of abrupt changes in the environ-
ment and rare behavioral patterns in a person may be an indication of
health deterioration.
Anomaly detection can be formulated as a binary classification task and
solved by training a classifier to distinguish between normal and abnor-
mal instances. The problem with this approach is that anomalous points
are rare and there may not be enough to train a classifier. This can also
lead to class imbalance problems. Furthermore, the models should be
able to detect abnormal points even if they are very different from the
training data. To address those issues, several anomaly detection meth-
ods have been developed over the years and this chapter introduces two
of them: Isolation Forests and autoencoders.
This chapter starts by explaining how Isolation Forests work and then,
an example of how to apply them for abnormal trajectory detection is
presented. Next, a method (ROC curve) to evaluate the performance of
such models is described. Finally, another method called autoencoder
that can be used for anomaly detection is explained and applied to the
abnormal trajectory detection problem.
0.51, 1.6, 1.7, and 1.8. The code to reproduce this example is in the script
example_isolate_point.R. If we look at the highlighted normal instance we
can see that it took 8 partitions to isolate it.
Instead of generating a single tree, we can generate an ensemble of 𝑛
trees and average their path lengths. Figure 10.2 shows the average path
length for the same previous normal and anomalous instances as the
number of trees in the ensemble is increased.
After 200 trees, the average path length of the normal instance starts to
converge to 8.7 and the path length of the anomalous one converges to
3.1. This shows that anomalies have shorter path lengths on average.
In practice, an Isolation Tree is recursively grown until a predefined
maximum height is reached (more on this later), or when all instances
are isolated, or all instances in a partition have the same values. Once
all Isolation Trees in the ensemble (Isolation Forest) are generated, the
instances can be sorted according to their average path length to the
root. Then, instances with the shorter path lengths can be marked as
anomalies.
Instead of directly using the average path lengths for deciding whether
or not an instance is an anomaly, the authors of the method proposed an
anomaly score that is between 0 and 1. The reason for this, is that this
score is easier to interpret since it’s normalized. The closer the anomaly
score is to 1 the more likely the instance is an anomaly. Instances with
346 10 Detecting Abnormal Behaviors
anomaly scores << 0.5 can be marked as normal. The anomaly score
for an instance 𝑥 is computed with the formula:
𝐸(ℎ(𝑥))
𝑠(𝑥) = 2− 𝑐(𝑛) (10.1)
where ℎ(𝑥) is the path length of 𝑥 to the root of a given tree and 𝐸(ℎ(𝑥))
is the average of the path lengths of 𝑥 across all trees in the ensemble.
𝑛 is the number of instances in the train set. 𝑐(𝑛) is the average path
length of an unsuccessful search in a binary search tree:
library(R.matlab)
# Read data.
df <- readMat("../fishDetections_total3102.mat"))$fish.detections
We use the dim() function to print the dimensions of the array. From
the output, we can see that there are 3102 individual trajectories and
each trajectory has 7 attributes. Let’s explore what are the contents of a
single trajectory. The following code snippet extracts the first trajectory
and prints its structure.
350 10 Detecting Abnormal Behaviors
The bounding box represents the square region where the fish was de-
tected in the video footage. Figure 10.5 shows an example of a fish and
its bounding box (not from the original dataset but for illustration pur-
pose only). Also note that the dataset does not contain the images but
only the bounding boxes’ coordinates.
Each trajectory has a different number of video frames. We can get the
frame count by inspecting the length of one of the coordinates.
FIGURE 10.5 Fish bounding box (in red). (Author: Nick Hobgood.
Source: wikimedia.org (CC BY-SA 3.0) [https://creativecommons.org/li
censes/by-sa/3.0/legalcode]).
The first trajectory has 37 frames but on average, they have 10 frames.
For our analyses, we only include trajectories with a minimum of 10
frames since it may be difficult to extract patterns from shorter paths.
Furthermore, we are not going to use the bounding boxes themselves but
the center point of the box.
At this point, it would be a good idea to plot how the data looks like. To
do so, I will use the anipaths package [Scharf, 2020] which has a function
to animate trajectories! I will not cover the details here on how to use
the package but the complete code is in the same script visualize_fish.R.
The output result is in the form of an ‘index.html’ file that contains the
interactive animation. For simplicity, I only selected 50 and 10 normal
and abnormal trajectories (respectively) to be plotted. Figure 10.6 shows
the resulting plot. The plot also includes some controls to play, pause,
change the speed of the animation, etc.
The ‘normal’ and ‘abnormal’ labels were determined by visual inspec-
tion by experts. The abnormal cases include events such as predator
avoidance and aggressive movements (due to another fish or because of
being frightened).
The x and y coordinates of the center points from a given trajectory trj
for all time frames will be stored in x.coord and y.coord. The next line
‘shifts’ the frame numbers so they all start in 0 (to simplify preprocess-
ing). Finally we store the coordinates and frame times in a temporal
data frame for further preprocessing.
At this point we will use the trajr package [McLean and Volponi, 2018]
which includes functions to plot and perform operations on trajectories.
The TrajFromCoords() function can be used to create a trajectory object
from a data frame. Note that the data frame needs to have a prede-
fined order. That is why we first stored the x coordinates, then the y
coordinates, and finally the time in the tmp data frame.
10.2 Detecting Abnormal Fish Behaviors 353
The temporal data frame is passed as the first argument and the frames
per second is set to 1. Now we plot the tmp.trj object.
From Figure 10.7 we can see that there are big time gaps between some
points. This is because some time frames are missing. If we print the first
rows of the trajectory and look at the time, we see that for example, time
steps 4, 5, and 6 are missing.
head(tmp.trj)
#> x y time displacementTime polar displacement
354 10 Detecting Abnormal Behaviors
Before continuing, it would be a good idea to try to fill those gaps. The
function TrajResampleTime() does exactly that by applying linear interpo-
lation along the trajectory.
If we plot the resampled trajectory (Figure 10.8) we will see how the
missing points were filled.
FIGURE 10.8 The original trajectory (circles) and after filling the gaps
with linear interpolation (crosses).
We do the feature extraction for each trajectory and save the results as
a .csv file fishFeatures.csv which is already included in the dataset. Let’s
read and print the first rows of the dataset.
# Read dataset.
dataset <- read.csv("fishFeatures.csv", stringsAsFactors = T)
Each row represents one trajectory. We can use the table() function to
get the counts for ‘normal’ and ‘abnormal’ cases.
table(dataset$label)
#> abnormal normal
#> 54 1093
In Figure 10.9 we see that several abnormal points are in the right hand
side but many others are in the same space as the normal points so it’s
time to train an Isolation Forest and see to what extent it can detect
the abnormal cases!
One of the nice things about Isolation Forest is that it does not need
examples of the abnormal cases during training. If we want, we can also
include the abnormal cases but since we don’t have many we will reserve
them for the test set. The script isolation_forest_fish.R contains the
code to train the model. We will split the data into a train set (80%)
consisting only of normal instances and a test set with both, normal and
abnormal instances. The train set is stored in the data frame train.normal
and the test set in test.all. Since the method is based on trees, we don’t
need to normalize the data.
First, we need to define the parameters of the Isolation Forest. We can
do so by passing the values at creation time.
358 10 Detecting Abnormal Behaviors
As suggested in the original paper [Liu et al., 2008], the sampling size is
set to 256 and the number of trees to 100. The nproc parameter specifies
the number of CPU cores to use. I set it to 1 to ensure we get reproducible
results.
Now we can train the model with the train set. The first two columns
are removed since they correspond to the trajectories ids and class label.
Once the model is trained, we can start making predictions. Let’s start
by making predictions on the train set (later we’ll do it on the test set).
We know that the train set only consists of normal instances.
We know that the train set only has normal instances thus, we need to
find the highest anomaly score so that we can set a threshold to detect
the abnormal cases. The following code will print the highest anomaly
scores.
Now, we predict the anomaly scores on the test set and if the score is >
𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then we classify that point as abnormal. The predicted.labels
array will contain 0𝑠 and 1𝑠. A 1 means that the instance is abnormal.
Now that we have the predicted labels we can compute some performance
metrics.
360 10 Detecting Abnormal Behaviors
#> Reference
#> Prediction 0 1
#> 0 218 37
#> 1 0 17
# Print sensitivity
cm$byClass["Sensitivity"]
#> Sensitivity
#> 0.3148148
#> Reference
#> Prediction 0 1
#> 0 206 8
#> 1 12 46
10.2 Detecting Abnormal Fish Behaviors 361
This time we were able to identify 46 of the abnormal cases! This gives a
sensitivity of 46/54 = 0.85 which is much better than before. However,
nothing is for free. If we look at the normal class, this time we had 12
misclassified points (false positives).
library(PRROC)
roc_obj <- roc.curve(scores.class0 = test.scores$anomaly_score,
weights.class0 = gt.all,
curve = TRUE,
rand.compute = TRUE)
362 10 Detecting Abnormal Behaviors
FIGURE 10.10 ROC curve and AUC. The dashed line represents a
random model.
Here we can see how the sensitivity and FPR increase as the threshold
decreases. In the best case we want a sensitivity of 1 and a FPR of 0. This
ideal point is located at top left corner but this model does not reach
that level of performance but a bit lower. The dashed line in the diagonal
is the curve for a random model. We can also access the thresholds table:
The first column is the FPR, the second column is the sensitivity, and
the last column is the threshold. Choosing the best threshold is not
straightforward and will depend on the compromise we want to have
between sensitivity and FPR.
Note that the plot also prints an 𝐴𝑈 𝐶 = 0.963. This is known as the
Area Under the Curve (AUC) and as the name implies, it is the area
under the ROC curve. A perfect model will have an AUC of 1.0. Our
model achieved an AUC of 0.963 which is pretty good. A random model
will have an AUC around 0.5. A value below 0.5 means that the model is
performing worse than random. The AUC is a performance metric that
measures the quality of a model regardless of the selected threshold and
is typically presented in addition to accuracy, recall, precision, etc.
If someone tells you something negative about yourself (e.g., that you
don’t play football well), assume that they have an AUC below 0.5
(worse than random). At least, that’s what I do to cope with those
situations. (If you invert the predictions of a binary classifier that does
worse than random you will get a classifier that is better than random).
10.3 Autoencoders
In its simplest form, an autoencoder is a neural network whose output
layer has the same shape as the input layer. If you are not familiar
with artificial neural networks, you can take a look at chapter 8. An
autoencoder will try to learn how to generate an output that is as similar
364 10 Detecting Abnormal Behaviors
guarantee that the reconstructed file will be exactly the same as the
original. However, autoencoders have many applications including:
• Dimensionality reduction for visualization.
• Data denoising.
• Data generation (variational autoencoders).
• Anomaly detection (this is what we are interested in!).
Recall that when training a neural network we need to define a loss
function. The loss function captures how well the network is learning.
It measures how different the predictions are from the true expected
outputs. In the context of autoencoders, this difference is known as the
reconstruction error and can be measured using the mean squared
error (similar to regression).
keras_autoencoder_fish.R
autoencoder %>%
layer_dense(units = 32, activation = 'relu',
input_shape = ncol(train.normal)-2) %>%
layer_dense(units = 16, activation = 'relu') %>%
layer_dense(units = 8, activation = 'relu') %>%
layer_dense(units = 16, activation = 'relu') %>%
10.3 Autoencoders 367
This is a normal neural network with an input layer having the same
number of units as number of features (8). This network has 5 hidden
layers of size 32, 16, 8, 16, and 32, respectively. The output layer has 8
units (the same as the input layer). All activation functions are RELU’s
except the last one which is linear because the network should be able to
produce any number as output. Now we can compile and fit the model.
We set mean squared error (MSE) as the loss function. We use the normal
instances in the train set (train.normal) as the input and expected output.
The validation split is set to 10% so we can plot the reconstruction error
(loss) on unseen instances. Finally, the model is trained for 100 epochs.
From Figure 10.12 we can see that as the training progresses, the loss
and the MSE decrease.
We can now compute the MSE on the normal and abnormal test sets.
The test.normal data frame only contains normal test instances and
test.abnormal only contains abnormal test instances.
368 10 Detecting Abnormal Behaviors
Clearly, the MSE of the normal test set is much lower than the abnormal
test set. This means that the autoencoder had a difficult time trying to
reconstruct the abnormal points because it never saw similar ones before.
To find a good threshold we can start by analyzing the reconstruction
errors on the train set. First, we need to get the predictions.
mean(errors.train.normal)
#> [1] 0.8113273
quantile(errors.train.normal)
#> 0% 25% 50% 75% 100%
#> 0.0158690 0.2926631 0.4978471 0.8874694 15.0958992
The mean reconstruction error of the normal instances in the train set is
0.811. If we look at the quantiles, we can see that most of the instances
have an error of <= 0.887. With this information we can set threshold
<- 1.0. If the reconstruction error is > 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 then we will consider
that point as an anomaly.
#> Reference
#> Prediction 0 1
#> 0 202 8
#> 1 16 46
FIGURE 10.13 ROC curve and AUC. The dashed line represents a
random model.
From the ROC curve in Figure 10.13 we can see that the AUC was 0.93
which is lower than the 0.96 achieved by the Isolation Forest but with
some fine tuning and training for more epochs, the autoencoder should
be able to achieve similar results.
10.4 Summary 371
10.4 Summary
This chapter presented two anomaly detection models, namely Isolation
Forests and autoencoders. Examples of how those models can be used
for anomaly trajectory detection were also presented. This chapter also
introduced ROC curves and AUC which can be used to assess the per-
formance of a model.
• Isolation Forests work by generating random partitions of the fea-
tures until all instances are isolated.
• Abnormal points are more likely to be isolated during the first parti-
tions.
• The average tree path length of abnormal points is smaller than that
of normal points.
• An anomaly score that ranges between 0 and 1 is calculated based
on the path length and the closer to 1 the more likely the point is an
anomaly.
• A ROC curve is used to visualize the sensitivity and false positive
rate of a model for different thresholds.
• The area under the curve AUC can be used to summarize the perfor-
mance of a model.
• A simple autoencoder is an artificial neural network whose output
layer has the same shape as the input layer.
• Autoencoders are used to encode the data into a lower dimension from
which then, it can be reconstructed.
• The reconstruction error (loss) is a measure of how distant a pre-
diction is from the ground truth and can be used as an anomaly score.
A
Setup Your Environment
The examples in this book were tested with R 4.0.5. You can get the
latest R version from its official website: www.r-project.org/
As IDE, I use RStudio (https://rstudio.com/) but you can use your
favorite one. Most of the code examples in this book rely on datasets.
The following two sections describe how to get and install the datasets
and source code. If you want to try out the examples, I recommend you
to follow the instructions in the following two sections.
The last section includes instructions on how to install Keras and Tensor-
Flow, which are the required libraries to build and train deep learning
models. Deep learning is covered in chapter 8. Before that, you don’t
need those libraries.
You can get the code using git or if you are not familiar with it, click on
the ‘Code’ button and then ‘Download zip’. Then, extract the file into a
local directory of your choice.
There is a directory for each chapter and two additional directories:
auxiliary_functions/ and install_functions/.
them if they are not present. This is just a convenient way to install
everything at once but you can always install each package individually
with the usual install.packages() method.
When running the examples, it is assumed that the working direc-
tory is the same as the actual script. For example, if you want to try
indoor_classification.R, and that script is located in C:/code/Predicting
Behavior with Classification Models/ then, your working directory should
be C:/code/Predicting Behavior with Classification Models/. In Windows,
and if RStudio is not already opened, when you double-click an R script,
RStudio will be launched (if it is set as the default program) and the
working directory will be set.
You can check your current working directory by typing getwd() and you
can set your working directory with setwd(). Alternatively, in RStudio,
you can set your working directory in the menu bar ‘Session’ -> ‘Set
Working Directory’ -> ‘To Source File Location’.
1
https://shiny.rstudio.com/
376 A Setup Your Environment
install.packages("shiny")
install.packages("shinydashboard")
TensorFlow has two main versions. a CPU and a GPU version. The GPU
version takes advantage of the capabilities of some video cards to per-
form faster operations. The examples in this book can be run with both
versions. The following instructions apply to the CPU version. Installing
the GPU version requires some platform-specific details. I recommend
you to first install the CPU version and if you want/need to perform
faster computations, then, go with the GPU version.
Installing Keras with TensorFlow (CPU version) as backend takes four
simple steps:
2
https://www.anaconda.com
A.4 Installing Keras and TensorFlow 377
library(tensorflow)
tf$constant("Hello World")
#> tf.Tensor(b'Hello World', shape=(), dtype=string)
The first time in a session that you run TensorFlow related code with the
CPU version, you may get warning messages like the following, which
you can safely ignore.
#> tensorflow/stream_executor/platform/default/dso_loader.cc:55]
#> Could not load dynamic library 'cudart64_101.dll';
#> dlerror: cudart64_101.dll not found
If you want to install the GPU version, first, you need to make sure
you have a compatible video card. More information on how to install
the GPU version is available here https://keras.rstudio.com/reference/
install_keras.html and here https://tensorflow.rstudio.com/installatio
n/gpu/local_gpu/
B
Datasets
This Appendix has a list with a description of all the datasets used in this
book. A compressed file with a compilation of most of the datasets can
be downloaded here: https://github.com/enriquegit/behavior-crc-datasets
I recommend you to download the datasets compilation file and extract
its contents to a local directory. Due to some datasets with large file sizes
or license restrictions, not all of them are included in the compiled set.
But you can download them separately. Even though a dataset may not
be included in the compiled set, it will have a corresponding directory
with a README file with instructions on how to obtain it.
Each dataset in the following list, states whether or not it is included in
the compiled set. The datasets are ordered alphabetically.
B.2 DEPRESJON
Included: Yes.
This dataset contains motor activity recordings of 23 unipolar and
bipolar depressed patients and 32 healthy controls. Motor activity was
monitored with an actigraph watch worn at the right wrist (Actiwatch,
Cambridge Neurotechnology Ltd, England, model AW4). The sampling
frequency was 32 Hz. The device uses the inertial sensors data to com-
pute an activity count every minute which is stored as an integer value
in the memory unit of the actigraph watch. The number of counts is pro-
portional to the intensity of the movement. The dataset also contains
some additional information about the patients and the control group.
For more details please see Garcia-Ceja et al. [2018b].
B.3 ELECTROMYOGRAPHY
Included: Yes.
This dataset was made available by Kirill Yashuk. The data was collected
using an armband device that has 8 sensors placed on the skin surface
that measure electrical activity from the right forearm at a sampling
rate of 200 Hz. A video of the device can be seen here: https://youtu.be
/OuwDHfY2Awg.
on computer keyboard’, ‘brush teeth’, ‘wash hands’, ‘eat chips’, and ‘watch
t.v’. Each volunteer performed each activity for approximately 3 minutes.
If the activity lasted less than 3 minutes, another session was recorded
until completing the 3 minutes. The data were collected with a wrist-
band (Microsoft Band 2) and a cellphone. The wrist-band was used to
collect accelerometer data and was worn by the volunteers in their dom-
inant hand. The accelerometer sensor returns values from the x, y, and
z axes, and the sampling rate was set to 31 Hz. A cellphone was used
to record environmental sound with a sampling rate of 8000 Hz and it
was placed on a table in the same room the user was performing the ac-
tivity. To preserve privacy, the dataset does not contain the raw audio
recordings but extracted features. Sixteen features from the accelerom-
eter sensor and 12 Mel frequency cepstral coefficients from the audio
recordings. For more information, please see Garcia-Ceja et al. [2018a].
scans and records the MAC address and signal strength of the nearby
access points. A delay of 500 ms is set between scans. For each location,
approximately 3 minutes of data were collected while the user walked
around the specific location. The data includes four different locations:
‘bedroomA’, ‘beadroomB’, ‘tv room’ and the ‘lobby’. To preserve privacy,
the MAC addresses are encoded as integer numbers. For more informa-
tion, please, see Garcia and Brena [2012].
B.12 SMILES
Included: No.
This dataset contains color face images of 64 × 64 pixels and is pub-
lished here: http://conradsanderson.id.au/lfwcrop/. This is a cropped
version [Sanderson and Lovell, 2009] of the Labeled Faces in the Wild
(LFW) database [Huang et al., 2008]. Please, download the color version
(lfwcrop_color.zip) and copy all ppm files into the faces/ directory.
A subset of the database was labeled by Arigbabu et al. [2016], Arigbabu
[2017]. The labels are provided as two text files (SMILE_list.txt, NON-
SMILE_list.txt), each, containing the list of files that correspond to
smiling and non-smiling faces (CC BY 4.0 https://creativecommons.or
g/licenses/by/4.0/legalcode). The smiling set has 600 pictures and the
non-smiling has 603 pictures.
387
388 Bibliography
395
396 Index