Context Based Emotion Recognition Using EMOTIC Dataset-Dual-Translated
Context Based Emotion Recognition Using EMOTIC Dataset-Dual-Translated
Abstract—In our everyday lives and social interactions we often try to perceive the emotional states of people. There has been a lot of
research in providing machines with a similar capacity of recognizing emotions. From a computer vision perspective, most of the previous
efforts have been focusing in analyzing the facial expressions and, in some cases, also the body pose. Some of these methods work
remarkably well in specific settings. However, their performance is limited in natural, unconstrained environments. Psychological studies
show that the scene context, in addition to facial expression and body pose, provides important information to our perception of people’s
emotions. However, the processing of the context for automatic emotion recognition has not been explored in depth, partly due to the lack
of proper data. In this paper we present EMOTIC, a dataset of images of people in a diverse set of natural situations, annotated with their
apparent emotion. The EMOTIC dataset combines two different types of emotion representation: (1) a set of 26 discrete categories, and
(2) the continuous dimensions Valence, Arousal, and Dominance. We also present a detailed statistical and algorithmic analysis of the
dataset along with annotators’ agreement analysis. Using the EMOTIC dataset we train different CNN models for emotion recognition,
combining the information of the bounding box containing the person with the contextual information extracted from the scene. Our
results show how scene context provides important information to automatically recognize emotional states and motivate further
research in this direction.
1 INTRODUCTION
Fig. 1. How is this kid feeling? Try to recognize his emotional states from
the person bounding box, without scene context.
and the emotions felt by observers of the image. For example, this paper. However, COCO attributes are not intended to
in the image of Fig. 2.b, we see a kid who seems to be annoyed be exhaustive for emotion recognition, and not all the peo-
for having an apple instead of chocolate and another who ple in the dataset are annotated with affect attributes.
seems happy to have chocolate. However, as observers, we
might not have any of those sentiments when looking at the
photo. Instead, we might think the situation is not fair and
3 EMOTIC DATASET
feel disapproval. Also, if we see an image of an athlete that The EMOTIC dataset is a collection of images of people in
has lost a match, we can recognize the athlete feels sad. How- unconstrained environments annotated according to their
ever, an observer of the image may feel happy if the observer apparent emotional states. The dataset contains 23,571 images
is a fan of the team that won the match. and 34,320 annotated people. Some of the images were manu-
ally collected from the Internet by Google search engine. For
2.1 Emotion Recognition Datasets that we used a combination of queries containing various pla-
Most of the existing datasets for emotion recognition using ces, social environments, different activities and a variety of
computer vision are centered in facial expression analysis. keywords on emotional states. The rest of images belong to 2
For example, the GENKI database [19] contains frontal face public benchmark datasets: COCO [30] and Ade20k [31].
images of a single person with different illumination, geo- Overall, the images show a wide diversity of contexts, con-
graphic, personal and ethnic settings. Images in this dataset taining people in different places, social settings, and doing
are labelled as smiling or non-smiling. Another common facial different activities.
expression analysis dataset is the ICML Face-Expression Fig. 2 shows three examples of annotated images in the
Recognition dataset [20], that contains 28,000 images anno- EMOTIC dataset. Images were annotated using Amazon
tated with 6 basic emotions and a neutral category. On the Mechanical Turk (AMT). Annotators were asked to label
other hand, the UCDSEE dataset [21] has a set of 9 emotion each image according to what they think people in the
expressions acted by 4 persons. The lab setting is strictly kept images are feeling. Notice that we have the capacity of mak-
the same in order to focus mainly on the facial expression of ing reasonable guesses about other people’s emotional state
the person. due to our capacity of being empathetic, putting ourselves
The dynamic body movement is also an essential source into another’s situation, and also because of our common
for estimating emotion. Studies such as [22], [23] establish sense knowledge and our ability for reasoning about visual
the relationship between affect and body posture using as information. For example, in Fig. 2b, the person is perform-
ground truth the base-rate of human observers. The data ing an activity that requires Anticipation to adapt to the trajec-
consist of a spontaneous set of images acquired under a tory. Since he is doing a thrilling activity, he seems excited
restrictive setting (people playing Wii games). The GEMEP about it and he is engaged or focused in this activity. In
database [24] is multi-modal (audio and video) and has 10 Fig. 2c, the kid feels a strong desire (yearning) for eating the
actors playing 18 affective states. The dataset has videos of chocolate instead of the apple. Because of his situation we
actors showing emotions through acting. Body pose and can interpret his facial expression as disquietness and annoy-
facial expression are combined. ance. Notice that images are also annotated according to the
The Looking at People (LAP) challenges and competitions continuous dimensions Valence, Arousal, and Dominance.
[25] involve specialized datasets containing images, sequen- We describe the emotion annotation modalities of EMOTIC
ces of images and multi-modal data. The main focus of these dataset and the annotation process in Sections 3.1 and 3.2,
datasets is the complexity and variability of human body respectively.
configuration which include data related to personality traits After the first round of annotations (1 annotator per
(spontaneous), gesture recognition (acted), apparent age rec- image), we divided the images into three sets: Training
ognition (spontaneous), cultural event recognition (spon- (70 percent), Validation (10 percent), and Testing (20 percent)
taneous), action/interaction recognition and human pose maintaining a similar affective category distribution across
recognition (spontaneous). the different sets. After that, Validation and Testing were
The Emotion Recognition in the Wild (EmotiW) chal- annotated by 4 and 2 extra annotators respectively. As a con-
lenges [26] host 3 databases: (1) The AFEW database [27] sequence, images in the Validation set are annotated by a
focuses on emotion recognition from video frames taken total of 5 annotators, while images in the Testing set are
from movies and TV shows, where the actions are annotated annotated by 3 annotators (these numbers can slightly vary
with attributes like name, age of actor, age of character, pose, for some images since we removed noisy annotations).
gender, expression of person, the overall clip expression and We used the annotations from the Validation to study the
the basic 6 emotions and a neutral category; (2) The SFEW, consistency of the annotations across different annotators.
which is a subset of AFEW database containing images of This study is shown in Section 3.3. The data statistics and
face-frames annotated specifically with the 6 basic emotions algorithmic analysis on the EMOTIC dataset are detailed in
and a neutral category; and (3) the HAPPEI database [28], Sections 3.4 and 3.5 respectively.
which addresses the problem of group level emotion estima-
tion. Thus, [28] offers a first attempt to use context for the 3.1 Emotion Representation
problem of predicting happiness in groups of people. The EMOTIC dataset combines two different types of emo-
Finally, the COCO dataset has been recently annotated tion representation:
with object attributes [29], including some emotion catego- Continuous Dimensions. images are annotated according
ries for people, such as happy and curious. These attributes to the VAD model [7], which represents emotions by a com-
show some overlap with the categories that we define in bination of 3 continuous dimensions: Valence, Arousal and
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2758 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020
Fig. 3. Examples of annotated people in EMOTIC dataset for each of the 26 emotion categories (Table 1). The person in the red bounding box is
annotated by the corresponding category.
clearly looks Engaged in the activity. He also seems Confident emotion categories. We obtained that more than 50 percent
in getting the ball. of the images have k > 0:30. Fig. 8.a shows the distribution
After these observations we conducted different quanti- of kappa values across the validation set for all the annotated
tative analysis on the annotation agreement. We focused people in the validation set, sorted in decreasing order. Ran-
first on analyzing the agreement level in the category anno- dom annotations or total disagreement produces k 0, how-
tation. Given a category assigned to a person in an image, ever for our case, k 0:3 (on average) suggesting significant
we consider as an agreement measure the number of anno- agreement level even though the task of emotion recognition
tators agreeing for that particular category. Accordingly, we is subjective.
calculated, for each category and for each annotation in the For continuous dimensions, the agreement is measured
validation set, the agreement amongst the annotators and by the standard deviation (SD) of the different annotations.
sorted those values across categories. Fig. 7 shows the distri-
bution on the percentage of annotators agreeing for an
annotated category across the validation set.
We also computed the agreement between all the annota-
tors for a given person using Fleiss’ Kappa (k). Fleiss’ Kappa is
a common measure to evaluate the agreement level among a
fixed number of annotators when assigning categories to
data. In our case, given a person to annotate, there is a subset
of 26 categories. If we have N annotators per image, that
means that each of the 26 categories can be selected by n
annotators, where 0 n N. Given an image we compute
the Fleiss’ Kappa per each emotion category first, and then
the general agreement level on this image is computed as the
average of these Fleiss’ Kappa values across the different
TABLE 2
Instruction Summary for Each HIT
Fig. 9. Dataset Statistics. (a) Number of people annotated for each emo-
tion category; (b), (c) & (d) Number of people annotated for every value
of the three continuous dimensions viz.Valence, Arousal & Dominance.
Overall, these observations suggest that some common loss in this prediction is defined by L ¼ disc Ldisc þ cont Lcont ,
sense knowledge patterns related with emotions and context where Ldisc and Lcont represent the loss corresponding to
could be potentially extracted, automatically, from the data. learning the discrete categories and the continuous dimen-
sions respectively. The parameters ðdisc;contÞ weight the con-
4 CNN MODEL FOR EMOTION RECOGNITION IN tribution of each loss and are set empirically using the
validation set.
SCENE CONTEXT Criterion for Discrete categories (Ldisc ). The discrete category
We propose a baseline CNN model for the problem of emo- estimation is a multilabel problem with an inherent class
tion recognition in context. The pipeline of the model is imbalance issue, as the number of training examples is not
shown in Fig. 13 and it is divided in three modules: body fea- the same for each class (see Fig. 9a).
ture extraction, image (context) feature extraction and fusion net- In our experiments, we use a weighted euclidean loss for
work. The first module takes the whole image as input and the discrete categories. Empirically, we found the euclidean
generates scene-related features. The second module takes loss to be more effective than using KullbackLeibler diver-
the visible body of the person and generates body-related gence or a multi-class multi-classification hinge loss. More
features. Finally, the third module combines these features precisely, given a prediction y^disc , the weighted euclidean
to do a fine-grained regression of the two types of emotion loss is defined as follows
representations (Section 3.1).
The body feature extraction module takes the visible part X
26
2
of the body of the target person as input and generates body- L2disc ð^
ydisc Þ ¼ wi ð^
ydisc
i ydisc
i Þ ; (1)
related features. These features include important cues like i¼1
face and head aspects and pose or body appearance. In order
to capture these aspects, this module is pre-trained with where y^disc
i is the prediction for the i-th category and ydisc
i is
ImageNet [40], which is an object centric dataset that the ground-truth label. The parameter wi is the weight
includes the category person. assigned to each category. Weight values are defined as
1
The image feature extraction module takes the whole wi ¼ lnðcþp iÞ
, where pi is the probability of the i-th category
image as input and generates scene-context features. These and c is a parameter to control the range of valid values for
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2763
TABLE 3 TABLE 4
Average Precision (AP) Obtained on Test Set per Category Average Absolute Error (AAE) Obtained on Test
Set per Each Continuous Dimension
CNN Inputs and Lcont type
Emotion Categories CNN Inputs and Lcont type
B (L2 ) B (SL1 ) B+I (L2 ) B+I (SL1 ) Continuous Dimensions
1. Affection 21.80 16.55 21.16 27.85 B (L2 ) B (SL1 ) B+I (L2 ) B+I (SL1 )
2. Anger 06.45 04.67 06.45 09.49 Valence 0.0537 0.0545 0.0546 0.0528
3. Annoyance 07.82 05.54 11.18 14.06 Arousal 0.0600 0.0630 0.0648 0.0611
4. Anticipation 58.61 56.61 58.61 58.64 Dominance 0.0570 0.0567 0.0573 0.0579
5. Aversion 05.08 03.64 06.45 07.48 Mean 0.0569 0.0581 0.0589 0.0573
6. Confidence 73.79 72.57 77.97 78.35
7. Disapproval 07.63 05.50 11.00 14.97 Results for models where the input is just the body B, and models where the
8. Disconnection 20.78 16.12 20.37 21.32 input are both the body and the whole image B+I. The type of Lcont used is
9. Disquietment 14.32 13.99 15.54 16.89 indicated in parenthesis (L2 refers to equation 2 and SL1 refers to equation 3).
10. Doubt/Confusion 29.19 28.35 28.15 29.63
11. Embarrassment 02.38 02.15 02.44 03.18
The Smooth L1 loss refers to the absolute error using the
12. Engagement 84.00 84.59 86.24 87.53
13. Esteem 18.36 19.48 17.35 17.73 squared error if the error is less than a threshold (set to 1 in
14. Excitement 73.73 71.80 76.96 77.16 our experiments). This loss has been widely used for object
15. Fatigue 07.85 06.55 08.87 09.70 detection [44] and, in our experiments, has been shown to
16. Fear 12.85 12.94 12.34 14.14 be less sensitive to outliers. Precisely, the Smooth L1 loss is
17. Happiness 58.71 51.56 60.69 58.26 defined as follows
18. Pain 03.65 02.71 04.42 08.94
19. Peace 17.85 17.09 19.43 21.56 X
3
0:5x2 ; ifjxk j < 1
20. Pleasure 42.58 40.98 42.12 45.46 SL1cont ð^
ycont Þ ¼ vk ; (3)
21. Sadness 08.13 06.19 10.36 19.66 k¼1
jxk j 0:5; otherwise
22. Sensitivity 04.23 03.60 04.82 09.28
23. Suffering 04.90 04.38 07.65 18.84 where xk ¼ ð^ ycont
k ycont
k Þ, and vk is a weight assigned to
24. Surprise 17.20 17.03 16.42 18.81 each of the continuous dimensions and it is set to 1 in our
25. Sympathy 10.66 09.35 11.44 14.71 experiments.
26. Yearning 07.82 07.40 08.34 08.34
We train our recognition system end-to-end, learning the
Mean 23.86 22.36 24.88 27.38
parameters jointly using stochastic gradient descent with
Results for models where the input is just the body B, and models where the momentum. The first two modules are initialized using pre-
input are both the body and the whole image B+I. The type of Lcont used is trained models from Places [37] and Imagenet [45] while the
indicated in parenthesis (L2 refers to Equation (2) and SL1 refers to
Equation (3)).
fusion network is trained from scratch. The batch size is set
to 52 - twice the size of the discrete emotion categories. We
found empirically after testing multiple batch sizes (includ-
wi . Using this weighting scheme the values of wi are ing multiples of 26 like 26, 52, 78, 108) that batch-size of 52
bounded as the number of instances of a category approach gives the best performance (on the validation set).
to 0. This is particularly relevant in our case as we set the
weights based on the occurrence of each category for each
batch. Experimentally, we obtained better results using this 5 EXPERIMENTS
approach compared to setting the global weights based on We trained four different instances of our CNN model,
the entire dataset. which are the combination of two different input types and
Criterion for Continuous dimensions (Lcont ). We model the the two different continuous loss functions described in sec-
estimation of the continuous dimensions as a regression tion 4.1. The input types are body (i.e., upper branch in
problem. Due to multiple annotators annotating the data Fig. 13), denoted by B, and body plus image (i.e., both
based on subjective evaluation, we compare the performance branches shown in Fig. 13), denoted by B+I. The continuous
when using two different robust losses: (1) a margin euclid- loss types are denoted in the experiments by L2 for euclidean
ean loss L2cont , and (2) the Smooth L1 SL1cont . The former loss (equation 2) and SL1 for the Smooth L1 (equation 3).
defines a margin of error (vk ) when computing the loss for Results for discrete categories in the form of Average Pre-
which the error is not considered. The margin euclidean loss cision per category (the higher, the better) are summarized
for continuous dimension is defined as: in Table 3. Notice that the B+I model outperforms the B
model in all categories except 1. The combination of body
X
3
2 and image features (B+I(SL1 ) model) is better than the B
L2cont ð^
ycont Þ ¼ vk ð^
ycont
k ycont
k Þ ; (2)
k¼1
model.
Results for continuous dimensions in the form of Aver-
where y^cont
k and ycont
k are the prediction and the ground-truth age Absolute Error per dimension, AAE (the lower, the bet-
for the k-th dimension, respectively, and vk 2 f0; 1g is a ter) are summarized in Table 4. In this case, all the models
binary weight to represent the error margin. vk ¼ 0 if provide similar results where differences are not significant.
j^
ycont
k ycont
k j < u. Otherwise, vk ¼ 1. If the predictions are Fig. 14 shows the summary of the results obtained per
within the error margin, i.e. error is smaller than u, then these each instance in the testing set. Specifically, Fig. 14a shows
predictions do not contribute to update the weights of the Jaccard coefficient (JC) for all the samples in the test set.
network. The JC coefficient is computed as follows: per each category
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020
Fig. 15. Ground truth and results on images randomly selected with different JC scores.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2765
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2766 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020
[34] E. G. Fern andez-Abascal, B. Garcıa, M. Jimenez, M. Martın, and Jose M. Alvarez received the PhD degree from
F. Domınguez, Psicologıa de la emoci on. Editorial Universitaria Autonomous University of Barcelona, in 2010
Ram on Areces, 2010. under the supervision of Prof. Antonio Lopez and
[35] R. W. Picard and R. Picard, Affective Computing, vol. 252. Cambridge, Prof. Theo Gevers. Previous to CSIRO, he worked
MA, USA: MIT Press, 1997, vol. 252. as a postdoctoral researcher with the Courant Insti-
[36] Y. Groen, A. B. M. Fuermaier, A. E. Den Heijer, O. Tucha, and tute of Mathematical Science, New York University
M. Althaus, “The empathy and systemizing quotient: The psycho- under the supervision of Prof. Yann LeCun. He is a
metric properties of the dutch version and a review of the cross-cul- senior research scientist with NVIDIA. Previously,
tural stability,” J. Autism Developmental Disorders, vol. 45, no. 9, he was senior deep learning researcher with
pp. 2848–2864, 2015. [Online]. Available: http://dx.doi.org/ Toyota Research Institute, prior to that he was with
10.1007/s10803-015-2448-z Data61, CSIRO, Australia (formerly NICTA) as a
[37] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: researcher.
A 10 million image database for scene recognition,” IEEE
tran. pattern analysis and machine intelligence., vol. 40, no. 6,
Adria Recasens received the Telecommunica-
pp. 1452–1464, Jul. 4, 2017.
tions Engineer’s degree and the Mathematics
[38] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale
Licentiate degree from the Universitat Politcnica de
visual sentiment ontology and detectors using adjective noun
Catalunya. He is working toward the PhD degree in
pairs,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 223–232.
computer vision in the Computer Science and Artifi-
[39] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank:
cial Intelligence Laboratory (CSAIL), Massachu-
Visual sentiment concept classification with deep convolutional
setts Institute of Technology advised by professor
neural networks,” arXiv preprint arXiv:1410.8586., Oct. 30, 2014.
Antonio Torralba. His research interests include
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
various topics in computer vision and machine
cation with deep convolutional neural networks,” in Proc. Neural
learning. He is focusing most of his research on
Inf. Process. Syst., 2012, pp. 1097–1105.
automatic gaze-following.
[41] J. Alvarez and L. Petersson, “Decomposeme: Simplifying convnets
for end-to-end learning,” CoRR, vol. abs/1606.05426, 2016,
pp. 1–16. Agata Lapedriza received the MS degree in
[42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep mathematics from the Universitat de Barcelona
network training by reducing internal covariate shift,” in Proc. Int. and the PhD degree in computer science from the
Conf. Mach. Learn., 2015, pp. 448–456. Computer Vision Center, Universitat Autonoma
[43] R. Caruana, “A Dozen Tricks with Multitask Learning,” in Neural Barcelona. She is an associate professor with the
Networks: Tricks of the Trade. New York, NY, USA: Springer, 2012, Universitat Oberta de Catalunya. She was working
pp. 163–189. as a visiting researcher with Computer Science
[44] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., and Artificial Intelligence Lab, Massachusetts Insti-
2015, pp. 1440–1448. tute of Technology (MIT), from 2012 until 2015.
[45] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: Currently she is a visiting researcher with the MIT
A large-scale hierarchical image database,” in Proc. IEEE Conf. Medialab, Affective Computing group from Sep-
Comput. Vis. Pattern Recognit., 2009, pp. 248–255. tember 2017. Her research interests include image understanding, scene
recognition and characterization, and affective computing.
Ronak Kosti received the master’s degree in
machine intelligence from Dhirubhai Ambani Insti- " For more information on this or any other computing topic,
tute of Information and Communication Technol- please visit our Digital Library at www.computer.org/csdl.
ogy (DA-IICT), in 2014. His master’s research was
based on depth estimation from single image using
Artificial Neural Networks. He is working toward the
PhD degree at Universitat Oberta de Catalunya,
Spain advised by prof. Agata Lapedriza. He is
working with Scene Understanding and Artificial
Intelligence (SUNAI) group in computer vision, spe-
cifically in the area of affective computing.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.