0% found this document useful (0 votes)
5 views

Context Based Emotion Recognition Using EMOTIC Dataset-Dual-Translated

Uploaded by

zijunwu5566
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Context Based Emotion Recognition Using EMOTIC Dataset-Dual-Translated

Uploaded by

zijunwu5566
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO.

11, NOVEMBER 2020 2755

Context Based Emotion Recognition


Using EMOTIC Dataset
Ronak Kosti , Jose M. Alvarez, Adria Recasens , and Agata Lapedriza

Abstract—In our everyday lives and social interactions we often try to perceive the emotional states of people. There has been a lot of
research in providing machines with a similar capacity of recognizing emotions. From a computer vision perspective, most of the previous
efforts have been focusing in analyzing the facial expressions and, in some cases, also the body pose. Some of these methods work
remarkably well in specific settings. However, their performance is limited in natural, unconstrained environments. Psychological studies
show that the scene context, in addition to facial expression and body pose, provides important information to our perception of people’s
emotions. However, the processing of the context for automatic emotion recognition has not been explored in depth, partly due to the lack
of proper data. In this paper we present EMOTIC, a dataset of images of people in a diverse set of natural situations, annotated with their
apparent emotion. The EMOTIC dataset combines two different types of emotion representation: (1) a set of 26 discrete categories, and
(2) the continuous dimensions Valence, Arousal, and Dominance. We also present a detailed statistical and algorithmic analysis of the
dataset along with annotators’ agreement analysis. Using the EMOTIC dataset we train different CNN models for emotion recognition,
combining the information of the bounding box containing the person with the contextual information extracted from the scene. Our
results show how scene context provides important information to automatically recognize emotional states and motivate further
research in this direction.

Index Terms—Emotion recognition, affective computing, pattern recognition

1 INTRODUCTION

O VER the past years, the interest in developing automatic


systems for recognizing emotional states has grown
rapidly. We can find several recent works showing how emo-
also on some of the common public datasets for emotion
recognition.
Although face and body pose give lot of information on
tions can be inferred from cues like text [1], voice [2], or the affective state of a person, our claim in this work is that
visual information [3], [4]. The automatic recognition of scene context information is also a key component for under-
emotions has a lot of applications in environments where standing emotional states. Scene context includes the sur-
machines need to interact or monitor humans. For instance, roundings of the person, like the place category, the place
automatic tutors in an online learning platform would pro- attributes, the objects, or the actions occurring around the
vide better feedback to a student according to her level of person. Fig. 1 illustrates the importance of scene context for
motivation or frustration. Also, a car with the capacity of emotion recognition. When we just see the kid it is difficult
assisting a driver can intervene or give an alarm if it detects to recognize his emotion (from his facial expression it seems
the driver is tired or nervous. he is feeling Surprise). However, when we see the context
In this paper we focus on the problem of emotion recogni- (Fig. 2a) we see the kid is celebrating his birthday, blowing
tion from visual information. Concretely, we want to recog- the candles, probably with his family or friends at home.
nize the apparent emotional state of a person in a given With this additional information we can interpret much bet-
image. This problem has been broadly studied in computer ter his face and posture and recognize that he probably feels
vision mainly from two perspectives: (1) facial expression engaged, happy and excited.
analysis, and (2) body posture and gesture analysis. Section 2 The importance of context in emotion perception is well
gives an overview of related work on these perspectives and supported by different studies in psychology [5], [6]. In gen-
eral situations, facial expression is not sufficient to deter-
mine the emotional state of a person, since the perception of
the emotion is heavily influenced by different types of con-
 R. Kosti and A. Lapedriza are with Universitat Oberta de Catalunya, text, including the scene context [2], [3], [4].
Barcelona 08018, Spain. E-mail: {rkosti, alapedriza}@uoc.edu. In this work, we present two main contributions. Our first
 A. Recasens is with the Computer Science and Artificial Intelligence contribution is the creation and publication of the EMOTIC
Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139.
E-mail: [email protected]. (from EMOTions In Context) Dataset. The EMOTIC database
 J.M. Alvarez is with NVIDIA, Santa Clara, CA 95051. is a collection of images of people annotated according to their
E-mail: [email protected]. apparent emotional states. Images are spontaneous and
Manuscript received 21 May 2018; revised 22 Apr. 2019; accepted 25 Apr. unconstrained, showing people doing different things in dif-
2019. Date of publication 14 May 2019; date of current version 1 Oct. 2020. ferent environments. Fig. 2 shows some examples of images
(Corresponding author: Ronak Kosti.)
Recommended for acceptance by T. Berg. in the EMOTIC database along with their corresponding
Digital Object Identifier no. 10.1109/TPAMI.2019.2916866 annotations. As shown, annotations combine 2 different types
0162-8828 ß 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2756 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

Fig. 1. How is this kid feeling? Try to recognize his emotional states from
the person bounding box, without scene context.

of emotion representation: Discrete Emotion Categories and 3


Continuous Emotion Dimensions Valence, Arousal, and Domi-
nance [7]. The EMOTIC dataset is now publicly available for
download at the EMOTIC website.1 Details of the dataset
construction process and dataset statistics can be found in
Section 3.
Our second contribution is the creation of a baseline sys-
Fig. 2. Sample images in the EMOTIC dataset along with their annotations.
tem for the task of emotion recognition in context. In particu-
lar, we present and test a Convolutional Neural Network
(CNN) model that jointly processes the window of the per- emotion recognition from facial expression analysis use
son and the whole image to predict the apparent emotional CNNs to recognize emotions or Action Units [13].
state of the person. Section 4 describes the CNN model and In terms of emotion representation, some recent works
the implementation details while Section 5 presents our based on facial expression [14] use the continuous dimen-
experiments and discussion on the results. All the trained sions of the VAD Emotional State Model [7]. The VAD model
models resulting from this work are also publicly available describes emotions using 3 numerical dimensions: Valence
at the EMOTIC website.1 (V), that measures how positive or pleasant an emotion is,
This paper is an extension of the conference paper ranging from negative to positive; Arousal (A), that measures
“Emotion Recognition in Context”, presented at the IEEE the agitation level of the person, ranging from non-active / in
International Conference on Computer Vision and Pattern calm to agitated / ready to act; and Dominance (D) that measures
Recognition (CVPR) 2017 [8]. We present here an extended the level of control a person feels of the situation, ranging
version of the EMOTIC dataset, with further statistical data- from submissive / non-control to dominant / in-control. On the
set analysis, an analysis of scene-centric algorithms on the other hand, Du et al. [15] proposed a set of 21 facial emotion
data, and a study on the annotation consistency among dif- categories, defined as different combinations of the basic
ferent annotators. This new release of the EMOTIC database emotions, like ‘happily surprised’ or ‘happily disgusted’.
contains 44:4% more annotated people as compared to its With this categorization the authors can give a fine-grained
previous smaller version. With the new extended dataset we detail about the expressed emotion.
retrained all the proposed baseline CNN models with addi- Although the research in emotion recognition from a com-
tional loss functions. We also present comparative analysis puter vision perspective is mainly focused in the analysis of
of two different scene context features, showing how the con- the face, there are some works that also consider other addi-
text is contributing to recognize emotions in the wild. tional visual cues or multimodal approaches. For instance, in
[16] the location of shoulders is used as additional informa-
tion to the face features to recognize basic emotions. More
2 RELATED WORK
generally, Schindler et al. [17] used the body pose to recog-
Emotion recognition has been broadly studied by the Com- nize 6 basic emotions, performing experiments on a small
puter Vision community. Most of the existing work has dataset of non-spontaneous poses acquired under controlled
focused on the analysis of facial expression to predict emo- conditions. Mou et al. [18] presented a system of affect analy-
tions [9], [10]. The base of these methods is the Facial Action sis in still images of groups of people, recognizing group-
Coding System [11], which encodes the facial expression using level arousal and valence from combining face, body and
a set of specific localized movements of the face, called Action contextual information.
Units. These facial-based approaches [9], [10] usually use Emotion Recognition in Scene Context and Image Senti-
facial-geometry based features or appearance features to ment Analysis are different problems that share some charac-
describe the face. Afterwards, the extracted features are used teristics. Emotion Recognition aims to identify the emotions
to recognize Action Units and the basic emotions proposed of a person depicted in an image. Image Sentiment Analysis
by Ekman and Friesen [12]: anger, disgust, fear, happiness, sad- consists of predicting what a person will feel when observing
ness, and surprise. Currently, state-of-the-art systems for a picture. This picture does not necessarily contain a person.
When an image contains a person, there can be a difference
1. http://sunai.uoc.edu/emotic/ between the emotions experienced by the person in the image
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2757

and the emotions felt by observers of the image. For example, this paper. However, COCO attributes are not intended to
in the image of Fig. 2.b, we see a kid who seems to be annoyed be exhaustive for emotion recognition, and not all the peo-
for having an apple instead of chocolate and another who ple in the dataset are annotated with affect attributes.
seems happy to have chocolate. However, as observers, we
might not have any of those sentiments when looking at the
photo. Instead, we might think the situation is not fair and
3 EMOTIC DATASET
feel disapproval. Also, if we see an image of an athlete that The EMOTIC dataset is a collection of images of people in
has lost a match, we can recognize the athlete feels sad. How- unconstrained environments annotated according to their
ever, an observer of the image may feel happy if the observer apparent emotional states. The dataset contains 23,571 images
is a fan of the team that won the match. and 34,320 annotated people. Some of the images were manu-
ally collected from the Internet by Google search engine. For
2.1 Emotion Recognition Datasets that we used a combination of queries containing various pla-
Most of the existing datasets for emotion recognition using ces, social environments, different activities and a variety of
computer vision are centered in facial expression analysis. keywords on emotional states. The rest of images belong to 2
For example, the GENKI database [19] contains frontal face public benchmark datasets: COCO [30] and Ade20k [31].
images of a single person with different illumination, geo- Overall, the images show a wide diversity of contexts, con-
graphic, personal and ethnic settings. Images in this dataset taining people in different places, social settings, and doing
are labelled as smiling or non-smiling. Another common facial different activities.
expression analysis dataset is the ICML Face-Expression Fig. 2 shows three examples of annotated images in the
Recognition dataset [20], that contains 28,000 images anno- EMOTIC dataset. Images were annotated using Amazon
tated with 6 basic emotions and a neutral category. On the Mechanical Turk (AMT). Annotators were asked to label
other hand, the UCDSEE dataset [21] has a set of 9 emotion each image according to what they think people in the
expressions acted by 4 persons. The lab setting is strictly kept images are feeling. Notice that we have the capacity of mak-
the same in order to focus mainly on the facial expression of ing reasonable guesses about other people’s emotional state
the person. due to our capacity of being empathetic, putting ourselves
The dynamic body movement is also an essential source into another’s situation, and also because of our common
for estimating emotion. Studies such as [22], [23] establish sense knowledge and our ability for reasoning about visual
the relationship between affect and body posture using as information. For example, in Fig. 2b, the person is perform-
ground truth the base-rate of human observers. The data ing an activity that requires Anticipation to adapt to the trajec-
consist of a spontaneous set of images acquired under a tory. Since he is doing a thrilling activity, he seems excited
restrictive setting (people playing Wii games). The GEMEP about it and he is engaged or focused in this activity. In
database [24] is multi-modal (audio and video) and has 10 Fig. 2c, the kid feels a strong desire (yearning) for eating the
actors playing 18 affective states. The dataset has videos of chocolate instead of the apple. Because of his situation we
actors showing emotions through acting. Body pose and can interpret his facial expression as disquietness and annoy-
facial expression are combined. ance. Notice that images are also annotated according to the
The Looking at People (LAP) challenges and competitions continuous dimensions Valence, Arousal, and Dominance.
[25] involve specialized datasets containing images, sequen- We describe the emotion annotation modalities of EMOTIC
ces of images and multi-modal data. The main focus of these dataset and the annotation process in Sections 3.1 and 3.2,
datasets is the complexity and variability of human body respectively.
configuration which include data related to personality traits After the first round of annotations (1 annotator per
(spontaneous), gesture recognition (acted), apparent age rec- image), we divided the images into three sets: Training
ognition (spontaneous), cultural event recognition (spon- (70 percent), Validation (10 percent), and Testing (20 percent)
taneous), action/interaction recognition and human pose maintaining a similar affective category distribution across
recognition (spontaneous). the different sets. After that, Validation and Testing were
The Emotion Recognition in the Wild (EmotiW) chal- annotated by 4 and 2 extra annotators respectively. As a con-
lenges [26] host 3 databases: (1) The AFEW database [27] sequence, images in the Validation set are annotated by a
focuses on emotion recognition from video frames taken total of 5 annotators, while images in the Testing set are
from movies and TV shows, where the actions are annotated annotated by 3 annotators (these numbers can slightly vary
with attributes like name, age of actor, age of character, pose, for some images since we removed noisy annotations).
gender, expression of person, the overall clip expression and We used the annotations from the Validation to study the
the basic 6 emotions and a neutral category; (2) The SFEW, consistency of the annotations across different annotators.
which is a subset of AFEW database containing images of This study is shown in Section 3.3. The data statistics and
face-frames annotated specifically with the 6 basic emotions algorithmic analysis on the EMOTIC dataset are detailed in
and a neutral category; and (3) the HAPPEI database [28], Sections 3.4 and 3.5 respectively.
which addresses the problem of group level emotion estima-
tion. Thus, [28] offers a first attempt to use context for the 3.1 Emotion Representation
problem of predicting happiness in groups of people. The EMOTIC dataset combines two different types of emo-
Finally, the COCO dataset has been recently annotated tion representation:
with object attributes [29], including some emotion catego- Continuous Dimensions. images are annotated according
ries for people, such as happy and curious. These attributes to the VAD model [7], which represents emotions by a com-
show some overlap with the categories that we define in bination of 3 continuous dimensions: Valence, Arousal and
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2758 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

TABLE 1 possible to separate them visually in a single image. Thus, our


Proposed Emotion Categories with Definitions list of affective categories can be seen as a first level of a hierar-
1. Affection: fond feelings; love; tenderness
chy, where each category has associated subcategories.
2. Anger: intense displeasure or rage; furious; resentful Notice that the final list of affective categories also
3. Annoyance: bothered by something or someone; irritated; includes the 6 basic emotions (categories 2, 5, 16, 17, 21, 24),
impatient; frustrated but we used the more general term Aversion for the category
4. Anticipation: state of looking forward; hoping on or getting Disgust. Thus, the category Aversion also includes the subca-
prepared for possible future events
5. Aversion: feeling disgust, dislike, repulsion; feeling hate
tegories dislike, repulsion, and hate apart from disgust.
6. Confidence: feeling of being certain; conviction that an outcome
will be favorable; encouraged; proud 3.2 Collecting Annotations
7. Disapproval: feeling that something is wrong or reprehensible; We used Amazon Mechanical Turk crowd-sourcing plat-
contempt; hostile form to collect the annotations of the EMOTIC dataset. We
8. Disconnection: feeling not interested in the main event of the
surrounding; indifferent; bored; distracted designed two Human Intelligence Tasks (HITs), one for each
9. Disquietment: nervous; worried; upset; anxious; tense; pressured; of the 2 formats of emotion representation. The two annota-
alarmed tion interfaces are shown in Fig. 5. Each annotator is shown a
10. Doubt/Confusion: difficulty to understand or decide; thinking person-in-context enclosed in a red bounding-box along
about different options
with the annotation format next to it. Fig. 5a shows the inter-
11. Embarrassment: feeling ashamed or guilty
12. Engagement: paying attention to something; absorbed into face for discrete category annotation while Fig. 5b displays
something; curious; interested the interface for continuous dimension annotation. Notice
13. Esteem: feelings of favourable opinion or judgement; respect; that, in the last box of the continuous dimension interface,
admiration; gratefulness we also ask AMT workers to annotate the gender and esti-
14. Excitement: feeling enthusiasm; stimulated; energetic
15. Fatigue: weariness; tiredness; sleepy
mate the age (range) of the person enclosed in red bounding-
16. Fear: feeling suspicious or afraid of danger, threat, evil or pain; box. The designing of the annotation interface has two main
horror focuses: i) the task is easy to understand and ii) the interface
17. Happiness: feeling delighted; feeling enjoyment or amusement fits the HIT in one screen which avoids scrolling.
18. Pain: physical suffering To make sure annotators understand the task, we showed
19. Peace: well being and relaxed; no worry; having positive thoughts
or sensations; satisfied
them how to annotate the images step-wise, by explaining
20. Pleasure: feeling of delight in the senses two examples in detail. Also, instructions and examples
21. Sadness: feeling unhappy, sorrow, disappointed, or discouraged were attached at the bottom on each page as a quick refer-
22. Sensitivity: feeling of being physically or emotionally wounded; ence to the annotator. Finally, a summary of the detailed
feeling delicate or vulnerable instructions was shown at the top of each page (Table 2).
23. Suffering: psychological or emotional pain; distressed; anguished
24. Surprise: sudden discovery of something unexpected We adopted two strategies to avoid noisy annotations in
25. Sympathy: state of sharing others emotions, goals or troubles; the EMOTIC dataset. First, we conduct a qualification task to
supportive; compassionate annotator candidates. This qualification task has two parts:
26. Yearning: strong desire to have something; jealous; envious; lust (i) an Emotional Quotient HIT (based on standard EQ task
[36]) and (ii) 2 sample image annotation tasks - one for each
of our 2 emotion representations (discrete categories and
Dominance. In our representation each dimension takes an continuous dimensions). For the sample annotations, we had
integer value that lies in the range [1  10]. Fig. 4 shows a set of acceptable labels. The responses of the annotator can-
examples of people annotated by different values of the didates to this qualification task were evaluated and those
given dimension. who responded satisfactorily were allowed to annotate the
Emotion Categories. in addition to VAD we also estab- images from the EMOTIC dataset. The second strategy to
lished a list of 26 emotion categories that represent various avoid noisy annotations was to insert, randomly, 2 control
state of emotions. The list of the 26 emotional categories and images in every annotation batch of 20 images; the correct
their corresponding definitions can be found in Table 1. assortment of labels for the control images was know before-
Also, Fig. 3 shows (per category) examples of people show- hand. Annotators selecting incorrect labels on these control
ing different emotional categories. images were not allowed to annotate further and their anno-
The list of emotion categories has been created as follows. tations were discarded.
We manually collected an affective vocabulary from dictio-
naries and books on psychology [32], [33], [34], [35]. This 3.3 Agreement Level among Different Annotators
vocabulary consists of a list of approximately 400 words rep- Since emotion perception is a subjective task, different peo-
resenting a wide variety of emotional states. After a careful ple can perceive different emotions after seeing the same
study of the definitions and the similarities amongst these def- image. For example in both Fig. 6a and 6b, the person in the
initions, we formed cluster of words with similar meanings. red box seems to feel Affection, Happiness and Pleasure and
The clusters were formalized into 26 categories such that they the annotators have annotated with these categories with
were distinguishable in a single image of a person with her consistency. However, not everyone has selected all these
context. We created the final list of 26 emotion categories tak- emotions. Also, we see that annotators do not agree in the
ing into account the Visual Separability criterion: words that emotions Excitement and Engagement. We consider, however,
have a close meaning could not be visually separable. For that these categories are reasonable in this situation. Another
instance, Anger is defined by the words rage, furious and resent- example is that of Roger Federer hitting a tennis ball in
ful. These affective states are different, but it is not always Fig. 6c. He is seen predicting the ball (or Anticipating) and
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2759

Fig. 3. Examples of annotated people in EMOTIC dataset for each of the 26 emotion categories (Table 1). The person in the red bounding box is
annotated by the corresponding category.

clearly looks Engaged in the activity. He also seems Confident emotion categories. We obtained that more than 50 percent
in getting the ball. of the images have k > 0:30. Fig. 8.a shows the distribution
After these observations we conducted different quanti- of kappa values across the validation set for all the annotated
tative analysis on the annotation agreement. We focused people in the validation set, sorted in decreasing order. Ran-
first on analyzing the agreement level in the category anno- dom annotations or total disagreement produces k  0, how-
tation. Given a category assigned to a person in an image, ever for our case, k  0:3 (on average) suggesting significant
we consider as an agreement measure the number of anno- agreement level even though the task of emotion recognition
tators agreeing for that particular category. Accordingly, we is subjective.
calculated, for each category and for each annotation in the For continuous dimensions, the agreement is measured
validation set, the agreement amongst the annotators and by the standard deviation (SD) of the different annotations.
sorted those values across categories. Fig. 7 shows the distri-
bution on the percentage of annotators agreeing for an
annotated category across the validation set.
We also computed the agreement between all the annota-
tors for a given person using Fleiss’ Kappa (k). Fleiss’ Kappa is
a common measure to evaluate the agreement level among a
fixed number of annotators when assigning categories to
data. In our case, given a person to annotate, there is a subset
of 26 categories. If we have N annotators per image, that
means that each of the 26 categories can be selected by n
annotators, where 0  n  N. Given an image we compute
the Fleiss’ Kappa per each emotion category first, and then
the general agreement level on this image is computed as the
average of these Fleiss’ Kappa values across the different

Fig. 4. Examples of annotated images in EMOTIC dataset for each of the


3 continuous dimensions Valence, Arousal & Dominance. The person in Fig. 5. AMT interface designs (a) For Discrete Categories’ annotations &
the red bounding box has the corresponding value of the given dimension. (b) For Continuous Dimensions’ annotations.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2760 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

TABLE 2
Instruction Summary for Each HIT

Emotion Category Continuous Dimension


“Consider each emotion category “Consider each emotion dimension
separately and, if it is applicable separately, observe what level is
to the person in the given context, applicable to the person in the given
select that emotion category” context, and select that level”

Fig. 7. Representation of agreement between multiple annotators. Cate-


gories sorted in decreasing order according to the average number of
annotators who agreed for that category.

Fig. 6. Annotations of five different annotators for 3 images in EMOTIC.


Fig. 8. (a) Kappa values (sorted) and (b) Standard deviation (sorted), for
The average SD across the Validation set is 1.04, 1.57 and 1.84 each annotated person in validation set.
for Valence, Arousal and Dominance respectively - indicat-
ing that Dominance has higher (1:84) dispersion than the by the category Anger. This means that when a person seems
other dimensions. It reflects that annotators disagree more to be feeling Annoyance it is likely (by 46.05 percent) that this
often for Dominance than for the other dimensions which is person might also be feeling Anger. We also used a K-Means
understandable since Dominance is more difficult to inter- clustering on the category annotations to find groups of cate-
pret than Valence or Arousal [7]. As a summary, Fig. 8b gories that occur frequently. We found, for example, that
shows the standard deviations of all the images in the valida- these category groups are common in the EMOTIC annota-
tion set for all the 3 dimensions, sorted in decreasing order. tions: fAnticipation, Engagement, Confidenceg, fAffection, Hap-
piness, Pleasureg, fDoubt/Confusion, Disapproval, Annoyanceg,
3.4 Dataset Statistics fYearning, Annoyance, Disquietmentg.
EMOTIC dataset contains 34,320 annotated people, where Fig. 11 shows the distribution of each continuous dimen-
66 percent of them are males and 34 percent of them are sion across the different emotion categories. For every plot,
females. There are 10 percent children, 7 percent teenagers categories are arranged in increasing order of their average
and 83 percent adults amongst them. values of the given dimension (calculated for all the instances
Fig. 9a shows the number of annotated people for each of containing that particular category). Thus, we observe from
the 26 emotion categories, sorted by decreasing order. Notice Fig. 11a that emotion categories like Suffering, Annoyance,
that the data is unbalanced, which makes the dataset particu- Pain correlate with low Valence values (feeling less positive)
larly challenging. An interesting observation is that there are in average whereas emotion categories like Pleasure, Happi-
more examples for categories associated to positive emo- ness, Affection correlate with higher Valence values (feeling
tions, like Happiness or Pleasure, than for categories associ- more positive). Also interesting is to note that a category like
ated with negative emotions, like Pain or Embarrassment. The Disconnection lies in the mid-range of Valence value which
category with most examples is Engagement. This is because makes sense. When we observe Fig. 11b, it is easy to under-
in most of the images people are doing something or are stand that emotional categories like Disconnection, Fatigue,
involved in some activity, showing some degree of engage- Sadness show low Arousal values and we see high activeness
ment. Fig. 9b, 9c and 9d show the number of annotated peo- for emotion categories like Anticipation, Confidence, Excite-
ple for each value of the 3 continuous dimensions. In this ment. Finally, Fig. 11c shows that people are not in control
case we also observe unbalanced data but fairly distributed when they show emotion categories like Suffering, Pain, Sad-
across the 3 dimensions which is good for modelling. ness whereas when the Dominance is high, emotion catego-
Fig. 10 shows the co-occurrence rates of any two catego- ries like Esteem, Excitement, Confidence occur more often.
ries. Every value in the matrix ðr; cÞ (r represents the row cat- An important remark about the EMOTIC dataset is
egory and c column category) is a co-occurrence probability that there are people whose faces are not visible. More than
(in %) of category r if the annotation also contains the cate- 25 percent of the people in EMOTIC have their faces par-
gory c, that is, P ðrjcÞ. We observe, for instance, that when a tially occluded or with very low resolution, so we can not
person is labelled with the category Annoyance, then there is rely on facial expression analysis for recognizing their emo-
46.05 percent probability that this person is also annotated tional state.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2761

Fig. 9. Dataset Statistics. (a) Number of people annotated for each emo-
tion category; (b), (c) & (d) Number of people annotated for every value
of the three continuous dimensions viz.Valence, Arousal & Dominance.

3.5 Algorithmic Scene Context Analysis


This section illustrates how current scene-centric systems
can be used to extract contextual information that can be
potentially useful for emotion recognition. In particular, we
illustrate this idea with a CNN trained on Places dataset [37]
and with the Sentibanks Adjective-Noun Pair (ANP) detec-
tors [38], [39], a Visual Sentiment Ontology for Image Senti-
ment Analysis. As a reference, Fig. 12 shows Places and ANP
outputs for sample images of the EMOTIC dataset.
We used AlexNet Places CNN [37] to predict the scene
category and scene attributes for the images in EMOTIC.
This information helps to divide the analysis into place cate-
gory and place attribute. We observed that the distribution Fig. 11. Distribution of continuous dimension values across emotion cat-
of emotions varies significantly among different place cate- egories. Average value of a dimension is calculated for every category
gories. For example, we found that people in the ‘ski_slope’ and then plotted in increasing order for every distribution.
frequently experience Anticipation or Excitement, which are
associated to the activities that usually happen in this place images usually show Excitement, Anticipation and Confidence,
category. Comparing sport-related and working-environ- however they show Sadness or Annoyance less frequently.
ment related images, we find that people in sport-related Interestingly, Sadness and Annoyance appear with higher
frequency in working environments. We also observe inter-
esting patterns when correlating continuous dimensions

Fig. 12. Illustration of 2 current scene-centric methods for extracting con-


Fig. 10. Co-variance between 26 emotion categories. Each row repre- textual features from the scene: AlexNet Places CNN outputs (place cat-
sents the occurrence probability of every other category given the cate- egories and attributes) and Sentibanks ANP outputs for three example
gory of that particular row. images of the EMOTIC dataset.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2762 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

contextual features can be interpreted as an encoding of


the scene category, its attributes and objects present in the
scene, or the dynamics between other people present in
the scene. To capture these aspects, we pre-train this mod-
ule with the scene-centric Places dataset [37].
The fusion module combines features of the two feature
extraction modules and estimates the discrete emotion cate-
gories and the continuous emotion dimensions.
Both feature extraction modules are based on the one-
dimensional filter CNN proposed in [41]. These CNN net-
works provide competitive performance while the number
Fig. 13. Proposed end-to-end model for emotion recognition in context.
The model consists of two feature extraction modules and a fusion net-
of parameters is low. Each network consists of 16 convolu-
work for jointly estimating the discrete categories and the continuous tional layers with 1-dimensional kernels alternating between
dimensions. horizontal and vertical orientations, effectively modeling
8 layers using 2-dimensional kernels. Then, to maintain the
with place attributes and categories. For instance, places location of different parts of the image, we use a global aver-
where people usually show high Dominance are sport- age pooling layer to reduce the features of the last convolu-
related places and sport-related attributes. On the contrary, tional layer. To avoid internal-covariant-shift we add a batch
low Dominance is shown in ‘jail_cell’ or attributes like normalizing layer [42] after each convolutional layer and rec-
‘enclosed_area’ or ‘working’, where the freedom of move- tifier linear units to speed up the training.
ment is restricted. In Fig. 12, the predictions by Places CNN The fusion network module consists of two fully connected
describe the scene in general, like in the top image there is a (FC) layers. The first FC layer is used to reduce the dimension-
girl sitting in a ‘kindergarten_classroom’ (places category) ality of the features to 256 and then, a second fully connected
which usually is situated in enclosed areas with ‘no_horizon’ layer is used to learn independent representations for each
(attributes). task [43]. The output of this second FC layer branches off into
We also find interesting patterns when we compute the 2 separate representations, one with 26 units representing the
correlation between detected ANPs and emotions labelled discrete emotion categories, and second with 3 units repre-
in the image. For example, in images with people labelled senting the 3 continuous dimensions (section 3.1).
with Affection, the most frequent ANP is ‘young_couple’,
while in images with people labelled with Excitement we 4.1 Loss Function and Training Setup
found frequently the ANPs ‘last_game’ and ‘playing_field’. We define the loss function as a weighted combination of two
Also, we observe a high correlation between images with separate losses. A prediction y^ is composed by the prediction
Peace and ANP like ‘old_couple’ and ‘domestic_scenes’, and of each of the 26 discrete categories and the 3 continuous
between Happiness and the ANPs ‘outdoor_wedding’, dimensions, y^ ¼ ð^ ydisc ; y^cont Þ. In particular, y^disc ¼ ð^1 ;...;
ydisc
‘outdoor_activities’, ‘happy_family’ or ‘happy_couple’. y^26 Þ and y^
disc cont
y1 ; y^2 ; y^3 Þ. Given a prediction y^, the
¼ ð^ cont cont cont

Overall, these observations suggest that some common loss in this prediction is defined by L ¼ disc Ldisc þ cont Lcont ,
sense knowledge patterns related with emotions and context where Ldisc and Lcont represent the loss corresponding to
could be potentially extracted, automatically, from the data. learning the discrete categories and the continuous dimen-
sions respectively. The parameters ðdisc;contÞ weight the con-
4 CNN MODEL FOR EMOTION RECOGNITION IN tribution of each loss and are set empirically using the
validation set.
SCENE CONTEXT Criterion for Discrete categories (Ldisc ). The discrete category
We propose a baseline CNN model for the problem of emo- estimation is a multilabel problem with an inherent class
tion recognition in context. The pipeline of the model is imbalance issue, as the number of training examples is not
shown in Fig. 13 and it is divided in three modules: body fea- the same for each class (see Fig. 9a).
ture extraction, image (context) feature extraction and fusion net- In our experiments, we use a weighted euclidean loss for
work. The first module takes the whole image as input and the discrete categories. Empirically, we found the euclidean
generates scene-related features. The second module takes loss to be more effective than using KullbackLeibler diver-
the visible body of the person and generates body-related gence or a multi-class multi-classification hinge loss. More
features. Finally, the third module combines these features precisely, given a prediction y^disc , the weighted euclidean
to do a fine-grained regression of the two types of emotion loss is defined as follows
representations (Section 3.1).
The body feature extraction module takes the visible part X
26
2
of the body of the target person as input and generates body- L2disc ð^
ydisc Þ ¼ wi ð^
ydisc
i  ydisc
i Þ ; (1)
related features. These features include important cues like i¼1
face and head aspects and pose or body appearance. In order
to capture these aspects, this module is pre-trained with where y^disc
i is the prediction for the i-th category and ydisc
i is
ImageNet [40], which is an object centric dataset that the ground-truth label. The parameter wi is the weight
includes the category person. assigned to each category. Weight values are defined as
1
The image feature extraction module takes the whole wi ¼ lnðcþp iÞ
, where pi is the probability of the i-th category
image as input and generates scene-context features. These and c is a parameter to control the range of valid values for
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2763

TABLE 3 TABLE 4
Average Precision (AP) Obtained on Test Set per Category Average Absolute Error (AAE) Obtained on Test
Set per Each Continuous Dimension
CNN Inputs and Lcont type
Emotion Categories CNN Inputs and Lcont type
B (L2 ) B (SL1 ) B+I (L2 ) B+I (SL1 ) Continuous Dimensions
1. Affection 21.80 16.55 21.16 27.85 B (L2 ) B (SL1 ) B+I (L2 ) B+I (SL1 )
2. Anger 06.45 04.67 06.45 09.49 Valence 0.0537 0.0545 0.0546 0.0528
3. Annoyance 07.82 05.54 11.18 14.06 Arousal 0.0600 0.0630 0.0648 0.0611
4. Anticipation 58.61 56.61 58.61 58.64 Dominance 0.0570 0.0567 0.0573 0.0579
5. Aversion 05.08 03.64 06.45 07.48 Mean 0.0569 0.0581 0.0589 0.0573
6. Confidence 73.79 72.57 77.97 78.35
7. Disapproval 07.63 05.50 11.00 14.97 Results for models where the input is just the body B, and models where the
8. Disconnection 20.78 16.12 20.37 21.32 input are both the body and the whole image B+I. The type of Lcont used is
9. Disquietment 14.32 13.99 15.54 16.89 indicated in parenthesis (L2 refers to equation 2 and SL1 refers to equation 3).
10. Doubt/Confusion 29.19 28.35 28.15 29.63
11. Embarrassment 02.38 02.15 02.44 03.18
The Smooth L1 loss refers to the absolute error using the
12. Engagement 84.00 84.59 86.24 87.53
13. Esteem 18.36 19.48 17.35 17.73 squared error if the error is less than a threshold (set to 1 in
14. Excitement 73.73 71.80 76.96 77.16 our experiments). This loss has been widely used for object
15. Fatigue 07.85 06.55 08.87 09.70 detection [44] and, in our experiments, has been shown to
16. Fear 12.85 12.94 12.34 14.14 be less sensitive to outliers. Precisely, the Smooth L1 loss is
17. Happiness 58.71 51.56 60.69 58.26 defined as follows
18. Pain 03.65 02.71 04.42 08.94
19. Peace 17.85 17.09 19.43 21.56 X
3 
0:5x2 ; ifjxk j < 1
20. Pleasure 42.58 40.98 42.12 45.46 SL1cont ð^
ycont Þ ¼ vk ; (3)
21. Sadness 08.13 06.19 10.36 19.66 k¼1
jxk j  0:5; otherwise
22. Sensitivity 04.23 03.60 04.82 09.28
23. Suffering 04.90 04.38 07.65 18.84 where xk ¼ ð^ ycont
k  ycont
k Þ, and vk is a weight assigned to
24. Surprise 17.20 17.03 16.42 18.81 each of the continuous dimensions and it is set to 1 in our
25. Sympathy 10.66 09.35 11.44 14.71 experiments.
26. Yearning 07.82 07.40 08.34 08.34
We train our recognition system end-to-end, learning the
Mean 23.86 22.36 24.88 27.38
parameters jointly using stochastic gradient descent with
Results for models where the input is just the body B, and models where the momentum. The first two modules are initialized using pre-
input are both the body and the whole image B+I. The type of Lcont used is trained models from Places [37] and Imagenet [45] while the
indicated in parenthesis (L2 refers to Equation (2) and SL1 refers to
Equation (3)).
fusion network is trained from scratch. The batch size is set
to 52 - twice the size of the discrete emotion categories. We
found empirically after testing multiple batch sizes (includ-
wi . Using this weighting scheme the values of wi are ing multiples of 26 like 26, 52, 78, 108) that batch-size of 52
bounded as the number of instances of a category approach gives the best performance (on the validation set).
to 0. This is particularly relevant in our case as we set the
weights based on the occurrence of each category for each
batch. Experimentally, we obtained better results using this 5 EXPERIMENTS
approach compared to setting the global weights based on We trained four different instances of our CNN model,
the entire dataset. which are the combination of two different input types and
Criterion for Continuous dimensions (Lcont ). We model the the two different continuous loss functions described in sec-
estimation of the continuous dimensions as a regression tion 4.1. The input types are body (i.e., upper branch in
problem. Due to multiple annotators annotating the data Fig. 13), denoted by B, and body plus image (i.e., both
based on subjective evaluation, we compare the performance branches shown in Fig. 13), denoted by B+I. The continuous
when using two different robust losses: (1) a margin euclid- loss types are denoted in the experiments by L2 for euclidean
ean loss L2cont , and (2) the Smooth L1 SL1cont . The former loss (equation 2) and SL1 for the Smooth L1 (equation 3).
defines a margin of error (vk ) when computing the loss for Results for discrete categories in the form of Average Pre-
which the error is not considered. The margin euclidean loss cision per category (the higher, the better) are summarized
for continuous dimension is defined as: in Table 3. Notice that the B+I model outperforms the B
model in all categories except 1. The combination of body
X
3
2 and image features (B+I(SL1 ) model) is better than the B
L2cont ð^
ycont Þ ¼ vk ð^
ycont
k  ycont
k Þ ; (2)
k¼1
model.
Results for continuous dimensions in the form of Aver-
where y^cont
k and ycont
k are the prediction and the ground-truth age Absolute Error per dimension, AAE (the lower, the bet-
for the k-th dimension, respectively, and vk 2 f0; 1g is a ter) are summarized in Table 4. In this case, all the models
binary weight to represent the error margin. vk ¼ 0 if provide similar results where differences are not significant.
j^
ycont
k  ycont
k j < u. Otherwise, vk ¼ 1. If the predictions are Fig. 14 shows the summary of the results obtained per
within the error margin, i.e. error is smaller than u, then these each instance in the testing set. Specifically, Fig. 14a shows
predictions do not contribute to update the weights of the Jaccard coefficient (JC) for all the samples in the test set.
network. The JC coefficient is computed as follows: per each category
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2764 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

5.1 Context Features Comparison


The goal of this section is to compare different context fea-
tures for the problem of emotion recognition in context. A
key aspect for incorporating the context in an emotion recog-
nition model is to be able to obtain information from the con-
text that is actually relevant for emotion recognition. Since
the context information extraction is a scene-centric task, the
information extracted from the context should be based in a
scene-centric feature extraction system. That is why our
baseline model uses a Places CNN for the context feature
extraction module. However, recent works in sentiment
Fig. 14. Results per each sample (Test Set, sorted): (a) Jaccard Coeffi- analysis (detecting the emotion of a person when he/she
cient (JC) of the recognized discrete categories (b) Average Absolute
Error (AAE) in the estimation of the three continuous dimensions.
observes an image) also provide a system for scene feature
extraction that can be used for encoding the relevant contex-
tual information for emotion recognition.
we use as threshold for the detection of the category the To compute body features, denoted by Bf , we fine tune an
value where Precision ¼ Recall. Then, the JC coefficient is AlexNet ImageNet CNN with EMOTIC database, and use
computed as the number of categories detected that are also the average pooling of the last convolutional layer as fea-
present in the ground truth (number of categories in the tures. For the context (image), we compare two different fea-
intersection of detections and ground truth) divided by the ture types, which are denoted by If and IS . If are obtained by
total number of categories that are in the ground truth or fine tunning an AlexNet Places CNN with EMOTIC data-
detected (union over detected categories and categories in base, and taking the average pooling of the last convolutional
the ground truth). The higher this JC is the better, with a layer as features (similar to Bf ), while IS is a feature vector
maximum value of 1, where the detected categories and the composed of the sentiment scores for the ANP detectors
ground truth categories are exactly the same. In the graphic, from the implementation of [39].
examples are sorted in decreasing order of the JC coeffi- To fairly compare the contribution of the different con-
cient. Notice that these results also support that the B+I text features, we train Logistic Regressors for the following
model outperforms the B model. features and combination of features: (1) Bf , (2) Bf +If , and
For the case of continuous dimensions, Fig. 14b shows (3) Bf +IS . For the discrete categories we obtain mean APs
the Average Absolute Error (AAE) obtained per each sam- AP ¼ 23:00, AP ¼ 27:70, and AP ¼ 29:45, respectively. For
ple in the testing set. Samples are sorted by increasing order the continuous dimensions, we obtain AAE 0.0704, 0.0643,
(best performances on the left). Consistent with the results and 0.0713 respectively. We observe that, for the discrete
shown in Table 4, we do not observe a significant difference categories, both If and IS contribute relevant information to
among the different models. the emotion recognition in context. Interestingly, IS per-
Finally, Fig. 15 shows qualitative predictions for the best forms better than If , even though these features have not
B and B+I models. These examples were randomly selected been trained using EMOTIC. However, these features are
among samples with high JC in B+I (a-b) and samples with smartly designed for sentiment analysis, which is a problem
low JC in B+I (g-h). Incorrect category recognition is indi- closely related to extracting relevant contextual information
cated in red. As shown, in general, B+I model outperforms for emotion recognition, and are trained with a large dataset
B, although there are some exceptions, like Fig. 15.c. of images.

Fig. 15. Ground truth and results on images randomly selected with different JC scores.
Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
KOSTI ET AL.: CONTEXT BASED EMOTION RECOGNITION USING EMOTIC DATASET 2765

6 CONCLUSIONS [11] E. Friesen and P. Ekman, “Measuring facial movement. Environ-


mental psychology and nonverbal behavior.,” Sep. 1976, vol. 1,
In this paper we pointed out the importance of considering no. 1, pp. 56–75.
the person scene context in the problem of automatic emo- [12] P. Ekman and W. V. Friesen, “Constants across cultures in the face
and emotion,” J. Personality Social Psychology, vol. 17, no. 2, 1971,
tion recognition in the wild. We presented the EMOTIC data- Art. no. 124.
base, a dataset of 23,571 natural unconstrained images with [13] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez,
34,320 people labeled according to their apparent emotions. “Emotionet: An accurate, real-time algorithm for the automatic
annotation of a million facial expressions in the wild,” in
The images in the dataset are annotated using two different Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit, 2016, pp. 5562–
emotion representations: 26 discrete categories, and the 3 5570.
continuous dimensions Valence, Arousal and Dominance. [14] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis
We described in depth the annotation process and analyzed of eeg signals and facial expressions for continuous emotion
detection,” IEEE Trans. Affective Comput., vol. 7, no. 1, pp. 17–28,
the annotation consistency of different annotators. We also Jan. 2016.
provided different statistics and algorithmic analysis on the [15] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions of
data, showing the characteristics of the EMOTIC database. In emotion,” Proc. Nat. Acad. Sci., vol. 111, no. 15, pp. E1454–E1462,
addition, we proposed a baseline CNN model for emotion 2014.
[16] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous prediction
recognition in scene context that combines the information of spontaneous affect from multiple cues and modalities in
of the person (body bounding box) with the scene context valence-arousal space,” IEEE Trans. Affective Comput., vol. 2, no. 2,
information (whole image). We also compare two different pp. 92–105, Apr.-Jun. 2011.
[17] K. Schindler, L. Van Gool, and B. de Gelder, “Recognizing emotions
feature types for encoding the contextual information. Our expressed by body pose: A biologically inspired neural model,”
results show the relevance of using contextual information Neural Netw., vol. 21, no. 9, pp. 1238–1246, 2008.
to recognize emotions and, in conjunction with the EMOTIC [18] W. Mou, O. Celiktutan, and H. Gunes, “Group-level arousal and
dataset, motivate further research in this direction. All the valence recognition in static images: Face, body and context,” in
Proc. 11th IEEE Int. Conf. Workshops Autom. Face Gesture Recognit.,
data and trained models are publicly available for the 2015, vol. 5, pp. 1–6.
research community in the website of the project. [19] “GENKI database.” [Online]. Available: http://mplab.ucsd.edu/
wordpress/?page_id=398, Accessed on: Apr. 12, 2017.
[20] “ICML face expression recognition dataset.” [Online]. Available:
ACKNOWLEDGMENTS https://goo.gl/nn9w4R, Accessed on: Apr. 12, 2017.
[21] J. L. Tracy, R. W. Robins, and R. A. Schriber, “Development of a
This work has been partially supported by the Ministerio de facs-verified set of basic and self-conscious emotion expressions,”
Economia, Industria y Competitividad (Spain), under the Grants Emotion, vol. 9, no. 4, 2009, Art. no. 554.
Ref. TIN2015-66951-C2-2-R and RTI2018-095232-B-C22, and [22] A. Kleinsmith and N. Bianchi-Berthouze, “Recognizing affective
dimensions from body posture,” in Proc. 2nd Int. Conf. Affective
by Innovation and Universities (FEDER funds). The authors Comput. Intell. Interaction, 2007, pp. 48–58. [Online]. Available:
also thank NVIDIA for their generous hardware donations. http://dx.doi.org/10.1007/978-3-540-74889-2_5
Project Page: http://sunai.uoc.edu/emotic/ [23] A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, “Automatic
recognition of non-acted affective postures,” IEEE Trans. Syst.
Man Cybern. Part B (Cybern.), vol. 41, no. 4, pp. 1027–1038,
REFERENCES Aug. 2011.
[24] T. B€anziger, H. Pirker, and K. Scherer, “Gemep-geneva multi-
[1] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale modal emotion portrayals: A corpus for the study of multimodal
visual sentiment ontology and detectors using adjective noun emotional expressions,” in Proc. Int. Conf. Lang. Res. Eval., 2006,
pairs,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 223–232. vol. 6, pp. 15–019.
[2] H. Aviezer, R. R. Hassin, J. Ryan, C. Grady, J. Susskind, [25] S. Escalera, X. Baro, H. J. Escalante, and I. Guyon, “Chalearn look-
A. Anderson, M. Moscovitch, and S. Bentin, “Angry, disgusted, or ing at people: Events and resources,” CoRR, vol. abs/1701.02664,
afraid? studies on the malleability of emotion perception,” Psycho- 2017. [Online]. Available: http://arxiv.org/abs/1701.02664
logical Sci., vol. 19, no. 7, pp. 724–732, 2008. [26] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon,
[3] R. Righart and B. De Gelder, “Rapid influence of emotional scenes “Emotiw 2016: Video and group-level emotion recognition
on encoding of facial expressions: An erp study,” Social Cognitive challenges,” in Proc. 18th ACM Int. Conf. Multimodal Interaction,
Affective Neuroscience, vol. 3, no. 3, pp. 270–278, 2008. 2016, pp. 427–432. [Online]. Available: http://doi.acm.org/
[4] T. Masuda, P. C. Ellsworth, B. Mesquita, J. Leu, S. Tanida, and 10.1145/2993148.2997638
E. Van de Veerdonk, “Placing the face in context: cultural differen- [27] A. Dhall, et al., “Collecting large, richly annotated facial-
ces in the perception of facial emotion,” J. Personality Social Psy- expression databases from movies,” IEEE MultiMedia, vol. 19,
chology, vol. 94, no. 3, 2008, Art. no. 365. no. 3, pp. 34–41, Jul.-Sep. 2012.
[5] L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotion [28] A. Dhall, J. Joshi, I. Radwan, and R. Goecke, “Finding happiest
perception,” Current Directions Psychological Sci., vol. 20, no. 5, moments in a social context,” in Proc. Asian Conf. Comput. Vis.,
pp. 286–290, 2011. 2012, pp. 613–626.
[6] L. F. Barrett, How Emotions Are Made: The Secret Life of the Brain. [29] G. Patterson and J. Hays, “Coco attributes: Attributes for people,
Boston, MA, USA: Houghton Mifflin Harcourt, 2017. animals, and objects,” in Proc. Eur. Conf. Comput. Vis., 2016,
[7] A. Mehrabian, “Framework for a comprehensive description and pp. 85–100.
measurement of emotional states,” Genetic Social General Psychology [30] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays,
Monographs, vol. 121, pp. 339–361, 1995. P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, “Microsoft
[8] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion COCO: Common objects in context,” CoRR, vol. abs/1405.0312,
recognition in context,” in Proc. IEEE Conf. Comput. Vis. Pattern 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
Recognit., 2017. [31] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
[9] M. Pantic and L. J. Rothkrantz, “Expert system for automatic anal- “Semantic understanding of scenes through ade20k dataset,”
ysis of facial expressions,” Image Vis. Comput., vol. 18, no. 11, 2016. [Online]. Available: https://arxiv.org/pdf/1608.05442
pp. 881–905, 2000. [32] “Oxford english dictionary.” [Online]. Available: http://http://
[10] Z. Li, J.-i. Imai, and M. Kaneko, “Facial-component-based bag of www.oed.com, Accessed on: Jun. 9, 2017.
words and phog descriptor for facial expression recognition,” in [33] “Merriam-webster online english dictionary.” [Online]. Available:
Proc. IEEE Int. Conf. Syst. Man Cybern., 2009, pp. 1353–1358. https://www.merriam-webster.com, Accessed on: Jun. 9, 2017.

Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.
2766 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 42, NO. 11, NOVEMBER 2020

[34] E. G. Fern andez-Abascal, B. Garcıa, M. Jimenez, M. Martın, and Jose M. Alvarez received the PhD degree from
F. Domınguez, Psicologıa de la emoci on. Editorial Universitaria Autonomous University of Barcelona, in 2010
Ram on Areces, 2010. under the supervision of Prof. Antonio Lopez and
[35] R. W. Picard and R. Picard, Affective Computing, vol. 252. Cambridge, Prof. Theo Gevers. Previous to CSIRO, he worked
MA, USA: MIT Press, 1997, vol. 252. as a postdoctoral researcher with the Courant Insti-
[36] Y. Groen, A. B. M. Fuermaier, A. E. Den Heijer, O. Tucha, and tute of Mathematical Science, New York University
M. Althaus, “The empathy and systemizing quotient: The psycho- under the supervision of Prof. Yann LeCun. He is a
metric properties of the dutch version and a review of the cross-cul- senior research scientist with NVIDIA. Previously,
tural stability,” J. Autism Developmental Disorders, vol. 45, no. 9, he was senior deep learning researcher with
pp. 2848–2864, 2015. [Online]. Available: http://dx.doi.org/ Toyota Research Institute, prior to that he was with
10.1007/s10803-015-2448-z Data61, CSIRO, Australia (formerly NICTA) as a
[37] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, “Places: researcher.
A 10 million image database for scene recognition,” IEEE
tran. pattern analysis and machine intelligence., vol. 40, no. 6,
Adria Recasens received the Telecommunica-
pp. 1452–1464, Jul. 4, 2017.
tions Engineer’s degree and the Mathematics
[38] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale
Licentiate degree from the Universitat Politcnica de
visual sentiment ontology and detectors using adjective noun
Catalunya. He is working toward the PhD degree in
pairs,” in Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 223–232.
computer vision in the Computer Science and Artifi-
[39] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank:
cial Intelligence Laboratory (CSAIL), Massachu-
Visual sentiment concept classification with deep convolutional
setts Institute of Technology advised by professor
neural networks,” arXiv preprint arXiv:1410.8586., Oct. 30, 2014.
Antonio Torralba. His research interests include
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-
various topics in computer vision and machine
cation with deep convolutional neural networks,” in Proc. Neural
learning. He is focusing most of his research on
Inf. Process. Syst., 2012, pp. 1097–1105.
automatic gaze-following.
[41] J. Alvarez and L. Petersson, “Decomposeme: Simplifying convnets
for end-to-end learning,” CoRR, vol. abs/1606.05426, 2016,
pp. 1–16. Agata Lapedriza received the MS degree in
[42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep mathematics from the Universitat de Barcelona
network training by reducing internal covariate shift,” in Proc. Int. and the PhD degree in computer science from the
Conf. Mach. Learn., 2015, pp. 448–456. Computer Vision Center, Universitat Autonoma
[43] R. Caruana, “A Dozen Tricks with Multitask Learning,” in Neural Barcelona. She is an associate professor with the
Networks: Tricks of the Trade. New York, NY, USA: Springer, 2012, Universitat Oberta de Catalunya. She was working
pp. 163–189. as a visiting researcher with Computer Science
[44] R. Girshick, “Fast r-cnn,” in Proc. IEEE Int. Conf. Comput. Vis., and Artificial Intelligence Lab, Massachusetts Insti-
2015, pp. 1440–1448. tute of Technology (MIT), from 2012 until 2015.
[45] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: Currently she is a visiting researcher with the MIT
A large-scale hierarchical image database,” in Proc. IEEE Conf. Medialab, Affective Computing group from Sep-
Comput. Vis. Pattern Recognit., 2009, pp. 248–255. tember 2017. Her research interests include image understanding, scene
recognition and characterization, and affective computing.
Ronak Kosti received the master’s degree in
machine intelligence from Dhirubhai Ambani Insti- " For more information on this or any other computing topic,
tute of Information and Communication Technol- please visit our Digital Library at www.computer.org/csdl.
ogy (DA-IICT), in 2014. His master’s research was
based on depth estimation from single image using
Artificial Neural Networks. He is working toward the
PhD degree at Universitat Oberta de Catalunya,
Spain advised by prof. Agata Lapedriza. He is
working with Scene Understanding and Artificial
Intelligence (SUNAI) group in computer vision, spe-
cifically in the area of affective computing.

Authorized licensed use limited to: Wuhan University. Downloaded on March 31,2024 at 11:18:40 UTC from IEEE Xplore. Restrictions apply.

You might also like