0% found this document useful (0 votes)
8 views

Emotion Recognition in Conversation Research Challenges Datasets and Recent Advances

Uploaded by

Eya Bouhouch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Emotion Recognition in Conversation Research Challenges Datasets and Recent Advances

Uploaded by

Eya Bouhouch
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received May 9, 2019, accepted June 12, 2019, date of publication July 16, 2019, date of current version

August 9, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2929050

Emotion Recognition in Conversation: Research


Challenges, Datasets, and Recent Advances
SOUJANYA PORIA 1, NAVONIL MAJUMDER 2, RADA MIHALCEA 3, AND EDUARD HOVY4
1 ISTD, Singapore University of Technology and Design, Singapore 487372
2 CIC, Instituto Politécnico Nacional, Mexico City 07738, Mexico
3 Computer Science and Engineering, University of Michigan, Ann Arbor, MI 48109, USA
4 Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

Corresponding author: Soujanya Poria ([email protected])


This work was supported in part by the Michigan Institute for Data Science, in part by the National Science Foundation under Grant
1815291, in part by the John Templeton Foundation under Grant 61156, and in part by the DARPA under Grant
HR001117S0026-AIDA-FP-045.

ABSTRACT Emotion is intrinsic to humans and consequently, emotion understanding is a key part of
human-like artificial intelligence (AI). Emotion recognition in conversation (ERC) is becoming increasingly
popular as a new research frontier in natural language processing (NLP) due to its ability to mine opinions
from the plethora of publicly available conversational data on platforms such as Facebook, Youtube, Reddit,
Twitter, and others. Moreover, it has potential applications in health-care systems (as a tool for psychological
analysis), education (understanding student frustration), and more. In Addition, ERC is also extremely
important for generating emotion-aware dialogues that require an understanding of the user’s emotions.
Catering to these needs calls for effective and scalable conversational emotion-recognition algorithms.
However, it is a difficult problem to solve because of several research challenges. In this paper, we discuss
these challenges and shed light on recent research in this field. We also describe the drawbacks of these
approaches and discuss the reasons why they fail to successfully overcome the research challenges in ERC.

INDEX TERMS Emotion recognition, sentiment analysis, dialogue systems, natural language processing.

I. INTRODUCTION so on. Strapparava and Mihalcea [6] addressed the task of


Emotion is often defined as an individual’s mental state emotion detection in news headlines. A number of emotion
associated with thoughts, feelings, and behavior. Stoics like lexicons [7], [8] have been developed to tackle the textual
Cicero organized emotions into four categories - metus (fear), emotion recognition problem.
aegritudo (pain), libido (lust), and laetitia (pleasure). Later, Only in the past few years has emotion recognition in
evolutionary theories of emotion were initiated in the late conversation (ERC) gained attention from the NLP commu-
19th century by Darwin and Prodger [1]. He hypothesized nity [9]–[12] due to the growing availability of public con-
that emotions evolved through natural selection and, hence, versational data. ERC can be used to analyze conversations
have cross-culturally universal counterparts. In recent times, that take place on social media. It can also aid in analyzing
Plutchik [2] categorized emotion into eight primary types, conversations in real times, which can be instrumental in legal
visualized by the wheel of emotions. Further, Ekman [3] trials, interviews, e-health services, and more.
argued for the correlation between emotion and facial Unlike vanilla emotion recognition of sentences/
expression. utterances, ERC ideally requires context modeling of the
Natural language is often indicative of one’s emotion. individual utterances. This context can be attributed to the
Hence, emotion recognition has been enjoying popularity in preceding utterances, and relies on the temporal sequence
the field of NLP [4], [5], due to its widespread applications of utterances. Compared to the recently published works on
in opinion mining, recommender systems, health-care, and ERC [10]–[12], both lexicon-based [8], [13], [14] and modern
deep learning-based [4], [5] vanilla emotion recognition
The associate editor coordinating the review of this manuscript and approaches fail to work well on ERC datasets as these works
approving it for publication was Tao Zhou. ignore the conversation specific factors such as the presence

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
VOLUME 7, 2019 100943
S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

of contextual cues, the temporality in speakers’ turns,


or speaker-specific information. Fig. 4a and Fig. 4b show
an example where the same utterance changes its meaning
depending on its preceding utterance.

A. TASK DEFINITION
Given the transcript of a conversation along with speaker
information of each constituent utterance, the ERC task aims
to identify the emotion of each utterance from a set of pre-
defined emotions. Fig. 2 illustrates one such conversation
between two people, where each utterance is labeled by
the underlying emotion. Formally, given the input sequence
of N number of utterances [(u1 , p1 ), (u2 , p2 ), . . . , (uN , pN )],
where each utterance ui = [ui,1 , ui,2 , . . . , ui,T ] consists of T
words ui,j and spoken by party pi , the task is to predict the
emotion label ei of each utterance ui .

B. CONTROLLING VARIABLES IN CONVERSATIONS


Conversations are broadly categorized into two categories:
task oriented and chit-chat (also referred to as non-task ori-
ented). Both types of conversations are governed by different
FIGURE 1. Interaction among different controlling variables during a
factors or pragmatics [15], such as topic, interlocutors’ per- dyadic conversation between persons A and B. Grey and white circles
sonality, argumentation logic, viewpoint, intent [16], and so represent hidden and observed variables, respectively. P represents
on. Fig.1 shows how these factors play out in a dyadic con- personality, U represents utterance, S represents interlocutor state, I
represents interlocutor intent, E represents emotion and Topic represents
versation. Firstly, topic (Topic) and interlocutor personality topic of the conversation. This can easily be extended to multi-party
(P∗ ) always influence the conversation, irrespective of the conversations.
time. A speaker makes up his/her mind (S∗t ) about the reply
(U∗t ) based on the contextual preceding utterances (U∗<t ) from distributed (such as embedding) representations that do not
both speaker and listener, the previous utterance being the readily support logical inference.
most important one since it usually makes the largest change On conversational setting, D’Mello et al. [20] and [21]
in the joint task model (for task-oriented conversations) or worked with small datasets with three and four emotion
the speaker’s emotional state (for chit-chat). Delving deeper, labels, respectively. This was followed by Phan et al. [22],
the pragmatic features, as explained by Hovy [15], like argu- where emotion detection on conversation transcript was
mentation logic, interlocutor viewpoint, inter-personal rela- attempted. Recently, several works [23], [24] have devised
tionship and dependency, situational awareness are encoded deep learning-based techniques for ERC. These works are
in speaker state (S∗t ). Intent (I∗t ) of the speaker is decided crucial as we surmise an instrumental role of ERC in emotion-
based on previous intent I∗t−2 and speaker state S∗t , as the aware a.k.a. affective dialogue generation which has fallen
interlocutor may change his/her intent based on the oppo- within the topic of ‘‘text generation under pragmatics cons-
nent’s utterance and current situation. Then, the speaker for- triants’’ as proposed by Hovy [15]. Fig. 3 illustrates one such
mulates appropriate emotion E∗t for the response based on the conversation between a human (user) and a medical chatbot
state S∗t and intent I∗t . Finally, the response U∗t is produced (health-assistant). The assistant responds with emotion based
based on the speaker state S∗t , intent I∗t , and emotion E∗t . on the user’s input. Depending on whether the user suffered
We surmise that considering these factors would help repre- an injury earlier or not, the health-assistant responds with
senting the argument and discourse structure of the conver- excitement (evoking urgency) or happiness (evoking relief).
sation, which leads to improved conversation understanding, As ERC is a new research field, outlining research chal-
including emotion recognition. lenges, available datasets, and benchmarks can potentially
Early computational work on dialogue mostly focused aid future research on ERC. In this paper, we aim to serve
on task-oriented cases, in which the overall conversa- this purpose by discussing various factors that contribute to
tional intent and step-by-step sub-goals played a large the emotion dynamics in a conversation. We surmise that
part [17], [18]. Cohen and Levesque [19] developed a this paper will not only help the researchers to better under-
model and logic to represent intentions and their connec- stand the challenges and recent works on ERC but also show
tions to utterances, whose operators explicate the treatment possible future research directions. The rest of the paper is
of beliefs about the interlocutor’s beliefs and vice versa, organized as follows: Section II presents the key research
recursively. Emotion however played no role in this line of challenges; Section III and Section IV cover the datasets
research. In more recent work, chatbots and chit-chat dia- and recent progress in this field; finally Section V concludes
logue have become more prominent, in part due to the use of the paper.

100944 VOLUME 7, 2019


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

Most dimensional categorization models [25], [26] adopt


two dimensions — valence and arousal. Valence represents
the degree of emotional positivity and arousal represents the
intensity of the emotion. In contrast with the categorical
models, dimensional models map emotion into a continuous
spectrum rather than hard categories. This enables easy and
intuitive comparison of two emotional states using vector
operations, whereas comparison is non-trivial for categorical
models. As there are multiple categorization and dimensional
taxonomies available, it is challenging to select one particu-
lar model for annotation. Choosing a simple categorization
model e.g., Ekman’s model has a major drawback as these
models are unable to ground complex emotions. On the other
hand, complex emotion models such as Plutchik’s model
make it very difficult for the annotators to discriminate
between the related emotions, e.g., discerning anger from
rage. Complex emotion models also increase the risk of
FIGURE 2. An abridged dialogue from the IEMOCAP dataset. obtaining a lower inter-annotator agreement.
The popular ERC dataset IEMOCAP [27] adopted both
categorical and dimensional models. However, newer ERC
datasets like DailyDialogue [28] have employed only cat-
egorical model due to its more intuitive nature. Most of
the available datasets for emotion recognition in conversa-
tion adopted simple taxonomies, which are slight variants of
Ekman’s model. Each emotional utterance in the EmoCon-
text dataset is labeled with one of the following emotions:
happiness, sadness, and anger. The majority of the utterances
in EmoContext do not elicit any of these three emotions and
are annotated with an extra label: others. Naturally, the inter-
annotator agreement for the EmoContext dataset is higher
due to its simplistic emotion taxonomy. However, the short
context length and simple emotion taxonomy make ERC on
this dataset less challenging.

B. BASIS OF EMOTION ANNOTATION


FIGURE 3. Illustration of an affective conversation where the emotion Annotation with emotion labels is challenging as the label
depends on the context. depends on the annotators perspective. Self-assessment by
the interlocutors in a conversation is arguably the best way to
annotate utterances. However, in practice it is unfeasible as
II. RESEARCH CHALLENGES real-time tagging of unscripted conversations will impact the
Recent works on ERC, e.g., DialogueRNN [11] or ICON conversation flow. Post-conversation self-annotation could
[23], strive to address several key research challenges that be an option, but it has not been done yet. As such,
make the task of ERC difficult to solve: many ERC datasets [27] are scripted and annotated by a
group of people uninvolved with the script and conversation.
A. CATEGORIZATION OF EMOTIONS The annotators are given the context of the utterances as
Emotion is defined using two type of models — categorical prior knowledge for accurate annotation. Often pre-existing
and dimensional. Categorical model classifies emotion into a transcripts are annotated for quick turn-around, as in
fixed number of discrete categories. In contrast, dimensional EmotionLines [10].
model describes emotion as a point in a continuous multi- The annotators also need to be aware of the interlocu-
dimensional space. tors perspective for situation-aware annotation. For example,
In the categorical front, Plutchik [2]’s wheel of emotions the emotion behind the utterance ‘‘Lehman Brothers’ stock
defines eight discrete primary emotion types, each of which is plummeting!!’’ depends on whether the speaker benefits
has finer related subtypes. On the other hand, Ekman [3] con- from the crash. The annotators should be aware of the nature
cludes six basic emotions — anger, disgust, fear, happiness, of association between the speaker and Lehman Brothers for
sadness, and surprise. accurate labeling.

VOLUME 7, 2019 100945


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

FIGURE 4. Role of context in emotion recognition in conversation.

C. CONVERSATIONAL CONTEXT MODELING context often plays a less important role in ERC. Distant
Context is at the core of the NLP research. According to contextual information is useful mostly in the scenarios when
several recent studies [29], [30], contextual sentence and a speaker refers to earlier utterances spoken by any of the
word embeddings can improve the performance of the state- speakers in the conversational history.
of-the-art NLP systems by a significant margin. The usefulness of context is more prevalent in classifying
The notion of context can vary from problem to prob- short utterances, like ‘‘yeah’’, ‘‘okay’’, ‘‘no’’, that can express
lem. For example, while calculating word representations, different emotions depending on the context and discourse of
the surrounding words carry contextual information. Like- the dialogue. The examples in Fig. 4a and Fig. 4b explain this
wise, to classify a sentence in a document, other neighboring phenomenon. The emotions expressed by the same utterance
sentences are considered as its context. In Poria et al. [31], ‘‘Yeah’’ in both these examples differ from each other and
surrounding utterances are treated as context and they exper- can only be inferred from the context.
imentally show that contextual evidence indeed aids in Finding contextualized conversational utterance represen-
classification. tations is an active area of research. Leveraging such con-
Similarly in conversational emotion-detection, to deter- textual clues is a difficult task. Memory networks, RNNs,
mine the emotion of an utterance at time t, the preceding utter- and attention mechanisms have been used in previous works,
ances at time < t can be considered as its context. However, e.g., HRLCE or DialogueRNN, to grasp information from the
computing this context representation often exhibits major context.
difficulties due to emotional dynamics.
Emotional dynamics of conversations consists of two D. SPEAKER SPECIFIC MODELING
important aspects: self and inter-personal dependencies [32]. Individuals have their own subtle way of expressing emo-
Self-dependency, also known as emotional inertia, deals with tions. For instance, some individuals are more sarcastic than
the aspect of emotional influence that speakers have on them- others. For such cases, the usage of certain words would vary
selves during conversations [33]. On the other hand, inter- depending on if they are being sarcastic. Let’s consider this
personal dependencies relate to the emotional influences that example, Pa : ‘‘The order has been cancelled.’’, Pb : ‘‘This is
the counterparts induce into a speaker. Conversely, during the great!’’. If Pb is a sarcastic person, then his response would
course of a dialogue, speakers also tend to mirror their coun- express negative emotion to the order being canceled through
terparts to build rapport [34]. This phenomenon is illustrated the word great. On the other hand, Pb ’s response, great,
in Fig. 2. Here, Pa is frustrated over her long term unem- could be taken literally if the canceled order is beneficial
ployment and seeks encouragement (u1 , u3 ). Pb , however, to Pb (perhaps Pb cannot afford the product he ordered).
is pre-occupied and replies sarcastically (u4 ). This enrages Since, necessary background information is often missing
Pa to appropriate an angry response (u6 ). In this dialogue, from the conversations, speaker profiling based on preceding
emotional inertia is evident in Pb who does not deviate utterances often yields improved results.
from his nonchalant behavior. Pa , however, gets emotionally
influenced by Pb . Modeling self and inter-personal relation- E. LISTENER SPECIFIC MODELING
ship and dependencies may also depend on the topic of the During a conversation, the listeners make up their mind about
conversation as well as various other factors like argument the speaker’s utterance as it’s spoken. However, there is no
structure, interlocutors’ personality, intents, viewpoints on textual data on the listener’s reaction to the speaker while the
the conversation, attitude towards each other etc.. Hence, speaker speaks. A model must resort to visual modality to
analyzing all these factors are key for a true self and inter- model the listener’s facial expression to capture the listener’s
personal dependency modeling that can lead to enriched con- reaction. However, according to DialogueRNN, capturing
text understanding. listener reaction does not yield any improvement as the lis-
The contextual information can come from both local and tener’s subsequent utterance carries their reaction. Moreover,
distant conversational history. While the importance of local of the listener never speaks in a conversation, his/her reaction
context is more obvious, as stated in recent works, distant remains irrelevant. Nonetheless, listener modeling can be

100946 VOLUME 7, 2019


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

FIGURE 5. Emotion shift of speakers in a dialogue in comparison with speaker’s previous emotion. Red and blue colors are used to show the emotion
shift of Joey and Chandler respectively.

useful in the scenarios where continuous emotion recognition


of every moment of the conversation is necessary, like audi-
ence reaction during a political speech, as opposed to emotion
recognition of each utterance.

F. PRESENCE OF EMOTION SHIFT


Due to emotional inertia, participants in a conversation tend
to stick a particular emotional state, unless some external
FIGURE 6. Fine-grained emotion understanding: An example.
stimuli, usually the other participants, invoke a change. This
is illustrated in Fig. 5, where Joey changes his emotion
interlocutor opinion, and stand. For example, in Fig. 6, while
from neutral to anger due to the last utterance of Chandler,
both persons take a supportive stand for the government’s bill,
which was unexpected and rather shocking to Joey. This is
they use completely opposite emotions to express it. It is not
a hard problem to solve, as the state-of-the-art ERC model,
possible for a vanilla emotion recognizer to understand the
DialogueRNN is more accurate in emotion detection for the
positive emotion of both the interlocutors on the aspect of
utterances without emotional shift or when the shift is to a
government’s bill. Only by interpreting Person 2’s frustration
similar emotion (e.g., from fear to sad).
about the opposition’s protest against the bill can a classifier
The state-of-the-art methods keeps mimicking the same
infer Person 2’s support for the bill. On the other hand, even
emotion for a particular party, since an abrupt change of
though Person 1 does not explicitly express his/her opinion
emotion is unlikely. Hence, these methods fail in most cases
on the opposition, from the discourse of the conversation,
where a change occurs. To tackle this, a new problem of
it can be inferred that Person 1 holds a negative opinion on
detecting emotion shift can be framed:
1) based on the historical utterances and the present utter- the opposition.
ance, is there an emotion shift (binary classification)? H. MULTIPARTY CONVERSATION
2) if there is a shift then what is the target emotion (multi- In a multiparty conversation, more than two participants are
label classification)? involved. Naturally, emotion recognition in such conversa-
As baseline, the performance of conditional random tions is more challenging in comparison with dyadic con-
field (CRF) would be interesting as it models label depen- versations due to the difficulty in tracking individual speaker
dencies. states and handling co-references.

G. FINE-GRAINED EMOTION RECOGNITION I. PRESENCE OF SARCASM


Fine-grained emotion recognition aims at recognizing emo- Sarcasm is a linguistic tool that uses irony to express con-
tion expressed on explicit and implicit topics. It involves tempt. An ERC system incapable of detecting sarcasm mostly
a deeper understanding of the topic of the conversation, fails to predict emotion of the sarcastic utterances correctly.

VOLUME 7, 2019 100947


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

TABLE 1. Label distribution statistics in different emotion recognition datasets.

Sarcasm detection in a conversation largely depends on the There are a few publicly available datasets for ERC - IEMO-
context and discourse of the conversation. For example, CAP [27], SEMAINE [35], Emotionlines [10], MELD [36],
the utterance ‘‘The part where Obama signed it’’ can only DailyDialog [28], and EmoContext [37]. A detailed com-
be detected as sarcastic if we look at the previous utterance parison of these datasets is drawn in Table 2. Out of these
‘‘What part of this would be unconstitutional?’’. Sarcas- five datasets, IEMOCAP, SEMAINE, and MELD are multi-
tic nature is also person dependent, which again warrants modal (containing acoustic, visual and textual information)
speaker profiling in the conversation. and the remaining two are textual. Apart from SEMAINE
dataset, rest of the datasets contains categorical emotion
J. EMOTION REASONING labels. In contrast, each utterance of SEMAINE dataset is
The ability to reason is necessary for any explainable AI annotated with four real valued affective attributes: valence
system. In the context of ERC, it is often desired to understand ([−1, 1]), arousal ([−1, 1]), expectancy ([−1, 1]), and power
the cause of an expressed emotion by a speaker. As an exam- ([0, ∞)). We also show the emotion label distribution of these
ple, we can refer to 2. An ideal ERC system, with the ability of datasets in Table 1. In EmoContext dataset, an emotion label
emotion reasoning, should perceive the reason for PersonA ’s is assigned to only the last utterance of each dialogue. None
anger, expressed in u6 of Fig. 2. It is evident upon observa- of these datasets can be used for emotion reasoning as they
tion that this anger is caused by the persistent nonchalant lack necessary annotation details required for the reasoning
behavior of PersonB . Readers should not conflate emotion task. Readers should also note that, all these datasets do not
reasoning with context modeling, which we discuss earlier contain fine-grained and topic level emotion annotation.
in this section. Unlike context modeling, emotion reasoning
does not only find the contextual utterances in conversational IV. RECENT ADVANCES
history that triggers the emotion of an utterance, but also In this section we give a brief introduction to the recent work
determines the function of those contextual utterances on the on this topic. We also compare the approaches and report
target utterance. In Fig. 2, it is the indifference of PersonB , their drawbacks. As depicted in Fig. 1, recognizing emotion
reflected by u4 and u5 , that makes PersonA angry. Similarly, of an utterance in a conversation primarily depends on these
in Fig. 5, Joey expresses anger once he ascertains Chandler’s following three factors:
deception in the previous utterance. It is hard to define a 1) the utterance itself and its context defined by the
taxonomy or tagset for emotion reasoning. At present, there interlocutors’ preceding utterances in the conversation,
is no available dataset which contains such rich annotations. as well as intent and the topic of the conversation,
Building such dataset would enable future dialogue systems 2) the speaker’s state comprising variables like personal-
to frame meaningful argumentation logic and discourse struc- ity and argumentation logic and,
ture, taking one step closer to human-like conversation. 3) emotions expressed in the preceding utterances.
Although, IEMOCAP and SEMAINE have been developed
III. DATASETS almost a decade ago, most of the works that used these two
In the last few years, emotion recognition in conversation has datasets did not consider the aforementioned factors.
gained major research interest, mainly because of its potential
application in dialogue systems to generate emotion-aware A. BENCHMARKS AND THEIR DRAWBACKS
and empathetic dialogues [12]. The primary goal of ERC task Based on these factors, a number of approaches to address the
is to label each utterance in the conversation with an emo- ERC problem have been proposed recently. Conversational
tion label. In this section, we discuss the publicly available memory network (CMN), proposed by Hazarika et al. [38]
ERC datasets as well as the shortcomings of these datasets. for dyadic dialogues, is one of the first ERC approaches that

100948 VOLUME 7, 2019


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

TABLE 2. Comparison among IEMOCAP, SEMAINE, emotionLines, MELD, and dailydialog datasets.

FIGURE 7. Comparison of attention scores over utterance history of CMN and DialogueRNN. Higher attention value signifies more
important contextual information. Note: Figure taken from Majumder et al. [11].

utilizes distinct memories for each speaker for speaker- Interaction-aware Attention Network (IANN) by leveraging
specific context modeling. Later, Hazarika et al. [23] inter-speaker relation modeling. Similar to ICON and CMN,
improved upon this approach with interactive conversational IANN (Fig. 8) utilizes distinct memories for each speaker.
memory network (ICON), which interconnects these mem- All of these models affirm that contextual history, mod-
ories to model self and inter-speaker emotional influence. eling self and inter-speaker influence are beneficial to ERC
None of these two methods actually exploit the speaker infor- (shown in Fig. 7 and Fig. 10). Further, DialogueRNN shows
mation of the target utterance for classification. This makes that the nearby utterances are generally more context rich
the model blind to speaker-specific nuances. and ERC performance improves when the future utterances,
DialogueRNN [11] aims to solve this issue by consid- at time > t, are available. This is indicated by Fig. 10, where
ering the speaker information of the target utterance and, DialogueRNN uses both past and future utterances as context
further, modeling self and inter-speaker emotional influence with roughly the same frequency. Also, the distant utterances
with a hierarchical multi-stage RNN with attention mech- are used less frequently than the nearby utterances. On the
anism. On both IEMOCAP and SEMAINE datasets, Dia- other hand, CMN and ICON do not use future utterances as
logueRNN outperformed (Table 3 and Table 4) the other two context at all. However, for real-time applications, systems
approaches. cannot rely on future utterances. In such cases, CMN, ICON,
The need to grasp inter-speaker dependency for ERC and DialogueRNN with fixed context window would be
is also acknowledged and modeled in [39] by leverag- befitting.
ing quantum theory and LSTM. Their network, Quantum- All these networks, namely CMN, ICON, IANN, and Dia-
Inspired Interactive Networks (QIN) has outperformed logueRNN, perform poorly on the utterances with emotion
CMN and ICON on IEMOCAP and MELD datasets. shift. In particular, the cases where the emotion of the target
Recently, Yeh et al. [9] proposed an ERC method called utterance differs from the previous utterance, DialogueRNN

VOLUME 7, 2019 100949


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

TABLE 3. Comparison between DialogueRNN and baseline methods on IEMOCAP dataset; bold font denotes the best performances.
Average(w) = Weighted average. ICON results differ from the original paper [23] as in our experiment, we disregard their contextual
feature extraction and pre-processing part. More details can be found in Majumder et al. [11].

TABLE 4. Comparison between DialogueRNN and baseline methods on SEMAINE dataset; Acc. = Accuracy, MAE = Mean Absolute Error, r = Pearson
correlation coefficient; bold font denotes the best performances. More details can be found in Majumder et al. [11].

FIGURE 8. Interactive-aware Attention Network (IANN), proposed by Yeh et al. [9].

could only correctly predict 47.5% instances. This stands less is observed by DialogueRNN compared to bc-LSTM which
as compared to the 69.2% success-rate that it achieves at the depicts a future research direction on multiparty ERC. ICON
regions of no emotional-shift. and CMN are designed to detect emotions in dyadic dia-
Among these three approaches, only DialogueRNN is logues. Adapting ICON and CMN to apply on multiparty
capable of handling multiparty conversations on large scale. conversational dataset MELD can cause scalability issue in
However, on the multiparty conversational dataset MELD, situations when number speakers participating in a conversa-
only a little performance improvement (shown in Table 5) tion in the test data is more than the training data.

100950 VOLUME 7, 2019


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

TABLE 5. Test-set F-score results of bc-LSTM and DialogueRNN for emotion classification in MELD. Note: w-avg denotes weighted-average. text-CNN:
CNN applied on text, contextual information were not used.

FIGURE 9. The EmoContext dataset [37] and the HRLCE [40] framework.

TABLE 6. Recent works on the EmoContext dataset.

Due to the sequential nature of the utterances in


conversations, RNNs are used for context generation in
the aforementioned models. However, there is ample room FIGURE 10. Histogram of 1t = distance between the target utterance
for improvement, as the RNN-based context representation and its context utterance based on DialogueRNN’s attention scores.
Note: Figure taken from Majumder et al. [11].
methods perform poorly in grasping long distant contextual
information.
using bc-LSTM architecture [31] that encapsulates the tem-
Recently, two shared tasks — EmotionX 1 (co-located
poral order of the utterances using an LSTM. A common
with SocialNLP workshop) and EmoContext 2 (co-located
trend can be noticed in these works, where traditional word
with Semeval 2019) have been organized to address the
embeddings, such as Glove [45], are combined with contex-
ERC problem. EmoContext shared task has garnered more
tualized word embeddings, such as ELMo [29] to improve the
than 500 participants, affirming the growing popularity of
performance. Most of these works use attention mechanism
this research field. Compared to other datasets, EmoContext
on top of the bc-LSTM to enrich context representation.
dataset [37] has very short conversations consisting only three
In Fig. 9b, we depict the HRLCE framework, proposed by
utterances where the goal is to label the 3rd utterance as
Huang et al. [40], that comprises of an utterance encoder and
shown in Fig. 9a.
a context encoder that takes input from the utterance encoder.
Emotion labels of the previous utterances are not present in
To represent each utterance, HRLCE utilizes ELMo [29],
the EmoContext dataset. The key works [24], [37], [40]–[44]
Glove [45], and Deepmoji [46].
on this dataset have mainly leveraged on context modeling
The context encoder in HRLCE adapts the bc-LSTM
1 https://sites.google.com/view/emotionx2019/ framework followed by a multi-head attention layer.
2 https://www.humanizing-ai.com/emocontext.html Huang et al. [40] applied HRLCE framework only on the

VOLUME 7, 2019 100951


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

EmoContext dataset. However, HRLCE can be easily adapted [8] S. M. Mohammad and P. D. Turney, ‘‘Emotions evoked by common words
to apply on other ERC datasets. It should be noted that and phrases: Using mechanical turk to create an emotion lexicon,’’ in Proc.
NAACL HLT Workshop Comput. Approaches Anal. Gener. Emotion Text.
none of the works on the EmoContext dataset utilize speaker Los Angeles, CA, USA: Association for Computational Linguistics, 2010,
information. In fact, in our experiments, we found that Dia- pp. 26–34.
logueRNN, which makes use of the speaker information, [9] S.-L. Yeh, Y.-S. Lin, and C.-C. Lee, ‘‘An interaction-aware attention
network for speech emotion recognition in spoken dialogs,’’ in Proc.
performs similar (Table 6) to Bae et al. [24], Huang et al. [40], IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), May 2019,
and Chatterjee et al. [37] on EmoContext dataset. One pos- pp. 6685–6689.
sible reason for this could be the presence of very short [10] S.-Y. Chen, C.-C. Hsu, C.-C. Kuo, T.-H. Huang, and L.-W. Ku,
‘‘Emotionlines: An emotion corpus of multi-party conversations,’’ 2018,
context history in the dataset that renders speaker information arXiv:1802.08379. [Online]. Available: https://arxiv.org/abs/1802.08379
inconsequential. [11] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and
E. Cambria, ‘‘DialogueRNN: An attentive RNN for emotion detec-
tion in conversations,’’ 2018, arXiv:1811.00405. [Online]. Available:
https://arxiv.org/abs/1811.00405
V. CONCLUSION
[12] H. Zhou, M. Huang, T. Zhang, X. Zhu, and B. Liu, ‘‘Emotional chatting
Emotion recognition in conversation has been gaining popu- machine: Emotional conversation generation with internal and external
larity among NLP researchers. In this paper, we summarized memory,’’ in Proc. 32nd AAAI Conf. Artif. Intell., 2018, pp. 730–738.
[13] C.-H. Wu, Z.-J. Chuang, and Y.-C. Lin, ‘‘Emotion recognition from text
the recent advances in this task and highlight several key
using semantic labels and separable mixture models,’’ ACM Trans. Asian
research challenges associated with this research area. Fur- Lang. Inf. Process., vol. 5, no. 2, pp. 165–183, 2006.
ther, we pointed out how current work has partly addressed [14] S. Shaheen, W. El-Hajj, H. Hajj, and S. Elbassuoni, ‘‘Emotion recognition
these challenges, while also presenting some shortcomings. from text based on automatically generated rules,’’ in Proc. IEEE Int. Conf.
Data Mining Workshop, Dec. 2014, pp. 383–392.
Overall, we surmised that an effective emotion-shift recog- [15] E. Hovy, ‘‘Generating natural language under pragmatic constraints,’’
nition model and context encoder can yield significant per- J. Pragmatics, vol. 11, no. 6, pp. 689–719, 1987.
formance improvement over chit-chat dialogue, and even [16] J. J. Schlöder and R. Fernández, ‘‘Clarifying intentions in dialogue:
A corpus study,’’ in Proc. 11th Int. Conf. Comput. Semantics, 2015,
improve some aspects of task-oriented dialogue. Moreover, pp. 46–51.
challenges like topic-level speaker-specific emotion recog- [17] B. J. Grosz and C. L. Sidner, ‘‘Attention, intentions, and the structure of
nition, ERC on multiparty conversations, and conversational discourse,’’ Comput. Linguistics, vol. 12, no. 3, pp. 175–204, 1986.
[18] D. E. Appelt, Planning English Sentences. Cambridge, U.K.:
sarcasm detection can form new research directions. Addi- Cambridge Univ. Press, 1992.
tionally, fine-grained speaker-specific continuous emotion [19] P. R. Cohen and H. J. Levesque, ‘‘Speech acts and rationality,’’ in Proc.
recognition may become of interest for the purpose of 23rd Annu. Meeting Assoc. Comput. Linguistics. Chicago, IL, USA:
Association for Computational Linguistics, 1985, pp. 49–60.
tracking emotions during long monologues. We believe that [20] S. K. D’Mello, S. D. Craig, J. Sullins, and A. C. Graesser, ‘‘Predicting
addressing each of the challenges outlined in this paper will affective states expressed through an emote-aloud procedure from autotu-
not only enhance AI-enabled conversation understanding, but tor’s mixed-initiative dialogue,’’ Int. J. Artif. Intell. Educ., vol. 16, no. 1,
pp. 3–28, 2006.
also improve the performance of dialogue systems by catering [21] C. Yang, K. H.-Y. Lin, and H.-H. Chen, ‘‘Emotion classification using
to affective information. Web blog corpora,’’ in Proc. IEEE/WIC/ACM Int. Conf. Web Intell. (WI),
Nov. 2007, pp. 275–278.
[22] D. A. Phan, H. Shindo, and Y. Matsumoto, ‘‘Multiple emotions detection
in conversation transcripts,’’ in Proc. 30th Pacific Asia Conf. Lang., Inf.
ACKNOWLEDGMENT Comput., Seoul, South Korea, Oct. 2016, pp. 85–94. [Online]. Available:
Any opinions, findings, and conclusions or recommendations https://www.aclweb.org/anthology/Y16-2006
expressed in this material are those of the authors and do not [23] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann,
‘‘ICON: Interactive conversational memory network for multimodal emo-
necessarily reflect the views of these agencies. tion detection,’’ in Proc. Conf. Empirical Methods Natural Lang. Process.,
2018, pp. 2594–2604.
[24] S. Bae, J. Choi, and S.-G. Lee, ‘‘SNU_IDS at SemEval-2019 task
REFERENCES 3: Addressing training-test class distribution mismatch in conver-
sational classification,’’ 2019, arXiv:1903.02163. [Online]. Available:
[1] C. Darwin and P. Prodger, The Expression of the Emotions in Man and https://arxiv.org/abs/1903.02163
Animals. Oxford, U.K.: Oxford Univ. Press, 1998. [25] J. A. Russell, ‘‘A circumplex model of affect,’’ J. Personality Social
[2] R. Plutchik, ‘‘A psychoevolutionary theory of emotions,’’ Social Sci. Inf., Psychol., vol. 39, no. 6, pp. 1161–1178, Dec. 1980.
vol. 21, nos. 4–5, pp. 529–553, 1982. [26] A. Mehrabian, ‘‘Pleasure-arousal-dominance: A general framework for
[3] P. Ekman, ‘‘Facial expression and emotion,’’ Amer. Psychol., vol. 48, no. 4, describing and measuring individual differences in temperament,’’ Current
p. 384, 1993. Psychol., vol. 14, no. 4, pp. 261–292, 1996. doi: 10.1007/BF02686918.
[4] B. Kratzwald, S. Ilic, M. Kraus, S. Feuerriegel, and H. Prendinger, [27] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim,
‘‘Deep learning for affective computing: Text-based emotion recogni- J. N. Chang, S. Lee, and S. S. Narayanan, ‘‘IEMOCAP: Interactive emo-
tion in decision support,’’ 2018, arXiv:1803.06397. [Online]. Available: tional dyadic motion capture database,’’ Lang. Resour. Eval., vol. 42, no. 4,
https://arxiv.org/abs/1803.06397 p. 335, 2008.
[5] N. Colneriĉ and J. Demsar, ‘‘Emotion recognition on Twitter: Comparative [28] Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, ‘‘Dailydialog: A manually
study and training a unison model,’’ IEEE Trans. Affect. Comput., to be labelled multi-turn dialogue dataset,’’ 2017, arXiv:1710.03957. [Online].
published. Available: https://arxiv.org/abs/1710.03957
[6] C. Strapparava and R. Mihalcea, ‘‘Semeval-2007 task 14: Affective text,’’ [29] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
in Proc. 4th Int. Workshop Semantic Eval., Stroudsburg, PA, USA: Associ- and L. Zettlemoyer, ‘‘Deep contextualized word representations,’’ 2018,
ation for Computational Linguistics, 2007, pp. 70–74. [Online]. Available: arXiv:1802.05365. [Online]. Available: https://arxiv.org/abs/1802.05365
http://dl.acm.org/citation.cfm?id=1621474.1621487 [30] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
[7] C. Strapparava and A. Valitutti, ‘‘Wordnet affect: An affective extension of of deep bidirectional transformers for language understanding,’’ 2018,
WordNet,’’ in Proc. LREC, vol. 4, 2004, p. 40. arXiv:1810.04805. [Online]. Available: https://arxiv.org/abs/1810.04805

100952 VOLUME 7, 2019


S. Poria et al.: ERC: Research Challenges, Datasets, and Recent Advances

[31] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and NAVONIL MAJUMDER received the M.Sc.
L.-P. Morency, ‘‘Context-dependent sentiment analysis in user-generated degree from the CIC, Instituto Politécnico
videos,’’ in Proc. ACL, vol. 1, 2017, pp. 873–883. Nacional, Mexico, in 2017, where he is currently
[32] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, pursuing the Ph.D. degree. He is also a Visiting
‘‘MELD: A multimodal multi-party dataset for emotion recognition in Researcher with Nanyang Technological Univer-
conversations,’’ in Proc. 57th Conf. Assoc. Comput. Linguistics. Florence,
sity (NTU). He has more than 15 research publica-
Italy: Association for Computational Linguistics, Jul. 2019, pp. 527–536.
[33] P. Kuppens, N. B. Allen, and L. B. Sheeber, ‘‘Emotional inertia and tions in top-tier conferences and journals, such as
psychological maladjustment,’’ Psychol. Sci., vol. 21, no. 7, pp. 984–991, ACL, EMNLP, AAAI, Knowledge-Based Systems,
2010. and the IEEE Intelligent Systems. His research
[34] C. Navarretta, ‘‘Mirroring facial expressions and emotions in dyadic con- interests include natural language processing, sen-
versations,’’ in Proc. LREC, 2016, pp. 469–474. timent analysis, and multimodal language processing. His research works
[35] G. McKeown, M. Valstar, R. Cowie, M. Pantic, and M. Schröder, ‘‘The have occasionally been featured in news portals, such as KDDNuggets and
SEMAINE database: Annotated multimodal records of emotionally col- Datanami. He received the Lázaro Cárdenas Gold Medal from the President
ored conversations between a person and a limited agent,’’ IEEE Trans. of Mexico for his M.Sc. study.
Affect. Comput., vol. 3, no. 1, pp. 5–17, Jan./Mar. 2012.
[36] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and
R. Mihalcea, ‘‘MELD: A multimodal multi-party dataset for emotion
recognition in conversations,’’ 2018, arXiv:1810.02508. [Online]. Avail-
able: https://arxiv.org/abs/1810.02508
[37] A. Chatterjee, U. Gupta, M. K. Chinnakotla, R. Srikanth, M. Galley, and
P. Agrawal, ‘‘Understanding emotions in text using deep learning and big
data,’’ Comput. Hum. Behav., vol. 93, pp. 309–317, Apr. 2019.
[38] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and
R. Zimmermann, ‘‘Conversational memory network for emotion recogni-
tion in dyadic dialogue videos,’’ in Proc. Conf. North Amer. Chapter Assoc.
Comput. Linguistics, Hum. Lang. Technol., vol. 1, 2018, pp. 2122–2132.
[39] Y. Zhang, Q. Li, D. Song, P. Zhang, and P. Wang, ‘‘Quantum-inspired RADA MIHALCEA is currently a Professor of
interactive networks for conversational sentiment analysis,’’ in Proc. 28th computer science and engineering with the Univer-
Int. Joint Conf. Artif. Intell. (IJCAI), 2019, pp. 1–8.
[40] C. Huang, A. Trabelsi, and O. R. Zaïane, ‘‘ANA at SemEval-2019
sity of Michigan and the Director of the Michigan
task 3: Contextual emotion detection in conversations through hierarchi- Artificial Intelligence Laboratory. Her research
cal LSTMs and BERT,’’ 2019, arXiv:1904.00132. [Online]. Available: interests include lexical semantics, multilingual
https://arxiv.org/abs/1904.00132 NLP, and computational social sciences. She
[41] A. V. González, V. P. B. Hansen, J. Bingel, and A. Søgaard, ‘‘CoAStaL serves or has served on the editorial boards of the
at SemEval-2019 Task 3: Affect classification in dialogue using atten- Journals of Computational Linguistics, Language
tive BiLSTMs,’’ in Proc. 13th Int. Workshop Semantic Eval., 2019, Resources and Evaluations, Natural Language
pp. 169–174. Engineering, the Journal of Artificial Intelligence
[42] A. Bouchekif, P. Joshi, L. Bouchekif, and H. Afli, ‘‘EPITA-ADAPT at Research, the IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, and Transactions
SemEval-2019 Task 3: Detecting emotions in textual conversations using of the Association for Computational Linguistics. She was a recipient of the
deep learning models combination,’’ in Proc. 13th Int. Workshop Semantic
NSF CAREER Award, in 2008, and the Presidential Early Career Award
Eval., 2019, pp. 215–219.
[43] W. Ragheb, J. Azé, S. Bringay, and M. Servajean, ‘‘LIRMM-advanse for Scientists and Engineers from President Obama, in 2009. She was a
at SemEval-2019 Task 3: Attentive conversation modeling for emotion Program Co-Chair of EMNLP 2009 and ACL 2011, and a General Chair
detection and classification,’’ in Proc. 13th Int. Workshop Semantic Eval., of NAACL 2015 and *SEM 2019. She currently serves as the ACL Vice-
2019, pp. 251–255. President Elect. In 2013, she was made an honorary citizen of her hometown
[44] W. Ragheb, J. Azé, S. Bringay, and M. Servajean, ‘‘Attention- of Cluj-Napoca, Romania.
based modeling for emotion detection and classification in textual
conversations,’’ 2019, arXiv:1906.07020. [Online]. Available:
https://arxiv.org/abs/1906.07020
[45] J. Pennington, R. Socher, and C. Manning, ‘‘Glove: Global vectors for
word representation,’’ in Proc. Conf. Empirical Methods Natural Lang.
Process. (EMNLP), 2014, pp. 1532–1543.
[46] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann, ‘‘Using mil-
lions of emoji occurrences to learn any-domain representations for detect-
ing sentiment, emotion and sarcasm,’’ 2017, arXiv:1708.00524. [Online].
Available: https://arxiv.org/abs/1708.00524

EDUARD HOVY received the Ph.D. degree


in computer science from Yale University.
He received the honorary doctorates from
SOUJANYA PORIA received the Ph.D. degree in the National University of Distance Educa-
computer science from the University of Stirling, tion (UNED) in Madrid, in 2013, and the Uni-
U.K. He is currently an Assistant Professor of versity of Antwerp, in 2015. He is currently a
computer science with the Singapore University Research Professor of the Language Technologies
of Technology and Design (SUTD), Singapore. Institute, Carnegie Mellon University. He is one
He has coauthored more than 70 papers, and pub- of the original 17 Fellows of the Association for
lished in top-tier conferences and journals, such as Computational Linguistics. He has published more
ACL, EMNLP, AAAI, NAACL, Neurocomputing than 500 research papers. His research focuses on various topics, including
and Computational Intelligence Magazine. He was aspects of the computational semantics of human language. He is a Fellow of
a recipient of the prestigious early career research the AAAI. He serves or has served on the editorial boards of several journals,
award called NTU Presidential Postdoctoral Fellowship, in 2018. He has such as the ACM Transactions on Asian Language Information Processing
given several invited talks at venues like CICLing 2018, which is a large TALIP and Language Resources and Evaluation (LRE).
international NLP conference.

VOLUME 7, 2019 100953

You might also like