2
2
Abstract—Emotion detection is one of the most challenging learning approaches have been used to detect and predict
problems in the automated understand of language. Under- emotions and sentiments [9], however, it has been noticed that
standing human emotions using text without facial expression deep learning approaches can gain better performance than
is considered a complicated task. Therefore, building a machine
that understands the context of the sentences and differentiates traditional machine learning approaches [10] [11].
between emotions has motivated the machine learning commu- Our system can determine the emotion in English textual di-
nity recently. We propose a system to detect emotions using alogue and classify it into four categories (Happy, Sad, Angry
deep learning approaches. The main input to the system is a and Other). The main input to the system are utterances along
combination of GloVe word embeddings, BERT Embeddings with three turns of context. We extract features using a com-
and a set of psycholinguistic features (e.g. from AffectiveTweets
Weka-package). The proposed system (EmoDet2) is combining a bination of GloVe word embeddings, BERT Embeddings and
fully connected neural network architecture and BiLSTM neural a set of psycholinguistic features (e.g. from AffectiveTweets
network to obtain performance results that show substantial Weka-package). The proposed system is combining a fully
improvements (F1-Score 0.748) over the baseline model provided connected neural network architecture and BiLSTM neural
by Semeval-2019 / Task-3 organizers (F1-score 0.58). network. We have used the data provided by the organizers
Index Terms—Neural Network, BERT, Deep Learning, Ma-
chine learning, Emotions, Sentiment. of task 3, EmoContext competition, of Semeval 2019. The
performance results of the model show substantial improve-
ments (F1-Score 0.748) over the baseline model provided by
I. I NTRODUCTION
Semeval-2019 / Task-3 organizers (F1-score 0.58).
The challenge of defining emotions has motivated psy- The organization of this paper is as follows: First, we will go
chology researchers for a long period of time. In Lindsley through the related work in Section II. After that, Section III
et al., [1], the authors defined emotion as a complex be- provides more details about the model architecture. In Section
havioral phenomenon involving many levels of neural and IV, we will talk about the test and evaluation of the model.
chemical integration. The list of basic emotions varies in Section V will go through the conclusion of this research.
content and length [2]. Ekman [3] identified the six ba-
sic emotions: anger, disgust, fear, happiness, sadness, and II. R ELATED W ORK
surprise. While the list of basic emotion in Izard [4] had Emotions can be defined as complex state of feelings that
included anger, contempt, disgust, sadness, enjoyment-joy, come from physical and psychological changes in our lives
fear, interest-excitement, surprise-astonishment (and possibly and depend on the mood and the personality of speakers.
guilt, shame, and shyness) in his basic emotions list. Several researchers have investigated this to define emotions,
In the past decades, we have seen a rapid growth of user- including Ekman [3] who identified the six basic emotion like
generated content through different social media platforms, anger, disgust, fear, happiness, sadness and surprise. Machine
such as Facebook and Twitter, on a variety of topics on a daily learning researchers have built algorithms to understands emo-
basis. This content contains people’s sentiments and emotions tion. The researchers in Chatterjee et al. [12], worked on an
expressing happiness, sadness, and anger. Using social media LSTM model and fed it with two types of word embedding,
data, we can analyze and track public opinions to help predict semantic word embedding using Glove [13] and sentiment
attitudes towards certain products or political issues or even word embedding using Sentiment Specific Word Embedding
preventing depressed people from committing a suicide [5] [6]. (SSWE). In [10], the researchers used the data provided by
However, there are some difficulties toward detecting emotions shared task (Task 3: EmoContext) in SemEval-2019 workshop
using text-only without combining it with facial expressions. to build EmoDet model that ensembled a fully connected
Few corpora exist for emotion labeling with text. To under- neural network architecture and LSTM neural network. The
stand emotion, we need a way to get better knowledge from SEDAT model [11] detects sentiments and emotions in Arabic
the labeled data and predict new unlabeled data using machine tweets, using word and document embeddings and a set of se-
learning [7] and deep learning [8] techniques. Several machine mantic features in a CNN-LSTM and a fully connected neural
network architecture. In EmoNet [14], the researchers worked that we did not apply preprocessing on the data for BERT
on building emotional chatbots to have a better understanding model because it can get more features from the raw data.
of other humans. They used huge labeled datasets and built
a system to classify 24 fine-grained emotion. Their system B. Extracting Feature Vectors
consisted of Gated Recurrent Neural Network (GRNN) that We have explored different encoding techniques to convert
is considered simpler and faster than the LSTM model. In the text into vector representation such as using word2vec
Illendula and Sheth [15], the researchers studied the effect of google news, glove, or fasttext embeddings. We have explored
emojis and images. They used BiLSTM model with Attention different features to represent each turn in the dataset and the
mechanism, and they fed it with fasttext embeddings, Emo- concatenated turns. Our approach have extracted feature vector
jiNet with the extracted features from the images. Rosenthal from texts as follows:
et al. [16] used the multi-view ensemble approach to detect First, we have extracted a 300-dimensional vector using
emotions. They trained models with features space like bag- the pretrained word2vec embedding model that is trained on
of-words and word2vec. They have used traditional machine Google News [19]. We also have extracted GloVe embedding
learning approaches, Logistic Regression and Support Vector that consists of a 300-dimensional vector. The Bert embed-
Machines. ding have been used to obtain 173-dimensional vector using
transformers package [20]
III. O UR A PPROACH Second, we also have extracted the semantic features
Our system, EmoDet2, can determine the emotion and by converting the whole conversation to 145-dimensional
sentiment in English textual dialogue and classify it into four vector using three vectors from the AffectiveTweets Weka-
categories (Happy, Sad, Angry and Other). In this section, we package [21] as follows: 43 features have been extracted us-
will describe the overall system design. ing the TweetToLexiconFeatureVectorAttribute that calculates
attributes for sentences using a variety of lexical resources, a
A. Collecting and Pre-Processing Data two-dimensional vector using the SentimentStrength features
For our approach, we have used the public data from the from the same package, and a 100-dimensional vector is
shared task (Task 3: EmoContext) of Semeval 2019 [17]. The obtained by vectorizing the sentence to embedding attributes.
task provides training, development and testing datasets to be
used by all participants. The number of training, development C. Network Architecture
and testing datasets for each emotion is shown in I. We can see EmoDet2 has been built using ensembling methods with
that the distribution between different classes is not balanced. different sub-models as shown in Fig.1: EmoDense, EmoDet-
BiLSTM-submodel1, EmoDet-BiLSTM-submodel2, EmoDet-
TABLE I BERT-BiLSTM (cased), and EmoDet-BERT-BiLSTM (un-
T RAINING AND TESTING DATASETS cased). More details for these four sub-models in the following
Train Data Dev Data Test Data three subsections.
Anger 5506 150 298 1) EmoDense: This submodel uses feed forward neural
Happy 4243 142 284 network that consists of four dense hidden layers with 512,
Sad 5463 125 250
Other 14948 2338 4677 256, 128, 64 neurons for each layer. The activation function for
Total 30160 2755 5509 all layers is ReLU [22]. We have added 0.2 dropout between
layers. The output layer consists of four sigmoid neurons to
predict the class of the conversation. For optimization, we have
The training corpus contains 5 columns: used Adam optimizer [23] with 0.0001 learning rate and Mean
• ID – a unique number to identify each training sample. Squared Error as a loss function. Moreover, we have saved
• Turn 1 – The first turn in the three-turn conversation, the output prediction weights to predict the testing dataset.
written by User 1. The fit function uses number of epochs= 40, batch size = 16,
• Turn 2 – The second turn, which is a reply to the first validation split= 0.33. The best epoch from the validation set
turn in conversation and written by User 2. have been chosen to be applied on the test data (more details
• Turn 3 – The third turn, which is a reply to the second in Fig.2).
turn in the conversation, which is written by User 1. 2) EmoDet-BiLSTM: The standard Recurrent Neural Net-
• Label – The human-judged label of Emotion of Turn 3 work (RNN) [24] is distinguished from Feed-Forward Network
based on the conversation for the given training sample. with a memory. A special kind of RNNs is Long Short Term
It always one of the four values – ‘happy’, ‘sad’, ‘angry’, Memory (LSTM), which is composed of a memory cell, an
and ‘other’. input gate, an output gate and a forget gate. The Bidirectional
We did not apply standard pre-processing steps like stem- Long Short-Term Memory (BiLSTM) [25] is the advanced
ming and removing the stopwords. We have converted all of form of LSTM in which the BiLSTM feeds the algorithm
the emojis in the text to textual forms and used Ekphrasis with the data once from beginning to the end, and once
[18] package to handle the spelling mistakes and add empty from the end to the beginning. This lets the network learn
strings between the special characters. It is worth mentioning more information from the data. We have applied two types
227
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.
2020 11th International Conference on Information and Communication Systems (ICICS)
of models using BiLSTM. For EmoDet-BiLSTM-submodel1, pretrained Glove embeddingsas and the extracted features from
we have one input, the encoded sentence, that goes into a AffectiveTweets Weka package for Turn 3 that consists of 445-
lookup table with 300-dimensional pretrained Glove vector dimensions. The encoded sentence goes into two BiLSTM
that represents words. After that, it goes into BiLSTM, which layers, each one with 256 nodes and 0.2 dropouts to avoid
consists of 2 layers each with 256 neurons and followed by 0.2 overfitting. Then the output is flattened. The second input
dropouts to avoid overfitting. Then, we take the output from goes into fully connected neural network with four dense
the BiLSTM and flatten it then feed it into fully connected hidden layers of 512, 256, 128, 64 neurons for each layer. The
neural network with four dense hidden layers of 512, 256, activation function for each layer is ReLU, and between them
128, 64 neurons for each. The activation function for each there are 0.2 dropouts. After that, the output of the second
layer is ReLU, and between them there are 0.2 dropouts. input are concatenated with the output from the first input to
The output layer consists of 4 sigmoid neurons to predict the go into a fully connected neural network with four dense layers
class of the conversation. For optimization, we also use Adam of 512, 256, 128, 64 neurons for each layer. The activation
optimizer with 0.0001 learning rate and Mean Squared Error function of each layer is ReLU with 0.2 dropouts. The output
as a loss function. We have saved the output prediction weights layer consists of 4 sigmoid neurons to predict the class of
to predict the testing datasets. The fit function uses number of the conversation. For optimization, we use Adam optimizer
epochs= 100, batch size = 32, validation split= 33. This is with 0.0001 learning rate and Mean Squared Error as a loss
shown in Fig. 3 that shows EmoDet-BiLSTM-submodel 1. function. The fit function uses number of epochs= 100, batch
size = 32, validation split= 33. Check Fig. 4 for EmoDet-
For EmoDet-BiLSTM-submodel 2, there are two inputs: BiLSTM-submodel 2.
the encoded sentence that is obtained by a 300-dimensional
228
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.
2020 11th International Conference on Information and Communication Systems (ICICS)
229
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.
2020 11th International Conference on Information and Communication Systems (ICICS)
Epoch, Batch optimizer Accuracy Precision Recall F1 Epoch, Batch optimizer Accuracy Precision Recall F1
(40,16) SGD 0.8041 0.1885 0.1731 0.1805 (40,16) SGD 0.8461 0.2222 0.0144 0.0271
(40,32) SGD 0.8488 0.3333 0.0012 0.0024 (40,32) SGD 0.8441 nan 0.0000 0.0000
(40,16) Adam 0.8116 0.3923 0.5841 0.4693 (40,16) Adam 0.8455 0.2708 0.0156 0.0295
(40,32) Adam 0.7949 0.3642 0.5950 0.4518 (40,32) Adam 0.8490 nan 0.0000 0.0000
For fine-tuning the parameters, we have noticed the best TABLE VII
E VALUATING THE M ODEL USING W ORD 2V EC (T URN 3 O NLY )
parameters for our model are as follows: Dropout = 0.4, the
text for Word2Vec and Glove Wiki embeddings should be in
Epoch, Batch optimizer Accuracy Precision Recall F1
lower case, and Glove Common Crawl should use the cased (40,16) SGD 0.8726 0.5441 0.7260 0.6220
system. The best accuracy was by using Glove Common Crawl (40,32) SGD 0.8539 0.4658 0.5901 0.5207
pre-trained embedding as we think that the cased text can give (40,16) Adam 0.8880 0.5950 0.7380 0.6588
(40,32) Adam 0.8546 0.4521 0.5156 0.4818
extra features to the model so it can understand the context
better.
We have tested the model using Turn 1 only and Turn 2
only (shown in Tables V and VI). Also the hyperparameters TABLE VIII
are as follows: Dropout = 0.4 and the text in lower case. E VALUATING THE M ODEL USING G LOVE C OMMON CRAWL (T URN 3
O NLY )
TABLE V
E VALUATING THE M ODEL USING W ORD 2V EC (T URN 1 O NLY ) Epoch, Batch optimizer Accuracy Precision Recall F1
(40,16) SGD 0.8711 0.5377 0.7536 0.6276
(40,32) SGD 0.8533 0.4891 0.7260 0.5844
Epoch, Batch optimizer Accuracy Precision Recall F1 (40,16) Adam 0.8934 0.6099 0.7440 0.6703
(40,16) SGD 0.8461 0.1053 0.0048 0.0092 (40,32) Adam 0.8840 0.5755 0.7692 0.6584
(40,32) SGD 0.8435 0.0877 0.0060 0.0112
(40,16) Adam 0.7954 0.2809 0.2788 0.2799
(40,32) Adam 0.7873 0.2718 0.2957 0.2832
We have added the AffectiveTweets features to the third turn
features that consist of 145 dimensions from Glove Common
Next, we have tested the model using Turn 3 only and Crawl embeddings. The experiments are shown in Table IX.
extracted the words encoding using three pretrained embed- More experiments for EmoDet BiLSTM Submodel 1 are
ding models: Glove Common Crawl trained on normal text shown in Table X. We have used Glove Common Crawl
230
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.
2020 11th International Conference on Information and Communication Systems (ICICS)
TABLE IX V. C ONCLUSION
E VALUATING THE M ODEL USING G LOVE C OMMON CRAWL W ITH
A FFECTIVE T WEETS (T URN 3 O NLY ) In this paper, we have presented our system EmoDet2 that
Text type Accuracy Precision Recall F1 uses deep learning architectures for detecting the existence
Cased 0.8989 0.6196 0.7752 0.6887 of emotions in a text. The performance of the system (F1-
Lower 0.8840 0.5755 0.7692 0.6584 Score 0.75) surpasses the performance of the baseline model
(F1-Score 0.58) indicating that our approach is promising.
In this system, we have used word embedding models with
Embeddings to encode the text and the text were on cased feature vectors extracted using the AffectiveTweets. We also
form. We have used EmoDet BiLSTM Submodel 2 model to extracted word contextual embedding from BERT base model.
get better result than the previous model (Table XI). These vectors feed different deep neural network architectures,
feed-forward, and LSTM models, to obtain the predictions.
We use the SemEval-2019 Task 3 datasets as input for our
TABLE X
T ESTING B I LSTM SUB - MODEL 1
system and show that EmoDet2 has a high proficiency in
detecting emotions in a conversational text and surpasses the
Text type Accuracy Precision Recall F1 F1-score baseline model performance, which is provided by
Cased 0.8889 0.5836 0.7885 0.6708
Lower 0.8737 0.5440 0.7656 0.6360 the SemEval-Task 3 organizers.
R EFERENCES
231
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.
2020 11th International Conference on Information and Communication Systems (ICICS)
232
Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 10,2020 at 10:38:58 UTC from IEEE Xplore. Restrictions apply.