Learning To Identify Emotions in Text: Carlo Strapparava Strappa@itc - It Rada Mihalcea Rada@cs - Unt.edu
Learning To Identify Emotions in Text: Carlo Strapparava Strappa@itc - It Rada Mihalcea Rada@cs - Unt.edu
Carlo Strapparava
FBK-Irst, Italy
Rada Mihalcea
University of North Texas
ABSTRACT
This paper describes experiments concerned with the automatic analysis of emotions in text. We describe the construction of a large data set annotated for six basic emotions: anger, disgust, fear, joy, sadness and surprise, and we propose and evaluate several knowledge-based and corpusbased methods for the automatic identication of these emotions in text.
Sentiment Analysis. Text categorization according to aective relevance, opinion exploration for market analysis, etc., are examples of applications of these techniques. While positive/negative valence annotation is an active area in sentiment analysis, we believe that a ne-grained emotion annotation could increase the eectiveness of these applications. Computer Assisted Creativity. The automated generation of evaluative expressions with a bias on certain polarity orientation is a key component in automatic personalized advertisement and persuasive communication. Verbal Expressivity in Human Computer Interaction. Future human-computer interaction is expected to emphasize naturalness and eectiveness, and hence the integration of models of possibly many human cognitive capabilities, including aective analysis and generation. For example, the expression of emotions by synthetic characters (e.g., embodied conversational agents) is now considered a key element for their believability. Aective words selection and understanding is crucial for realizing appropriate and expressive conversations. This paper describes experiments concerned with the emotion analysis of news headlines. In Section 2, we describe the construction of a data set of news titles annotated for emotions, and we propose a methodology for ne-grained and coarse-grained evaluations. In Section 3, we introduce several algorithms for the automatic classication of news headlines according to a given emotion. In particular we present several algorithms, ranging from simple heuristics (e.g., directly checking specic aective lexicons) to more rened algorithms (e.g., checking similarity in a latent semantic space in which explicit representations of emotions are built, and exploiting Na Bayes classiers trained on ve mood-labeled blogposts). Section 4 presents the evaluation of the algorithms and a comparison with the systems that participated in the SemEval 2007 task on Aective Text. It is worth noting that the proposed methodologies are either completely unsupervised or, when supervision is used, the training data can be easily collected from online moodannotated materials.
General Terms
Algorithms,Experimentation
Keywords
emotion annotation, emotion analysis, sentiment analysis
1. INTRODUCTION
Emotions have been widely studied in psychology and behavior sciences, as they are an important element of human nature. They have also attracted the attention of researchers in computer science, especially in the eld of human computer interaction, where studies have been carried out on facial expressions (e.g., [3]) or on the recognition of emotions through a variety of sensors (e.g., [13]). In computational linguistics, the automatic detection of emotions in texts is becoming increasingly important from an applicative point of view. Consider for example the tasks of opinion mining and market analysis, aective computing, or natural language interfaces such as e-learning environments or educational/edutainment games. For instance, the following represent examples of applicative scenarios in which aective analysis could make valuable and interesting contributions:
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. SAC08 March 16-20, 2008, Fortaleza, Cear , Brazil a Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00.
the given headline, and 100 represents maximum emotional load. Unlike previous annotations of sentiment or subjectivity [18, 12], which typically rely on binary 0/1 annotations, we decided to use a ner-grained scale, hence allowing the annotators to select dierent degrees of emotional load. The test data set was independently labeled by six annotators. The annotators were instructed to select the appropriate emotions for each headline based on the presence of words or phrases with emotional content, as well as the overall feeling invoked by the headline. Annotation examples were also provided, including examples of headlines bearing two or more emotions to illustrate the case where several emotions were jointly applicable. Finally, the annotators were encouraged to follow their rst intuition, and to use the full-range of the annotation scale bars.
2.3
Inter-Annotator Agreement
We conducted inter-tagger agreement studies for each of the six emotions. The agreement evaluations were carried out using the Pearson correlation measure, and are shown in Table 1. To measure the agreement among the six annotators, we rst measured the agreement between each annotator and the average of the remaining ve annotators, followed by an average over the six resulting agreement gures. Emotions anger 49.55 disgust 44.51 fear 63.81 joy 59.91 sadness 68.19 surprise 36.07 Table 1: Pearson correlation for inter-annotator agreement
2.4
Provided a gold-standard data set with emotion annotations, we used both ne-grained and coarse-grained evaluation metrics for the evaluation of systems for automatic emotion annotation. Fine-grained evaluations were conducted using the Pearson measure of correlation between the system scores and the gold standard scores, averaged over all the headlines in the data set. We also ran coarse-grained evaluations, where each emotion was mapped to a 0/1 classication (0 = [0,50), 1 = [50,100]). For the coarse-grained evaluations, we calculated precision, recall, and F-measure.
3. 3.1
We approach the task of emotion recognition by exploiting the use of words in a text, and in particular their cooccurrence with words that have explicit aective meaning. As suggested by Ortony et al. [11], we have to distinguish between words directly referring to emotional states (e.g.,
fear, cheerful) and those having only an indirect reference that depends on the context (e.g., words that indicate possible emotional causes such as killer or emotional responses such as cry). We call the former direct aective words and the latter indirect aective words [16]. As far as direct aective words are concerned, we follow the classication found in WordNet Affect.3 . This is an extension of the WordNet database [5], including a subset of synsets suitable to represent aective concepts. In particular, one or more aective labels (a-labels) are assigned to a number of WordNet synsets. There are also other a-labels for those concepts representing moods, situations eliciting emotions, or emotional responses. Starting with WordNet Affect, we collected six lists of aective words by using the synsets labeled with the six emotions considered in our data set. Thus, as a baseline, we implemented a simple algorithm that checks the presence of this direct aective words in the headlines, and computes a score that reects the frequency of the words in this aective lexicon in the text. Sentiment analysis and the recognition of the semantic orientation of texts is an active research area in the eld of natural language processing (e.g., [17, 12, 18, 9]). A crucial aspect is the availability of a mechanism for evaluating the semantic similarity among generic terms and aective lexical concepts. To this end we implemented a semantic similarity mechanism automatically acquired in an unsupervised way from a large corpus of texts (e.g., British National Corpus4 ). In particular we implemented a variation of Latent Semantic Analysis (LSA). LSA yields a vector space model that allows for a homogeneous representation (and hence comparison) of words, word sets, sentences and texts. For representing word sets and texts by means of an LSA vector, we used a variation of the pseudo-document methodology described in [1]. This variation takes into account also a tf-idf weighting schema (see [6] for more details). In practice, each document can be represented in the LSA space by summing up the normalized LSA vectors of all the terms contained in it. Thus a synset in WordNet (and even all the words labeled with a particular emotion) can be represented in the LSA space, performing the pseudo-document technique on all the words contained in the synset. In the LSA space, an emotion can be represented at least in three ways: (i) the vector of the specic word denoting the emotion (e.g. anger), (ii) the vector representing the synset of the emotion (e.g. {anger, choler, ire}), and (iii) the vector of all the words in the synsets labeled with the emotion. In this paper we performed experiments with all these three representations. Regardless of how an emotion is represented in the LSA space, we can compute a similarity measure among (generic) terms in an input text and aective categories. For example in a LSA space built form the BNC, the noun gift is highly related to the emotional categories joy and surprise. In summary, the vectorial representation in the LSA allows us to represent, in a uniform way, emotional categories, generic WordNet Affect is freely available for research purpose at http://wndomains.itc.it See [15] for a complete description of the resource. 4 BNC is a very large (over 100 million words) corpus of modern English, both spoken and written (see http://www.hcu.ox.ac.uk/bnc/). Other more specic corpora could also be considered, to obtain a more domain oriented similarity.
3
terms and concepts (synsets), and eventually full sentences. See [16] for more details.
3.2
In addition to the experiments based on WordNet Affect, we have also conducted corpus-based experiments relying on blog entries from LiveJournal.com. We used a collection of blogposts annotated with moods that were mapped to the six emotions used in the classication. While every blog community practices a dierent genre of writing, LiveJournal.com blogs seem to more closely recount the goingson of everyday life than any other blog community. The indication of the mood is optional when posting on LiveJournal, therefore the mood-annotated posts we are using are likely to reect the true mood of the blog authors, since they were explicitly specied without particular coercion from the interface. Our corpus consists of 8,761 blogposts, with the distribution over the six emotions shown in Table 2. This corpus is a subset of the corpus used in the experiments reported in [10]. LiveJournal mood angry disgusted scared happy sad surprised Number of blogposts 951 72 637 4,856 1,794 451
Table 2: Blogposts and mood annotations extracted from LiveJournal In a pre-processing step, we removed all the SGML tags and kept only the body of the blogposts, which was then passed through a tokenizer. We also kept only blogposts with a length within a range comparable to the one of the headlines, i.e. 100-400 characters. The average length of the blogposts in the nal corpus is 60 words / entry. Six sample entries are shown in Table 3. The blogposts were then used to train a Na Bayes clasve sier, where for each emotion we used the blogs associated with it as positive examples, and the blogs associated with all the other ve emotions as negative examples.
4.
We have implemented ve dierent systems for emotion analysis by using the knowledge-based and corpus-based approaches described above. 1. WN-Affect presence, which is used as a baseline system, and which annotates the emotions in a text simply based on the presence of words from the WordNet Affect lexicon. 2. LSA single word, which calculates the LSA similarity between the given text and each emotion, where an emotion is represented as the vector of the specic word denoting the emotion (e.g., joy). 3. LSA emotion synset, where in addition to the word denoting an emotion, its synonyms from the WordNet synset are also used.
anger I am so angry. Nicci cant get work o for the Useds show on the 30th, and we were stuck in trac for almost 3 hours today, preventing us from seeing them. bastards disgust Its time to snap out of this. Its time to pull things together. This is ridiculous. Im going nowhere. Im doing nothing. fear He might have lung cancer. Its just a rumor...but it makes sense. is very depressed and thats just the beginning of things joy This week has been the best week Ive had since I cant remem ber when! I have been so hyper all week, its been awesome!!! sadness Oh and a girl from my old school got run over and died the other day which is horrible, especially as it was a very small village school so everybody knew her. surprise Small note: French men shake your hand as they say good morning to you. This is a little shocking to us fragile Americans, who are used to waving to each other in greeting.
Fine r anger WN-affect presence 12.08 LSA single word 8.32 LSA emotion synset 17.80 LSA all emotion words 5.77 NB trained on blogs 19.78 disgust WN-affect presence -1.59 LSA single word 13.54 LSA emotion synset 7.41 LSA all emotion words 8.25 NB trained on blogs 4.77 Fear WN-affect presence 24.86 LSA single word 29.56 LSA emotion synset 18.11 LSA all emotion words 10.28 NB trained on blogs 7.41 joy WN-affect presence 10.32 LSA single word 4.92 LSA emotion synset 6.34 LSA all emotion words 7.00 NB trained on blogs 13.81 sadness WN-affect presence 8.56 LSA single word 8.13 LSA emotion synset 13.27 LSA all emotion words 10.71 NB trained on blogs 16.01 surprise WN-affect presence 3.06 LSA single word 9.71 LSA emotion synset 12.07 LSA all emotion words 12.35 NB trained on blogs 3.08
Prec. 33.33 6.28 7.29 6.20 13.68 0 2.41 1.53 1.98 0 100.00 12.93 12.44 12.55 16.67 50.00 17.81 19.37 18.60 22.71 33.33 13.13 14.35 11.69 20.87 13.04 6.73 7.23 7.62 8.33
Coarse Rec. 3.33 63.33 86.67 88.33 21.67 0 70.59 64.71 94.12 0 1.69 96.61 94.92 86.44 3.39 0.56 47.22 72.22 90.00 59.44 3.67 55.05 58.71 87.16 22.02 4.68 67.19 89.06 95.31 1.56
F1 6.06 11.43 13.45 11.58 16.77 4.68 3.00 3.87 3.33 22.80 22.00 21.91 5.63 1.10 25.88 30.55 30.83 32.87 6.61 21.20 23.06 20.61 21.43 6.90 12.23 13.38 14.10 2.63
Table 3: Sample blogposts labeled with moods corresponding to the six emotions 4. LSA all emotion words, which augments the previous set by adding the words in all the synsets labeled with a given emotion, as found in WordNet Affect. 5. NB trained on blogs, which is a Naive Bayes classier trained on the blog data annotated for emotions. The ve systems were evaluated on the data set of 1,000 newspaper headlines. As mentioned earlier, we conduct both ne-grained and coarse-grained evaluations. Table 4 shows the results obtained by each system for the annotation of the six emotions. The best results obtained according to each individual metric are marked in bold. As expected, dierent systems have dierent strengths. The system based exclusively on the presence of words from the WordNet Affect lexicon has the highest precision at the cost of low recall. Instead, the LSA system using all the emotion words has by far the largest recall, although the precision is signicantly lower. In terms of performance for individual emotions, the system based on blogs gives the best results for joy, which correlates with the size of the training data set (joy had the largest number of blogposts). The blogs are also providing the best results for anger (which also had a relatively large number of blogposts). For all the other emotions, the best performance is obtained with the LSA models. We also compare our results with those obtained by three systems participating in the Semeval emotion annotation task: SWAT, UPAR7 and UA. Table 5 shows the results obtained by these systems on the same data set, using the same evaluation metrics. We briey describe below each of these three systems: UPAR7 [2] is a rule-based system using a linguistic approach. A rst pass through the data uncapitalizes common words in the news title. The system then used the Stanford syntactic parser on the modied titles, and identies what is being said about the main subject by exploiting the dependency graph obtained from the parser. Each word is rst rated separately for each emotion and then the main subject rating is boosted. The system uses a combination of SentiWordNet [4] and WordNet Affect [15], which
were semi-automatically enriched on the basis of the original trial data provided during the Semeval task. UA [8] uses statistics gathered from three search engines (MyWay, AlltheWeb and Yahoo) to determine the kind and the amount of emotion in each headline. Emotion score are obtained by using Pointwise Mutual Information (PMI). First, the number of documents obtained from the three Web search engines using a query that contains all the headline words and an emotion (the words occur in an independent proximity across the Web documents) is divided by the number of documents containing only an emotion and the number of documents containing all the headline words. Second, an associative score between each content word and an emotion is estimated and used to weight the nal PMI score. The nal results are normalized to the 0-100 range. SWAT [7] is a supervised system using an unigram model trained to annotate emotional content. Synonym expansion on the emotion label words is also performed, using the Roget Thesaurus. In addition to the development data provided by the task organizers, the SWAT team annotated an additional set of 1000 headlines, which was used for training. For an overall comparison, we calculated the average over all six emotions for each system. Table 6 shows the overall results obtained by our ve systems and by the three Se-
SWAT UA UPAR7 SWAT UA UPAR7 SWAT UA UPAR7 SWAT UA UPAR7 SWAT UA UPAR7 SWAT UA UPAR7
Fine Coarse r Prec. Rec. anger 24.51 12.00 5.00 23.20 12.74 21.6 32.33 16.67 1.66 disgust 18.55 0.00 0.00 16.21 0.00 0.00 12.85 0.00 0.00 fear 32.52 25.00 14.40 23.15 16.23 26.27 44.92 33.33 2.54 joy 26.11 35.41 9.44 2.35 40.00 2.22 22.49 54.54 6.66 sadness 38.98 32.50 11.92 12.28 25.00 0.91 40.98 48.97 22.02 surprise 11.82 11.86 10.93 7.75 13.70 16.56 16.71 12.12 1.25
F1 7.06 16.03 3.02 18.27 20.06 4.72 14.91 4.21 11.87 17.44 1.76 30.38 11.78 15.00 2.27
Acknowledgments
Carlo Strapparava was partially supported by the HUMAINE Network of Excellence.
6.
REFERENCES
Table 5: Results of the systems participating in the the SemEval task for emotion annotations
meval systems. The best results in terms of ne-grained evaluations are obtained by the UPAR7 system, which is perhaps due to the deep syntactic analysis performed by this system. Our systems give however the best performance in terms of coarse-grained evaluations, with the WNaffect presence providing the best precision, and the LSA all emotion words leading to the highest recall and F-measure.
Fine r 9.54 12.36 12.50 9.06 10.81 25.41 14.15 28.38 Prec. 38.28 9.88 9.20 9.77 12.04 19.46 17.94 27.60 Coarse Rec. 1.54 66.72 77.71 90.22 18.01 8.61 11.26 5.68 F1 4.00 16.37 13.38 17.57 13.22 11.57 9.51 8.71
WN-affect presence LSA single word LSA emotion synset LSA all emotion words NB trained on blogs SWAT UA UPAR7
Table 6: Overall average results obtained by the ve proposed systems and by the three Semeval systems
5. CONCLUSIONS
In this paper, we described experiments for the automatic annotation of emotions in text. Through comparative evaluations of several knowledge-based and corpus-based methods carried out on a large data set of 1,000 deadlines, we tried to identify the methods that work best for the annotation of emotions. In future work, we plan to explore the lexical structure of emotions, and integrate deeper semantic processing of the text into the knowledge-based and corpusbased classication methods.
[1] M. Berry. Large-scale sparse singular value computations. International Journal of Supercomputer Applications, 6(1):1349, 1992. [2] F. Chaumartin. Upar7: A knowledge-based system for headline sentiment tagging. In Proceedings of SemEval-2007, Prague, Czech Republic, June 2007. [3] P. Ekman. Biological and cultural contributions to body and facial movement. In J. Blacking, editor, Anthropology of the Body, pages 3484. Academic Press, London, 1977. [4] A. Esuli and F. Sebastiani. SentiWordNet: A publicly available lexical resource for opinion mining. In Proceedings of the 5th Conference on Language Resources and Evaluation, Genova, IT, 2006. [5] C. Fellbaum. WordNet. An Electronic Lexical Database. The MIT Press, 1998. [6] A. Gliozzo and C. Strapparava. Domains kernels for text categorization. In Proc. of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), Ann Arbor, June 2005. [7] P. Katz, M. Singleton, and R. Wicentowski. Swat-mp:the semeval-2007 systems for task 5 and task 14. In Proceedings of SemEval-2007, Prague, Czech Republic, June 2007. [8] Z. Kozareva, B. Navarro, S. Vazquez, and A. Montoyo. Ua-zbsa: A headline emotion classication through web information. In Proceedings of SemEval-2007, Prague, Czech Republic, June 2007. [9] R. Mihalcea and H. Liu. A corpus-based approach to nding happiness. In Proc. of Computational approaches for analysis of weblogs, AAAI Spring Symposium 2006, Stanford, March 2006. [10] G. Mishne. Experiments with mood classication in blog posts. In Proceedings of the 1st Workshop on Stylistic Analysis Of Text For Information Access (Style 2005), Brazile, 2005. [11] A. Ortony, G. L. Clore, and M. A. Foss. The psychological foundations of the aective lexicon. Journal of Personality and Social Psychology, 53:751766, 1987. [12] B. Pang and L. Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics, Barcelona, Spain, July 2004. [13] R. Picard. Aective computing. MIT Press, Cambridge, MA, USA, 1997. [14] C. Strapparava and R. Mihalcea. SemEval-2007 task 14: Aective Text. In Proceedings of SemEval-2007, Prague, Czech Republic, June 2007. [15] C. Strapparava and A. Valitutti. WordNet-Aect: an aective extension of WordNet. In Proc. of 4th International Conference on Language Resources and Evaluation, Lisbon, May 2004. [16] C. Strapparava, A. Valitutti, and O. Stock. The aective weight of lexicon. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, Genoa, Italy, May 2006. [17] P. Turney and M. Littman. Measuring praise and criticism: Inference of semantic orientation from association. ACM Transactions on Information Systems (TOIS), 21(4):315346, October 2003. [18] J. Wiebe, T. Wilson, and C. Cardie. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2-3), 2005.