Emotion Detection Analysis Documenration
Emotion Detection Analysis Documenration
1. ABSTRACT ………………………………………………… 3
2. INTRODUCTION……………………………………………4
2.1 Motivation………………………………………………...4
2.2 Problem Statement………………………………………..5
2.3 Domain Introduction ……………………………………..5
3. REVIEW OF LITERATURE
3.1 Limitation Of prior Art……………………………...……….7
3.2 Related Work………………………………………………...7
3.3 Social Media…………………………………………………9
4. DATA CHARECTERISTICS…………………………………11
5. DATA ACQUISITION………………………………………..12
6. METHODOLOGY…………………………………………….14
6.1 Machine Learning Techniques……………………………….17
6.2 The Classification Problem…………………………………..18
6.3 Naïve Bayes Classifier……………………………………….18
6.4 Algorithm…………………………………………………….19
7. SYSTEM DESIGN
7.1 Fetching tweets………………………………………………21
7.2 Processing and storing tweets………………………………..22
7.3 Extraction Of emotion from tweets…………………………..23
2
Emotion Detection Analysis
8. IMPLEMENTATION………………………………………..25
8.1 A Bag Of Words…………………………………………….28
8.2 Application Of sentiment analysis………………………….30
8.3 Bayesian theorem appiled to document……………………..30
9. EVALUATION METRCIS…………………………………..32
10. RESULT AND ANALYSIS………………………………...32
11. CONCLUSION……………………………………………...34
12. FUTURE WORK …………………………………………...35
13.REFERENCES……………………………………………….36
3
Emotion Detection Analysis
1. ABSTRACT
This project addresses the problem of emotion analysis in twitter; that is classifying tweets
according to the sentiment expressed in them: sad, angry, love, happy, guilt etc Twitter is an
online micro-blogging and social-networking platform which allows users to write short status
updates of maximum length 140 characters. It is a rapidly expanding service with over 350
million registered users - out of which 250 million are active users and half of them log on
twitter on a daily basis - generating nearly 500 million tweets per day . Due to this large amount of
usage we hope to achieve a reflection of public sentiment by analysing the sentiments expressed
in the tweets. Analysing the public sentiment is important for many applications such as firms
trying to find out the response of their products in the market, predicting political elections and
predicting socioeconomic phenomena like stock exchange. The aim of this project is to develop a
functional classifier for accurate and automatic sentiment classification of an unknown tweet
stream.
Emotions play an important role in social media and they are used to stimulate cognitive
processes for strategies making. Emotions represent 1000 another form of language universally
spoken and understood. Identification and classification of emotions has been a research area
since Charles Darwin's age. Textual information can be collected from many sources, such as
books, newspapers, web pages, e-mail messages, etc. Nowadays Internet is the most popular
communication medium also rich in emotion. Emotions can be extracted from textual input by
analyzing punctuation, emotional keywords, syntactic structure and semantic information with
the help of natural language processing techniques. There is a large amount of textual information
available in the web. It is interesting to extract emotions for different goals like those of business.
For example, in luxury goods, the emotional aspects as brand, uniqueness and prestige for
purchasing decisions, are more important than rational aspects such as technical, functional or
price. In this case customer is happy to buy a product even with high prices. Emotional Marketing
aims to stimulate emotions in customer for tying him to brand and so increase the sell of
product/service. Nowadays it isn’t the product to be sold, since for each category there is a wide
choice, but the focus is the relationship that the consumer establishes with the brand and with the
emotions which the product communicates. Twitter is a large and rapidly growing micro blogging
social networking website where people express their opinions in a short and simple manner of
expressions. It is a common practice that merchants selling products on the Web ask their
customers to review the products. In twitter number of customer reviews on different products is
appearing. Electronic products are a common domain in which number of customer reviews
appears. Among the electronic products, mobile phones are a major domain in which customers
have a keen interest. This makes it difficult for a potential customer to read them in order to make
a decision on whether to buy the product. An emotion is a particular feeling that characterizes a
state of mind, such as joy, anger, love, fear and so on. Automatic emotion detection from text has
5
Emotion Detection Analysis
attracted growing attention due to its potentially useful applications. For examples, psychologists
can better assist their patients by analyzing their session transcripts for any subtle emotions;
reliable emotion detection can help develop powerful human-computer interaction devices; and
deep emotional analysis of public data such as tweets and blogs could reveal interesting insights
into human nature and behavior. Emotion is expressed as joy, sadness, anger, surprise, hate, fear
and so on. According to W. Gerrod Parrot , the human emotions are classified through an emotion
hierarchy into six classes at primary level which are Love, Joy, Anger, Sadness, Fear and
Surprise. Naïve Bayes classifier is an efficient classifier in terms of accuracy. This paper aims to
automatically detect the emotions from tweets using Naïve Bayes classifier. Tweets are extracted
and stored in a MYsql database which stores the data in csv format. Section 3.2 details the related
works, section 6 and section 6.3 explains the modular design and Naïve Bayes classifier
respectively.
Given a tweet, classify whether the given tweet is expressing sad, joy, anger, disgust, guilt
or love emotions. The tweets are stored in the form of csv format and those files are given as
input.The nltk, numpy, nltk.corpus, nltk.stem, textblob.classifiers, Kfold, optionparser data
science libraries are to be installed to perform emotion detection analysis using python.
This project of analyzing sentiments of tweets comes under the domain of“Pattern
Classification” and “Data Mining”. Both of these terms are very closely related and intertwined,
and they can be formally defined as the process of discovering “useful” patterns in large set of
data, either automatically (unsupervised) or semi-automatically (supervised). The project would
heavily rely on techniques of “Natural Language Processing” in extracting significant patterns
and features from the large data set of tweets and on “Machine Learning” techniques for
accurately classifying individual unlabelled data samples (tweets) according to whichever pattern
model best describes them.
6
Emotion Detection Analysis
The features that can be used for modeling patterns and classification can be divided into two
main groups: formal language based and informal blogging based. Language based features are
those that deal with formal linguistics and include prior sentiment polarity of individual words
and phrases, and parts of speech tagging of the sentence. Prior sentiment polarity means that some
words and phrases have a natural innate tendency for expressing particular and specific
sentiments in general. For example the word “excellent” has a strong positive connotation while
the word “evil” possesses a strong negative connotation. So whenever a word with positive
connotation is used in a sentence, chances are that the entire sentence would be expressing a
positive sentiment. Parts of Speech tagging, on the other hand, is a syntactical approach to the
problem. It means to automatically identify which part of speech each individual word of a
sentence belongs to: noun, pronoun, adverb, adjective , verb , interjection , etc. Patterns can be
extracted from analyzing the frequency distribution of these parts of speech (ether individually or
collectively with some other part of speech) in a particular class of labeled tweets.
Twitter based features are more informal and relate with how people express themselves on
online social platforms and compress their sentiments in the limited space of 140 characters
offered by twitter. They include twitter hashtags, retweets, word capitalization, word lengthening,
question marks, presence of url in tweets, exclamation marks, internet emoticons and internet
shorthand/slangs.
7
Emotion Detection Analysis
3. REVIEW OF LITERATURE
3.1 Limitations of Prior Art:
S. Radha Krishna et.al studied different spectral features such as MFCC, pitch chroma,
skewness and centroid for emotion recognition. The emotions considered in this study are Fear,
Anger, Neutral, and Happy. The system is evaluated for various combinations of spectral
features. It is established that the combination of MFCC and skewness gave better recognition
performance when compared with other combinations. These experiments are conducted and
evaluated using Gaussian Mixture models (GMMs).The data base used in this study is Telugu
emotion speech corpus (IIIT-KGP). Taner Danisman et.al [2] studied, automatic classification of
anger, disgust, fear, joy and sad emotions in text. The study was conducted on ISEAR
(International Survey on Emotion Antecedents and Reactions) dataset. For the classification they
8
Emotion Detection Analysis
have used Vector Space Model with a total of 801 news headlines provided by “Affective Task”
in SemEval 2007 workshop which focuses on classification of emotions and valences in text.
They have compared their results with ConceptNet and powerful text based classifiers including
Naive Bayes and Support Vector Machines. Their experiments showed that VSM classification
gives better performance than ConceptNet, Naive Bayes and SVM based classifiers for emotion
detection in sentences. An overall F-measure value of 32.22% and kappa value of 0.18 for five
class emotional text classification on SemEval dataset which is better than Navie Bayes
(28.52%), SVM (28.6%) was obtained. Swati D. Bhutekar et.al [3] presents a methodology to
extract emotion from the text at real time and add the expression to the textual contents during
speech synthesis. This paper also focuses on implementation of creation of Corpus, emotion
recognition module etc. In text analysis, all emotional keywords and emotion modification words
are manually defined. The test was carried out on set of textual sentences and preliminary rules
written for 34 different emotions. These rules are used in an automated procedure that assigns
emotional state values to words. These values are then used by speech synthesizer to add
emotions to speech & input sentence. Pitch detection algorithm has been implemented for pitch
recognition. The system is language dependent. Changqin Quan et.al[4] , make an analysis on
sentence emotion based on emotion words using Ren-CECps (a Chinese emotion corpus). Some
classification methods (including C4.5 decision tree, SVM, NaiveBayes, ZEROR, and
DecisionTable) have been compared. Then a supervised machine learning method (Polynomial
kernel method) is proposed to recognize the eight basic emotions (Expect, Joy, Love, Surprise,
Anxiety, Sorrow, Angry and Hate). Using Ren-CECps, we get the emotion lexicons for the eight
basic emotions. Polynomial kernel (PK) method is used to compute the similarities between
sentences and the eight emotion lexicons. Then the experiential knowledge derived from
Ren-CECps is used to recognize whether the eight emotion categories are present in a sentence.
The experiments showed promising results. Ali Houjeij, Layla Hamieh et.al [5] designed a
system that adopts a novel approach for emotional classification from human dialogue based on
text and speech context. Their main objective was to boost the accuracy of speech emotional
9
Emotion Detection Analysis
classification by accounting for the features extracted from the spoken text. The proposed system
concatenates text and speech features and feeds them as one input to the classifier. The work
builds on past research on music mood classification based on the combination of lyrics and audio
features. The innovation in our approach is in the specific application of text and speech fusion
for emotion classification and in the choice of features, Furthermore, in the absence of benchmark
data, a dataset of movie quotes was developed for testing of emotional classification and future
benchmarking. The comparison of the results obtained in each case shows that the hybrid
text-speech approach achieves better accuracy than speech or text mining alone. Amira F. El
Gohary et.a l[6] are concerned with the automatic detection of emotions in Arabic text. This
construction is based on a moderate sized Arabic emotion lexicon used to annotate Arabic
children stories for the six basic emotions: Joy, Fear, Sadness, Anger, Disgust, and Surprise.
Their approach achieves 65% accuracy for emotion detection in Arabic text. Shadi Shaheen,
Wassim El-Hajj et.al k [7], proposed a framework for emotion classification in English sentences
where emotions are treated as generalized concepts extracted from the sentences. They generated
an intermediate emotional data representation of a given input sentence based on its syntactic and
semantic structure. They then generalized this representation using various ontologies such as
WordNet and ConceptNet, which resulted in an emotion seed called an emotion recognition rule
(ERR). Finally, a suite of classifiers are used to compare the generated ERR with a set of
reference ERRs extracted from a training set in a similar fashion. The used classifiers are
k-nearest neighbors (KNN) with handcrafted similarity measure, Point Mutual Information
(PMI), and PMI with Information Retrieval (PMI-IR). When applied on different datasets, the
proposed approach significantly outperformed the existing state-of-the art machine learning and
rule-based classifiers with an average F-Score of 84%.
Social media can be referred to as the ”group of internet-based applications that build on the
ideological and technological foundations of Web 2.0, and that allow the creation and exchange
of user - generated content”, as defined by Kaplan and Haenlein .In recent years in addition to
10
Emotion Detection Analysis
the leaders of the World Wide Web such as Facebook,Google+, LinkedIn and Twitter, there are
new services for different groups of users: social network for students, the network for specific
groups of professionals, communities of ethnic minorities, and even a special network for
all the world’s drinkers. This extends the scope to very different kinds of research from consumer
preferences to psychological characteristics. In early 2015 Facebook retained the first place
among social platforms, and also Twitter was in the top ten. According to the same study by
Simon Kemp [41], more than 2 billion people worldwide are active users of social networks and
blogs. In Twitter, the number of monthly active users is 284 million in 2015. In 2016 the number
of monthly active users exceeded 320 million.
11
Emotion Detection Analysis
4. DATA CHARECTERISTICS
Twitter is a social networking and microblogging service that lets its users post real time
messages , called tweets. Tweets have many unique characteristics ,which implicates new
challenges and shape up the means of carrying sentiment analysis on it as compared to other
domains.
Writing technique: The occurrence of incorrect spellings and cyber slang in tweets is more
often in comparison with other domains. As the messages are quick and short, people use
acronyms, misspell, and use emoticons and other characters that convey special meanings.
Availability: The amount of data available is immense. More people tweet in the public
domain as compared to Facebook (as Facebook has many privacy settings) thus making
data more readily available. The Twitter API facilitates collection of tweets for training.
Topics: Twitter users post messages about a range of topics unlike other sites which are
designed for a specific topic. This differs from a large fraction of past research, which
focused on specific domains such as movie reviews.
Real time: Blogs are updated at longer intervals of time as blogs are Longer in nature and
writing them takes time. Tweets on the other hand being limited to 140 letters and are
updated very often. This gives a more real time feel and represents the first reactions to
events. We now describe some basic terminology related to twitter.
Emoticons: These are pictorial representations of facial expressions using punctuation and
letters. The purpose of emoticons is to express the user’s mood.
Target: Twitter users make use of the “@” symbol to refer to other users on Twitter. Users
12
Emotion Detection Analysis
are automatically alerted if they have been mentioned in this fashion.
Hash tags: Users use hash tags “#” to mark topics. It is used by Twitter users to make their
tweets visible to a greater audience.
Special symbols: “RT” is used to indicate that it is a repeat of someone else’s earlier tweet.
5.DATA ACQUISITION
Data in the form of raw tweets is acquired by using the python library “tweestream” which
provides a package for simple twitter streaming API . This API allows two modes of accessing
tweets: SampleStream and FilterStream. SampleStream simply delivers a small, random sample
of all the tweets streaming at a real time. FilterStream delivers tweet which match a certain
criteria. It can filter the delivered tweets according to three criteria:
A programmer can specify any single one of these filtering criteria or a multiple combination of
these. But for our purpose we have no such restriction and will thus stick to the SampleStream
mode.
Since we wanted to increase the generality of our data, we acquired it in portions at different
points of time instead of acquiring all of it at one go. If we used the latter approach then the
generality of the tweets might have been compromised since a significant portion of the tweets
would be referring to some certain trending topic and would thus have more or less of the same
general mood or sentiment. This phenomenon has been observed when we were going through
our sample of acquired tweets. For example the sample acquired near Christmas and New Year’s
had a significant portion of tweets referring to these joyous events and were thus of a generally
positive sentiment. Sampling our data in portions at different points in time would thus try to
minimize this problem. Thus forth, we acquired data at four different points which would be 17th
13
Emotion Detection Analysis
of December 2011, 29th of December 2011, 19th of January 2012 and 8th of February 2012.
A tweet acquired by this method has a lot of raw information in it which we may or may not find
useful for our particular application. It comes in the form of the python “dictionary” data type
with various key-value pairs. A list of some key-value pairs are given below:
• User ID
• Presence of hashtags
• Whether it is a re-tweet
• Language under which the twitter user has registered their account
Since this is a lot of information we only filter out the information that we need and discard the
rest. For our particular application we iterate through all the tweets in our sample and save the
actual text content of the tweets in a separate file given that Project Thesis Report 24 language of
the twitter is user’s account is specified to be English. The original text content of the tweet is
given under the dictionary key “text” and the language of user’s account is given under “lang”.
Since human labelling is an expensive process we further filter out the tweets to be labelled so
that we have the greatest amount of variation in tweets without the loss of generality. The filtering
criteria applied are stated below:
• Remove very short tweets (tweet with length less than 20 characters)
• Remove non-English tweets (by comparing the words of the tweets with a list of 2,000
common English words, tweets with less than 15% of content matching threshold are discarded)
• Remove similar tweets (by comparing every tweet with every other tweet, tweets with more
than 90% of content matching with some other tweet is discarded) .
After this filtering roughly 30% of tweets remain for human labelling on average per sample,
which made a total of 10,173 tweets to be labelled.
6. METHODOLOGY
The existing approaches to sentiment classification fall into two large categories:
The rule-based approach uses a set of rules based on analysis of object domain that could explain
and predict the polarity of the text (or the single sentence).Rules tend to get increasingly
complicated that lets to increase accuracy of results.
Disadvantages are associated by the large amount of time and expertise needed to design such
rules. Using of this approach for analysis of micro blogging may be difficult due to noise data.
The performance of rule-based approaches tends to be comparably robust across domains and
texts, but it is typically inferior to the performance of machine learning methods of polarity
classification. Rule-based methods mostly rely on lexicons that list words and their associated
sentiment scores. The sentiment scores of words in a text are typically combined (e.g., summed or
averaged) in accordance with predefined rules and assumptions in order to obtain a text’s overall
sentiment score, which can be used as an indicator for the text’s polarity. A lexicon based
15
Emotion Detection Analysis
approach uses an affective lexicon to derive the polarity of the examined text. Affective lexicons
contain lists of words either divided by certain sentiment classes (e.g. positive, negative,neutral)
or providing a single list of words each associated with a numerical value representing its
polarity. The follows dictionaries (affective lexicons) can be used for the English language:
WordNet: WordNet is one of the largest lexical resource for English language which is
extensively used in scientific research. Multiple words can form a synset, a set of words
that may be used synonymously within a word sense. Additionally, synonymous relations
are defined between synsets, leading to a taxonomy structure. These relations may
be used for generalization over objects (e.g., cheese and bread are a type of food).
WordNet 3.0 lists 117,798 nouns in 82,115 synsets. WordNet also contains words of other
part-of-speech, such as adjectives and verbs, however they have much lower coverage and
their taxonomies are relatively flat.
def emotion_word_set(emotions):
word_set = {}
for e in emotions:
representative_words = readfile(e)
wordlist = affect_wordlist(representative_words)
16
Emotion Detection Analysis
word_set[e] = wordlist
return word_set
attribute values for the dataset are known. This data is called training data . The training data
consists of a set of training examples . To evaluate the performance of the learned
model after training is finished, one applies it to a different set of data, known as test data.
Sometimes, people also use part of the whole dataset as validation dataset for model selection,
i.e., select from all the models obtained from the training process the one model with the best
performance on the validation dataset. Machine learning methods can be divided into
classification, regression, clustering. When the class attribute is discrete, it is called classification;
when the class attribute is continuous, it is regression. In clustering, the data is often unlabeled.
Thus ,the label for each instance is not known to the clustering algorithm.
Considering the task of sentiment analysis in social media in this thesis, classification is used.
Classification methods such as decision trees, naive Bayes classifier and other are considered in
the Section 6.1. Many researchers have used a combination of the two approaches: machine
learning and approache based on affective lexicon and rules. The reason is that the hybrid
approach in practice shows the best results. In this thesis , machine learning classification
methods are considered. Since these methods show good results in the sentiment analysis of
social media (blog-sites and reviews). The approaches described above such as approaches based
on rules and affective lexicon are beyond the scope of this thesis.
need learn a classifier or classifier function Γ(D) = γ that maps documents to classes:
γ : X → C.
For the sentiment analysis of social media, the set C consists of three classes
In Naive Bayes classifier, Y represents the class variable and X represents the instance feature.
Let X be (x1, x2,. . . , xm), where xi represents the value of feature i.
Let (y1, y2, . . . , yn) represents the value the class attribute Y can take. Then, the class attribute
value of instance X can be calculated by measuring
argmaxyiP(yi|X). (2.2)
Note that P(X) is constant and independent of yi, so we can ignore the denominator of
Equation 2.3 when maximizing Equation 2.2. The Naive Bayes Classifier also assumes
conditional independence to make the calculations easier ; that is , given the class
19
Emotion Detection Analysis
attribute value, other feature attributes become conditionally independent . This
condition, though unrealistic, performs well in practice and greatly simplifies calculation.
Naive Bayes algorithm is also called the probabilistic method .In Naïve Bayes algorithm and
binary keyword were simultaneously used to produce a single Dimensional degree of sentiment
entrenched in tweets from twitter network.There are two different ways we can set up the Naive
Bayes classifier: the multinomial model and the multivariate Bernoulli model. In this thesis the
multinomial Naive Bayes model is considered.
1. Consider a training data set D consists of documents which belongs to different classes
say class A and B.
20
Emotion Detection Analysis
2. Prior probability of both classes A and B is calculated as shown Class A=number of objects of
class A / total number of objects. Class B=number of objects of class B / total number of objects.
3. Now calculate the total number of word frequencies of both classes A and B i.e., ni na = the
total number of word frequency of class A. nb =the total number of word frequency of class B.
…………………………………………
…………………………………………
6. Now a new document M is classified based on calculating the probability for both classes A
and B P (M/W).
a) Find P(A / W) = P(A) * P(word1/class A)* P(word2/ class A)……* P(wordn / class A).
b) Find P(B / W) = P(B) * P(word1/class B)*P(word2/ class B)……* P(wordn / class B).
7. After calculating probability for both classes A and B the class with higher probability is the
one the new document M assigned.
21
Emotion Detection Analysis
Learning
Algorithm
Bayes Mode;
Bayes Testing
Test Set
Predicted
Results
7. SYSTEM DESIGN
The proposed methodology can be divided into different modules.The first module fetches the
tweets. Second module extracts and stores the tweets in Mysql database and third module extracts
emotions from tweets using Naïve Bayes Classifier.
In this module, tweets are fetched from the twitter website. The package twitter is imported for
the same.
There are two sub modules within this module
22
Emotion Detection Analysis
1 Twitter OAuth
Here the application should authenticate with twitter so as to access the twitter data. For this, an
account has to be created in twitter. Four parameters will be provided by twitter for the developers
using which the developers and their applications can be authenticated. The four parameters are:
consumer key, consumer secret, access token and access secret. After authentication, data can be
fetched from twitter.
2 Extraction of tweets
Tweets are extracted from the site using searchTwitter() function. The function takes the
following inputs:
The search for tweets is done using the hashtag of respective phone name. The search can also be
done using other twitter handles as well.
In this module, tweets after processing are stored into the database. Mongodb [11] is used as the
database.
This module can be viewed as two submodules .
1) Processing of tweets:
Here the tweets are subjected to a set of procedures which makes them suitable for sentiment
analysis in the next module. The following steps are involved in the processing of tweets:
• Removal of re-tweets.
23
Emotion Detection Analysis
• Removal of handles like # etc.
• Removal of punctuations.
• Removal of digits.
• Removal of web addresses like https:// etc.
• Removal of blank spaces.
• Convert the entire text to lower case.
2) Storing tweets
The processed tweets are then converted into a dataframe and inserted into a mysql collection.
Tokenization:
It is a method that divides the variety of document into small parts called tokens. These
tokens may be in the form of words or numbers or punctuation marks.
Stop words:
These are the common words that are to be ignored which reduces the size of the dataset
also the no of words (tokens). In our programming language python we use a tool called natural
language tool kit(NLTK) in which there is list of stop words in 16 different languages.
Results show the probability of each tweet saying whether the tweet is either
positive or negative.
• Url’s and user references (identified by tokens “http” and “@”) are removed if we are
interested in only analyzing the text of the tweet.
def removal(sentences):
sentence_list = []
count = 0
for sen in sentences:
27
Emotion Detection Analysis
count += 1
print count
print sen
print type(sen)
s = nltk.word_tokenize(sen)
characters = ["á", "\xc3", "\xa1", "\n", ",", "."]
new = ' '.join([i for i in s if not [e for e in characters if e in i]])
sentence_list.append(new)
return sentence_list
• Lowercase Conversion:
Tweet may be normalized by converting it to lowercase which makes it’s comparison with an
English dictionary easier.
• Stemming:
It is the text normalizing process of reducing a derived word to its root or stem [28]. For example
a stemmer would reduce the phrases “stemmer”, “stemmed”, “stemming” to the root word
“stem”. Advantage of stemming is that it makes comparison between words simpler, as we do not
need to deal with complex grammatical transformations of the word. In our case we employed the
algorithm of “porter stemming” on both the tweets and the dictionary, whenever there was a need
of comparison. The process of stemming is as follows:
def stemming(sentences):
sentence_list = []
sen_string = []
sen_token = []
stemmer = PorterStemmer()
i=0
for sen in sentences:
28
Emotion Detection Analysis
print i,
i += 1
st = ""
for word in sen:
word_l = word.lower()
if len(word_l) >= 3:
st += stemmer.stem(word_l) + " "
sen_string.append(st)
w_set = nltk.word_tokenize(st)
sen_token.append(w_set)
w_text = nltk.Text(w_set)
sentence_list.append(w_text)
return sentence_list, sen_string, sen_token
• Parts-of-Speech Tagging:
POS-Tagging is the process of assigning a tag to each word in the sentence as to which
grammatical part of speech that word belongs to, i.e. noun, verb, adjective, adverb, coordinating
conjunction etc.
8.1 Bag-of-words:
A bag-of-words is a representation of text that describes the occurrence of words within a
document. The occurrence of words is represented in a numerical feature. It is a way of extracting
features from the text for use in modelling, such as with machine learning algorithms.The
approach is very simple and flexible and can be used for extracting features from documents.But
there is some complexity on twocases i.e., one is on designing the vocabulary of known words
and the other is on scoring the presence of known words. Let us consider there are 2 classes i.e.,
positive class and negative class. Each class contains some words that is positive class contains
29
Emotion Detection Analysis
some bag of positive words (slow, fine, good, fantastic) and negative class contains some bag of
negative words (hate, terrible, heavy). We will give the input as a text/sentence and starts
counting the frequency of each word in the document and this gives the result whether the
text/sentence belongs to positive class or negative class.
Now we have to maintain a table which contains 2 columns with attributes word and counts
i.e., we are going to count frequency of each word in the document.
Word Count
It 1
Is 1
So 2
Hot 1
Today 1
I 2
Am 2
Not 1
Outside 1
Watch 1
The 1
Season 1
30
Emotion Detection Analysis
Primere 1
Now after completion of the table, we came to know the words that are frequently occurred and
the words with rare cases. So due to this, we don't really want to consider all the different words
that appear in different documents. We consider only some fixed words. As from the above table,
there is a total of 14 words but after that, there are only 10 most frequent words which reduce the
comparisons.
Naïve Bayes classifier is one of the supervised classification technique which classifies the
text/sentence that belongs to particular class. It is the probabilistic algorithm which calculates the
probability of each word in the text/sentence and the word with highest probability is considered
as output.
Now we train our classification algorithm using this training set and we get trained classifier. By
using this trained classifier we can classify the new document.
P (b | a) = [p (a | b) * p (b)] / [p (a)]
Example: consider a tweet or a sentence such as “It is going to rain today" now. We are going to
apply naïve Bayes classifier and say whether the sentence is either positive or negative.By
considering the bag of words concept i.e., which contains some positive and negative words and
their frequency counts. Now comparison is performed among each word in the sentence and
positive and negative words in bag of words. Probability is calculated on both positive and
negative words, the words which is having highest probability is taken into consideration i.e., if
the positive words having highest probability, then the tweet is considered as positive and vice
versa.
After calculating the probability of the above statement we get some overall probability for
positive words.
After this, we will get some overall probability for negative words.
Now by comparing both probabilities the words having the highest range is taken into
consideration. By this, we can say whether the new input tweet or sentence is positive or negative.
32
Emotion Detection Analysis
9. EVALUATION METRICS:
Table depicts the precision, recall and accuracy corresponding to the emotions sad, joy
and surprise extracted from latest 500 tweets.
The below table depicts the total number of tweets according the emotions mentioned below.
The below graph represents the total number of tweets tweeted according to the emotions.
33
Emotion Detection Analysis
14
12
10
8 max
6 avg
4 min
0
Joy Fear Anger Sadness Disgust Shame
In addition to the above information, we make a condition while reporting the results of
polarity classification (which differentiates between positive and negative classes) that only
subjective labelled tweets are used to calculate these results. However, in case of final
classification approach, any such condition is removed and basically both objectivity and polarity
classifications are applied to all tweets regardless of whether they are labelled objective or
subjective. If we compare these results to those provided by Wilson et al. (results are displayed in
Table 1 and Table 2 of this report) we see that although the accuracy of neutral class falls from
82.1% to 73% if we use our classification instead of theirs. However, for all other classes we
report significantly greater results. Although the results presented by Wilson et al. are not from
Twitter data they are of phrase level sentiment analysis which is very close in concept to Twitter
sentiment analysis. Next we will compare our results with those presented by Go et al. The results
presented by this paper are as follows:
the best foundation for classifiers. To find the best data representation, all datasets were
represented in vector models: Bagof-Words model (unigrams) and Bag-of-N-grams model
(bigrams and trigrams). Bag-ofWords model (unigrams) of representation of the datasets showed
the best results for all methods and influenced in a positive way improving the overall accuracy of
the machine learning techniques than Bag-of-N-grams model (bigrams and trigrams). The
primary research question is about comparing machine learning models for sentiment
classification of the datasets collected from Twitter and than analysing the outcome. Several
machine learning methods were used during experimentation session: Naive Bayes. All the
methods were compared in terms of accuracy. As turned out all the machine learning models that
have been applied to these datasets gave good performance results. Conclusion The main point
for further work is to use recurrent neural networks, particularly LSTM (Long short-term
memory) model and deep learning models for experimentation. The resulting vectors of text data
35
Emotion Detection Analysis
representation could be used if necessary as part of deep learning models. Regarding
representation text as vector models could be also used another techniques taking into account the
frequency with which a term appears in the collection of tweets. Examples are the model TF
(Term Frequency - the frequency of the term) and TF-IDF (Term Frequency - Inverse Document
Frequency).
(1) The collected sample data is limited. The streaming data may be big enough, however, less
than 5% of them have geographical information available. The amount of tweets finally flowed
into our system are very small. A method to predict or deduct the location of a tweet based on the
tweet’s information and the user’s information should be found in the future.
(2) For the parallel Naive Bayes MapReduce implementation in this thesis, it is not only can be
applied to text classification or sentiment analysis. It can also be extended to other areas like
image processing, or gene analysis etc.
(3) We have concluded the performance prediction model related to the size of input data and the
number of nodes in a cluster, we also have a certain understanding of the relationship between the
performance and the global counters, and the hardware configuration. For the latter, we still need
more data and experiments to have more details, such as how many global counters we should use
for a certain use case, or under what situation that upgrading the hardware is worth the
improvements it brings.
(4) When the MapReduce framework is dealing with some algorithms that have a high time
complexity in the data mining area, the consumption of resource is increased since the exchange
of data will bring a lot of I/O transferring and 73 network communication during the iteration.
Therefore, how to reduce and optimize the data exchange in the MapReduce framework is also an
issue that needs to be solved.
36
Emotion Detection Analysis
13.REFERENCES
Steven Bird, Even Klein & Edward Loper. Natural Language Processing with python
Taner Danisman, Adil Alpkocak, “Feeler: Emotion Classification of Text Using Vector Space
Model”.
Albert Biffet and Eibe Frank. Sentiment Knowledge Discovery in Twitter Streaming Data.
Discovery Science, Lecture Notes in Computer Science, 2010, Volume 6332/2010.
Alec Go, Richa Bhayani and Lei Huang. Twitter Sentiment Classification using Distant
Supervision. Project Technical Report, Stanford University, 2009.
Alexander Pak and Patrick Paroubek. Twitter as a Corpus for Sentiment Analysis and Opinion
Mining. In Proceedings of international conference on Language Resources and Evaluation
(LREC), 2010.
Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner and Isabell M. Welpe. Predicting
Elections with Twitter: What 140 Characters Reveal about Political Sentiment. In Proceedings of
AAAI Conference on Weblogs and Social Media (ICWSM), 2010.
Bo Pang, Lillian Lee and Shivakumar Vaithyanathan. Thumbs up? Sentiment Classification
using Machine Learning Techniques. In Proceedings of the Conference on Empirical Methods in
Natural Language Processing (EMNLP), 2002.
37
Emotion Detection Analysis
Chenhao Tan, Lilian Lee, Jie Tang, Long Jiang, Ming Zhou and Ping Li. User Level Sentiment
Analysis Incorporating Social Networks. In Proceedings of ACM Special Interest Group on
Knowledge Discovery and Data Mining (SIGKDD), 2011.