Data Mining and Sentiment Analysis: A Seminar Report On
Data Mining and Sentiment Analysis: A Seminar Report On
on
by
Prof. N.A.Mhetre
UNIVERSITY OF PUNE
2019-2020
Sinhgad Technical Education Society,
Department of Computer Engineering
Sinhgad College of Engineering , Pune-41
Date:
CERTIFICATE
This is to certify that Ms. Shraddha Shekhar has successfully completed her seminar work titled “Data
Mining And Sentiment Analysis’’ at Department of Computer Engineering, SCOE, Pune for the partial
fulfillment of the Bachelor Degree of Computer Engineering, Savitribai Phule Pune University, in
semester-II, academic Year 2017-2018.
Dr. S.D.Lokhande
Principal
Sinhgad College of Engineering
Abstract
As a huge amount of data is generated every-day and this data is in unstructured format, Text
Mining is one of the fastest growing technology for extraction of information from this unstructured data.
With the help of NLP (natural language processing) Text Mining is simplified. This report consists of,
study of text mining, its implementation using NLP with the help to nltk, a text toolkit provided by
python for NLP. It also specifies study of Sentiment analysis. In a course of time, a large number of
audits are created on the web about an item, individual or a place. Sentiment Analysis is a research area
which comprehends and extricates the assessment from the given review and the analysis process in
corporates natural language processing (NLP), computational linguistics, text analytics and classifying the
polarity of the opinion. In the field of sentiment analysis, there are numerous algorithms exist to tackle.
i
Acknowledgements
I would like to express my sincere gratitude towards Prof. N. A. Mhetre, for being such a
humble and helpful guide throughout. Under such amazing guidance, I was easily able solve all
my difficulties and complete my seminar.
I would also like to thank our respected H.O.D. (Computer Dept.) Prof. M. P. Wankhede
and (Principal) Dr. S. D. Lokhande for being a generous support and helping me out in every
way possible.
Last but not the least, I would like to thank all my friends and my family for being by my
side and for their honest opinions and their contribution.
Signature
Shraddha Shekhar
(Roll no-305109)
(PRN No-71813890C)
ii
List of Tables
iii
List of Figures
Acronyms
BI Business Intelligence
NLP Natural Language Processing
iv
NLTK Natural Language Tool Kit
SA Sentiment Analysis
v
Contents
Page No.
Certificate
Abstract i
Acknowledgement ii
List of Tables iii
List of Figures iv
Acronyms v
Chapter-1 Introduction------------------------------------------------- 1-3
1.1 Motivation 1
1.2 Timeline /evolution 2
1.3 Organization of the report 3
Chapter-2 Literature Review----------------------------------------- 4-5
2.1 Fundamentals 4
2.2 Related Work 5
2.3 Survey Conclusion 5
Chapter-3 Methodology----------------------------------------------- 6-24
3.1 Text Mining 6
3.2 NLP (Natural Language Processing) 9
3.3 Applying Machine Learning 15
3.4 Applications of Text Mining 17
3.5 Sentiment Analysis 18
3.6 Sentiment Classification 21
3.7 Applications of Sentiment Analysis 23
3.8 Example (Twitter Sentiment Analysis) 24
Chapter-4 Discussion of Results------------------------------------- 25
Chapter-5 Conclusion and Future work--------------------------- 27-28
5.1 Conclusion 27
5.2 Future work <Suggested in the referred paper> 27
References-------------------------------------------------- 29
Chapter 1
Introduction
Retrieving information is just a matter of seconds now-a-days, but ever thought How
does this happen? How is this data retrieved? Around 43 zettabytes of data is generated
every-day, and about 89% of this data is trivial in the analysis process. Extracting useful
information out of this raw data is nothing but Text Mining. Ever got an advertise pop-up
in the middle of surfing on the internet related to what you previously searched for? Or
how does You-tube select “recommended” videos for you? This is nothing but an
application of Text Mining, more specifically sentiment analysis. Text mining simply
means analyzing the huge data stored in the database, and extracting the relevant and
required information from it for further processing. And sentiment analysis means
evaluating the text for the tone of the text, whether it is positive, negative or neutral.
Mining of data in any forms (text, image, audio, video) is solely carried out for extracting
necessary information from the raw data. In this report you will find the study of Text
mining and Sentiment Analysis, what it exactly is and how it is actually implemented,
from scratch.
Section 1.1 discusses about Motivation for the topic, Section 1.2 about the Timeline
Evolution a and Section 1.3 states the Organization flow of the report.
1.1 Motivation
A huge amount of data is generated every-day, and most of the data is held in
unstructured format, most of which is trivial for analysis process. Examining this large data
and processing it for extracting the necessary information, is a very tedious and exhaustive
task. This is where Text Mining comes in the picture.
For the purpose of extracting valuable and non-trivial information from this
unstructured data, Text mining has become a need in today’s time. Providing with different
facilities for almost all requirements in data processing, text mining makes the task a lot
easier and saves the time. With the help of NLP (natural language processing), text mining
has got a new level of simplicity.
1
Curiosity of knowing how this conversion happens and how this huge data is handled,
motivated me to choose this topic.
1.2 Timeline/Evolution
The term text analytics describes a set of linguistic, statistical, and machine
learning techniques that model and structure the information content of textual sources
for business intelligence, exploratory data analysis, research, or investigation. The term is
roughly synonymous with text mining, indeed, Ronen Feldman modified a 2000 description
of "text mining" in 2004 to describe "text analytics"[1]. The latter term is now used more
frequently in business settings while "text mining" is used in some of the earliest
application areas, dating to the 1980s, notably life-sciences research and government
intelligence. Prof. Marti A. Hearst described the emergence of text analytics in its current
form stems from a refocusing of research in the late 1990s from algorithm development to
application in the paper Untangling Text Data Mining.[2]
As management information systems developed starting in the 1960s, and as BI
emerged in the '80s and '90s as a software category and field of practice, the emphasis was
on numerical data stored in relational databases. Later on the focus immerged into
analyzing the text for BI and increasing the productivity in business.[8]
There are several papers published on sentiment analysis for the domain of blogs
and product reviews. Researchers have contributed their work on detecting sentiment in
text, by providing a simple algorithm, called semantic orientation by analyzing the utility of
linguistic features for detecting the sentiment of Twitter text. In the given hierarchical
method, text is classified first as containing sentiment and then categorized as +ve and -ve .
With the group of blogs and social networking sites, opinion mining and sentiment analysis
grew to become an area of interest for a lot of researchers. In authors used Twitter database
and trained models to perform a sentiment search and developed corpora by using
emoticons to gather +ve as well as −ve samples, after that worked with several classifiers.
The most suitable outcome was attained by the Naive Bayes classifier with a traditional
information measure for feature selection. In they succeeded in achieving around 81
2
percentage of accuracy on their test set. The techniques which were mentioned above were
not up to the mark for classifying the emoticons (−ve, +ve and neutral).[3]
3
Chapter 2
Literature Review
Text mining and Sentiment analysis has been handled using Natural Language Processing task at
many levels of granularity. Section 2.1 discusses the Fundamentals of the report, Section 2.2
gives information on Related Work in this field and Section 2.3 states the Survey Conclusion.
2.1 Fundamentals
Understanding how raw unstructured data Is converted to structured data for
knowledge extraction and processing it further. Implementation of text mining
using NLP.
Chapter 3
5
Methodology
This chapter includes the explanation for all the topics and its subtopics that the report comprises
of. Section 3.1 discusses about Text Mining and all its sub topics. Section 3.2 explains NLP and
its terminologies. Section 3.3 explains how machine learning can be applied and Section 3.4
gives the application of Text Mining. The next section 3.5 introduces you to Sentiment analysis
and its process and Section 3.6 explains Sentiment Classification. Lastly Section 3.7 jots down
the applications of SA and Section 3.8 provides a simple example of Twitter Sentiment Analysis.
3.1.3 Summarization
Summarization is collecting and producing concise representation of documents
with original text, this process is called as text summarization. In summarization,
first the raw text is taken and pre-processing and processing operations are
performed on it. In pre-processing, three methods are applied i.e., tokenization
stemming and word removal methods are applied. At handling stage of text
summarization, generation of lexicon lists take place. The performance of
automatic text summarization was influenced by rate of appearance of words or
phrases in the last few years. Later to increase correctness of results some more
methods were brought into practice with the standards procedure of text mining.
7
Multiple documents can apply text summarization techniques at the same time.
The subject of the documents depends on the quality and type of classifiers.
Precise text is generated from number of documents in Summarization. It is not
often possible to encapsulate huge textual file. Also, in centres used in for
examining all the documents cannot be read. They basically summarize
documents and make up the summary of document from important points.
3.1.4 Clustering
The process of sectioning a group of objects or data into collection of relevant
and understandable subclasses is termed as clustering. Clustering is mainly used
to make a set of similar documents and files. The advantage of clustering is that
the document or text files will be in multiple sub topics, which makes it safer for
important documents from getting erased from search. Clustering technique
separates records in a dataset into groups in such a way that themes in a cluster
are same while themes between the clusters are different. Acquiring the group
that has some value with regard to the difficulty being addressed is the main aim
of cluster analysis. The result is not achieved always. Clustering is mainly
classified into two types:
i. Hierarchical
ii. Non hierarchical
3.1.5 Categorization
In categorization, the important themes of a document are recognized. This is
done by assigning the documents into a set of topics which are predefined. The
document which is categorized it can be treated as ‘bag of words’. As
information extraction does attempt to process the actual information whereas
categorization doesn’t attempt to process the actual information. In this the
words of the document are counted by categorization process Then using the
counts they recognize the important subjects of the documents.
8
3.2 NLP (Natural Language Processing)
NPL is a field of AI in which computer analyzes, understands and derive meaning of
human language in a smart and useful way. The history of natural language
processing (NLP) generally started in the 1950s, although work can be found from earlier
periods. In 1950, Alan Turing published an article titled "Computing Machinery and
Intelligence" which proposed what is now called the Turing test as a criterion of
intelligence. Up to the 1980s, most natural language processing systems were based on
complex sets of hand-written rules. Starting in the late 1980s, however, there was a
revolution in natural language processing with the introduction of machine
learning algorithms for language processing. It is the finest of all for Text Mining. It is
widely used with the Natural Language Tool Kit provided by python. [8]
The algorithm consists of following steps:
1. Tokenization
2. Stemming
3. Lemmatization
4. Data Cleaning
5. POS Tags
6. Named Entities
7. Chunking
3.2.1 Tokenization
It is the first step in NLP and consist of 3 steps:
• Break a complex sentence into words.
• Understand the importance of each word with respect to the sentence.
• Produce a structural description on an input sentence.
9
Fig. (3.2.1) Tokenization using nltk
10
Bigram gives a list of tuples containing 2 consecutive tokens from the tokenized list.
Similarly Trigram produces tuples of 3 consecutive tokens and Ngram for N tokens.
3.2.2 Stemming
Stemming is Normalizing words into its base form or root form. For eg.
Detected Detection Detecting Detections
has the root form Detect.
Stemming algorithm works by cutting off the end or the beginning of the word by taking
in account the common set of suffixes and prefixes and hence the actual root form of the
word may not be obtain every time.
There are 2 main types of stemming, PorterStemmer and LancasterStemmer
LancasterStemmer is more robust than the PorterStemmer.
PorterStemmer
LancasterStemmer
11
Fig (3.2.4) LancasterStemmer using nltk
3.2.3 Lemmatization
This step Groups together different inflected forms of words, called Lemma.
It is somewhat similar to stemming as it maps several words having same meaning of
root
Output is a proper word. A detail dictionary is required as it maps similar words which is
provided by WordNet from nltk.stem. Lemmatization also streamlines the analysis
process by removing redundant data by mapping them to the common root.
eg. Gone, Going and went is mapped to the same root Go.
• Algorithm:
from nltk.corpus import stopwords
12
import re
punctuations=re.compile(r’[-,?:;()|0-9]’)
final=[]
for words in AI_tokens:
word=punctuations.sub(“”,words)
if len(word)>0:
final.append(word)
13
Fig (3.2.6) POS Tags assigned using nltk
3.2.7 Chunking
Chunking means picking up individual pieces of information and grouping them into
bigger pieces, known as chunks. It groups the Noun Phrases according to the specified
grammar.
In the example we first specify a Regular Expression. Then a parser is created according
to the Regular Expression. The text/doc is firstly tokenized, and then these tokens is are
passed to the parser that we generated to form the tree.
14
Fig (3.2.7) Chunking using nltk
The output for chunking is a tree of Noun Phrases (as shown below) which clubs together
the noun phrases which will be formed according to the regular expression that we
specify in the code.
15
but this assumption severely limits its applicability. Depending on the precise nature of
the probability model, the naïve Bayes classifiers can be trained very efficiently by
requiring a relatively small amount of training data to estimate the parameters necessary
for classification. Because independent variables are assumed, only the variances of the
variables for each class need to be determined and not the entire covariance matrix. Due
to its apparently over-simplified assumptions, the naïve Bayes classifiers often work
much better in many complex real-world situations than one might expect. The naïve
Bayes classifiers has been reported to perform surprisingly well for many real world
classification applications under some specific conditions [11] [12] [13] [14] [15]. An
advantage of the naïve Bayes classifier is that it requires a small amount of training data
to estimate the parameters necessary for classification. Bayesian classification approach
arrives at the correct classification as long as the correct category is more probable than
the others. Category’s probabilities do not have to be estimated very well. In other words,
the overall classifier is robust enough to ignore serious deficiencies in its underlying
naïve probability model. The main disadvantage of the naïve Bayes classification
approach is its relatively low classification performance compare to other discriminative
algorithms, such as the SVM with its outperformed classification effectiveness.
Therefore, many active researches have been carried out to clarify the reasons that the
naïve Bayes classifier fails in classification tasks and enhance the traditional approaches
by implementing some effective and efficient techniques [11] [13] [14] [15] [16].
Naïve Bayes has been one of the popular machine learning methods for many years. Its
simplicity makes the framework attractive in various tasks and reasonable performances
16
are obtained in the tasks although this learning is based on an unrealistic independence
assumption. For this reason, there also have been many interesting works of investigating
naive Bayes. Recently the [17] shows very good results by selecting Naïve Bayes with
SVM for text classification also the authors in [18] prove that Naive Bayes with SOM
give very good results in clustering the documents.
18
content. It involves predicting or analyzing the hidden information present in the text.
This hidden information is very useful to get insights of user’s likes and dislikes. The aim
of sentiment analysis is to determine the attitudes of a writer or a speaker for a given
topic. Sentiment analysis can also be applied to audio, images and videos.[19] Today
internet has become the major part of our life. Most of the people use online blogging
sites or social networking sites to express their opinions on certain things. They also use
these sites to know what other people’s opinions are. Thus, mining of this data and
sentiment extraction has become an important field of research.
19
3) Stemming-It is the process of removing prefixes and suffixes. For example ‘playing’,
‘played’ can be stemmed to ‘play’. It helps in classification but sometimes leads to
decrease in classification accuracy.
4) Stop words- Pronouns (he/she, it), articles (a, the), prepositions (in, near, beside) are
stop words. They provide no or little information about sentiments. There is a list of stop
words available on the internet. It can be used to remove them in the pre-processing step.
5) Conjunction handling- In general, each sentence expresses only one meaning at a time.
But certain conjunction words like but, while, although, however changes the whole
meaning of the sentence. For example although movie was good but it was not up to my
expectations. By using these rules throughput can be increased by 5%.
6) Negation handling- Negation words like ‘not’ inverts the meaning of whole sentence.
For example The movie was not good has ‘good’ in it which is positive but ‘not’ inverts
the polarity to negative. [20]
The polarity of the text is calculated according to the positive, negative values assigned to
the word/token. The sum is calculated and final polarity is displayed. An example
specifying the format of calculation is shown below.
20
3.6 Sentiment Classification
21
3.2.1.1 Dictionary based approach- In this approach a set of opinion words are manually
collected and a seed list is prepared. Then we search for dictionaries and thesaurus to find
synonyms and antonyms of text. The newly found synonyms are added to the seed list.
This process continues until no new words are found. Disadvantage: difficulty in finding
context or domain-oriented opinion words
3.2.1.2 Corpus based approach- Corpus is collection of writings, often on a specific topic.
In this approach, seed list is prepared and is expanded with the help of corpus text. Thus
is solves the problem of limited domain-oriented text. It can be done in two ways
a) Statistical approach: This approach is used to find cooccurrence words in the
corpus. Idea is that if the word appears mostly in positive text, then its polarity is
positive. If it mostly occurs in negative text, then its polarity is negative.
b) Semantic approach: This approach calculates sentiment values by using the
principal of similarity between words. Wordnet can be used for this purpose. Synonyms
and antonyms of given word can be found using this and sentiment value can be
calculated.
22
Naïve bayes: This classifier uses bayes theorem to predict the probability that
a given set of features is a part of particular label. It uses bag of words (BOW)
model for feature extraction. This model assumes that all the features are
independent. P(label/features)=P(label)* P(features/label)/P(features)
The above code snippet provides a basic program to access the Twitter API and the
dataset and perform SA.
Textblob performs all the tokenization,stemming,etc.
The api.search() takes in a word as a parameter and search through the dataset for tweets
consisting of that word and analyses it and provides the sentiment for each of the tweet
by scaling the polarity and subjectivity for each.
Chapter 4
24
Discussion of Results
A good understanding regarding the topics explained, TM, SA, NLP is incurred from the report.
All the information required to begin with Text Mining, all the terminologies of NLP are
explained in detail. Along with it the actual implementation with python using nltk and various
facilities provided for text processing are discussed with proper coding. Code snippets for the
functions are provided so as to understand the working and output for each.
Introduction to Text Mining and Sentiment Analysis and their need is discussed. How these
technologies are evolving and the applications they are used in are stated. Text mining allows
processing of unstructured data which contributes in a lot of applications for information and
knowledge extraction which is massively immerging in BI to expand the business empire.
Furthermore, opinion mining and sentiment analysis also reviewed. During the review it has
been analysed that huge of NLP techniques are available for opinion mining and sentiment
analysis. The fundamental point of opinion mining and sentiment analysis is extracting presence
of sentiments from the given writings. To process the given undertaking feeling mining could be
separated into three dimensions: document level, sentence level, and fine-grained level.
Use of SA is in every field and yet growing. As we discussed how it works and where it is used
the results are prominent in the fields of analyzing the sentiments in the text and providing the
polarity of the text and how much of positive or negative or neutral it is.
With a small and widely used example we discussed the application in Twitter sentiment
analysis. Many more applications are implementing Text Mining and Sentiment Analysis making
them rapidly growing technologies.
Text Mining has come challenges:
- Identification of spam, low value content, and users with multiple accounts. Need to
segment and score users.
- Potential privacy or liability issues, e.g. if data gathered is used to target people
individually, through marketing campaigns, fraud investigations or to penalize users (e.g.
refusing a job to a candidate based on data mining of user posts on social networks).[22]
- Sarcasm detection is difficult as it includes synonymous meanings for some words and
correct interpretation of the word is not accurate all the time.
25
Chapter 5
26
5.1 Conclusion
Text mining techniques help in deriving different traits from amorphous textual data.
Several methods and techniques lead to well-organized and accurate text mining. This
report is based on how mining should be performed on textual data. The process of text
mining, its applications, Information retrieval, Summarization and various such methods
have been discussed. A very convincing approach is discovered due to observations,
because of which methods are examined and upgrading of method is suggested.
The implementation using python’s nltk is discussed and explained. How the tools and
it’s facilities can be utilized to perform text mining using NLP as discussed.
A general approach for Sentiment Analysis is discussed and different approaches are
explained.
An overall description of NLP, its terminologies, its application for text mining,
Sentiment Analysis, its classification and its applications are conversed and explained. It
is a descriptive report for getting introduced to NLP, TM, SA.
Two basic approaches [Limited accuracy, depth]
– Statistical Signature of Bag of Words
– Dictionary of positive & negative words. [21]
References
[1] https://thesai.org/Downloads/Volume7No11/Paper_53-
Text_Mining_Techniques_Applications_and_Issues.pdf
[2] "Marti Hearst: What is Text Mining?"
[3] https://www.researchgate.net/publication/334167408_Machine_Learning_Based_Approa
ch_To_Sentiment_Analysis
28
[4] https://www.semanticscholar.org/paper/Natural-Language-Processing-and-Text-Mining-
to-for-Valdez-Almada-Rodr%C3%ADguez-
Elias/70e46639f3f9a570797fd6adf337f722bf0330bf
[5] https://www.researchgate.net/publication/220355220_Recognizing_Contextual_Polarity_
An_Exploration_of_Features_for_Phrase-Level_Sentiment_Analysis
[6] https://www.researchgate.net/publication/330796490_Review_on_Natural_Language_Pr
ocessing_NLP_and_Its_Toolkits_for_Opinion_Mining_and_Sentiment_Analysis
[7] Hotho, A., Nürnberger, A. and Paaß, G. (2005). "A brief survey of text mining". In Ldv
Forum, Vol. 20(1), p. 19-62
[8] https://en.wikipedia.org/wiki/Natural_language_processing
[9] Heide Brücher, Gerhard Knolmayer, Marc-André Mittermayer; “Document Classification
Methods for Organizing Explicit Knowledge”, Research Group Information Engineering,
Institute of Information Systems, University of Bern, Engehaldenstrasse 8, CH - 3012
Bern, Switzerland. 2002.
[10] Andrew McCallum, Kamal Nigam; “A Comparison of Event Models for Naïve Bayes
Text Classification”, Journal of Machine Learning Research 3, pp. 1265-1287. 2003.
[11] Irina Rish; “An Empirical Study of the Naïve Bayes Classifier”, In Proceedings of the
IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence. 2001.
[12] Irina Rish, Joseph Hellerstein, Jayram Thathachar; “An Analysia of Data Characteristics
that affect Naïve Bayes Performance”, IBM T.J. Watson Research Center 30 Saw Mill
River Road, Hawthorne, NY 10532, USA. 2001.
[13] Pedro Domingos, Michael Pazzani; “On the Optimality of the Simple Bayesian Classifier
under Zero-One Loss, Machine Learning”, Vol. 29, No. 2-3, pp.103-130. 1997.
[14] Sang-Bum Kim, Hue-Chang Rim, Dong-Suk Yook, Huei-Seok Lim; “Effective Methods
for Improving Naïve Bayes Text Classification”, 7th Pacific Rim International
Conference on Artificial Intelligence, Vol. 2417. 2002.
[15] Susana Eyheramendy, Alexander Genkin, Wen-Hua Ju, David D. Lewis, and David
Madigan; “Sparce Bayesian Classifiers for Text Categorization”, Department of
Statistics, RutgersUniversity.2003.
29
[16] Dino Isa, Lam Hong lee, V. P Kallimani, R. RajKumar, “ Text Documents Preprocessing
with the Bahes Formula for Classification using the Support vector machine”, IEEE,
Traction of Knowledge and Data Engineering, Vol-20, N0- 9 pp-1264-1272, 2008.
[17] ] Dino Isa,, V. P Kallimani Lam Hong lee, “Using Self Organizing Map for Clustering of
Text Documents”, ”, Elsever , Expert System with Applications-2008.
[18] Medhat, Walaa, Ahmed Hassan, and Hoda Korashy. “Sentiment analysis algorithms and
applications: A survey” Ain Shams Engineering Journal 5.4 :1093-1113, 2014.
[19] https://www.researchgate.net/publication/320250187_A_survey_of_sentiment_analysis_t
echniques
[20] http://www.textanalyticsworld.com/pdf/Future_directions.pdf
[21] https://www.analyticbridge.datasciencecentral.com/group/socialnetworkanalytics/forum/t
opics/what-are-the-main-4-challenges
30