Social Media Sentiment Analysis
Social Media Sentiment Analysis
We hereby declare that this project has been done by us under the supervision of
Dr. Hari Singh, Assistant Professor (SG), CSE & IT Department, Jaypee
University of Information Technology. We also declare that neither this project
nor any part of this project has been submitted elsewhere for the award of any
degree or diploma.
Supervised by:
Dr. Hari Singh
Assistant Professor (SG)
CSE & IT Department
Jaypee University of Information Technology
Submitted by:
Ayush Bari
(181379)
Final Year Student, B. Tech
Jaypee University of Information Technology
CERTIFICATE
This is to certify that the work which is being presented in the project report titled
“SOCIAL MEDIA SENTIMENT ANALYSIS " in partial fulfillment of the
requirements for the award of the degree of BTech in Computer Science And
Engineering and submitted to the Department of Computer Science And
Engineering, Jaypee University of Information Technology, Waknaghat is an
authentic record of work carried out by “Ayush Bari (181379)” during the period
from August 2021 to December 2021 under the supervision of Dr. Hari Singh,
Assistant Professor (SG), CSE & IT Department, Jaypee University of
Information Technology, Waknaghat.
Ayush Bari
(181379)
First, I express our heartiest thanks and gratefulness to Almighty God for His divine
blessing makes it possible to complete the project successfully.
We are grateful and wish my profound indebtedness to Supervisor Dr. Hari Singh,
Associate Prof, Department of CSE Jaypee University of Information Technology,
Waknaghat. Deep Knowledge & keen interest of my supervisor in the field of
“Sentiment Analysis” to carry out this project. His endless patience, scholarly
guidance, continual encouragement, constant and energetic supervision,
constructive criticism, valuable advice, reading many inferior drafts, and correcting
them at all stages have made it possible to complete this project.
We would also generously welcome each one of those individuals who have helped
us straightforwardly or in a roundabout way in making this project a win. In this
unique situation, I might want to thank the various staff individuals, both educating
and non-instructing, which have developed their convenient help and facilitated my
undertaking.
Finally, we must acknowledge with due respect the constant support and patients
of my parents.
Ayush Bari
(181379)
ABSTRACT
Social Media has become a popular place for people to express their opinion about
a brand, talk about it, and give feedback. It’s helps in understanding people's
sentiment over any topic and incident. Analyzing sentiments help in understanding
how people are thinking emotionally and classifying it as negative, positive or
neutral.
Such data is available in big quantities which will be difficult to evaluate manually,
examine and identify. So instead of doing this time-consuming exercise, we are
going to use technical aspects to solve this problem. This dataset used here is a
collection of many text and internet blogs. Many different machine learning
classifiers are used here, so that a person's sentiment can be identified. All these
classifiers are applied and then the best classifier with the best result will be chosen
in order to predict people's emotions.
By this analysis the professionals can evaluate more of people's emotions accurately
and it will help them identify early symptoms of distress.
TABLE OF CONTENT
REFERENCES 48
APPENDICES 50
Chapter-1 INTRODUCTION
1.1 Introduction
As the world is advancing and the internet era has begun, many youngsters have
developed a habit of starting their new day with a good morning text. The intention of
these technologies was to make human life easier and friendly and they are useful in
every aspect of our life. Today the machines have become so advanced that these
machines can even predict the future with the help of artificial intelligence based on
current data.
As social media websites keep evolving and slowly become the source of all kinds of
information, people started posting their opinion on various topics, discussions, issues,
complaints and expressing negative, positive, or neutral emotions in response to the
product they use or the condition they go through. Many brands and companies even
conduct polls on these sites and blogs to understand the general people sentiment and
demand of their various offerings. This is requirement for some technology that can
identify and summarize overall people's sentiment.
Problem Statement -
Social media is the hub of public opinion where millions of opinions are shared every
few minutes. As more and more people are expressing their thoughts on social media
which can be both neutral and polarizing. The problem is to plot public sentiments on
the relative topic and monitor it.
Objective -
Objective of the project is to extract tweets from twitter and extracts sentiments out of
them. We will be plotting a graph based on those sentiments to check their sentiments
and plot a graph based on subjectivity and polarity of their tweets.
CHAPTER 2 - LITERATURE SURVEY
Machine Learning
Machine learning algorithms are used in all kinds of variety applications like
recommender systems which recommend people the next thing they should interact on
their devices or helping people by filtering out redundant email often called spam.
Subjects like speech recognition and computer vision have become very convenient
and accessible by use of machine learning. Many of today’s big tech companies like
Facebook, Google and Apple are using machine learning as a central part of their
functioning.
Sentiment analysis is the process of judging whether the opinion expressed is positive,
negative or neutral. A sentiment analysis system for text analysis uses natural language
processing (NLP) with machine learning techniques to assign scores to each entity's
topics, categories, themes for every sentence and phrase shared by the user.
It helps organizations and data scientists gauge, express opinion, and conduct nuanced
market research. It helps in monitoring complex things like bank reputation, likes,
dislikes about some certain product and identifying people’s emotions. People going
through mental illness, depression, suicidal thoughts can be approached early on which
can help in reducing already very high suicide rates.
There are many types of sentiment analysis which are useful for different purposes like
The data usually ranges from very positive to very negative. It uses words like
anger, sadness, worries that are associated with negative sentiment, and for positive
sentiment it has words like happiness, love, enthusiasm.
Example – Virat Kohli is killing it is a positive emotion comment but it can create
confusion to the algorithm.
● Aspect based
Aspect based sentiment analysis is usually used for a particular aspect of service or
product. For example, a laptop vendor might use it for one aspect like processing
speeds or graphics performance. They can understand customer feelings about an
attribute by using this. By using specific details, the composed opinion can be
derived to provide insightful feedback to the company. This makes it very easy to
detect customer complaints and resolve them.
● Intent Analysis
Many times, it is required to focus on the right customer. Focusing on the right
customer can help save resources for a company, so that they can utilize their
limited resources better. It helps companies understand whether the customer
actually intends to use that service or not. It is highly important for marketing and
advertising purposes.
Sentiment Library – These are basically very large collections of words like
wonderful, horrible, bad, dirty and phrases like amazing show, horrible story, good,
performance, wonderful game. These things are pre scored by humans and given a
sentiment weight. A basic sentiment analysis algorithm uses such a library to
understand sentiments of the phrases it encounters.
This type of manual sentiment scoring can be a tricky process as it involves decision
on how positive or negative a word/phrase is going to get rated. These libraries require
consistent maintenance by tweaking scores based on trend and adding new phrases.
This can be used to execute sentiment analysis without training machine learning
models which are known and rules-based sentiment analysis.
Sentiment Analysis can be performed on both Static Data and Streaming Data. Static
data refers to a fixed data set. This type of data set is collected beforehand and is
processed later to get required results. This is quite different from Streaming data.
Streaming Data is data that is continuously generated in real time. This type of data
needs to be processed constantly using suitable stream techniques. While static data is
predefined and is not changing, streaming data continually changes after its recorded
and hence provides better accuracy results.
Related Works –
This topic sentiment analysis and emotion analysis has gone through a lot of research
in recent times even if most of the research is done to classify positive and negative
emotions on textual data.
Zhao Jianqiangi and Gui Xiaolini (2016) determined various Text pre-processing
methods affect the performance of sentiment classification. In their research they
compared six different pre-processing methods that show different sentiment polarity
classification results in Twitter. Experimental results indicated that the performance
of classifiers is not affected by the removal of URLs, the removal of stop words and
numbers although they are helpful in reducing noise. But on the other hand, replacing
negations and expanding acronyms. The study concludes that selection of appropriate
pre-processing methods for different classifiers produce different results.
Mondher Bouazizi and Tomoaki Otsuki (2016) in order to extract patterns that
determined the level of sarcasm of tweets used Parts of Speech tags. Upon performing
different natural language processing (NLP) tasks with Apache OpenNLP tool they got
good results by applying a selected approach. The result could be better when the
training set is bigger. To cover larger data set of sarcastic tweets Bridianne O'Dea,
Philip J. Batterham, Mark E. Larsen, Cecile Paris, Tjeerd W. Boonstra, and Helen
Christensen (2015) presented the “We Feel system” to get a real-time emotional
sentiment analysis tool for Twitter. In their collective efforts, they were able to analyze
2.73×109 tweets over a 12-week period. They detected significant events while
analyzing the weekly variations in emotional expression. They also observed some
indices of anxiety, depression and suicidal thoughts, which indicates some social media
tool that can measure mental health could also be made.
Ping Feng Pai and chia Hsin Liu (2018) [5] developed a framework consisting of time
series forecasting models and multivariate regression technique to predict monthly
total vehicle sales.
The researchers got more accurate forecasting results after seasonalizing procedures
with hybrid multivariate data.
El Aloui [4] performed sentiment analysis on the data obtained by 2016 tweets of
ongoing US elections. They presumed that every class has its own version of us
elections. S. Kaur [6] had used sentiment analysis with the N-gram algorithm which he
used to feature KNN and extraction for classification. That model was able to achieve
an accuracy of over 86%.
NLP helps in computer software to take text from one language and convert it to
another. It makes them respond to various commands humans spoke with the help of
statistics, machine learning, and deep learning neural models. In and all these
technologies enable software to understand human text to higher meaning and work
with human sentiment and intent.
We interact with NLP in voice assistants, GPS systems, they help us summarize large
volumes of text. All the text to dictation programs works with NLP, customer service
chatbots and many other consumer convenience services. NLP is very useful to
enterprises as it helps in streamlining operations, simplifies critical mission business
processes and increases productivity in office and for employees.
Human language is filled with so many nuances and ambiguity that can make it
difficult to manage software and determine what customers want from respective text
or voice data. Human speech contains many things like sarcasm, idioms, metaphors,
grammar, homophones and many more like variations in human speech. It’s also full
of irregularities.
2. Part of Speech tagging - It is the problem of understanding the part of speech of any
specific word on some human written piece of text based on where and in what context
use and context.
3. Named entity recognition – Many phrases are identical and many phrases are also
identical which are used as useful entities. NEM helps identify Which word is the
location and which is the name etc.
Python is a general-purpose high-level language and has been a very popular language
for a long time. It is getting heavily used in domains of web development, machine
learning applications as it brings all the easy-to-use advanced features to the
programmers. Python also allows object oriented and procedural paradigms and code
can be written that way. In python people have to generally type less indentation which
makes it more readable to everyone. It is now used all across the IT sector and big tech
companies, in fact all the big tech companies like google, Microsoft, apple, all are using
python to develop their cutting-edge technology and bring advanced customization to
their customers. It also supports advanced word processing (late binding), which binds
the way the words change during the process. Python has reference computation inbuilt
integrated and owing to the less redundant nature of the language, programmers have
to write less lines of code as compared to other high-level languages like C, C++ and
java.
Python applications contain less lines than programs written in other languages. That
is why we choose Python for artificial intelligence, artificial consciousness, and dealing
with massive volumes of data. Python is an article-oriented programming language.
Classes, objects, polymorphism, exemplification, legacy, and reflection are all
concepts in Python.
● Textblob – It helps in accomplishing many tasks of NLP through its API which
includes sentiment analysis, classification, translation, part of speech tagging
and more.
Natural Language Toolkit (NLTK)
Pickle
Pickle is a python module which we use for serializing and deserializing any object in
python structure. We can pickle any object in python to save it on disk. The main things
pickle achieves is that it serializes the given object before it writes the object in a file.
So it get used as a tool to convert any object python has like list, dict, etc. into a
character stream. The resultant character stream actually contains all the necessary
information needed to reconstruct it back into another python script.
Twitter API
Twitter API is helpful as it provides developers access to Twitter and its most of the
given functionality. We can use this API and read and write various information
present in Twitter website like tweets, trends, users.
Tweets
Retweets
Trends
Media
Likes
To access Twitter API, we first need to authorize using varies keys and token given to
us.
Tweepy
It is an open source Python package that provides a easy and reliable way to access
Twitter API through Python. Tweepy consists of range of different classes and methods
which helps in accessing Twitter’s models and API endpoints. It takes care of various
things like
Streams
OAuth authentication
Rate Limits
HTTP request
Image – Authentication details of Twitter API
Machine Learning Models –
Logistic Regression
This is the function used when we have to convert log odds to respective
probability.
In this figure, logistic regression is shown, where we set the inverse of regularization
at 1.0 while max iteration was set up to 10000. All the rest of the parameters are set to
be default.
Naïve Bayes Classifier-
Naïve Bayes classification is a machine learning model which runs on the method of
probability. Naïve bayes uses probability for completing classification task
This classifier is based on Bayes Theorem
Bayes Theorem
Bayes theorem is a probability theorem using which we can find the probability of A
happening, if the B has already occurred. Here, Since B has already happened, B is
evidence while A is just a hypothesis we are going to extract. Bayes theorem works on
the assumption that predictors are independent which means presence of one feature
does not impact other, that’s why it is called as naïve
This probabilistic approach makes the naïve bayes classifier the fastest among other
classifiers.
Bernoulli Naïve Bayes is a variation of Naïve Bayes. Its works on discrete data
according to Bernoulli distribution. The main thing about Bernoulli Naïve is that it
only works in binary values, it only accepts values containing true or false or yes or no
or success or failure, 0 or1 and so on. So, this classifier mainly works when the main
feature values are binary.
Bernoulli distribution –
In this figure, Bernoulli Naïve Bayes is shown, where the highest accuracy came when
we set additive parameters to 0.25 while setting all other parameters as default.
ii) Multinomial Naïve Bayes
Multinomial Naïve bayes is another machine learning algorithm which is the variation
of Naïve Bayes. Multinomial Naïve Bayes is mostly used in Natural Language
Processing (NLP). This algorithm is derived from Naïve Bayes and it uses a
probabilistic approach to machine learning classification. Multinomial Naïve Bayes
classifier comes with features containing the frequency of specific word in document,
it is calculated probability of every tag given in a respected sample and then whichever
tag has the highest probability is the answer.
Since it is based on Naïve Bayes, the presence or absence of some feature doesn’t not
impact presence or absence of some other feature. Multinomial Naïve bayes normally
require integer feature counts but it can also work with fractional counts for example
TF-IDF.
In this figure, Multinomial Naïve Bayes is shown, where we got the best result when
additive smoothing parameter was 0.0 and all other parameters were set to default.
Random Forest Classifier
Random forest classifiers consist of randomly selected decision trees, these trees are
processes individually that can operate as an ensemble. Random forest model is made
of multiple decision trees. Decision trees seek to find the best split to split the data.
These are mainly trained through classification and regression trees. In this
classification model, each separate tree provides a separate class prediction and after
all whichever class gets the most votes become model prediction. There was a little
correlation between predictions made by these individual trees to each other. It uses
the majority vote for class and the common for regression to generate selection timber
from diverse samples. Since it incorporates various characteristic choices in its
constructing techniques, they are able to accommodate a large number of functions at
excessive prediction accuracy.
In this figure, Random Forest classifier is shown, where we set the measure of quality
of the split to ‘Gini’ and all other parameters to default.
This contains the description of all the methods involved in this project, these are the
methods we are using to extract appropriate emotions from the people's opinions we
used.
1. Fetching data – Firstly, we will authenticate our Twitter API credentials to fetch
tweets. After authentication will be provide the person whose tweets we are going
to analyze and fetch their tweets. We will store those tweets after fetching for
further usage.
2. Data Pre-processing Phase – This is the process performed over raw data
according to algorithm needs. It involves cleaning of data, integration of data
reduction of data and finally transformation of data. In my project, initially the data
is a bunch of extracts taken from twitter API, these will be stored in another text
file. But before storing them we will clean it and remove all the unnecessary items
from it. We are using stop-words here that are present in the Natural Language
toolkit so that we can remove all the common words like the, me, you, a, an etc.
Stop-words are the commonly used words for example as, the, a, an, in, etc. We do
not want to waste time over these common redundant words which contribute
nothing to our sentiment analysis and hence we remove them through the NLTK
library. This helps to eliminate redundant clutter and focus on drawing out correct
emotions. Here, we are using lemmatization which can reduce words to their root
status. Lemmatization uses a vocabulary and processes the data in order to remove
any redundant suffix or prefix and reduce the word to its core, that is lemma. For
example, it will reduce words like ran, dun, running to run which is its lemma.
Another alternative to this is stemming which reduces the words to stems but
lemmatization is better as it stays closer to the real meaning inflicted in text.
We will clean tweets and remove many component of tweets and doesn’t assist in
our sentiment analysis.
These things include
Emoticons
Retweets
Hyperlinks
Advertisement
Mentions
Tags
3. Term-Weighting Scheme – The Term weighting schemes are very useful in
the essence that they are useful to collect essential information from the
unstructured data. It processes all the words and assigns them a certain numerical
value. This value is used to calculate the weight of that particular word and
distinguish it from the others. This method predicts the importance of different
words by calculating the predictable emotions embedded within them. It calculates
the frequency of the word used and then uses that to calculate its relative
importance. Here TF_IDF has been used.
TF_IDF –
TF (Term frequency) IDF (inverse document frequency) is used for scoring the
relative importance of words. It is a statistical method to determine relevance and
importance of any given word in the given text and what relevance is the
information that it provides about the context. The number is inversely proportional
if the same word is common in other documents as well although directly
proportional to the number of times a word is present in a phrase, paragraph or
document.
This has been shown with the help of below equations
An emotionally distressed person often uses word like frustrated, sad, etc.
frequently while a person with suicidal thinking uses kill, death, hopelessness types
words in their writing. Their writing styles consist more of extremist words with
heavy usage specific words like full, must, absolute, never, etc. They often look at
words as black and white and nothing in between which is visible in their writing.
For the dataset, I have used Twitter site which contains opinions on all types of
topics ad events and from there we will be live extracting them through our code.
4.1. Authenticate Twitter API
consumer_key
consumer_secret
access_token
access_secret
These keys are available once we make our twitter developer account. These keys
are private to each user.
Then, we will enter handle id of twitter user we want to extract the tweets. Here, I
am extracting 100 tweets from twitter users.
For example these are the last 5 tweets of Virat Kohli (tweeter handle :
@imVkohli)
A tweet consists of many things which are not necessary for our analysis.
These things include
Tags (#) – Tags are related to the topic user is commenting. WE will
remove it.
Polarity – The main thing in sentiment analysis is to analyze a tweet and extract
positive or negative sentiment out of it. These sentiments are denoted in values :
positive values for positive emotions and negative values for negative emotions.
#Negative Tweets –
Here we can see most tweets of Virat Kohli shows on the positive side of graph
#Percentage of positive tweets –
Conclusion
Emotional stress has become a leading concern all over the world. I have taken
many lives and more and more people are getting affected by it. We analized
sentiments of various tweeter handles to get what emotions they are inciting and
what sentiments they are pointing to. In our project, we were able to identify and
plot positive, negative and neutral tweets and show them in graphical format
Future Scope
In the future we can use various other classifiers to come with better accuracy and
this model can be even further upgraded to GUI interface get live sentiment of
people. This can be used with a graphic interface to deliver the same results with
better accessibility.
REFERENCES
[1] Haller DM, Sanci LA, Sawyer SM, Patton GC. The identification of young
people’s emotional distress: a study in primary care. Br J Gen Pract. 2009
Mar;59(560):e61-70. doi: 10.3399/bjgp09X419510. PMID: 19275825; PMCID:
PMC2648934.
[2] World Health Organization. (n.d.). SDG Target 3.4 Noncommunicable
diseases and mental health. World Health Organization.
https://www.who.int/data/gho/data/themes/topics/sdg-target-3 4-
noncommunicable-diseases-and-mental-health.
[3] Praveen. (2020, April 16). Emotions dataset for NLP. Kaggle.
https://www.kaggle.com/praveengovi/emotions-dataset-for-nlp.
[4] El Alaoui, I., Gahi, Y., Messoussi, R. et al. A novel adaptable approach for
sentiment analysis on big social data. J Big Data 5, 12 (2018).
https://doi.org/10.1186/s40537-018-0120-0
[5] B. Seref and E. Bostanci, ”Sentiment Analysis using Naive Bayes and
Complement Naive Bayes Classifier Algorithms on Hadoop Framework,” 2018
2nd International Symposium on Multidisciplinary Studies and Innovative
Technologies (ISMSIT), 2018, pp. 1-7, doi: 10.1109/ISMSIT.2018.8567243.
[6] S. Kaur, G. Sikka and L. K. Awasthi, ”Sentiment Analysis Approach Based on
N-gram and KNN Classifier,” 2018 First International Conference on Secure
Cyber Computing and Communication (ICSCCC), 2018, pp. 1-4, doi:
10.1109/ICSCCC.2018.8703350
[7] Zainuddin, Nurulhuda & Selamat, Ali. (2014). Sentiment analysis using
Support Vector Machine. I4CT 2014 - 1st International Conference on Computer,
Communications, and Control Technology, Proceedings. 333- 337.
10.1109/I4CT.2014.6914200.
[8] Wankhade, Mayur & Chandra, A & Rao, Sekhara & Dara, Suresh & Kaushik,
Baij. (2017). A Sentiment Analysis of Food Review using Logistic Regression.
2456-3307.
[9] Ramosaco, Miftar & Hasani, Vjollca & Dumi, Alba. (2015). Application of
Logistic Regression in the Study of Students’ Performance Level (Case Study of
Vlora University). Journal of Educational and Social Research.
10.5901/jesr.2015.v5n3p239.
[10] Singh, Gurinder & Kumar, Bhawna & Gaur, Loveleen & Tyagi, Akriti.
(2019). Comparison between Multinomial and Bernoulli Na¨ıve Bayes for Text
Classification. 593-596. 10.1109/ICACTM.2019.8776800.
[11] Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).
https://doi.org/10.1023/A:1010933404324
[12] The Bernoulli model - https://nlp.stanford.edu/IRbook/html/htmledition/the-
bernoulli-model-1.html
[13] A. Verma and S. Mehta, ”A comparative study of ensemble learning methods
for classification in bioinformatics,” 2017 7th International Conference on Cloud
Computing, Data Science & Engineering - Confluence, 2017, pp. 155-158, doi:
10.1109/CONFLUENCE.2017.7943141.
[14] Bauer, E. & Kohavi, R. (1999). An empirical comparison of voting
classification algorithms. Machine Learning, 36(1/2), 105–139.