0% found this document useful (0 votes)
2 views

Twitte Analysis

The document presents a project on Twitter sentiment analysis using natural language processing (NLP) and various machine learning algorithms, highlighting the significance of analyzing unstructured tweet data to determine sentiments. The study identifies Random Forest as the most effective model, achieving an accuracy of 93.94%. It discusses the challenges and applications of sentiment analysis, emphasizing its importance in understanding public opinions and enhancing business strategies.

Uploaded by

nnile207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Twitte Analysis

The document presents a project on Twitter sentiment analysis using natural language processing (NLP) and various machine learning algorithms, highlighting the significance of analyzing unstructured tweet data to determine sentiments. The study identifies Random Forest as the most effective model, achieving an accuracy of 93.94%. It discusses the challenges and applications of sentiment analysis, emphasizing its importance in understanding public opinions and enhancing business strategies.

Uploaded by

nnile207
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

DEFENCE SERVICES ACADEMY

DEPARTMENT OF COMPUTER TECHNOLOGY

Twitter Sentiment Analysis Project using NLP

By
Captain. Lin Htet Paing

SECOND SEMESTER
(COURSE PROJECT)

MARCH, 2023
PYIN OO LWIN
ABSTRACT
With the advancement of web technology and its growth, there is a huge
volume of data present in the web for internet users and a lot of data is generated
too. The Internet has become a platform for online learning, exchanging ideas,
and sharing opinions. Social networking sites like Twitter, Facebook, and
Google+ are rapidly gaining popularity as they allow people to share and express
their views about topics, have discussions with different communities, or post
messages across the world. There has been a lot of work in the field of sentiment
analysis of Twitter data. This survey focuses mainly on sentiment analysis of
Twitter data which is helpful to analyze the information in the tweets where
opinions are highly unstructured, heterogeneous and are either positive or
negative, or neutral in some cases. Sentiment analysis (also known as opinion
mining or emotion AI) is the use of natural language processing, text analysis,
computational linguistics, and biometrics to systematically identify, extract,
quantify, and study affective states and subjective information. In this paper,
Using various machine learning algorithms like Naive Bayes, Logistics
Regression, Random Forest, Decision Tree and Support Vector Machine, we
provide research on Twitter data streams. We have also discussed general
challenges and applications of Sentiment Analysis on Twitter. Random Forest is
the best model in my project and accuracy is 93.94%.

Keywords: Machine Learning, Support Vector Machine, Naïve Bayes, Logistics


Regression, Random Forest, Decision Tree
TABLE OF CONTENTS
Page
ABSTRACT i
TABLE OF CONTENTS ii
LIST OF FIGURES v
CHAPTER TITLE
1 INTRODUCTION 1
1.1 Problem Statement 2
1.2 Objectives 2
1.3 Summary 2
2 BACKGROUND THEORY 3
2.1 Sentiment Analysis 3
2.2 Twitter Sentiment Analysis 4
2.3 Natural Language Processing (NLP) 5
2.3.1 NLP Terminology 6
2.3.2 Application of NLP 6
2.3.3 Processes in NLP 7
2.4 Different Approaches for Sentiment Analysis 7
2.4.1 Lexicon-Based Approach 8
2.4.2 Machine Learning Approach 9
2.5 Types of Machine Learning Algorithms 10
2.5.1 Logistic Regression 10
2.5.2 Naive Bayes 11
2.5.3 XG-Boost 12
2.5.4 Decision Tree 14
2.5.5 Support Vector Machine 15
2.6 Feature Extraction Method 16
2.7 Literature Review 18
2.8 Summary 18
3 PROPOSED SYSTEM AND IMPLEMENTATION 19
3.1 Dataset Description 19
3.2 Proposed System 20
3.3 Steps of Implementation 21
3.4 Data Collection 21
3.5 Data Exploration 22
3.6 Summary 23
4 EXPERIMENTAL RESULTS AND CONCLUSION 25
4.1 Implementation 25
4.1.1 Import Twitter Datasets in Python 25
4.1.2 Data Preprocessing(Concatenate data frames and
Shuffle the data) 26
4.1.3 Removing an unused column from Dataset 26
4.1.4 Entity vs Sentiment Relation 26
4.1.5 Visualizing Sentiment feature(Target) 27
4.1.6 Dropping null values and duplicated values 27
4.1.7 Label Encoding (from categorical to numerical ) 28
4.1.8 Convert to Lower Case 28
4.1.9 Remove Punctuation 29
4.1.10 Remove Stopwords 29
4.1.11 Tokenizing the Texts 29
4.1.12 Remove Words less than 3 30
4.1.13 Stemming 30
4.1.14 Lemmatize the Texts 30
4.1.15 Return Statements 31
4.1.16 Before Cleaning and After Cleaning 31
4.1.17 Plot WordCloud Target 31
4.1.18Train and Test Splitting 32
4.1.19Convert Lemmatized Words to Tf-Idf Feature
Vectors 32
4.1.20Using Logistic Regression 33
4.1.21Using eXtreme Gradient Boosting 33
4.1.22Using Naïve Bayes Classifier 34
4.1.23Using DecisionTreeClassifier 34
4.1.24Using Support Vector Machine 35
4.1.25Comparing Models and Predicting Texts 36
4.2 Limitation of Sentiment Analysis system 36
4.3 Conclusion 36
REFERENCES
APPENDICES
LIST OF FIGURES
Figure Page
Figure 2. 1 Sentiment Analysis 4
Figure 2. 2 Twitter Sentiment Analysis in NLP 5
Figure 2. 3 Natural Language Processing (NLP) 5
Figure 2. 4 Processes in NLP 7
Figure 2. 5 Sentiment Analysis using machine learning 10
Figure 2. 6 Logistic Regression Classification System 11
Figure 2. 7 Naïve Bayes Classification System 12
Figure 2. 8 Schematic Diagram for Random Forest Classification 14
Figure 2. 9 Schematic Diagram for Decision Tree Classification 15
Figure 3. 1 Training Dataset for Twitter 20
Figure 3. 2 The Proposed System of Sentiment Analysis 21
Figure 4. 1 Training Twitter datasets Description 25
Figure 4. 2 Values in Sentiment Description 26
Figure 4. 3 Removing the Tweet ID and Entity Column 26
Figure 4. 4 Entity and Sentiment Relation 26
Figure 4. 5 Visualizing Sentiment Features 27
Figure 4. 6 Drop and Duplicate Description 27
Figure 4. 7 Dataset Description After Dropping Unuse Columns 28
Figure 4. 8 Convert to Lower Case description 28
Figure 4. 9 Remove punctuation 29
Figure 4. 10 Remove Stopwords 29
Figure 4. 11 Tokenizing Description 29
List of Figure (Continued)
Figure 4. 12 Remove Words Description 30
Figure 4. 13 Stemming description 30
Figure 4. 14 Lemmatization Description 30
Figure 4. 15 Return Statement Description 31
Figure 4. 16 Feature Extraction 31
Figure 4. 17 WordCloud Description 31
Figure 4. 18 Train-Test Splitting Description 32
Figure 4. 19 Tf-Idf Feature Description 32
Figure 4. 20 Logistic Regression Accuracy Description 33
Figure 4. 21 Random Forest Accuracy Description 33
Figure 4. 22 Naïve Bayes Accuracy Description 34
Figure 4. 23 Decision Tree Classifier Accuracy Description 34
Figure 4. 24 Support Vector Machine Accuracy Description 35
CHAPTER 1
INTRODUCTION
Sentiment analysis is a technique through which you can analyze a piece
of text to determine the sentiment behind it. It combines machine learning and
natural language processing (NLP) to achieve this. Using basic Sentiment
analysis, a program can understand whether the sentiment behind a piece of text
is positive, negative, or neutral. It is a powerful technique in Artificial
intelligence that has important business applications. For example, we can use
sentiment analysis to analyze customer feedback. After collecting that feedback
through various mediums like Twitter and Facebook, we can run sentiment
analysis algorithms on those text snippets to understand our customers' attitudes
towards our product. Sentiment Analysis is a type of classification where the
data is classified into different classes. With the help of sentiment analysis, we
can find out the nature of opinion that is reflected in documents, websites, social
media feeds, etc. These classes can be binary in nature (positive or negative) or,
they can have multiple classes (happy, sad, angry, etc.).
In this project, one such social networking site is taken into account,
which is among the largest networking sites, Twitter. Looking at the statistics,
users that are active monthly range from about 316 million, and on average,
about 500 million tweets are sent daily (Twitter, 2016). Since these statistical
values are extremely high, the content is restricted to a minimal level, and
because the text has no uniform structures, social networking sites such as
Twitter, and those similar to it put up challenges for the classifiers to analyze
their data.
2

1.1.1 Problem Statement


We try to implement an NLP Twitter sentiment analysis model that helps
to overcome the challenges of identifying the sentiments of the tweets. The key
point is to find the sentiment of the text data. The task is to identify if the tweets
have a Negative, Positive, or Neutral sentiment towards the company or entity.
We want to compare sentiment from one quarter to the next to see if we need to
take action. Despite the availability of software to extract data regarding a
person’s sentiment on a specific product or service, organizations, and other data
workers still face issues regarding data extraction.
1.2 Objectives
The Objectives of this project are as follows.
(a) To build a sentiment analysis model that categorizes words based
on their sentiments.
(b) To analyze data on Twitter data.
(c) To analyze for monitoring and analyzing social phenomena, predict
potentially dangerous situations, and determine the general mood of
the blogosphere.
1.3 Summary
The introduction and the basic concept of Sentiment analysis are briefly
described in this chapter. Sentiment analysis provides a way to understand the
attitudes and opinions expressed in texts. The background theory of Sentiments
Analysis is explained in the next.
3

CHAPTER 2
BACKGROUND THEORY
In this chapter, a definition of sentiment analysis, methodologies, and
summary will be proposed. Twitter is a good source of information for
individuals' opinions. Twitter receives about 500 million tweets a day, where
people share comments regarding a wide range of topics. Many consumers take
to Twitter to give their opinions on current events, including real-time affairs.
By performing sentiment analysis on these tweets, one can determine the
polarity and inclination of a population towards specific topics, items, or entities.
Retweeting is a largely used mechanism for information diffusion on Twitter. It
has emerged as a simple yet powerful way of circulating information in the
Twitter social realm.
2.1 Sentiment Analysis
Sentiment analysis is contextual mining of text that identifies and extracts
subjective information in the source material and helps a business to understand
the social sentiment of their brand, product, or service while monitoring online
conversations. However, analysis of social media streams is usually restricted to
just basic sentiment analysis and count-based metrics. This is akin to just
scratching the surface and missing out on those high-value insights that are
waiting to be discovered. With the recent advances in deep learning, the ability of
algorithms to analyze text has improved considerably. Creative use of advanced
artificial intelligence techniques can be an effective tool for doing in-depth
research. We believe it is important to classify incoming customer conversations
about a brand based on the following lines. They are: key aspects of a brand’s
product and service that customers care about and users’ underlying intentions
and reactions concerning those aspects. These basic concepts when used in
4

combination, become a very important tool for analyzing millions of brand


conversations with human-level accuracy.

Figure 2. 1 Sentiment Analysis


2.2 Twitter Sentiment Analysis
Twitter sentiment analysis, which is a subset of social media sentiment
analysis, helps firms to understand their audience on social channels, stay on top
of what’s being said about their brand – and their rivals – and uncover new
trends in the market by carefully listening to the voice of the consumer on
Twitter.
On Twitter, a bazillion discussions are going on — talks that have no
boundaries. Politicians use Twitter to speak about their agendas and progress
with the audience. Businesses and brands use Twitter to communicate with their
customers in real time.
Unfortunately, the amount of data generated is so voluminous that apart
from user data, there is a large amount of non-useful data, often known as noise.
This is why it’s difficult for businesses to know which tweets to respond to
initially. Twitter sentiment analysis fills this void.
5

The Automated Machine Learning Sentiment Analysis Model has been


developed to understand customer perception from the data collected from
Twitter. This is why sentiment analysis has become a crucial tool for social
media marketing strategies.

Figure 2. 2 Twitter Sentiment Analysis in NLP


2.3 Natural Language Processing (NLP)
NLP stands for Natural Language Processing, which is a part of Computer
Science, Human language, and Artificial Intelligence. It is the technology that is
used by machines to understand, analyze, manipulate, and interpret human
languages.

Figure 2. 3 Natural Language Processing (NLP)


NLP helps developers organize knowledge for performing tasks such
as translation, automatic summarization, Named Entity Recognition (NER),
6

speech recognition, relationship extraction, and topic segmentation. NLP


combines computational linguistics rule-based modeling of human language
with statistical, machine learning, and deep learning models. Together, these
technologies enable computers to process human language in the form of text or
voice data and to ‘understand’ its full meaning, complete with the speaker or
writer’s intent and sentiment.
2.3.1 NLP Terminology
(a) Phonology − It is the study of organizing sound systematically.
(b) Morphology − It is a study of the construction of words from
primitive meaningful units.
(c) Morpheme − It is a primitive unit of meaning in a language.
(d) Syntax − It refers to arranging words to make a sentence. It also
involves determining the structural role of words in the sentence
and in phrases.
(e) Semantics − It is concerned with the meaning of words and how to
combine words into meaningful phrases and sentences.
(f) Pragmatics − It deals with using and understanding sentences in
different situations and how the interpretation of the sentence is
affected.
(g) Discourse − It deals with how the immediately preceding sentence
can affect the interpretation of the next sentence.
(h) World Knowledge − It includes the general knowledge about the
world.
2.3.2 Application of NLP
There are the following applications of NLP:
(a) Question Answering
(b) Spam Detection
(c) Sentiment Analysis
(d) Machine Translation
7

(e) Spelling Correction


(f) Speech Recognition
(g) Chatbot
(h) Information Extraction
(i) Natural Language Understanding (NLU)
2.3.3 Processes in NLP
There are nine in NLP
(a) Sentence Segmentation
(b) Word Tokenization
(c) Stemming
(d) Lemmatization
(e) Identifying Stop Words
(f) Dependency Parsing
(g) Part of Speech (POS) Tags
(h) Named Entity Recognition (NER)
(i) Chunking

Figure 2. 4 Processes in NLP


2.4 Different Approaches for Sentiment Analysis
There are many approaches used for sentiment analysis on linguistic data,
and which approach to be used depends on the nature of the data and the
8

platform you are working on. Most research carried out in the field of sentiment
analysis employs lexicon-based analysis or machine-learning techniques.
Machine learning techniques control the data processing by the use of machine
learning algorithms and by classifying the linguistic data by representing them in
vector form (Olsson et al. 2009). On the other side, the Lexicon-based (also
called Dictionary-based) approach classifies the linguistic data using a dictionary
lookup database. During this classification, it computes sentence or document-
level sentiment polarity using lexicon databases for processing linguistic data
like WordNet, SentiWordNet, and Treebanks. In this section, a brief discussion
of lexicon-based and Machine Learning approaches has been outlined.
2.4.1 Lexicon-Based Approach
The lexicon-based approach predicts the sentiments by using lexical
databases like SentiWordNet and WordNet. It obtains a score for each word in
the sentence or document and annotates using the features from the lexicon
database that are present. It derives text polarity based on a set of words, each of
which is annotated with the weight and extracts information that contributes to
concluding the overall sentiments of the text. Also, it is necessary to pre-process
data before assigning the weight to the words.
Moreover, Lexicon dictionary or database contains the opinionated words
that are classified with positive and negative word type, and the description of
the word that occurs in current context. For each word in the document, it is
assigned with numeric score, and average score is computed by summing up all
the numeric scores and sentiment polarity is assigned to the document. There are
three general-purpose lexicons. They are:
(a) AFINN
(b) Bing
(c) Loughran
These three lexicons make use of the unigrams. Unigrams are a type of n-
gram model that consists of a sequence of 1 item, that is, a word collected from a
9

given textual data. In the AFINN lexicon model scores the words in a range from
-5 to 5. The increase in negativity corresponds the negative sentiment whereas
an increase in positivity corresponds the positive one. The bing lexicon model on
the other hand, classifies the sentiment into a binary category of negative or
positive. And finally, the loughran model that performs analysis of the
shareholder’s reports. In this project, we will make use of the bing lexicons to
extract the sentiments out of our data.
2.4.2 Machine Learning Approach
Machine Learning approach is widely seen in the literature on sentiment
analysis. Using this approach the words in the sentence are considered in form of
vectors, and analyzed using different machine learning algorithms like Naïve
Bayes, SVM, and Maximum Entropy. The data is trained accordingly, which can
be applied to machine learning algorithms. This data has been appended with the
total positive score, negative score in the tweets and sentiment labeling
(‘POSITIVE’,’NEGATIVE’ and ‘NEUTRAL’) has been assigned to each tweet
in the dataset. These data sets which are labeled with sentiment of the tweets are
further trained using machine learning algorithm to measure its accuracy,
performance and reliability of the result obtained from lexicon-based sentiment
analysis. The most abstract view that performs sentiment analysis using machine
learning .
10

Figure 2. 5 Sentiment Analysis using machine learning


2.5 Types of Machine Learning Algorithms
The proposed system uses the following machine learning algorithms:
(a) Logistic Regression
(b) Naive Bayes
(c) XG-Boost
(d) Decision Tree
2.5.1 Logistic Regression
A statistical analysis technique called logistic regression uses previous
observations from a data set to predict a binary outcome, such as yes or no. A
logistic regression model uses an analysis of the correlation between one or
more pre-existing independent variables to predict a dependent data variable.
Multiple input criteria may be considered by a logistic regression model.
For instance, logistic regression could be used to forecast whether a
candidate for office will win or lose, or if a high school student will be accepted
into a particular institution or not. These binary outcomes allow straightforward
decisions between two alternatives. A logistic regression model can take into
consideration multiple input criteria.
11

Figure 2. 6 Logistic Regression Classification System


Logistic regression has become an important tool in the discipline of
machine learning. It allows algorithms used in machine learning applications to
classify incoming data based on historical data. As additional relevant data
comes in, the algorithms get better at predicting classifications within data sets.
Logistic regression can also play a role in data preparation activities by allowing
data sets to be put into specifically predefined buckets during the extract,
transform, and load (ETL) process to stage the information for analysis.
2.5.2 Naive Bayes
The Naïve Bayes algorithm is a supervised learning algorithm that is
based on the Bayes theorem and is used for solving classification problems. It is
mainly used in text classification, which includes a high-dimensional training
dataset. One of the most straightforward and efficient classification algorithms is
the naive bayes classifier, which aids in the development of quick machine
learning models capable of making prompt predictions. When a probabilistic
classifier, it makes predictions based on the likelihood that an object will occur.
Spam filtration, sentiment analysis, and classification are a few common uses of
the Naive Bayes algorithm.
12

Bayes’ Theorem finds the probability of an event occurring given the


probability of another event that has already occurred. The formula of Bayes’
theorem is showed in equation:

P ( B∨ A ) P( A)
P ( A∨B ) =
P(B)

where,
A, B = events
P(A | B) = probability of A given B is true
P(B | A) = probability of B given A is true and
P(A), P(B) = the independent probabilities of A and B

Figure 2. 7 Naive Bayes Classification System


2.5.3 Random Forest
Several decision trees are grown and combined to form a "forest" in the
Random Forest supervised machine learning technique. It is possible to employ
Random Forest to solve both classification and regression problems. For a more
precise forecast, Random Forest produces numerous decision trees that are then
combined.
A random forest is a machine learning method for tackling classification
and regression issues. It makes use of ensemble learning, a method for solving
13

complicated issues by combining a number of classifiers. In a random forest


algorithm, there are many different decision trees. The random forest algorithm
creates a "forest" that is trained via bagging or bootstrap aggregation. The
accuracy of machine learning algorithms is increased by bagging, an ensemble
meta-algorithm. Based on the predictions of the decision trees, the (random
forest) algorithm determines the result. It makes predictions by averaging or
averaging out the results from different trees. The accuracy of the result grows as
the number of trees increases .
The Gini Index Formula is showed in equation 2.2. The Weight of Gini
Index Formula is showed in equation.
n
Gini Index (T )=1−∑ (P¿¿ i)2 ¿
i=1

¿1– ¿
Weight of Gini Index=N ¿¿ ¿

where,
Pi= the probability of classes

P¿¿ = the probability of positive class

P¿¿ = the probability of negative class

N ¿ ¿= the total number of positive class


N (– )= the total number of negative class

Similarly, the algorithm will try to find the Gini index of all the splits
possible and will choose that feature for the root node. Then, for each decision
tree, the result is decided. Finally, we used majority voting of random forest
classifier to choose the best results.
14

Figure 2. 8 Random Forest Classification


2.5.4 Decision Tree
A Decision Tree is a supervised Machine learning algorithm. It is used in
both classification and regression algorithms. The decision tree is like a tree with
nodes. The branches depend on a number of factors. It splits data into branches
like these till it achieves a threshold value. A decision tree consists of the root
nodes, children’s nodes, and leaf nodes.
Let’s Understand the decision tree methods by Taking one Real-life
Scenario. Imagine that you play football every Sunday and you always invite
your friend to come to play with you. Sometimes your friend actually comes and
sometimes he doesn’t. The factor on whether or not to come depends on
numerous things, like weather, temperature, wind, and fatigue. We start to take
all of these features into consideration and begin tracking them alongside your
friend’s decision whether to come for playing or not.
You can use this data to predict whether or not your friend will come to
play football or not. The technique you could use is a decision tree. Here’s what
the decision tree would look like after implementation:
15

Every decision tree consists following list of elements:


(a) Nodes: It is the point where the tree splits according to the value of
some attribute/feature of the dataset.
(b) Edges: It directs the outcome of a split to the next node we can see
in the figure above that there are nodes for features like outlook,
humidity and windy. There is an edge for each potential value of
each of those attributes/features.
(c) Root: This is the node where the first split takes place

(d) Leaves: These are the terminal nodes that predict the outcome of
the decision tree.

Figure 2. 9 Schematic Diagram for Decision Tree Classification


2.5.5 Support Vector Machine
SVM stands for Support Vector Machines. It is a popular machine
learning algorithm used for classification and regression analysis. SVM is a
powerful and versatile algorithm that can be used for a variety of applications,
including image classification, text classification, and bioinformatics.
16

The main idea behind SVM is to find a hyperplane that maximally


separates the data points of different classes in a high-dimensional space. The
hyperplane is defined by a set of weights and biases, which are optimized during
the training process. The hyperplane is chosen so that it maximizes the margin
between the two closest data points of different classes. The data points that lie
closest to the hyperplane are called support vectors, and they are used to define
the hyperplane.
SVM can handle both linearly separable and non-linearly separable
datasets by using different types of kernels, such as linear, polynomial, and
radial basis function (RBF) kernels. The choice of kernel depends on the nature
of the data and the complexity of the decision boundary that needs to be learned.
SVM have several advantages, including:
(a) High accuracy and performance on a wide range of datasets.
(b) Ability to handle high-dimensional data with many features.
(c) Robustness to outliers and noise in the data.
(d) Flexibility to handle both linearly and non-linearly separable data.
However, SVM also has some disadvantages, such as:
(a) Sensitivity to the choice of kernel and its parameters
(b) Difficulty in interpreting the learned model
(c) High computational and memory requirements for large datasets
2.6 Feature Extraction Method
Feature extraction is a process of dimensionality reduction by which an
initial set of raw data is reduced to more manageable groups for processing. A
characteristic of these large data sets is a large number of variables that require a
lot of computing resources to process. Feature extraction is the name for
methods that select and /or combine variables into features, effectively reducing
the amount of data that must be processed, while still accurately and completely
describing the original data set. The process of feature extraction is useful when
you need to reduce the number of resources needed for processing without losing
17

important or relevant information. Feature extraction can also reduce the amount
of redundant data for a given analysis. Also, the reduction of the data and the
machine’s efforts in building variable combinations (features) facilitate the
speed of learning and generalization steps in the machine learning process.
Data scientists use many feature extraction methods to tap into the value
of raw data sources. Let’s look at three of the most common and how they’re
used to extract data useful for machine learning applications.
(a) Image Processing
(b) Bag of Words
(c) Autoencoder
In this project, we are using TF-IDF feature extraction method. A
statistical technique called TF-IDF (term frequency-inverse document
frequency) assesses how pertinent a word is to a document within a collection of
documents. A word's frequency in a document and its inverse document
frequency over a group of documents are multiplied in order to achieve this. It is
highly useful for scoring words in machine learning algorithms for Natural
Language Processing and has a wide range of applications, with automated text
analysis being the most essential one (NLP). For document search and
information retrieval, TF-IDF was developed. It operates by increasing
according to the frequency with which a word appears in a document, but is
counterbalanced by the quantity of documents in which the word appears. So,
words that are common in every document, such as this, what, and if, rank low
even though they may appear many times, since they don’t mean much to that
document in particular.
However, if the word "Bug" appears frequently in one document but not
in another, it is presumably because it is highly relevant. For instance, if our goal
is to identify the subjects to which specific NPS replies belong, the term "bug" is
likely to be associated with the topic "reliability" because most responses
including that word will be related to that topic.
18

2.7 Literature Review


Raksha Sharma and Yogesh Kumar Sharma are described Sentiment
Analysis of Twitter Data Using Machine Learning Techniques. To analyze the
effectiveness of different machine learning algorithms for sentiment analysis of
Twitter data. The authors collected a dataset of tweets related to different topics
and sentiments. They preprocessed the data by removing stop words, stemming,
and converting to lowercase. They then used various machine learning
algorithms, including Naive Bayes, Support Vector Machines, and Decision
Trees, to classify the tweets as positive, negative, or neutral. They evaluated the
accuracy of each algorithm using metrics such as precision, recall, and F1 score.
The study found that the Support Vector Machines (SVM) algorithm achieved
the highest accuracy of 87.52% for sentiment analysis of Twitter data, followed
by the Naive Bayes algorithm with an accuracy of 84.64%. The Decision Trees
algorithm achieved the lowest accuracy of 73.53%.
Authors Kalpana and Vasuki conduct a comparative study to evaluate the
effectiveness of various sentiment analysis techniques for Twitter data. They
collect a dataset of tweets related to different domains and preprocess the data by
removing stop words, stemming, and converting to lowercase. The authors use
Naive Bayes, Maximum Entropy, and Support Vector Machines algorithms to
classify the tweets as positive, negative, or neutral and evaluate the performance
of each technique using accuracy, precision, recall, and F1 score. The study
found that Support Vector Machines achieved the highest accuracy of 87.3% for
sentiment analysis of Twitter data, followed by Maximum Entropy with an
accuracy of 85.2%. The study concludes that the choice of technique depends on
the specific domain and the characteristics of the data, and recommends
appropriate preprocessing techniques and feature selection methods to improve
the performance of sentiment analysis techniques. This article provides valuable
insights for researchers and practitioners working in the area of sentiment
analysis of social media data.
19

2.8 Summary
The background theories of sentiment analysis and natural language
processing are briefly described in this chapter. There are two different
approaches to analysis sentiment such as lexicon based and machine learning
based approach. In our proposed system, we discussed Logistic Regression,
Naive Bayes, Random Forest, Support Vector Machine and Decision Tree as
machine learning algorithms. We discussed the TFIDF feature for feature
extraction. Twitter sentiment analysis, sentiment analysis, machine learning, and
feature extraction methods are discussed. The implementation of Twitter
Sentiment Analysis will be described in the next chapter.
20

CHAPTER 3
PROPOSED SYSTEM AND IMPLEMENTATION
In this section, we will discuss the Sentiment Analysis Project using NLP
techniques.
3.1 Dataset Description
This Twitter sentiment Analysis Dataset is taken from Kaggle
Website. The training dataset contains 61691 rows and 4 columns. We have
focused our selection on those datasets that are: (i) publicly available to the
research community, (ii) manually annotated, providing a reliable set of
judgements over the tweets and, (iii) used to evaluate several sentiment analysis
models and the entity or company to which they are referring, and an assigned
sentiment. The various columns present in the dataset are:
 Sentiment: the polarity of the tweet (positive, negative ,natural)
 Tweet ID: Unique id of the tweet
 Entity: It refers to the name of the mentioned entity
 Tweet_content: It refers to the text of the tweets
. The following Figure 3.1 display sample of datasets using in the project:

Figure 3. 1 Training Dataset for Twitter


21

3.2 Proposed System


The proposed system for sentiments analysis system are as follows.

Twitter Dataset

Preprocessing Machine
Training Data Learning
Algorithm

Data Cleaning Train-Test Split


Model
Building

Testing Data
Feature Extraction
(TF-IDF Best Model

New Data Preprocessing Prediction Result

Figure 3. 2 The Proposed System of Sentiment Analysis


3.3 Steps of Implementation
The main steps are as follows.
(a) Step 1 Data Collection
(b) Step 2 Data Exploration
(c) Step 3 Data Cleaning
(d) Step 4 Feature Extraction
(e) Step 5 Train-Test Splitting
(f) Step 6 Model Building
(g) Step 7 Model Evaluation
(h) Step 8 Prediction
(i) Step 9 Result
22

3.4 Data Collection


Data Collection is one most important and crucial aspects of the Sentiment
Analysis application. Due to the wide adoption of machine learning models,
simply having large datasets on a domain specific task does not ensure superior
performance. The performance of the model depends on the quality of dataset
and labelling/annotation. As ML models learn from the data they are trained
with, automatic predictions are likely to mirror the human disagreement
identified during annotation. As a result, having a proper guideline to annotate
data is also of utmost importance (Mohammad, S. 2016).Ways to collect data for
sentiment analysis:
(a) Using API provided by social media platform which allows to
collect data in a streaming fashion. Example: Twitter API to extract
tweets by hashtags, News API to extract news by category from
different news publishers.
(b) Using Web scrapers that crawl up web data and collect specified
information. It extracts data from webpage ( HTML document).
Example: Scrapy, Beautiful Soup is a Python web-scraping package
to extract any information from the web like news articles or
comments from blogs by parsing HTML tags.
(c) Using a Web browser plugin with which users can extract
information from any public website using HTML and export the
data to the desired file format. Example: Webscraper.io is a free
extension for the Google Chrome web browser.
(d) Using existing open-source repositories of data that are cleaned and
compiled which can be used directly. Example: Rotten Tomatoes,
IMDB movie review, Yelp, Amazon product review, Twitter tweets
on Kaggle and from other websites.
23

3.5 Data Exploration


Data exploration is an important first step in any data analysis
project, including sentiment analysis on a Twitter dataset. Here are some
key steps we can take to explore your Twitter sentiment analysis dataset:
(a) Understand the dataset: Begin by understanding the structure of the
dataset. This includes the number of tweets, the date range covered
by the dataset, and the attributes of each tweet, such as the text of
the tweet, the user who posted it, and any metadata like retweets,
hashtags, or mentions.
(b) Data cleaning: Before exploring the data, it is important to clean it.
This involves removing any duplicate tweets, removing tweets that
contain irrelevant information, and correcting any misspellings or
other errors in the text.
(c) Text preprocessing: Once the data is cleaned, we should preprocess
the text of the tweets. This involves removing stop words (common
words like "the" and "and"), stemming (reducing words to their
base form, such as "running" to "run"), and converting all text to
lowercase.
(d) Visualization: Visualizing the data can help our identify patterns
and trends in the dataset. We can use tools like word clouds, bar
charts, and scatterplots to visualize the frequency of certain words,
the sentiment of tweets over time, and other relevant metrics.
(e) Sentiment analysis: Finally, we can perform sentiment analysis on
the dataset to classify each tweet as positive, negative, or neutral.
This can help we identify the overall sentiment of the dataset, as
well as any patterns in sentiment over time or across different user
groups.
24

By following these steps, we can gain a deeper understanding of our


Twitter sentiment analysis dataset and uncover insights that can inform our
analysis.
3.6 Summary
In this chapter, we described the implementation of our proposed system
design. And, we described the datasets that we are using in our project. The
process flow diagram is shown that steps are using in our project. Then, we will
describe the Experimental result in the next chapter.
25

CHAPTER 4
EXPERIMENTAL RESULT AND CONCLUSION
In this chapter, Experimental Result of models and Limitation of
Sentiment Analysis will be described.
4.1 Implementation
Firstly, downloading the Twitter Dataset from the Kaggle and import the
dataset in Python.
4.1.1 Import Twitter Datasets in Python

Figure 4. 1 Training Twitter datasets Description


4.1.2 Values Count in Sentiment Column

Figure 4. 2 Values in Sentiment Description


26

4.1.3 Removing unuse column from Dataset

Figure 4. 3 Removing the Tweet ID and Entity Column


4.1.4 Entity vs Sentiment Relation

Figure 4. 4 Entity and Sentiment Relation


27

4.1.5 Visualizing Sentiment feature(Target)

Figure 4. 5 Visualizing Sentiment Features


4.1.6 Droping null values and duplicated values

Figure 4. 6 Drop and Duplicate Description


28

4.1.7 Dataset Description After Dropping Unuse Columns

Figure 4. 7 Dataset Description After Dropping Unuse Columns


4.1.8 Convert to Lower Case

Figure 4. 8 Convert to Lower Case description


29

4.1.9 Remove Punctuation

Figure 4. 9 Remove punctuation


4.1.10 Remove Stopwords

Figure 4. 10 Remove Stopwords


4.1.11 Tokenizing the Texts

Figure 4. 11 Tokenizing Description


30

4.1.12 Remove Words less than 3

Figure 4. 12 Remove Words Description


4.1.13 Stemming

Figure 4. 13 Stemming description


4.1.14 Lemmatize the Texts

Figure 4. 14 Lemmatization Description


31

4.1.15 Return Statements

Figure 4. 15 Return Statement Description


4.1.16 Feature Extraction

Figure 4. 16 Feature Extraction


32

4.1.17 Plot WordCloud Target

Figure 4. 17 WordCloud Description


4.1.18Train and Test Splitting

Figure 4. 18 Train-Test Splitting Description


33

4.1.19Convert Lemmatized Words to Tf-Idf Feature Vectors

Figure 4. 19 Tf-Idf Feature Description


4.1.20Using Logistic Regression

Figure 4. 20 Logistic Regression Accuracy Description


34

4.1.21Random Forest Classifier

Figure 4. 21 Random Forest Accuracy Description


35

4.1.22Using Naïve Bayes Classifier

Figure 4. 22 Naïve Bayes Accuracy Description


4.1.23Using DecisionTreeClassifier

Figure 4. 23 Decision Tree Classifier Accuracy Description


36

4.1.24Using Support Vector Machine

Figure 4. 24 Support Vector Machine Accuracy Description


4.1.25Comparing Models and Predicting Texts

Figure 4. 25 Models Comparing and Predicting the Texts


37

4.2 Limitation of Sentiment Analysis system


In this project, packages that are available in R programming are used as
the data sets not from survey nor reviews from social platforms. Only analyzing
and visualizing are presented in this project but in the future, we will be able to
forward for decision making and recommendations.
4.3 Conclusion
Sentiment analysis provides a way to understand the attitudes and
opinions expressed in texts. This project explored how to approach sentiment
analysis using tidy data principles; when text data is in a tidy data structure,
sentiment analysis can be implemented as an inner join. We can use sentiment
analysis to understand how a narrative arc changes throughout its course or what
words with emotional and opinion content are important for a particular text. We
will continue to develop for applying sentiment analysis to different kinds of text
in our case studies.
38

REFERENCES
[1] What is Sentiment Analysis? (Manish Shivanandhan)
[2] Natural Language Processing, Sentiment Analysis and Clinical Analytics
Adil Rajput ([email protected]) Assistant Professor,
Information System Department, Effat University An Nazlah Al
Yamaniyyah, Jeddah 22332, Jeddah, Saudi Arabia
[3] https://data-flair.training/blogs/data-science-r-sentiment-analysis-project/
[4] https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-
applications-6c94d6f58c17
[5] https://www.javatpoint.com/nlp
[6] https://www.tidytextmining.com/sentiment.html
[7] https://lazarinastoy.com/sentiment-analysis-theory-methods-applications/
[8] Sharma, R., & Sharma, Y. K. (2020). Sentiment analysis of Twitter data
using machine learning techniques. International Journal of Advanced
Computer Science and Applications, 11(8), 174-180.
https://doi.org/10.14569/IJACSA.2020.0110824
[9] Kalpana, T., & Vasuki, V. (2018). A comparative study on sentiment
analysis techniques in Twitter data. Journal of Advanced Research in
Dynamical and Control Systems, 10(5), 273-282.
APPENDIX
import pandas as pd
import numpy as np
import string as st
from wordcloud import WordCloud #Word visualization
import matplotlib.pyplot as plt #Plotting properties
import seaborn as sns #Plotting properties
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer #Data transformation
from sklearn.model_selection import train_test_split #Data testing
from sklearn.linear_model import LogisticRegression #Prediction Model
import xgboost
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, confusion_matrix #Comparison between
real and predicted
import re #Regular expressions
import nltk
from nltk import word_tokenize
from nltk import PorterStemmer, WordNetLemmatizer
nltk.download('stopwords')
import warnings
train_df.head()
valid_df.head()
train_df.shape
valid_df.shape
data = pd.concat([train_df, valid_df]).reset_index(drop = True)
data.shape
from sklearn.utils import shuffle
data = shuffle(data)
data = data.reset_index(drop=True)
data.head()
data.drop(["Tweet ID"],axis=1,inplace=True)
data.head()
Twitter_sentiment = data.groupby(['entity', 'sentiment']).sentiment.count().unstack()
Twitter_sentiment.plot(kind='bar',figsize=(10,5))
data.drop(["entity"],axis=1,inplace=True)
data.head()
data['sentiment'].value_counts()
plt.figure(figsize=(10,10))#Visualizing Sentiment feature - Target feature
plt.subplot(1,2,1)
sns.set(style ='whitegrid')
pd.set_option('display.max_columns',None)
sns.countplot(x=data['sentiment'])
plt.subplot(1,2,2)#plot countplot
data.sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%',figsize=(10,5))
plt.show()
data.isnull().sum()# check null
data =data.dropna()
data.isnull().sum()# check null
data.duplicated().sum()
data = data.drop_duplicates()
data.duplicated().sum()
data['sentiment'].value_counts()
data["sentiment"] =data["sentiment"].map({"Positive": 3, "Neutral": 2, "Negative":
1,"Irrelevant":0})
data
from nltk.stem.snowball import SnowballStemmer
import re ,string, unicodedata, html
import nltk
from nltk import word_tokenize , WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
data['Tweet_content'] = data['Tweet_content'].apply(lambda x: x.lower())
data.head()
import string
def punctuation_removal(text):
all_list = [char for char in text if char not in string.punctuation]
clean_str = ''.join(all_list)
return clean_str
data['Tweet_content'] = data['Tweet_content'].apply(punctuation_removal)
data.head()
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
data['Tweet_content'] = data['Tweet_content'].apply(lambda x: ' '.join([word for word in
x.split() if word not in (stop)]))
data.head()
def tokenize(text):
text = re.split('\s+' ,text)
return [x.lower() for x in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda msg : tokenize(msg))
data.head()
def remove_small_words(text):
return [x for x in text if len(x) > 3 ]
data['Tweet_content'] = data['Tweet_content'].apply(lambda x :
remove_small_words(x))
data.head()
def stemming(text):
ps = PorterStemmer()
return [ps.stem(word) for word in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda wrd: stemming(wrd))
data.head()
def lemmatize(text):
word_net = WordNetLemmatizer()
return [word_net.lemmatize(word) for word in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda x : lemmatize(x))
data.head()
def return_sentences(lemmatize):
return " ".join([word for word in lemmatize])
data['Tweet_content'] = data['Tweet_content'].apply(lambda x : return_sentences(x))
data.head()
import re
def remove_numbers(text):
return re.sub(r'\d+', '', text)
data['Tweet_content']=data['Tweet_content'].apply(lambda x : remove_numbers(x))
data.head()
data.Tweet_content[0]
data.Tweet_content[0]
plt. grid(False)
plt.title("for positive")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 3].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,2)
# Negative
plt. grid(False)
plt.title("for negative")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 1].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,3)
# Neutral
plt. grid(False)
plt.title("for neutral")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 2].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,4)
# Irrelevant
plt. grid(False)
plt.title("for irrelevant")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 0].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x.shape)
print(y.shape)
x
tr_idf_model = TfidfVectorizer()
X = tr_idf_model.fit_transform(data['Tweet_content'])
#X_test = tr_idf_model.transform(y)
tf_idf_array = X.toarray()
tfidf_tokens = tr_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame(tf_idf_array, columns = tfidf_tokens)
df_tf_idf X_train,X_test, y_train, y_test = train_test_split(X, data['sentiment'], test_size
= 0.1, random_state=1)
print(X_train.shape)
print(X_test.shape)
train=0.80*100
test=0.20*100
labels=['Train','Test']
sizes=[train,test]
colors=['yellow','green']
fig1,ax1=plt.subplots()
ax1.pie(sizes, colors=colors, labels=labels,autopct='%1.1f%%',startangle=90)
ax1.axis('equal')
plt.title('Train Vs Test')
plt.show()
y_train.value_counts()
dct=dict()
lr = LogisticRegression(max_iter = 500)
lr.fit(X_train, y_train)
print('Logistic Regression model fitted..')
pred = lr.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['LogisticRegression']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print('Random Forest Classifier model fitted..')
pred = rf.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test, pred)))
dct['RdmFC']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d") nb =
MultinomialNB()
nb.fit(X_train,y_train)
pred = nb.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['MultinomialNB']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn.tree import DecisionTreeClassifier
#Decision Tree
dtc=DecisionTreeClassifier(random_state=1)
dtc.fit(X_train,y_train)
pred = dtc.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['DecisionTreeClassifier']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
model = clf.fit(X_train, y_train)
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
dct['SVM'] = round(accuracy_score(y_test, prediction)*100,2)
accuracy_score(y_test,prediction)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
colors = ["green", "orange", "blue" , "red" ,"yellow"]
sns.set_style("whitegrid")
plt.figure(figsize=(7,4))
plt.title('Comparing Models',fontsize=20)
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy %",fontsize=18)
plt.xlabel("Algorithms",fontsize=18)
sns.barplot(x=list(dct.keys()), y=list(dct.values()), palette=colors)
plt.show()
text = ["I love the musics."]
texts=tr_idf_model .transform(text)
print(texts)
test1=lr.predict(texts)
test2=nb.predict(texts)
test3=dtc.predict(texts)
test4=xgb.predict(texts)
print(test1,test2,test3,test4)
text = ["I hate you fucking."]
texts=tr_idf_model .transform(text)
print(texts)
test1=lr.predict(texts)
test2=nb.predict(texts)
test3=dtc.predict(texts)
test4=xgb.predict(texts)
print(test1,test2,test3,test4)

You might also like