Twitte Analysis
Twitte Analysis
By
Captain. Lin Htet Paing
SECOND SEMESTER
(COURSE PROJECT)
MARCH, 2023
PYIN OO LWIN
ABSTRACT
With the advancement of web technology and its growth, there is a huge
volume of data present in the web for internet users and a lot of data is generated
too. The Internet has become a platform for online learning, exchanging ideas,
and sharing opinions. Social networking sites like Twitter, Facebook, and
Google+ are rapidly gaining popularity as they allow people to share and express
their views about topics, have discussions with different communities, or post
messages across the world. There has been a lot of work in the field of sentiment
analysis of Twitter data. This survey focuses mainly on sentiment analysis of
Twitter data which is helpful to analyze the information in the tweets where
opinions are highly unstructured, heterogeneous and are either positive or
negative, or neutral in some cases. Sentiment analysis (also known as opinion
mining or emotion AI) is the use of natural language processing, text analysis,
computational linguistics, and biometrics to systematically identify, extract,
quantify, and study affective states and subjective information. In this paper,
Using various machine learning algorithms like Naive Bayes, Logistics
Regression, Random Forest, Decision Tree and Support Vector Machine, we
provide research on Twitter data streams. We have also discussed general
challenges and applications of Sentiment Analysis on Twitter. Random Forest is
the best model in my project and accuracy is 93.94%.
CHAPTER 2
BACKGROUND THEORY
In this chapter, a definition of sentiment analysis, methodologies, and
summary will be proposed. Twitter is a good source of information for
individuals' opinions. Twitter receives about 500 million tweets a day, where
people share comments regarding a wide range of topics. Many consumers take
to Twitter to give their opinions on current events, including real-time affairs.
By performing sentiment analysis on these tweets, one can determine the
polarity and inclination of a population towards specific topics, items, or entities.
Retweeting is a largely used mechanism for information diffusion on Twitter. It
has emerged as a simple yet powerful way of circulating information in the
Twitter social realm.
2.1 Sentiment Analysis
Sentiment analysis is contextual mining of text that identifies and extracts
subjective information in the source material and helps a business to understand
the social sentiment of their brand, product, or service while monitoring online
conversations. However, analysis of social media streams is usually restricted to
just basic sentiment analysis and count-based metrics. This is akin to just
scratching the surface and missing out on those high-value insights that are
waiting to be discovered. With the recent advances in deep learning, the ability of
algorithms to analyze text has improved considerably. Creative use of advanced
artificial intelligence techniques can be an effective tool for doing in-depth
research. We believe it is important to classify incoming customer conversations
about a brand based on the following lines. They are: key aspects of a brand’s
product and service that customers care about and users’ underlying intentions
and reactions concerning those aspects. These basic concepts when used in
4
platform you are working on. Most research carried out in the field of sentiment
analysis employs lexicon-based analysis or machine-learning techniques.
Machine learning techniques control the data processing by the use of machine
learning algorithms and by classifying the linguistic data by representing them in
vector form (Olsson et al. 2009). On the other side, the Lexicon-based (also
called Dictionary-based) approach classifies the linguistic data using a dictionary
lookup database. During this classification, it computes sentence or document-
level sentiment polarity using lexicon databases for processing linguistic data
like WordNet, SentiWordNet, and Treebanks. In this section, a brief discussion
of lexicon-based and Machine Learning approaches has been outlined.
2.4.1 Lexicon-Based Approach
The lexicon-based approach predicts the sentiments by using lexical
databases like SentiWordNet and WordNet. It obtains a score for each word in
the sentence or document and annotates using the features from the lexicon
database that are present. It derives text polarity based on a set of words, each of
which is annotated with the weight and extracts information that contributes to
concluding the overall sentiments of the text. Also, it is necessary to pre-process
data before assigning the weight to the words.
Moreover, Lexicon dictionary or database contains the opinionated words
that are classified with positive and negative word type, and the description of
the word that occurs in current context. For each word in the document, it is
assigned with numeric score, and average score is computed by summing up all
the numeric scores and sentiment polarity is assigned to the document. There are
three general-purpose lexicons. They are:
(a) AFINN
(b) Bing
(c) Loughran
These three lexicons make use of the unigrams. Unigrams are a type of n-
gram model that consists of a sequence of 1 item, that is, a word collected from a
9
given textual data. In the AFINN lexicon model scores the words in a range from
-5 to 5. The increase in negativity corresponds the negative sentiment whereas
an increase in positivity corresponds the positive one. The bing lexicon model on
the other hand, classifies the sentiment into a binary category of negative or
positive. And finally, the loughran model that performs analysis of the
shareholder’s reports. In this project, we will make use of the bing lexicons to
extract the sentiments out of our data.
2.4.2 Machine Learning Approach
Machine Learning approach is widely seen in the literature on sentiment
analysis. Using this approach the words in the sentence are considered in form of
vectors, and analyzed using different machine learning algorithms like Naïve
Bayes, SVM, and Maximum Entropy. The data is trained accordingly, which can
be applied to machine learning algorithms. This data has been appended with the
total positive score, negative score in the tweets and sentiment labeling
(‘POSITIVE’,’NEGATIVE’ and ‘NEUTRAL’) has been assigned to each tweet
in the dataset. These data sets which are labeled with sentiment of the tweets are
further trained using machine learning algorithm to measure its accuracy,
performance and reliability of the result obtained from lexicon-based sentiment
analysis. The most abstract view that performs sentiment analysis using machine
learning .
10
P ( B∨ A ) P( A)
P ( A∨B ) =
P(B)
where,
A, B = events
P(A | B) = probability of A given B is true
P(B | A) = probability of B given A is true and
P(A), P(B) = the independent probabilities of A and B
¿1– ¿
Weight of Gini Index=N ¿¿ ¿
where,
Pi= the probability of classes
Similarly, the algorithm will try to find the Gini index of all the splits
possible and will choose that feature for the root node. Then, for each decision
tree, the result is decided. Finally, we used majority voting of random forest
classifier to choose the best results.
14
(d) Leaves: These are the terminal nodes that predict the outcome of
the decision tree.
important or relevant information. Feature extraction can also reduce the amount
of redundant data for a given analysis. Also, the reduction of the data and the
machine’s efforts in building variable combinations (features) facilitate the
speed of learning and generalization steps in the machine learning process.
Data scientists use many feature extraction methods to tap into the value
of raw data sources. Let’s look at three of the most common and how they’re
used to extract data useful for machine learning applications.
(a) Image Processing
(b) Bag of Words
(c) Autoencoder
In this project, we are using TF-IDF feature extraction method. A
statistical technique called TF-IDF (term frequency-inverse document
frequency) assesses how pertinent a word is to a document within a collection of
documents. A word's frequency in a document and its inverse document
frequency over a group of documents are multiplied in order to achieve this. It is
highly useful for scoring words in machine learning algorithms for Natural
Language Processing and has a wide range of applications, with automated text
analysis being the most essential one (NLP). For document search and
information retrieval, TF-IDF was developed. It operates by increasing
according to the frequency with which a word appears in a document, but is
counterbalanced by the quantity of documents in which the word appears. So,
words that are common in every document, such as this, what, and if, rank low
even though they may appear many times, since they don’t mean much to that
document in particular.
However, if the word "Bug" appears frequently in one document but not
in another, it is presumably because it is highly relevant. For instance, if our goal
is to identify the subjects to which specific NPS replies belong, the term "bug" is
likely to be associated with the topic "reliability" because most responses
including that word will be related to that topic.
18
2.8 Summary
The background theories of sentiment analysis and natural language
processing are briefly described in this chapter. There are two different
approaches to analysis sentiment such as lexicon based and machine learning
based approach. In our proposed system, we discussed Logistic Regression,
Naive Bayes, Random Forest, Support Vector Machine and Decision Tree as
machine learning algorithms. We discussed the TFIDF feature for feature
extraction. Twitter sentiment analysis, sentiment analysis, machine learning, and
feature extraction methods are discussed. The implementation of Twitter
Sentiment Analysis will be described in the next chapter.
20
CHAPTER 3
PROPOSED SYSTEM AND IMPLEMENTATION
In this section, we will discuss the Sentiment Analysis Project using NLP
techniques.
3.1 Dataset Description
This Twitter sentiment Analysis Dataset is taken from Kaggle
Website. The training dataset contains 61691 rows and 4 columns. We have
focused our selection on those datasets that are: (i) publicly available to the
research community, (ii) manually annotated, providing a reliable set of
judgements over the tweets and, (iii) used to evaluate several sentiment analysis
models and the entity or company to which they are referring, and an assigned
sentiment. The various columns present in the dataset are:
Sentiment: the polarity of the tweet (positive, negative ,natural)
Tweet ID: Unique id of the tweet
Entity: It refers to the name of the mentioned entity
Tweet_content: It refers to the text of the tweets
. The following Figure 3.1 display sample of datasets using in the project:
Twitter Dataset
Preprocessing Machine
Training Data Learning
Algorithm
Testing Data
Feature Extraction
(TF-IDF Best Model
CHAPTER 4
EXPERIMENTAL RESULT AND CONCLUSION
In this chapter, Experimental Result of models and Limitation of
Sentiment Analysis will be described.
4.1 Implementation
Firstly, downloading the Twitter Dataset from the Kaggle and import the
dataset in Python.
4.1.1 Import Twitter Datasets in Python
REFERENCES
[1] What is Sentiment Analysis? (Manish Shivanandhan)
[2] Natural Language Processing, Sentiment Analysis and Clinical Analytics
Adil Rajput ([email protected]) Assistant Professor,
Information System Department, Effat University An Nazlah Al
Yamaniyyah, Jeddah 22332, Jeddah, Saudi Arabia
[3] https://data-flair.training/blogs/data-science-r-sentiment-analysis-project/
[4] https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-
applications-6c94d6f58c17
[5] https://www.javatpoint.com/nlp
[6] https://www.tidytextmining.com/sentiment.html
[7] https://lazarinastoy.com/sentiment-analysis-theory-methods-applications/
[8] Sharma, R., & Sharma, Y. K. (2020). Sentiment analysis of Twitter data
using machine learning techniques. International Journal of Advanced
Computer Science and Applications, 11(8), 174-180.
https://doi.org/10.14569/IJACSA.2020.0110824
[9] Kalpana, T., & Vasuki, V. (2018). A comparative study on sentiment
analysis techniques in Twitter data. Journal of Advanced Research in
Dynamical and Control Systems, 10(5), 273-282.
APPENDIX
import pandas as pd
import numpy as np
import string as st
from wordcloud import WordCloud #Word visualization
import matplotlib.pyplot as plt #Plotting properties
import seaborn as sns #Plotting properties
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer #Data transformation
from sklearn.model_selection import train_test_split #Data testing
from sklearn.linear_model import LogisticRegression #Prediction Model
import xgboost
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score, confusion_matrix #Comparison between
real and predicted
import re #Regular expressions
import nltk
from nltk import word_tokenize
from nltk import PorterStemmer, WordNetLemmatizer
nltk.download('stopwords')
import warnings
train_df.head()
valid_df.head()
train_df.shape
valid_df.shape
data = pd.concat([train_df, valid_df]).reset_index(drop = True)
data.shape
from sklearn.utils import shuffle
data = shuffle(data)
data = data.reset_index(drop=True)
data.head()
data.drop(["Tweet ID"],axis=1,inplace=True)
data.head()
Twitter_sentiment = data.groupby(['entity', 'sentiment']).sentiment.count().unstack()
Twitter_sentiment.plot(kind='bar',figsize=(10,5))
data.drop(["entity"],axis=1,inplace=True)
data.head()
data['sentiment'].value_counts()
plt.figure(figsize=(10,10))#Visualizing Sentiment feature - Target feature
plt.subplot(1,2,1)
sns.set(style ='whitegrid')
pd.set_option('display.max_columns',None)
sns.countplot(x=data['sentiment'])
plt.subplot(1,2,2)#plot countplot
data.sentiment.value_counts().plot(kind='pie', autopct='%1.0f%%',figsize=(10,5))
plt.show()
data.isnull().sum()# check null
data =data.dropna()
data.isnull().sum()# check null
data.duplicated().sum()
data = data.drop_duplicates()
data.duplicated().sum()
data['sentiment'].value_counts()
data["sentiment"] =data["sentiment"].map({"Positive": 3, "Neutral": 2, "Negative":
1,"Irrelevant":0})
data
from nltk.stem.snowball import SnowballStemmer
import re ,string, unicodedata, html
import nltk
from nltk import word_tokenize , WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
data['Tweet_content'] = data['Tweet_content'].apply(lambda x: x.lower())
data.head()
import string
def punctuation_removal(text):
all_list = [char for char in text if char not in string.punctuation]
clean_str = ''.join(all_list)
return clean_str
data['Tweet_content'] = data['Tweet_content'].apply(punctuation_removal)
data.head()
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
data['Tweet_content'] = data['Tweet_content'].apply(lambda x: ' '.join([word for word in
x.split() if word not in (stop)]))
data.head()
def tokenize(text):
text = re.split('\s+' ,text)
return [x.lower() for x in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda msg : tokenize(msg))
data.head()
def remove_small_words(text):
return [x for x in text if len(x) > 3 ]
data['Tweet_content'] = data['Tweet_content'].apply(lambda x :
remove_small_words(x))
data.head()
def stemming(text):
ps = PorterStemmer()
return [ps.stem(word) for word in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda wrd: stemming(wrd))
data.head()
def lemmatize(text):
word_net = WordNetLemmatizer()
return [word_net.lemmatize(word) for word in text]
data['Tweet_content'] = data['Tweet_content'].apply(lambda x : lemmatize(x))
data.head()
def return_sentences(lemmatize):
return " ".join([word for word in lemmatize])
data['Tweet_content'] = data['Tweet_content'].apply(lambda x : return_sentences(x))
data.head()
import re
def remove_numbers(text):
return re.sub(r'\d+', '', text)
data['Tweet_content']=data['Tweet_content'].apply(lambda x : remove_numbers(x))
data.head()
data.Tweet_content[0]
data.Tweet_content[0]
plt. grid(False)
plt.title("for positive")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 3].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,2)
# Negative
plt. grid(False)
plt.title("for negative")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 1].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,3)
# Neutral
plt. grid(False)
plt.title("for neutral")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 2].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
plt.subplot(2,2,4)
# Irrelevant
plt. grid(False)
plt.title("for irrelevant")
wc = WordCloud(max_words = 2000 , width = 1600 , height =
800).generate(data[data.sentiment == 0].Tweet_content.str.cat(sep=' '))
plt.imshow(wc , interpolation = 'bilinear')
x=data.iloc[:,:-1].values
y=data.iloc[:,-1].values
print(x.shape)
print(y.shape)
x
tr_idf_model = TfidfVectorizer()
X = tr_idf_model.fit_transform(data['Tweet_content'])
#X_test = tr_idf_model.transform(y)
tf_idf_array = X.toarray()
tfidf_tokens = tr_idf_model.get_feature_names_out()
df_tf_idf = pd.DataFrame(tf_idf_array, columns = tfidf_tokens)
df_tf_idf X_train,X_test, y_train, y_test = train_test_split(X, data['sentiment'], test_size
= 0.1, random_state=1)
print(X_train.shape)
print(X_test.shape)
train=0.80*100
test=0.20*100
labels=['Train','Test']
sizes=[train,test]
colors=['yellow','green']
fig1,ax1=plt.subplots()
ax1.pie(sizes, colors=colors, labels=labels,autopct='%1.1f%%',startangle=90)
ax1.axis('equal')
plt.title('Train Vs Test')
plt.show()
y_train.value_counts()
dct=dict()
lr = LogisticRegression(max_iter = 500)
lr.fit(X_train, y_train)
print('Logistic Regression model fitted..')
pred = lr.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['LogisticRegression']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
print('Random Forest Classifier model fitted..')
pred = rf.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
print("Confusion matrix : \n {}".format(confusion_matrix(y_test, pred)))
dct['RdmFC']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d") nb =
MultinomialNB()
nb.fit(X_train,y_train)
pred = nb.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['MultinomialNB']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn.tree import DecisionTreeClassifier
#Decision Tree
dtc=DecisionTreeClassifier(random_state=1)
dtc.fit(X_train,y_train)
pred = dtc.predict(X_test)
print("Accuracy score : {}".format(accuracy_score(y_test, pred)))
dct['DecisionTreeClassifier']=round(accuracy_score(y_test, pred)*100,2)
accuracy_score(y_test,pred)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
model = clf.fit(X_train, y_train)
prediction = model.predict(X_test)
print("accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
dct['SVM'] = round(accuracy_score(y_test, prediction)*100,2)
accuracy_score(y_test,prediction)
plt.figure(figsize=(8,8))
sns.heatmap(confusion_matrix(y_test,pred),annot = True,fmt = "d")
colors = ["green", "orange", "blue" , "red" ,"yellow"]
sns.set_style("whitegrid")
plt.figure(figsize=(7,4))
plt.title('Comparing Models',fontsize=20)
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy %",fontsize=18)
plt.xlabel("Algorithms",fontsize=18)
sns.barplot(x=list(dct.keys()), y=list(dct.values()), palette=colors)
plt.show()
text = ["I love the musics."]
texts=tr_idf_model .transform(text)
print(texts)
test1=lr.predict(texts)
test2=nb.predict(texts)
test3=dtc.predict(texts)
test4=xgb.predict(texts)
print(test1,test2,test3,test4)
text = ["I hate you fucking."]
texts=tr_idf_model .transform(text)
print(texts)
test1=lr.predict(texts)
test2=nb.predict(texts)
test3=dtc.predict(texts)
test4=xgb.predict(texts)
print(test1,test2,test3,test4)