100% found this document useful (1 vote)
42 views

Michael Final Project

Sentiment analysis of Twitter data

Uploaded by

onim0971
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
42 views

Michael Final Project

Sentiment analysis of Twitter data

Uploaded by

onim0971
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 59

CHAPTER ONE

INTRODUCTION

1.1 Background of the study

Twitter is one of the popular social networking site with more than 368 million monthly

active users and 500 million tweets per day (Salman, 2024). Tweets are short text messages

with 140 characters, but are powerful source of expressing emotional state and feelings with

the society of friends (Salman, 2024). With the fast development of the Internet, the number

of individuals who communicate their opinions in addition to views on Social media

platforms (namely discussion forums, social networks and blogs) by posting comments and

publishing blogs is increasing. Millions of pages (per day) are being posted on the Internet.

Thus, the Internet has become the most imperative source for people to attain information for

making decisions (Zhang et al., 2019). Amongst social networks, Twitter is the best-

recognized and the 3rd most visited website on the globe (Salman, 2024). It shares and

allows a user to either follow other users who post tweets or even to post their message

(Kraiem et al., 2015). Traditionally, the market research companies with the vision of

“comprehending feelings or opinions of people” normally utilized opinion polls,

questionnaires, interviews and forms, which devours much time and most expensive. To

control time and cost, nowadays, researchers are proposing methodologies that concentrate

on contents posted on Twitter (Furini and Montangero, 2018). In twitter, more than 300

million users (active) are sharing and posting information daily via tweets with multimedia

contents (Harakawa et al., 2018). Twitter has 465 million user accounts and has 250 million

visitors per month with an average of 100,000 micro posts/min that is approximately 175

million every day and around 8 terabytes data in a day ( A. Bukhari et al., 2017). Any sort of

information can well be posted and also shared; in addition, it is feasible to filter tweets

1
associated with a product, a person, an organization or other units of interest (Bennacer et al.,

2018).

Sentiment analysis has lately become a field of intensive research. Today’s data are too

ample and intricate for human usage. Consequently, the requisite of automated computer-

centric capabilities that could tell whether tweets or sentences are carrying factual

information or opinions is of utmost significant (Appel et al., 2018). It is a strenuous task

that is undertaken to discover the emotion of a corpus without direct communication. This

task could also be signified as mood extraction, opinion mining, or emotion analysis (Nguyen

and Jung, 2017). Sentiment analysis is a classification process, which is grounded on

ascertaining the polarity of a review as negative, positive, or neutral as per the expressed

opinion.

With the rapid growth of the World Wide Web, people are using social media such as Twitter

which generates big volumes of opinion texts in the form of tweets which is available for the

sentiment analysis. This translates to a huge volume of information from a human viewpoint

which make it difficult to extract a sentence, read them, analyze tweet by tweet, summarize

them and organize them into an understandable format in a timely manner. Informal language

refers to the use of colloquialisms and slang in communication, employing the conventions of

spoken language such as ‘would not’ and ‘wouldn’t’. Not all systems are able to detect

sentiment from use of informal language and this could hanker the analysis and decision-

making process.

Sentiment analysis, the process of automatically identifying the emotional tone of text, is a

powerful tool for understanding public opinion and social media trends. However, analyzing

sentiment on Twitter data presents unique challenges compared to more formal text. This

essay will explore these challenges and how decision tree models can offer some solutions.

2
1.2 Statement of the problems

One major hurdle is the sheer informality of Twitter language. Slang, emojis, and sarcasm

are commonplace, often confusing traditional sentiment analysis models. Large datasets can

further complicate matters, as ensuring consistent quality across millions of tweets can be

difficult. Decision trees can help mitigate this issue by being trained on vast amounts of

tweets, allowing them to learn the nuances of informal language and identify sarcasm

indicators like punctuation or specific emojis.

Another challenge lies in capturing the subtleties of human communication. Nuance and

context are crucial for accurate sentiment analysis. For instance, a seemingly positive tweet

might be laced with sarcasm, while understanding the context of a reply or the referenced

topic can completely alter the sentiment. While some models struggle with this, decision

trees can be structured to consider the relationships between words and phrases. For

example, the presence of “not” before an adjective could be a clue for sarcasm. This allows

the model to grasp some level of context and make more nuanced classifications.

Furthermore, many existing sentiment analysis methods offer limited categories, often just

positive, negative, and neutral. This fails to capture the full spectrum of emotions people

express on Twitter. Decision trees, however, can be designed to output more granular

classifications. One could create a model that differentiates between happiness, anger,

sadness, or even more specific emotions relevant to a particular analysis. This allows for a

richer understanding of the sentiment expressed on the platform.

Many methods only analyze the overall sentiment of a tweet, neglecting the possibility of

mixed emotions within a single post. A tweet might begin with frustration but end on a

hopeful note. Decision trees, with their branching structure, can handle such complexities.

3
The model can follow different paths based on keywords or sentiment indicators, leading to a

more accurate classification that reflects the tweet’s full emotional range.

Sentiment analysis on Twitter data faces challenges due to informal language, the importance

of context, limited sentiment categories, and the complexity of mixed emotions. Decision

tree models offer a valuable tool to address these issues. Their ability to handle informal

language, capture some context, provide more granular sentiment classifications, and identify

mixed emotions makes them a powerful asset for analyzing the ever-evolving world of

Twitter.

1.3 Aim and objectives of study

1.3.1 Aim

The aim of this project is to develop a decision tree model for the sentiment analysis of

twitter data.

1.3.2 Specific Objectives

The objectives of this are to:

i. Development of a decision tree model for sentiment analysis of twitter data.

ii. Implementation of the developed model.

iii. Evaluate and compare the implemented model with related methods.

1.4 Significance of the study

The significance of this project will focus on developing and evaluating a decision tree model

for sentiment analysis of Twitter data.

1.5 Organization of work

For the remainder of the study, the following is planned: The related works are presented in

Chapter Two. Chapter three provides a methodology. Chapter four provides an overview of

4
the implementation, evaluation and results. The work is concluded in Chapter five, with

future work.

1.6 Definition of terms

Twitter Data: Twitter is a social media platform where users post short messages called

tweets.

Sentiment Analysis: This is the task of understanding the emotional tone of a piece of text.

It’s about figuring out whether a statement expresses a positive, negative, or neutral

sentiment.

Decision Tree Model: This refers to a machine learning algorithm that works by asking a

series of yes-or-no questions based on features of the data.

Machine Learning: Machine learning is a field of artificial intelligence (AI) that focuses on

developing computer algorithms that can improve performance on a specific task over time

through experience.

5
CHAPTER TWO

LITERATURE REVIEW

2.1 Twitter Data

The message posted on Twitter is called Tweet, which is limited to 140

characters. Tweets are generally composed of one of the followings: text,

links, emoticons, and images. Based on these components the mining is

applied to classify text, links, emoji or emoticons. The Tweets contains

three notations including hashtags (#), retweets (RT) and account Id (@)

(Pang & Lee, 2008).

Tweets determine the public opinion/sentiment on a product, topic, event,

governmental issues, etc. Monitoring and analyzing these tweets give

valuable feedback to individuals users, private and public sectors (Hassan,

2013). Due to the large size of this data, sentiment analysis is chosen as a

6
technique to analyze this data due to the ease in determining the user-

generated opinions without going through millions of tweets manually.

2.2 Sentiment Analysis

Sentiment analysis is automated mining of opinions, attitudes, and

emotions from speech or text. Sentiment analysis can be defined as a

process that automates mining of attitudes, opinions, views and emotions

from text, speech, tweets and database sources through Natural

Language processing (NLP) (Sebastiani, 2002). Sentiment analysis

involves classifying opinions in text into categories like “positive” or

“negative” or “neutral”(Pang & Lee, 2008). It’s also referred as

subjectivity analysis, opinion mining, and appraisal extraction.

Considering the accessibility of such voluminous information on the web,

it becomes difficult and error-prone for a human to manually analyze the

data. Therefore, there is a need for ways to automatically analyze the

content. This can be achieved using sentiment analysis (Sebastiani,

2002). Consumers don’t need to ask other people about the quality of a

product as the answers are easily available for them. Sentiment analysis

not only finds its applications in product reviews but also on social media

and news articles. The results obtained during the sentiment analysis are

also helpful in market research and in making political/government

decisions.

2.2.1 Levels of Sentiment Analysis

7
Generally, Sentiment analysis is carried in three different levels-

document level, sentence level and feature level (Entity/Aspect level). In

the sentiment analysis of Twitter data, sentence level analysis is done due

to the length constraint of the tweet.

I. Document Level Sentiment Analysis

During sentiment analysis in the Document level, findings are based

on sentiments found in the overall document. The major task would

be to find out whether the document bears positive or negative

sentiment. Generally, the basic assumption in document-level

sentiment analysis is that the opinion holder expresses an opinion

on only one target at a time (Sebastiani, 2002). This type of

classification may not be suitable and valid for some data sets. It is

not true that we can always have one opinion target.

II. Sentence Level Sentiment Analysis

In Sentence level of the sentiment analysis, each sentence is

classified into positive, negative or neutral class (Hassan, 2013). In

this type of analysis, the sentences are assigned a polarity.

Generally, the sentences are assigned as positive or negative while

the neutral sentences are assumed to have no polarity.

III. Feature Level Sentiment Analysis

In Feature level of the sentiment analysis, there are possibilities of

one sentence containing sentiments about different entities. For

example, let us consider the following statement, “I like this movie,

but the music is bad!” From this statement, it can be noted that the

sentiment on the movie is positive but negative on the music. Movie

8
and music are the two entities/aspects included in the sentiment.

From this case, we can consider that aspect-based sentiment

analysis for finer sentiment analysis (Sebastiani, 2002).

2.2.2 Approaches For Sentiment Analysis

There are many approaches for sentiment analysis, this approach are

highlighted below:

I. Lexicon-based Approach

This approach utilizes lexicon or dictionaries. In this step semantic

orientation or polarity of words or phrases in the document are used

in calculating the orientation for a document. The lexicon-based

approach does not require the storage of a large corpus of data, as

required in a machine learning approach. It utilizes lexicon or

dictionaries to calculate the orientation of a document. Semantic

Orientation (SO) is the measure of subjectivity and opinion in the

text and it captures polarity and strength of words or phrases. All

these words determine the overall sentiment orientation of the

document (Pang & Lee, 2008).

II. Machine Learning Approach

Machine learning approaches use several machine learning

algorithms for classification (Hassan, 2013). Machine learning

approaches are well suitable for text classification. This approach

learn from previous computations to produce reliable decisions and

9
results The high value prediction that can guide better decisions and

smart actions in real time without human interaction.

III. Hybrid Approach

The Hybrid approach includes the machine learning and lexicon-

based method which contains manually written linguistic rules.

Classifiers present in this approach are used in a cascade manner so

when one of the classifiers fails the next one classifies, and so on

until the document/text is categorized (Pang & Lee, 2008).

2.3 Machine Learning Approach For Sentiment Analysis

A classification is a form of data analysis which can be used to extract

models describing important data classes. Classification provides us with a

better understanding of the large data. “Many classification and prediction

methods have been proposed by researchers in machine learning, pattern

recognition, and statistics”. Data classification can be understood as a

two-steps process in which the first step is the learning phase (or training

phase), where a classification algorithm builds the classifier by learning

from a training set made up of database tuples and their associated class

labels. The class label is a discrete value where each value serves as a

class or category (Pang & Lee, 2008). The second step is to use the model

for classification where initially, the predictive accuracy of the classifier is

estimated. In order to measure classifier accuracy, a test set which is

made up of test tuples and their associated class labels is used. The

tuples are randomly selected from the general data set and are

independent of the training tuples. Classifier accuracy for a test set is the

10
percentage of test set tuples the at are correctly classified by the

classifier. The associated class label of each test tuple is compared with

the learned classifier’s class prediction for that tuple (Sebastiani, 2002).

2.3.1 Classification Of Machine Learning Approach

Machine learning methods are generally classified under supervised and

unsupervised learning. Unsupervised learning methods do not use training

data set for classification while classification using supervised learning

algorithms involves two major phases which are a training phase and a

test phase used for validation”.

I. Supervised Learning

Supervised learning requires a well-labelled corpus to train a

classifier. There are several algorithms that can be used in

supervised learning. The main challenge with supervised learning

methods is that we need to have well defined labelled data

otherwise the training and testing is difficult to perform (Hassan,

2013). Supervised learning can be classified as two types which are

regression and classification. In regression, the labelled data sets

are provided which are trained and with the solutions available they

try to predict and try to improve the model by performing iteratively.

Classification attempts to help us find the appropriate class labels

which can be used to predict the positive, negative and neutral

sentiments. In supervised learning, a machine learning model is

developed which uses the labelled data to train and classify the

tweets and tries to predict the sentiments of the tweets. The

11
algorithms present in supervised learning are logical regressions,

neural networks, support vector machines (SVM), decision trees,

random forests, and Naïve Bayes classifiers (Pang & Lee, 2008).

II. Unsupervised Learning

Unsupervised methods can either be based on machine learning or

lexicon. The requirement of the labelled corpus is not always

required in unsupervised learning. Using this method, only the input

data-set is provided to the machine, the model does not require

labelling (Hassan, 2013). Unsupervised learning can also be

described as pattern discovery. An example of unsupervised

learning is clustering. An unsupervised approach to sentiment

analysis is generally based on a sentiment Lexicon. Text

classification has been done using the semantic orientation

approach where various algorithms extract phrases which contain

adjectives or adverbs to estimate a phrase’s semantic orientation.

Semantic orientation is then used to classify the reviews.

III. Semi-Supervised learning

Semi-supervised learning acts as a middle ground between

supervised learning and unsupervised learning models. In semi-

supervised learning model, a series of labelled and unlabeled data is

taken, the goal of the semi-supervised learning is to classify some of

the unlabeled data using labelled information set (Sebastiani, 2002).

In predicting the sentiment of Twitter datasets there are few

challenges like the size of the unlabeled dataset should be bigger

than labelled data, input-output proximity symmetry, relatively

12
simple labelling, the low dimension of the problem. This model is

most prevalent in stock trend and not helpful in sentiment analysis

of datasets.

2.3.2 Machine Learning Algorithms

There are several algorithms that can be used for text classification. The

following algorithms are selected below which are used for classification.

I. Naïve Bayes

Naïve Bayes classifier is a simple probabilistic classifier that uses

the concept of mixture models to perform classification. The mixture

model relies on the assumption that each of the predefined classes

is one of the components of the mixture itself (Pang & Lee, 2008)..

Bayesian classification finds its roots in statistical mathematics;

thus, its properties are mathematically provable. It is based on

Bayes theorem of the posterior probability. Text classification is

assigned to a given document d to the class c* = arg maxc P (c|d).

To derive the Naïve Bayes (NB) classifier there is a need to first

observe by Bayes’ rule.

𝑃(𝑐|𝑑) = 𝑃(𝑐)𝑃(𝑑|𝑐) /𝑃(𝑑). (1)

Where P (d) plays no role in selecting c*. The classifier supplies

relative probabilities along with the class labels, which expresses

the confidence of a decision. Considering a training set and the

associated class labels, each tuple is represented by an n-

dimensional attribute vector, the classifier determines that the

13
attribute vector belongs to a class with the highest posterior

probability (Sebastiani, 2002).

There are two different ways Naïve Bayes can be set up, Multinomial

model and Bernoulli model. In the multinomial model, the

documents are the classes which are treated in the estimation as a

separate “language”. BernoulliNB (Bernoulli Naïve Bayes) is suitable

for discrete data and it is designed for Boolean/binary features and

works with occurrence counts.

II. Random Forest

Random forest algorithm is a supervised ensemble learning

algorithm. Random forest tree considers a bunch of weak learners

and combines them to work together and form a stronger

classification predictor. In other words, it is a combination of

decision trees (weak learners) to form a random forest (strong

predictor) (Hassan, 2013). The main goal of the random forest tree

is to combine several base level predictors using a learning

algorithm which forms an effective and robust single predictor.

While implementing the random forest tree the forest classifiers are

fitted with two arrays, one with training data and the other with the

target values of the testing data.

III. XGboost (Extreme Gradient Boosting)

XGBoost stands for eXtreme Gradient Boosting. “The name

XGBoost, though, actually refers to the engineering goal to push the

limit of computations resources for boosted tree algorithms. Which

is the reason why XGBoost is used”. It is a decision tree-based

14
ensemble machine learning algorithm which implements the

gradient boosting framework. The XGBoost implements the gradient

boosting decision tree algorithm. Boosting is a technique in which

new models are added to the errors made by existing models, these

models are added sequentially till no further improvements can be

made. Gradient boosting is an approach where new models are

created that predict the residuals or errors of prior models and then

added together to make the final prediction. It is called gradient

boosting because it uses a gradient descent algorithm to minimize

the loss when adding new models. This approach supports both

regression and classification predictive modelling problems.

IV. Support Vector Machine

The support vector machine classifier is considered to be

nonprobability binary linear classifier. Support vector Machine is a

supervised model. This classifier uses large margins for the

classification (Pang & Lee, 2008).. It separates the data using a

hyper plane. Support vector machine uses the concept of decision

planes that defines decision boundaries.

G(X)=wTϕ(X)+b. (2)

‘X’ is a feature vector, ‘w’ is a weight vector and ‘b’ is bias vector .

ϕ()is nonlinear mapping from input space to high dimensional

features space. Here ‘w’ and ’b’ both are learned automatically on

training set . SVM can be used for the pattern recognition.

V. Decision Tree

15
The decision tree classifier is a supervised learning algorithm which

can used for both classification and regression tasks. It can be

adapted almost to any type of data. It divides the training data into

small parts in order to identify patterns so that they can be used for

classification (Hassan, 2013). This algorithm is specifically used

where there are many hierarchical categorical distinctions can be

made. It consists of root node, decision node and leaf node. The root

node represents the entire data set and decision node performs

computation and leaf node produces the classification. It is

particularly used when there are many hierarchical categorical

distinctions that can be made. In training phase, this algorithm

learns what are the decisions that are to be made in order to split

the labelled data into its classes. Passing the data through tree, a

unknown instance is classified. The computation which takes place

in each of the decision node usually compares the selected feature

with predetermined constant, the decision will be made based on

whether the feature is greater or less than the constant by creating

two way split in the tree. The data will be eventually passed through

these decision nodes until that reaches a leaf node which represent

its assigned class.

VI. Maximum Entropy

Maximum entropy provides a machine learning technique for

prediction. This is also known as multinomial logistic model. The

maximum entropy maximizes the entropy that is defined on the

conditional probability distribution (Sebastiani, 2002). It can even

16
handles the overlap feature and is same as the logistic regression

which finds the distribution over classes. It also follows some certain

feature constraints.

PME(c|d,λ)=exp[∑𝑖 𝜆𝑖𝑓𝑖(𝑐, 𝑑)]/∑𝑐 𝑒𝑥𝑝 [∑𝑖 𝜆𝑖𝑓𝑖(𝑐, 𝑑)]. (3)

‘C’ is the class, ‘d’ is the sentence, λ is the weight vector whereas,

the weight vector calculates the importance of a feature. More

weight denotes that a particular is strong enough in class.

VII. Deep Learning

Deep learning is a class of machine learning methods which is based

on artificial intelligence where it uses multiple layers to

progressively extract high features. Deep learning is an efficient

learning method which makes use of neural networks to perform

required tasks (Pang & Lee, 2008).. Neural Networks are analogues

that represent the functioning of biological neurons in the human

brain. The artificial neural networks consist of three layers: an input

layer, an output layer, and an optional hidden layer. Neural

networks are fully connected graphs which associate each node with

an input value and each edge with a weight, which are initially

random values and a bias added which is always set. The

functioning of a neural network is carried out by calculating the

weighted sum.

Weighted sum = ∑wixi+b. (4)

The weighted sum is applied as a special function to optimize the

output. Such special functions are called Activation Functions which

can be used to make the output non-linear so that classification is

17
made possible. Relu is rectified linear unit which is used to obtain

only positive values and zeros.

Relu = max(0, x). (5)

The sigmoid function is a special S-shaped curve which limits the

values between 0 and 1.

Sigmoid = 1/(1+e-x) (6)

In deep learning model, the training dataset is the set of examples

which are used to train the neural network. As the solutions to the

input are already known, the neural network learns from these

examples so that it could give the expected outputs. The ratio of the

number of correctly classified examples and the total number of

examples used as training data gives out the training accuracy.

In deep learning, the testing dataset is that set of examples which

are used to test the neural network and check how well the neural

network learned to classify from the training dataset (Hassan,

2013). As the solutions to these inputs are already known, the

neural network is tested on these examples to check if it’s giving

out the desired predictions while it has been tested on new data

which is different from the training dataset. The ratio of, number of

correctly classified examples and the total number of examples used

as testing data gives out the testing accuracy. In supervised

learning, the data feuded to the algorithm include desired solutions

call labels. In unsupervised learning, training data is unlabeled. The

deep learning model tries to learn without any supervision. Some

Algorithms under the deep learning approach:

18
a) Convolutional neural networks: Convolutional neural networks

have been designed to process data through multiple layers of

arrays. CNN is mainly used for pattern recognition. The main

difference between the CNN and other ordinary neural network is

that it takes the input in the form of a two dimensional array and

perform analysis as it does not perform feature extraction. Local

respective fields, Convolution, Pooling are the three basic ideas of a

CNN. Concurrent layers in the neural network are connected to the

input neurons (Hassan, 2013). The connection of the layers learns

the weight of the hidden neuron with an associated connection with

movement from one layer to another layer. The mapping of these

connections from the input layer to the hidden layer is defined as

shared weights. CNN use pooling layers which help in creating layers

with neurons of previous level.

b) Long-Short Term Memory: LSTM networks are one of the

Recurrent Neural Network architectures that are designed to

remember the history of positive values for a short time span. It

consists of three gates which are input gate to read in the input,

output gate to write out the output to the next layers and forget

gate decides which data to be remembered and which data to forget

(Pang & Lee, 2008). Several features are given to fine-grained

control over memory using LSTM; these aspects help us to control

how the present input matters for forming the new memory and how

the prior memories matter in designing the new memory along with

what parts of the memory are essential is producing the output.

19
c) CNN-LSTM: The CNN Long Short-term Memory Network (CNN-

LSTM) is a combination of Convolutional Neural Network and Long

Short-term Memory. In this model, CNN receives the embedded

words as input initially and output is fed to LSTM (Hassan, 2013). In

order to learn about the ordering of the input’s text, the convolution

layer extracts the local features and then the ordering of the said

features is done by LSTM layer.

2.4 Related Works

In recent past lots of work has been done in the field of “Sentiment

Analysis“ by many researchers. In fact, the work in this field started since

the beginning of the century. In its early stage it had been intended for

binary classification, that assigns opinions or reviews to bipolar categories

like positive or negative.

Minara et al., (2007) performed sentiment analysis for product rating by

collecting the feedbacks of desired product. The main focus is collecting

the data from twitter to overcome the difficulty in getting the feedback

manually by people. To find probability unigram approach is used and dual

prediction used for getting more accuracy. Support vector machine is used

for classification of twitter data (Bhumika et al., 2017). And also used SVM

for analyzing the twitter data. SVM requires long training time on large

data sets and requires a good kernel function which is not easy.

Shulong Tan et al. (2009) have proposed LDA based models to interpret

the sentiment variations on twitter i.e.-LDA to distill out the foreground

topics and RCB-LDA to find out the reasons why public sentiments have

20
been changed for the target. The advantage is process out the foreground

topics effectively and removes the noisy data accurately. It finds the exact

reasons behind sentiment variations on twitter data using RCB-LDA model

which is very useful for decision making. To determine sentiment, the

sentiment analysis tools like SentiStrength and TwitterSentiment are used

whose accuracy is less as compared to other sentiment analysis

techniques.

Po-Wei-Liang and Bi-Ru-Dai (2009) implemented opinion mining on social

media. Opinion miner is designed which integrates machine learning

techniques and domain specific data. Naïve Bayes classifier is used to

filter out opinion. Naïve Bayes is simple classifier and works well for text

categorization (Geetika and Divakar, 2014). To build the model, manually

labelled data is used as training data. Manually labelled data needs more

pressure to correct the data, takes more effort and physical space to keep

track of data, to find information and keep them secure. It is time

consuming and duplication of data entry may occur.

Sunny et al.(2009) used lexicon based approach for sentiment analysis of

social media. The lexicon based approach uses the dictionary order of

positive and negative words. Both positive and negative score is

calculated for each tweet and the sum of them is overall sentiment score.

Score depends on number of positive and negative words present in a

tweet. Sentiment classification depends on the size of dictionary (Anurag

and Vijay, 2015 ). As the size of dictionary increases this will become more

erroneous.

21
Pak and Paroubek (2010) proposed a model to classify the tweets as

objective, positive and negative. They created a twitter corpus by

collecting tweets using Twitter API and automatically annotating those

tweets using emoticons. Using that corpus, they developed a sentiment

classifier based on the multinomial Naive Bayes method that uses

features like N-gram and POS-tags. The training set they used was less

efficient since it contains only tweets having emoticons.

Parikh and Movassate (2009) implemented two models, a Naive Bayes

bigram model and a Maximum Entropy model to classify tweets. They

found that the Naive Bayes classifiers worked much better than the

Maximum Entropy model.

Go and L.Huang (2009) proposed a solution for sentiment analysis for

twitter data by using distant supervision, in which their training data

consisted of tweets with emoticons which served as noisy labels. They

build models using Naive Bayes, MaxEnt and Support Vector Machines

(SVM). Their feature space consisted of unigrams, bigrams and POS. They

concluded that SVM outperformed other models and that unigram were

more effective as features.

Barbosa et al. (2010) designed a two phase automatic sentiment analysis

method for classifying tweets. Theyclassified tweets as objective or

subjective and then in second phase, the subjective tweets were classified

as positive or negative. The feature space used included retweets,

hashtags, link, punctuation and exclamation marks in conjunction with

features like prior polarity of words and POS.

22
Bifet and Frank (2010) used Twitter streaming data provided by Firehouse

API , which gave all messages from every user which are publicly available

in real-time. They experimented multinomial naive Bayes, stochastic

gradient descent, and the Hoeffding tree. They arrived at a conclusion

that SGD-based model, when used with an appropriate learning rate was

the better than the rest used.

Agarwal et al. (2011) developed a 3-way model for classifying sentiment

into positive, negative and neutral classes. They experimented with

models such as: unigram model, a feature based model and a tree kernel

based model. For tree kernel based model they represented tweets as a

tree.The feature based model uses 100 features and the unigram model

uses over 10,000 features. They arrived on a conclusion that features

which combine prior polarity of words with their parts-of-speech(pos) tags

are most important and plays a major role in the classification task. The

tree kernel based model outperformed the other two models.

Davidov et al., (2010) proposed a approach to utilize Twitter user-defined

hastags in tweets as a classification of sentiment type using punctuation,

single words, n-grams and patterns as different feature types, which are

then combined into a single feature vector for sentiment classification.

They made use of K-Nearest Neighbor strategy to assign sentiment labels

by constructing a feature vector for each example in the training

and test set.

Po-Wei Liang et.al.(2014) used Twitter API to collect twitter data. Their

training data falls in three different categories (camera, movie , mobile).

23
The data is labeled as positive, negative and non-opinions. Tweets

containing opinions were filtered. Unigram Naive Bayes model was

implemented and the Naive Bayes simplifying independence assumption

was employed. They also eliminated useless features by using the Mutual

Information and Chi square feature extraction method. Finally , the

orientation of an tweet is predicted. i.e. positive or negative.

Pablo et. al. (2016) presented variations of Naive Bayes classifiers for

detecting polarity of English tweets. Two different variants of Naive Bayes

classifiers were built namely Baseline (trained to classify tweets as

positive, negative and neutral), and Binary (makes use of a polarity

lexicon and classifies as positive and negative. Neutral tweets neglected).

The features considered by classifiers were Lemmas (nouns, verbs,

adjectives and adverbs), Polarity Lexicons, and Multiword from different

sources and Valence Shifters.

Turney et al (2017) used bag-of-words method for sentiment analysis in

which the relationships between words was not at all considered and a

document is represented as just a collection of words. To determine the

sentiment for the whole document,sentiments of every word was

determined and those values are united with some aggregation functions.

Kamps et al. (2019) used the lexical database WordNet to determine the

emotional content of a word along differentdimensions. They developed a

distance metric on WordNet and determined semantic polarity of

adjectives.

24
CHAPTER THREE

MATERIAL AND METHOD

3.1 Dataset Description

This dataset consists of 31,962 entries and is designed for a sentiment

analysis task. Sentiment analysis is a type of text classification where the

goal is to determine the emotional tone behind a body of text, often

categorizing it as positive, negative, or neutral. In this specific dataset, the

task is to classify tweets into two categories: offensive and non-offensive,

which could be seen as an indication of negative and neutral sentiment,

respectively. Given that Twitter is a popular platform for expressing

opinions, both positive and negative, this dataset provides valuable

insight into the sentiment of social media conversations.

Dataset Name: train_tweets.csv.

3.1.1 Structure of the Dataset

25
The dataset consists of three columns:

I. Id:

This is a unique numeric identifier for each tweet, and while it

doesn’t contribute to the analysis, it helps manage and track

individual tweets. Each tweet is assigned a unique integer value,

ensuring there are no duplicates in terms of identification.

II. Label:

The label column serves as the target variable for sentiment

classification. Although this dataset is focused on determining

whether tweets are offensive (`1`) or non-offensive (`0`), this binary

label can be adapted for a broader sentiment analysis task.

Offensive tweets generally convey negative sentiment, while non-

offensive tweets are neutral or positive in nature. This makes the

dataset suitable for binary sentiment analysis tasks aimed at

identifying whether the sentiment behind a tweet is negative

(offensive) or neutral/positive (non-offensive).

III. Tweet:

This column contains the actual text of the tweets. The tweets

represent real-world social media posts, written in an informal and

often fragmented manner, with abbreviations, hashtags, mentions,

emojis, and URLs. This makes the sentiment analysis task more

challenging, as social media posts tend to be much less structured

than formal writing, often including slang or implicit meanings that

machine learning models must learn to interpret. The tweets are

26
short and to the point, adhering to Twitter’s 280-character limit,

which adds another level of complexity, as the full sentiment may

be conveyed in just a few words.

3.1.2 Data Composition and Characteristics

The dataset is complete, with no missing entries across the 31,962 rows,

ensuring that there are no null values in the `id`, `label`, or `tweet`

columns. However, while the data is clean in terms of structure, the

natural composition of social media data introduces challenges. Twitter

data, in particular, tends to include:

I. Hashtags:

Users include hashtags to categorize their tweets or emphasize

certain points. These are often sentiment indicators themselves, as

hashtags like #happy or #angry give clear cues to the sentiment.

II. Mentions (@):

These are references to other users, which could be noise in a

sentiment analysis task unless the mention directly influences the

sentiment of the tweet.

III. Emojis:

Emojis play a significant role in expressing sentiment in tweets. For

example, a smiley face 😊 or a sad face 😢 can provide a strong

indication of positive or negative sentiment.

IV. URLs:

Many tweets include links to external content, which may not

contribute directly to the sentiment of the text.

27
V. Abbreviations and Slang:

Social media platforms like Twitter are rife with informal language,

abbreviations, and internet slang. Understanding and processing

this language is crucial for effective sentiment analysis.

3.1.3 Application of the Dataset in Sentiment Analysis

This dataset can be employed for multiple applications related to

sentiment analysis, including:

I. Social Media Monitoring:

Companies and organizations can use sentiment analysis to

monitor public opinion or customer feedback on Twitter. By

training models on this dataset, they can detect whether users

are expressing satisfaction, anger, or neutrality toward a

product, service, or event.

II. Brand Sentiment:

Sentiment analysis helps in gauging the overall sentiment

toward a brand or product. By understanding whether tweets

about a brand are largely positive or negative, businesses can

take steps to address customer concerns or capitalize on

positive feedback.

III. Market Research:

Researchers and businesses can use sentiment analysis to

study market trends. By analyzing tweets in real-time,

28
companies can predict how people will react to product

launches, campaigns, or changes in services.

IV. Political Sentiment:

Governments and political organizations often analyze

sentiment on social media platforms to understand public

opinion on policies, political figures, and ongoing events. This

dataset could be extended to train models that identify

whether tweets contain favorable or unfavorable views on

political matters.

The dataset provides an excellent foundation for sentiment analysis of

Twitter data. While the labels are specific to offensive content, they align

with broader applications in detecting negative sentiment. The challenges

presented by the informal nature of Twitter language make this dataset

particularly useful for training robust sentiment analysis models capable

of handling real-world social media data.

By addressing the challenges associated with preprocessing and

understanding the nuances of online communication, this dataset can be

leveraged to create powerful tools for monitoring public sentiment,

managing brand reputation, and studying social trends across a wide

range of industries.

3.2 Architectural Model

29
Figure 3.1 Architectural model for sentiment analysis of twitter data.

3.2.1 Dataset

The dataset is very important component of the model and it required for

any implementation to be carried out. Data importation is crucial as it

brings the data into the environment where analysis will be performed.

30
This step ensures that the data is ready for further processing and

analysis.

The first step is to load the data into a structured format that can be

manipulated by a programming language like Python. This is typically

done using a library like `pandas`, which allows for efficient data

manipulation.

3.2.2Data Preprocessing

Preprocessing is essential to prepare the raw text data for the model.

Since machine learning models work with numerical data, converting the

text into vectors is a key step. The data preprocessing level is divided into

two:

I. Text Vectorization:

Conversion of the raw text data (tweets) into a numerical format

that the machine learning model can understand. TF-IDF (Term

Frequency-Inverse Document Frequency) Vectorizer is a technique

that transforms the text into vectors based on the importance of

words across the dataset.

II. Target Extraction:

The target extraction isolates the labels that the model will learn to

predict. It extract the sentiment labels (`y`) from the dataset.

3.2.3 Data Splitting

31
Splitting the data ensures that the model is evaluated on data it hasn’t

seen during training. This is critical for assessing the model’s ability to

generalize to new, unseen data. It is do using the train_test_split` tool

from `sklearn`.Divide the dataset into two parts—training and validation

sets.

I. The training set is used to train the model.

II. The validation set is used to evaluate its performance.

3.2.4 Model Training

The training step is where the model actually “learns” from the data. The

decision tree classifier will create a model that can be used to predict

sentiments based on the features extracted from the tweets. This training

is implemented by using the algorithms tool from `sklearn`.

The algorithms is trained on the training dataset (`X_train`, `y_train`).

During training, the model learns patterns and relationships between the

features (tweet vectors) and the target labels (sentiments).

3.2.5 Model Evaluation

Evaluation is a critical step where the model’s performance is

quantitatively assessed. It is divided into the aspect:

I. Prediction:

Use the trained model to predict sentiment labels for the validation

data (`X_val`).

32
II. Metric Calculation:

Calculate various performance metrics to evaluate how well the

model is performing. By calculating metrics like accuracy, precision,

recall, and F1-score, we can determine how well the model is likely

to perform on new, unseen data.

III. Classification Report:

Generate a detailed report showing precision, recall, and F1-score

for each class (e.g., positive, negative).

3.4 Algorithm

Start

Load the Dataset ( train_tweets.csv).

Preprocess the Data.

- Initialize a TF-IDF vectorizer to convert tweet text into numerical

features.

- Apply the TF-IDF vectorizer to the `tweet` column to create a feature

matrix `X`.

- Extract the sentiment labels into a target vector `y`.

Split the Data

- Split the data into training and validation sets using an 80/20 split.

- `X_train`, `y_train` represent the training set.

- `X_val`, `y_val` represent the validation set.

33
Initialize and Train the Model

- Initialize the models.

- Train the models using the training data (`X_train`, `y_train`).

Model Prediction on Validation Set

- Use the trained model to predict sentiment labels on the validation set

(`X_val`).

Evaluate the Model

- Calculate the following metrics to assess models performance

Output Evaluation Results

- Print the accuracy, precision, recall, F1-score.

End

3.5 Flowchart

34
Figure 3.2 flowchart for the sentiment analysis of twitter data.

I. Start:

The flowchart begins with the "Start" node, representing the

initiation of the sentiment analysis process.

II. Data Preprocessing:

This is a critical step where the raw data is cleaned and prepared for

analysis. Transforming the cleaned text into numerical features

using techniques like TF-IDF (Term Frequency-Inverse Document

Frequency). A feature matrix \( X \) where each tweet is represented

as a vector of numerical values.

III. Split Data into Training and Validation Sets:

The preprocessed data is divided into two subsets. Typically, an 80-

20 split is used, where 80% of the data is for training, and the

remaining 20% is for validation.

IV. Train Decision Tree Model:

In this step, models is trained using the training dataset. The modesl

learns to classify tweets into different sentiment categories (e.g.,

positive, negative). The process continues recursively until the

stopping criterion is met.

V. Evaluate Model Performance:

After training, the model's performance is evaluated using the

validation dataset. The model's accuracy, precision, recall, and F1-

35
score is printing which is detailing the performance on each class

(e.g., positive, negative).

VI. End:

The flowchart concludes with the "End" node, representing the

completion of the sentiment analysis process. At this point, the

trained model is ready to be deployed or used for making

predictions on new data.

CHAPTER FOUR

RESULTS AND DISCUSSION

4.1 Implementation Details

Implementing sentiment analysis of Twitter data using a decision tree

model involves several key steps: data preprocessing, feature extraction,

model building, training, and evaluation. Below is a detailed guide on how

to perform sentiment analysis using a decision tree model with Python,

leveraging common libraries like `pandas`, `scikit-learn`, and `nltk` for

natural language processing.

4.1.1 Library Import:

36
These are the lines of code at the beginning of a script where you import

external libraries or modules that are required for your script to run. Each

import statement allows you to use functions, classes, and methods from

the specified library. It was implemented in the code below:

Figure 4:1 library import code

4.1.2 Data Ingestion:

In this stage the dataset is import into the model, it was implemented with

the set of codes below:

Figure 4.2 loading of training dataset.

4.1.3 Data Preprocessing:

37
Twitter data typically contains noise in the form of URLs, mentions,

hashtags, punctuation, and emoticons. Preprocessing is essential to clean

the data and convert it into a usable format for the model. It includes the

text vectorization and target extraction.

It was in implemented in the code below:

Figure 4.3 dataset preprocessing.

4.1.4 Train-Test Split:

To evaluate the performance of the decision tree model, the data should

be split into training and testing sets. The model will be trained on the

training set and tested on the unseen testing set. It was implemented in

the code below:

Figure 4.4 dataset splitting.

4.1.5 Training of the Models:

The models can now be trained on the preprocessed and vectorized data.

The code defines and trains four models: decision tree, random forest ,

38
support vector machine (svm) , and logistic regression. It was

implemented in the code below:

Figure 4.5 Defining and training of the models.

4.1.6 Model Evaluation:

Once the model is trained, Each model is evaluated on the validation set

using accuracy, precision, recall, and F1-score. The results, including a

detailed classification report, are printed for comparison. It was

implemented in the code below:

Figure 4.6 model evaluation.

The classification report output is used for generating a detailed report

showing precision, recall, and F1-score for each class and printing is done.

It was implemented in the code below:

Figure 4.7 printing of results.

4.2 system specifications

39
The requirements for the implementation of the decision tree model can

be classified into two groups:

4.2.1 Hardware Requirements

Implementation of the decision tree model for sentiment analysis of

twitter data requires the following hardware feature:

I. Dual-core processor (e.g., Intel Core i3 or AMD Ryzen 3 with a speed

of 2.0 GHz or higher)

II. 8 GB or higher RAM

III. Storage ( 20 GB available space, SSD recommended)

4.2.2 Software Requirements

Implementation of the decision tree model for sentiment analysis of

twitter data requires the following software feature:

I. Operating System: Windows 10/11, macOS, or Linux (Ubuntu,

Fedora, etc.)

II. Programming Language: Python 3.0 or higher

III. Python Libraries and Packages: NumPy, pandas, scikit-

learnIntegrated Development Environment (IDE) or Text Editor:

Jupyter Notebook, VS Code.

4.3 implementation screenshots

40
Source code:

Figure 4.8 source code for implementation of models.

4.4 Evaluation Metrics

41
Evaluation Metrics are numerical measurements that are used to access a

machine learning model’s performance. They offer a methodical way to

evaluate how well a model predicts the results of a certain task. The type

of model being utilized and the particular problem being addressed can

affect the evaluation metrics. Accuracy, confusion matrix, precision, recall

and F1 score are some typical evaluation measure in machine learning. In

order to be sure that the model is successfully achieving the expected

results, it is crucial to choose the proper evaluation metric(s) for a given

task.

4.4.1 Accuracy:

Accuracy is a metric that measures how often a machine learning model

correctly predicts the outcome. You can calculate accuracy by dividing the

number of correct predictions by the total number of predictions.

In other words, accuracy answers the question: how often the model is

right?

You can measure the accuracy on a scale of 0 to 1 or as a percentage. The

higher the accuracy, the better. You can achieve a perfect accuracy of 1.0

when every prediction the model makes is correct.

4.4.2 Precision:

42
Precision is a metric that measures how often a machine learning model

correctly predicts the positive class. You can calculate precision by

dividing the number of correct positive predictions (true positives) by the

total number of instances the model predicted as positive (both true and

false positives).

In other words, precision answers the question: how often the positive

predictions are correct?

4.4.3 Recall:

Recall is a metric that measures how often a machine learning model

correctly identifies positive instances (true positives) from all the actual

positive samples in the dataset. You can calculate recall by dividing the

number of true positives by the number of positive instances. The latter

includes true positives (successfully identified cases) and false negative

results (missed cases).

In other words, recall answers the question: can an ML model find all

instances of the positive class?

4.4.4 F1-score:

43
The F1 score is a performance metric for classification and is calculated as

the harmonic mean of precision and recall

The F1 score is commonly used to measure performance of binary

classification, but extensions to multi-class classifications exist.

4.5 Results

Table 4.1 results for the performance of models.

Model Accuracy Precision Recall F1-

score

Decision tree 0.80 0.78 0.76 0.77

Random forest 0.85 0.83 0.82 0.83

Support Vector Machine 0.82 0.80 0.79 0.79

Logistic Regression 0.83 0.81 0.80 0.81

44
Figure 4.9 graph for parameters

4.6 Discussion

4.6.1 Decision Tree

I. Accuracy (0.80): The Decision Tree model correctly classifies 80%

of the tweets. However, it’s slightly less accurate than the other

models.

II. Precision (0.78): Out of all the tweets classified as positive (or

another target sentiment), 78% were correctly identified. Lower

45
precision suggests that the model might classify some negative

tweets incorrectly as positive.

III. Recall (0.76): Of all the positive tweets, 76% were correctly

identified by the model. This indicates that it might miss some

positive tweets, leading to false negatives.

IV. F1-Score (0.77): The F1-score balances precision and recall, and at

0.77, it shows the model has a moderate trade-off between these

metrics. The Decision Tree’s simplicity can lead to overfitting,

particularly if the tree becomes too deep.

4.6.2 Random Forest

I. Accuracy (0.85): The Random Forest model, which combines

multiple decision trees, achieves the highest accuracy. It correctly

classifies 85% of the tweets.

II. Precision (0.83): It has a high precision, meaning it makes fewer

mistakes when predicting positive tweets.

III. Recall (0.82): The model is good at identifying positive tweets,

with 82% of actual positive tweets being correctly classified.

IV. F1-Score (0.83): This high F1-score indicates that Random Forest

maintains a strong balance between precision and recall, making it a

reliable model for this task. Its robustness against overfitting (due to

ensembling) likely contributes to its superior performance.

4.6.3 Support Vector Machine (SVM)

I. Accuracy (0.82): SVM performs well, with an accuracy of 82%, but

slightly below Random Forest.

46
II. Precision (0.80): SVM has good precision, indicating that most

positive predictions are correct, though it’s slightly less precise than

Random Forest.

III. Recall (0.79): The recall is slightly lower, meaning the SVM might

miss some positive tweets.

IV. F1-Score (0.79): The F1-score of 0.79 suggests that SVM maintains

a decent balance between precision and recall, making it a strong

competitor, especially for linear or separable data.

4.6.4 Logistic Regression

I. Accuracy (0.83): Logistic Regression is a simple yet effective

model, achieving an accuracy of 83%, comparable to SVM and just

slightly lower than Random Forest.

II. Precision (0.81): The model is fairly precise, correctly identifying

most positive tweets with 81% precision.

III. Recall (0.80): With an 80% recall, it effectively identifies positive

tweets without missing many.

IV. F1-Score (0.81): The F1-score of 0.81 indicates that Logistic

Regression strikes a good balance between precision and recall,

making it a reliable baseline model for binary classification tasks.

Based on the metrics, Random Forest appears to be the best choice for

sentiment analysis on this Twitter dataset. However, if interpretability or

simplicity is a concern, Logistic Regression and SVM are strong

alternatives. The Decision Tree can be useful in cases where model

47
interpretability is critical, but it may require careful tuning to avoid

overfitting.

CHAPTER FIVE

SUMMARY, CONCLUSION AND RECOMMENDATION

5.1 Summary

This project explores the implementation of a sentiment analysis model using Twitter data,

focusing on classifying tweets into positive, negative, or neutral sentiments. The approach

begins with data collection through the Twitter API, gathering a variety of tweets based on

relevant keywords and hashtags. The collected tweets are then subjected to a comprehensive

preprocessing phase, which includes cleaning the text, tokenization, and converting the

textual data into numerical features using TF-IDF vectorization.

The core of the project involves training a Decision Tree model on the preprocessed data.

Decision Trees are chosen for their simplicity and interpretability, making them suitable for

scenarios where understanding the decision-making process is crucial. The model's

performance is evaluated using key metrics such as accuracy, precision, recall, and F1-score,

providing a detailed understanding of how well the model classifies sentiments. To provide

context and evaluate the effectiveness of the Decision Tree model, its performance is

compared with three other popular machine learning models: Random Forest, Support Vector

Machine (SVM), and Logistic Regression. Random Forest, an ensemble method combining

multiple decision trees, is expected to offer superior accuracy and generalization compared to

48
a single Decision Tree. SVM, known for its performance in high-dimensional spaces, and

Logistic Regression, a standard baseline model, serve as additional benchmarks.

The results indicate that while the Decision Tree model is valuable for its interpretability, it

may not generalize as well as Random Forest or SVM, which provide better accuracy and

robustness for large-scale sentiment analysis tasks. The project concludes with a discussion

on the trade-offs between interpretability and predictive performance, emphasizing the

importance of selecting the right model based on the specific requirements of sentiment

analysis. This project contributes to the broader understanding of how different machine

learning models can be applied to social media data for sentiment classification, providing

insights into their respective strengths and limitations.

5.2 Conclusion

The implementation of sentiment analysis on Twitter data using the Decision Tree model

demonstrates the potential and challenges of applying machine learning techniques to social

media analytics. The Decision Tree model, chosen for its simplicity and interpretability,

successfully classifies tweets into positive, negative, or neutral sentiments, providing a clear

understanding of the decision-making process. However, when compared to other models

such as Random Forest, Support Vector Machine (SVM), and Logistic Regression, it

becomes evident that the Decision Tree may not always be the most effective choice in terms

of accuracy and generalization.

Random Forest, with its ensemble approach, outperforms the Decision Tree by reducing

overfitting and improving predictive performance, making it more suitable for large-scale

sentiment analysis tasks. SVM and Logistic Regression also offer competitive results,

particularly in high-dimensional spaces or when a strong baseline model is needed. This

analysis highlights the importance of selecting the appropriate model based on the specific

49
goals of the sentiment analysis task. While the Decision Tree offers valuable interpretability,

which is crucial in certain applications, other models may be better suited for scenarios

requiring higher accuracy and robustness. Therefore, a balanced approach that considers both

interpretability and performance is essential when deploying sentiment analysis models on

social media data.

Overall, this project underscores the critical role of model selection in sentiment analysis and

suggests that while Decision Trees have their place, ensemble methods like Random Forest

often provide superior outcomes in real-world applications. As sentiment analysis continues

to evolve, leveraging the strengths of different models will be key to effectively

understanding and acting upon the vast amounts of data generated on platforms like Twitter.

5.3 Contribution to knowledge

This project contributes to the growing field of sentiment analysis by providing a

comprehensive evaluation of the Decision Tree model in classifying sentiments expressed in

Twitter data. Through the meticulous implementation and comparison of the Decision Tree

with other prominent machine learning models such as Random Forest, Support Vector

Machine (SVM), and Logistic Regression, this study offers valuable insights into the

strengths and limitations of different approaches in the context of social media analytics. This

project contributes to the field of sentiment analysis by providing the following insights and

advancements:

I. Evaluation of Decision Tree Model: Offers a detailed assessment of the Decision

Tree model’s effectiveness in classifying sentiments from Twitter data. Demonstrates

the model’s strengths in scenarios requiring high interpretability.

II. Comparative Analysis: Compares the Decision Tree model with other popular

machine learning models, including Random Forest, Support Vector Machine (SVM),

50
and Logistic Regression. Highlights the trade-offs between interpretability (Decision

Tree) and higher accuracy/generalization (Random Forest and SVM).

III. Model Selection Guidance: Provides practical guidance on selecting the appropriate

machine learning model based on specific sentiment analysis tasks. Emphasizes the

importance of balancing interpretability and predictive performance.

IV. Impact of Preprocessing Techniques: Demonstrates the critical role of text

preprocessing steps (e.g., text cleaning, tokenization, and TF-IDF vectorization) in

enhancing model performance. Shows how these techniques improve the quality of

input data and the effectiveness of sentiment classification.

V. Framework for Social Media Analytics: Develops a practical framework for

implementing and evaluating machine learning models for sentiment analysis on

Twitter data. Serves as a reference for future research and practical applications in

social media analytics.

VI. Insights into Ensemble Methods: Validates the superiority of ensemble methods like

Random Forest over single models like Decision Tree in terms of accuracy and

generalization. Suggests ensemble methods as preferable options for large-scale

sentiment analysis tasks.

VII. Contribution to Sentiment Analysis Techniques: Enhances understanding of the

application of machine learning in sentiment analysis, particularly in the context of

social media platforms. Contributes to the development of more accurate and

interpretable models for analyzing public opinion on Twitter.

VIII. Real-World Application Relevance: Offers insights that are directly applicable to

real-world sentiment analysis, particularly for businesses, researchers, and

policymakers interested in social media trends. Supports decision-making processes

with a better understanding of public sentiment on platforms like Twitter.

51
Overall, this project advances knowledge in the application of machine learning to social

media data, particularly in the context of sentiment analysis on Twitter. It provides a practical

framework for selecting and implementing models based on the needs of the analysis,

balancing the demands for interpretability and accuracy. This contribution serves as a

valuable reference for future research and development in the field of sentiment analysis and

its applications in understanding public opinion and trends on social media platforms.

5.4 Recommendation

Decision tree models offer a powerful approach to sentiment analysis, providing interpretable

and efficient solutions. However, their application to Twitter data presents unique

considerations due to the platform's characteristics.

I. Choose the Appropriate Model Based on Specific Needs: When applying sentiment

analysis to Twitter data, select the algorithm that best suits your specific requirements.

If interpretability is crucial—such as in cases where you need to explain decisions to

stakeholders—consider using the Decision Tree model. However, if your priority is

achieving the highest accuracy and robustness, particularly in large and complex

datasets, Random Forest or Support Vector Machine (SVM) should be preferred.

II. Optimize Hyperparameters for Better Performance: To maximize the

effectiveness of the implemented algorithms, it is recommended to perform

hyperparameter tuning. For example, adjusting the depth of the Decision Tree or the

number of trees in the Random Forest can significantly impact the model's

52
performance. Use techniques such as grid search or randomized search to find the

optimal parameters for your specific dataset.

III. Regularly Re-train Models on Updated Data: Twitter data is highly dynamic, with

trends and language usage evolving rapidly. To maintain model relevance and

accuracy, regularly re-train your models on new data. This will help your models

adapt to the latest trends and changes in sentiment expression, ensuring that the

analysis remains current and accurate.

IV. Combine Models for Enhanced Performance: Consider combining multiple models

(e.g., Decision Tree, Random Forest, SVM) to leverage their complementary

strengths. For instance, an ensemble method that aggregates predictions from different

models can provide more robust and accurate results than any single model alone.

V. Focus on Data Preprocessing: The quality of your sentiment analysis results heavily

depends on the preprocessing of the Twitter data. Pay careful attention to steps like

text cleaning, tokenization, stopword removal, and TF-IDF vectorization. Proper

preprocessing is crucial to transforming raw tweets into a format that the algorithms

can effectively analyze.

VI. Monitor Model Performance Over Time: Continuously monitor the performance of

your deployed models, particularly in production environments where sentiment

analysis results inform decision-making. Use metrics like accuracy, precision, recall,

and F1-score to track performance, and be prepared to update or adjust models as

necessary based on these evaluations.

VII. Consider Model Interpretability in Sensitive Applications: In applications where

decisions based on sentiment analysis may have significant consequences (e.g., in

financial markets, healthcare, or legal contexts), consider the interpretability of the

53
model. The Decision Tree model, while potentially less accurate than ensemble

methods, offers transparency that might be critical in these situations.

REFERENCES

Zhang, Y., Song, D., Zhang, P., Li, X., & Wang, P. (2019). A quantum-inspired

sentiment representation model for Twitter sentiment analysis. Applied

Intelligence, 49(1), 1–16.

Aslam, S. (2024, January 11). Twitter by the numbers: Stats, demographics &

fun facts. Omnicore Agency. https://www.omnicoreagency.com/twitter-

statistics.

Nazir, F., Ghazanfar, M. A., Maqsood, M., Aadil, F., Rho, S., & Mehmood, I.

54
(2019). Social media signal detection using tweets volume, hashtag, and sentiment

analysis. Multimedia Tools and Applications, 78 (3), 3553–3586.

Kraiem, M. B., Feki, J., Khrouf, K., Ravat, F., & Teste, O. (2015). Modeling

and OLAPing social media: The case of Twitter. Social Network Analysis

and Mining, 5 (1), 47.

Furini, M., & Montangero, M. (2018). Sentiment analysis and Twitter: A game

proposal. Personal and Ubiquitous Computing, 22 (4), 771–785.

Harakawa, R., Takehara, D., Ogawa, T., & Haseyama, M. (2018). Sentiment-

aware personalized tweet recommendation through multimodal FFM.

Multimedia Tools and Applications, 77(14), 18741–18759.

Bukhari, A., Qamar, U., & Ghazia, U. (2017). URWF: User reputation based

weightage framework for Twitter micropost classification. Information

Systems and e-Business Management, 15 (3), 623–659.

Bennacer, N. S., Bugiotti, F., Hewasinghage, M., Isaj, S., & Quercini, G.

(2018). A frequent named entities-based approach for interpreting

reputation in Twitter. Data Science and Engineering, 3 (2), 86–100.

Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2018). Successes and challenges

in developing a hybrid approach to sentiment analysis. Applied

55
Intelligence, 48 (5), 1176–1188.

Nguyen, H. L., & Jung, J. E. (2017). Statistical approach for figurative

sentiment analysis on social networking services: A case study on Twitter.

Multimedia Tools and Applications, 76 (6), 8901–8914.

Budiharto, W., & Meiliana, M. (2018). Prediction and analysis of Indonesia

presidential election from Twitter using sentiment analysis. Journal of

Big Data, 5 (1), 51.

Kumar, A., & Garg, G. (2019). Systematic literature review on context-based

sentiment analysis in social multimedia. Multimedia Tools and

Applications, 79, 1–32.

Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public mood and emotion:

Twitter sentiment and socio-economic phenomena. Proceedings of the

Fifth International Conference on Weblogs and Social Media (ICWSM

11), 450–453.

Hasan, M., Agu, E., & Rundensteiner, E. (2014). Using hashtags as labels for

supervised learning of emotions in Twitter messages. In ACM SIGKDD

Workshop on Health Informatics, New York, USA.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical

machine learning tools and techniques (4th ed.). Morgan Kaufmann.

Gamallo, P., & Garcia, M. (2014). Citius: A naive-Bayes strategy for sentiment

56
analysis on English tweets. In Proceedings of the 8th International

Workshop on Semantic Evaluation (SemEval 2014) (pp. 171-175).

Dublin, Ireland.

Neethu, M. S., & Rajashree, R. (2013). Sentiment analysis in Twitter using

machine learning techniques. In Proceedings of the 4th International

Conference on Computing, Communications and Networking

Technologies (ICCCNT 2013). IEEE. (pp. 1-5).

Ayandeyi, A. (2021). Sentiment analysis of Twitter data to analyze the effect of

COVID-19.

Bollegala, D., Weir, D., & Carroll, J. (2013). Cross-domain sentiment

classification using a sentiment sensitive thesaurus. IEEE Transactions on

Knowledge and Data Engineering, 25 (8), 1719-1731.

Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using

subjectivity summarization based on minimum cuts. In Proceedings of

the 42nd Annual Meeting of the Association for Computational

Linguistics (ACL-04) (pp. 271-278).

Peddinti, V. M. K., & Chintalapoodi, P. (2011). Domain adaptation in sentiment

analysis of Twitter. In Proceedings of the Analyzing Microtext Workshop,

AAAI.

57
Shangaui, B., Sheela, R., & Sudalai Manikandan, V. (2021). Twitter sentiment

analysis using machine learning.

Gautam, G., & Yadav, D. (2014, August 7-9). Sentiment analysis of Twitter data

using machine learning approaches and semantic analysis. 2014 Seventh

International Conference on Contemporary Computing (IC3).

Panto, M., Antony, M., Muhssina, K. M., Johny, N., James, V., & Wilson, A.

(2016). Product rating using sentiment analysis. 2016 International

Conference on Electrical, Electronics, and Optimization Techniques

(ICEEOT).

Gupta, B., Negi, M., Vishwakarma, K., Rawat, G., & Badhani, P. (2017). Study

of Twitter sentiment analysis using machine learning algorithms on

Python. International Journal of Computer Applications, 165(9), 29-34.

Tan, S., Li, Y., Sun, H., Guan, Z., & Yan, X. (2014). Interpreting the public

sentiment variations on Twitter. IEEE Transactions on Knowledge and

Data Engineering, 26(5), 1158-1170.

Kumar, S., Singh, P., & Rani, S. (2016, September 7-9). Sentimental analysis of

social media using R. 2016 5th International Conference on Reliability,

Infocom Technologies and Optimization (ICRITO) (Trends and Future

Directions).

Jain, A. P., & Katkar, V. D. (2015, December 16-19). Sentiments analysis of

58
Twitter data using data mining. 2015 International Conference on

Information Processing (ICIP).

59

You might also like