Michael Final Project
Michael Final Project
INTRODUCTION
Twitter is one of the popular social networking site with more than 368 million monthly
active users and 500 million tweets per day (Salman, 2024). Tweets are short text messages
with 140 characters, but are powerful source of expressing emotional state and feelings with
the society of friends (Salman, 2024). With the fast development of the Internet, the number
platforms (namely discussion forums, social networks and blogs) by posting comments and
publishing blogs is increasing. Millions of pages (per day) are being posted on the Internet.
Thus, the Internet has become the most imperative source for people to attain information for
making decisions (Zhang et al., 2019). Amongst social networks, Twitter is the best-
recognized and the 3rd most visited website on the globe (Salman, 2024). It shares and
allows a user to either follow other users who post tweets or even to post their message
(Kraiem et al., 2015). Traditionally, the market research companies with the vision of
questionnaires, interviews and forms, which devours much time and most expensive. To
control time and cost, nowadays, researchers are proposing methodologies that concentrate
on contents posted on Twitter (Furini and Montangero, 2018). In twitter, more than 300
million users (active) are sharing and posting information daily via tweets with multimedia
contents (Harakawa et al., 2018). Twitter has 465 million user accounts and has 250 million
visitors per month with an average of 100,000 micro posts/min that is approximately 175
million every day and around 8 terabytes data in a day ( A. Bukhari et al., 2017). Any sort of
information can well be posted and also shared; in addition, it is feasible to filter tweets
1
associated with a product, a person, an organization or other units of interest (Bennacer et al.,
2018).
Sentiment analysis has lately become a field of intensive research. Today’s data are too
ample and intricate for human usage. Consequently, the requisite of automated computer-
centric capabilities that could tell whether tweets or sentences are carrying factual
that is undertaken to discover the emotion of a corpus without direct communication. This
task could also be signified as mood extraction, opinion mining, or emotion analysis (Nguyen
ascertaining the polarity of a review as negative, positive, or neutral as per the expressed
opinion.
With the rapid growth of the World Wide Web, people are using social media such as Twitter
which generates big volumes of opinion texts in the form of tweets which is available for the
sentiment analysis. This translates to a huge volume of information from a human viewpoint
which make it difficult to extract a sentence, read them, analyze tweet by tweet, summarize
them and organize them into an understandable format in a timely manner. Informal language
refers to the use of colloquialisms and slang in communication, employing the conventions of
spoken language such as ‘would not’ and ‘wouldn’t’. Not all systems are able to detect
sentiment from use of informal language and this could hanker the analysis and decision-
making process.
Sentiment analysis, the process of automatically identifying the emotional tone of text, is a
powerful tool for understanding public opinion and social media trends. However, analyzing
sentiment on Twitter data presents unique challenges compared to more formal text. This
essay will explore these challenges and how decision tree models can offer some solutions.
2
1.2 Statement of the problems
One major hurdle is the sheer informality of Twitter language. Slang, emojis, and sarcasm
are commonplace, often confusing traditional sentiment analysis models. Large datasets can
further complicate matters, as ensuring consistent quality across millions of tweets can be
difficult. Decision trees can help mitigate this issue by being trained on vast amounts of
tweets, allowing them to learn the nuances of informal language and identify sarcasm
Another challenge lies in capturing the subtleties of human communication. Nuance and
context are crucial for accurate sentiment analysis. For instance, a seemingly positive tweet
might be laced with sarcasm, while understanding the context of a reply or the referenced
topic can completely alter the sentiment. While some models struggle with this, decision
trees can be structured to consider the relationships between words and phrases. For
example, the presence of “not” before an adjective could be a clue for sarcasm. This allows
the model to grasp some level of context and make more nuanced classifications.
Furthermore, many existing sentiment analysis methods offer limited categories, often just
positive, negative, and neutral. This fails to capture the full spectrum of emotions people
express on Twitter. Decision trees, however, can be designed to output more granular
classifications. One could create a model that differentiates between happiness, anger,
sadness, or even more specific emotions relevant to a particular analysis. This allows for a
Many methods only analyze the overall sentiment of a tweet, neglecting the possibility of
mixed emotions within a single post. A tweet might begin with frustration but end on a
hopeful note. Decision trees, with their branching structure, can handle such complexities.
3
The model can follow different paths based on keywords or sentiment indicators, leading to a
more accurate classification that reflects the tweet’s full emotional range.
Sentiment analysis on Twitter data faces challenges due to informal language, the importance
of context, limited sentiment categories, and the complexity of mixed emotions. Decision
tree models offer a valuable tool to address these issues. Their ability to handle informal
language, capture some context, provide more granular sentiment classifications, and identify
mixed emotions makes them a powerful asset for analyzing the ever-evolving world of
Twitter.
1.3.1 Aim
The aim of this project is to develop a decision tree model for the sentiment analysis of
twitter data.
iii. Evaluate and compare the implemented model with related methods.
The significance of this project will focus on developing and evaluating a decision tree model
For the remainder of the study, the following is planned: The related works are presented in
Chapter Two. Chapter three provides a methodology. Chapter four provides an overview of
4
the implementation, evaluation and results. The work is concluded in Chapter five, with
future work.
Twitter Data: Twitter is a social media platform where users post short messages called
tweets.
Sentiment Analysis: This is the task of understanding the emotional tone of a piece of text.
It’s about figuring out whether a statement expresses a positive, negative, or neutral
sentiment.
Decision Tree Model: This refers to a machine learning algorithm that works by asking a
Machine Learning: Machine learning is a field of artificial intelligence (AI) that focuses on
developing computer algorithms that can improve performance on a specific task over time
through experience.
5
CHAPTER TWO
LITERATURE REVIEW
three notations including hashtags (#), retweets (RT) and account Id (@)
2013). Due to the large size of this data, sentiment analysis is chosen as a
6
technique to analyze this data due to the ease in determining the user-
2002). Consumers don’t need to ask other people about the quality of a
product as the answers are easily available for them. Sentiment analysis
not only finds its applications in product reviews but also on social media
and news articles. The results obtained during the sentiment analysis are
decisions.
7
Generally, Sentiment analysis is carried in three different levels-
the sentiment analysis of Twitter data, sentence level analysis is done due
classification may not be suitable and valid for some data sets. It is
but the music is bad!” From this statement, it can be noted that the
8
and music are the two entities/aspects included in the sentiment.
There are many approaches for sentiment analysis, this approach are
highlighted below:
I. Lexicon-based Approach
9
results The high value prediction that can guide better decisions and
when one of the classifiers fails the next one classifies, and so on
two-steps process in which the first step is the learning phase (or training
from a training set made up of database tuples and their associated class
labels. The class label is a discrete value where each value serves as a
class or category (Pang & Lee, 2008). The second step is to use the model
made up of test tuples and their associated class labels is used. The
tuples are randomly selected from the general data set and are
independent of the training tuples. Classifier accuracy for a test set is the
10
percentage of test set tuples the at are correctly classified by the
classifier. The associated class label of each test tuple is compared with
the learned classifier’s class prediction for that tuple (Sebastiani, 2002).
algorithms involves two major phases which are a training phase and a
I. Supervised Learning
are provided which are trained and with the solutions available they
developed which uses the labelled data to train and classify the
11
algorithms present in supervised learning are logical regressions,
random forests, and Naïve Bayes classifiers (Pang & Lee, 2008).
12
simple labelling, the low dimension of the problem. This model is
of datasets.
There are several algorithms that can be used for text classification. The
following algorithms are selected below which are used for classification.
I. Naïve Bayes
is one of the components of the mixture itself (Pang & Lee, 2008)..
13
attribute vector belongs to a class with the highest posterior
There are two different ways Naïve Bayes can be set up, Multinomial
predictor) (Hassan, 2013). The main goal of the random forest tree
While implementing the random forest tree the forest classifiers are
fitted with two arrays, one with training data and the other with the
14
ensemble machine learning algorithm which implements the
new models are added to the errors made by existing models, these
created that predict the residuals or errors of prior models and then
the loss when adding new models. This approach supports both
G(X)=wTϕ(X)+b. (2)
‘X’ is a feature vector, ‘w’ is a weight vector and ‘b’ is bias vector .
features space. Here ‘w’ and ’b’ both are learned automatically on
V. Decision Tree
15
The decision tree classifier is a supervised learning algorithm which
adapted almost to any type of data. It divides the training data into
small parts in order to identify patterns so that they can be used for
made. It consists of root node, decision node and leaf node. The root
node represents the entire data set and decision node performs
learns what are the decisions that are to be made in order to split
the labelled data into its classes. Passing the data through tree, a
two way split in the tree. The data will be eventually passed through
these decision nodes until that reaches a leaf node which represent
16
handles the overlap feature and is same as the logistic regression
which finds the distribution over classes. It also follows some certain
feature constraints.
‘C’ is the class, ‘d’ is the sentence, λ is the weight vector whereas,
required tasks (Pang & Lee, 2008).. Neural Networks are analogues
networks are fully connected graphs which associate each node with
an input value and each edge with a weight, which are initially
weighted sum.
17
made possible. Relu is rectified linear unit which is used to obtain
which are used to train the neural network. As the solutions to the
input are already known, the neural network learns from these
examples so that it could give the expected outputs. The ratio of the
are used to test the neural network and check how well the neural
out the desired predictions while it has been tested on new data
which is different from the training dataset. The ratio of, number of
18
a) Convolutional neural networks: Convolutional neural networks
that it takes the input in the form of a two dimensional array and
shared weights. CNN use pooling layers which help in creating layers
consists of three gates which are input gate to read in the input,
output gate to write out the output to the next layers and forget
how the present input matters for forming the new memory and how
the prior memories matter in designing the new memory along with
19
c) CNN-LSTM: The CNN Long Short-term Memory Network (CNN-
order to learn about the ordering of the input’s text, the convolution
layer extracts the local features and then the ordering of the said
In recent past lots of work has been done in the field of “Sentiment
Analysis“ by many researchers. In fact, the work in this field started since
the beginning of the century. In its early stage it had been intended for
the data from twitter to overcome the difficulty in getting the feedback
prediction used for getting more accuracy. Support vector machine is used
for classification of twitter data (Bhumika et al., 2017). And also used SVM
for analyzing the twitter data. SVM requires long training time on large
data sets and requires a good kernel function which is not easy.
Shulong Tan et al. (2009) have proposed LDA based models to interpret
topics and RCB-LDA to find out the reasons why public sentiments have
20
been changed for the target. The advantage is process out the foreground
topics effectively and removes the noisy data accurately. It finds the exact
techniques.
filter out opinion. Naïve Bayes is simple classifier and works well for text
labelled data is used as training data. Manually labelled data needs more
pressure to correct the data, takes more effort and physical space to keep
social media. The lexicon based approach uses the dictionary order of
calculated for each tweet and the sum of them is overall sentiment score.
and Vijay, 2015 ). As the size of dictionary increases this will become more
erroneous.
21
Pak and Paroubek (2010) proposed a model to classify the tweets as
features like N-gram and POS-tags. The training set they used was less
found that the Naive Bayes classifiers worked much better than the
build models using Naive Bayes, MaxEnt and Support Vector Machines
(SVM). Their feature space consisted of unigrams, bigrams and POS. They
concluded that SVM outperformed other models and that unigram were
subjective and then in second phase, the subjective tweets were classified
22
Bifet and Frank (2010) used Twitter streaming data provided by Firehouse
API , which gave all messages from every user which are publicly available
that SGD-based model, when used with an appropriate learning rate was
models such as: unigram model, a feature based model and a tree kernel
based model. For tree kernel based model they represented tweets as a
tree.The feature based model uses 100 features and the unigram model
are most important and plays a major role in the classification task. The
single words, n-grams and patterns as different feature types, which are
Po-Wei Liang et.al.(2014) used Twitter API to collect twitter data. Their
23
The data is labeled as positive, negative and non-opinions. Tweets
was employed. They also eliminated useless features by using the Mutual
Pablo et. al. (2016) presented variations of Naive Bayes classifiers for
which the relationships between words was not at all considered and a
determined and those values are united with some aggregation functions.
Kamps et al. (2019) used the lexical database WordNet to determine the
adjectives.
24
CHAPTER THREE
25
The dataset consists of three columns:
I. Id:
II. Label:
III. Tweet:
This column contains the actual text of the tweets. The tweets
emojis, and URLs. This makes the sentiment analysis task more
26
short and to the point, adhering to Twitter’s 280-character limit,
The dataset is complete, with no missing entries across the 31,962 rows,
ensuring that there are no null values in the `id`, `label`, or `tweet`
I. Hashtags:
III. Emojis:
IV. URLs:
27
V. Abbreviations and Slang:
Social media platforms like Twitter are rife with informal language,
positive feedback.
28
companies can predict how people will react to product
political matters.
Twitter data. While the labels are specific to offensive content, they align
range of industries.
29
Figure 3.1 Architectural model for sentiment analysis of twitter data.
3.2.1 Dataset
The dataset is very important component of the model and it required for
brings the data into the environment where analysis will be performed.
30
This step ensures that the data is ready for further processing and
analysis.
The first step is to load the data into a structured format that can be
done using a library like `pandas`, which allows for efficient data
manipulation.
3.2.2Data Preprocessing
Preprocessing is essential to prepare the raw text data for the model.
Since machine learning models work with numerical data, converting the
text into vectors is a key step. The data preprocessing level is divided into
two:
I. Text Vectorization:
The target extraction isolates the labels that the model will learn to
31
Splitting the data ensures that the model is evaluated on data it hasn’t
seen during training. This is critical for assessing the model’s ability to
sets.
The training step is where the model actually “learns” from the data. The
decision tree classifier will create a model that can be used to predict
sentiments based on the features extracted from the tweets. This training
During training, the model learns patterns and relationships between the
I. Prediction:
Use the trained model to predict sentiment labels for the validation
data (`X_val`).
32
II. Metric Calculation:
recall, and F1-score, we can determine how well the model is likely
3.4 Algorithm
Start
features.
matrix `X`.
- Split the data into training and validation sets using an 80/20 split.
33
Initialize and Train the Model
- Use the trained model to predict sentiment labels on the validation set
(`X_val`).
End
3.5 Flowchart
34
Figure 3.2 flowchart for the sentiment analysis of twitter data.
I. Start:
This is a critical step where the raw data is cleaned and prepared for
20 split is used, where 80% of the data is for training, and the
In this step, models is trained using the training dataset. The modesl
35
score is printing which is detailing the performance on each class
VI. End:
CHAPTER FOUR
36
These are the lines of code at the beginning of a script where you import
external libraries or modules that are required for your script to run. Each
import statement allows you to use functions, classes, and methods from
In this stage the dataset is import into the model, it was implemented with
37
Twitter data typically contains noise in the form of URLs, mentions,
the data and convert it into a usable format for the model. It includes the
To evaluate the performance of the decision tree model, the data should
be split into training and testing sets. The model will be trained on the
training set and tested on the unseen testing set. It was implemented in
The models can now be trained on the preprocessed and vectorized data.
The code defines and trains four models: decision tree, random forest ,
38
support vector machine (svm) , and logistic regression. It was
Once the model is trained, Each model is evaluated on the validation set
showing precision, recall, and F1-score for each class and printing is done.
39
The requirements for the implementation of the decision tree model can
Fedora, etc.)
40
Source code:
41
Evaluation Metrics are numerical measurements that are used to access a
evaluate how well a model predicts the results of a certain task. The type
of model being utilized and the particular problem being addressed can
task.
4.4.1 Accuracy:
correctly predicts the outcome. You can calculate accuracy by dividing the
In other words, accuracy answers the question: how often the model is
right?
higher the accuracy, the better. You can achieve a perfect accuracy of 1.0
4.4.2 Precision:
42
Precision is a metric that measures how often a machine learning model
total number of instances the model predicted as positive (both true and
false positives).
In other words, precision answers the question: how often the positive
4.4.3 Recall:
correctly identifies positive instances (true positives) from all the actual
positive samples in the dataset. You can calculate recall by dividing the
In other words, recall answers the question: can an ML model find all
4.4.4 F1-score:
43
The F1 score is a performance metric for classification and is calculated as
4.5 Results
score
44
Figure 4.9 graph for parameters
4.6 Discussion
of the tweets. However, it’s slightly less accurate than the other
models.
II. Precision (0.78): Out of all the tweets classified as positive (or
45
precision suggests that the model might classify some negative
III. Recall (0.76): Of all the positive tweets, 76% were correctly
IV. F1-Score (0.77): The F1-score balances precision and recall, and at
IV. F1-Score (0.83): This high F1-score indicates that Random Forest
reliable model for this task. Its robustness against overfitting (due to
46
II. Precision (0.80): SVM has good precision, indicating that most
positive predictions are correct, though it’s slightly less precise than
Random Forest.
III. Recall (0.79): The recall is slightly lower, meaning the SVM might
IV. F1-Score (0.79): The F1-score of 0.79 suggests that SVM maintains
Based on the metrics, Random Forest appears to be the best choice for
47
interpretability is critical, but it may require careful tuning to avoid
overfitting.
CHAPTER FIVE
5.1 Summary
This project explores the implementation of a sentiment analysis model using Twitter data,
focusing on classifying tweets into positive, negative, or neutral sentiments. The approach
begins with data collection through the Twitter API, gathering a variety of tweets based on
relevant keywords and hashtags. The collected tweets are then subjected to a comprehensive
preprocessing phase, which includes cleaning the text, tokenization, and converting the
The core of the project involves training a Decision Tree model on the preprocessed data.
Decision Trees are chosen for their simplicity and interpretability, making them suitable for
performance is evaluated using key metrics such as accuracy, precision, recall, and F1-score,
providing a detailed understanding of how well the model classifies sentiments. To provide
context and evaluate the effectiveness of the Decision Tree model, its performance is
compared with three other popular machine learning models: Random Forest, Support Vector
Machine (SVM), and Logistic Regression. Random Forest, an ensemble method combining
multiple decision trees, is expected to offer superior accuracy and generalization compared to
48
a single Decision Tree. SVM, known for its performance in high-dimensional spaces, and
The results indicate that while the Decision Tree model is valuable for its interpretability, it
may not generalize as well as Random Forest or SVM, which provide better accuracy and
robustness for large-scale sentiment analysis tasks. The project concludes with a discussion
importance of selecting the right model based on the specific requirements of sentiment
analysis. This project contributes to the broader understanding of how different machine
learning models can be applied to social media data for sentiment classification, providing
5.2 Conclusion
The implementation of sentiment analysis on Twitter data using the Decision Tree model
demonstrates the potential and challenges of applying machine learning techniques to social
media analytics. The Decision Tree model, chosen for its simplicity and interpretability,
successfully classifies tweets into positive, negative, or neutral sentiments, providing a clear
such as Random Forest, Support Vector Machine (SVM), and Logistic Regression, it
becomes evident that the Decision Tree may not always be the most effective choice in terms
Random Forest, with its ensemble approach, outperforms the Decision Tree by reducing
overfitting and improving predictive performance, making it more suitable for large-scale
sentiment analysis tasks. SVM and Logistic Regression also offer competitive results,
analysis highlights the importance of selecting the appropriate model based on the specific
49
goals of the sentiment analysis task. While the Decision Tree offers valuable interpretability,
which is crucial in certain applications, other models may be better suited for scenarios
requiring higher accuracy and robustness. Therefore, a balanced approach that considers both
Overall, this project underscores the critical role of model selection in sentiment analysis and
suggests that while Decision Trees have their place, ensemble methods like Random Forest
understanding and acting upon the vast amounts of data generated on platforms like Twitter.
Twitter data. Through the meticulous implementation and comparison of the Decision Tree
with other prominent machine learning models such as Random Forest, Support Vector
Machine (SVM), and Logistic Regression, this study offers valuable insights into the
strengths and limitations of different approaches in the context of social media analytics. This
project contributes to the field of sentiment analysis by providing the following insights and
advancements:
II. Comparative Analysis: Compares the Decision Tree model with other popular
machine learning models, including Random Forest, Support Vector Machine (SVM),
50
and Logistic Regression. Highlights the trade-offs between interpretability (Decision
III. Model Selection Guidance: Provides practical guidance on selecting the appropriate
machine learning model based on specific sentiment analysis tasks. Emphasizes the
enhancing model performance. Shows how these techniques improve the quality of
Twitter data. Serves as a reference for future research and practical applications in
VI. Insights into Ensemble Methods: Validates the superiority of ensemble methods like
Random Forest over single models like Decision Tree in terms of accuracy and
VIII. Real-World Application Relevance: Offers insights that are directly applicable to
51
Overall, this project advances knowledge in the application of machine learning to social
media data, particularly in the context of sentiment analysis on Twitter. It provides a practical
framework for selecting and implementing models based on the needs of the analysis,
balancing the demands for interpretability and accuracy. This contribution serves as a
valuable reference for future research and development in the field of sentiment analysis and
its applications in understanding public opinion and trends on social media platforms.
5.4 Recommendation
Decision tree models offer a powerful approach to sentiment analysis, providing interpretable
and efficient solutions. However, their application to Twitter data presents unique
I. Choose the Appropriate Model Based on Specific Needs: When applying sentiment
analysis to Twitter data, select the algorithm that best suits your specific requirements.
achieving the highest accuracy and robustness, particularly in large and complex
hyperparameter tuning. For example, adjusting the depth of the Decision Tree or the
number of trees in the Random Forest can significantly impact the model's
52
performance. Use techniques such as grid search or randomized search to find the
III. Regularly Re-train Models on Updated Data: Twitter data is highly dynamic, with
trends and language usage evolving rapidly. To maintain model relevance and
accuracy, regularly re-train your models on new data. This will help your models
adapt to the latest trends and changes in sentiment expression, ensuring that the
IV. Combine Models for Enhanced Performance: Consider combining multiple models
strengths. For instance, an ensemble method that aggregates predictions from different
models can provide more robust and accurate results than any single model alone.
V. Focus on Data Preprocessing: The quality of your sentiment analysis results heavily
depends on the preprocessing of the Twitter data. Pay careful attention to steps like
preprocessing is crucial to transforming raw tweets into a format that the algorithms
VI. Monitor Model Performance Over Time: Continuously monitor the performance of
analysis results inform decision-making. Use metrics like accuracy, precision, recall,
53
model. The Decision Tree model, while potentially less accurate than ensemble
REFERENCES
Zhang, Y., Song, D., Zhang, P., Li, X., & Wang, P. (2019). A quantum-inspired
Aslam, S. (2024, January 11). Twitter by the numbers: Stats, demographics &
statistics.
Nazir, F., Ghazanfar, M. A., Maqsood, M., Aadil, F., Rho, S., & Mehmood, I.
54
(2019). Social media signal detection using tweets volume, hashtag, and sentiment
Kraiem, M. B., Feki, J., Khrouf, K., Ravat, F., & Teste, O. (2015). Modeling
and OLAPing social media: The case of Twitter. Social Network Analysis
Furini, M., & Montangero, M. (2018). Sentiment analysis and Twitter: A game
Harakawa, R., Takehara, D., Ogawa, T., & Haseyama, M. (2018). Sentiment-
Bukhari, A., Qamar, U., & Ghazia, U. (2017). URWF: User reputation based
Bennacer, N. S., Bugiotti, F., Hewasinghage, M., Isaj, S., & Quercini, G.
Appel, O., Chiclana, F., Carter, J., & Fujita, H. (2018). Successes and challenges
55
Intelligence, 48 (5), 1176–1188.
Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public mood and emotion:
11), 450–453.
Hasan, M., Agu, E., & Rundensteiner, E. (2014). Using hashtags as labels for
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data mining: Practical
Gamallo, P., & Garcia, M. (2014). Citius: A naive-Bayes strategy for sentiment
56
analysis on English tweets. In Proceedings of the 8th International
Dublin, Ireland.
COVID-19.
Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using
AAAI.
57
Shangaui, B., Sheela, R., & Sudalai Manikandan, V. (2021). Twitter sentiment
Gautam, G., & Yadav, D. (2014, August 7-9). Sentiment analysis of Twitter data
Panto, M., Antony, M., Muhssina, K. M., Johny, N., James, V., & Wilson, A.
(ICEEOT).
Gupta, B., Negi, M., Vishwakarma, K., Rawat, G., & Badhani, P. (2017). Study
Tan, S., Li, Y., Sun, H., Guan, Z., & Yan, X. (2014). Interpreting the public
Kumar, S., Singh, P., & Rani, S. (2016, September 7-9). Sentimental analysis of
Directions).
58
Twitter data using data mining. 2015 International Conference on
59