proposalwriting
proposalwriting
Introduction
In today's digital landscape, individuals communicate their thoughts primarily through blogs,
online forums, and social media platforms like Facebook and Twitter. Millions share their
experiences, forming dynamic online communities where consumer feedback influences
decisions. This surge of emotional data from tweets to reviews presents significant opportunities
for businesses to engage effectively with customers. As consumers increasingly rely on user-
generated content for purchasing decisions, the need for automated sentiment analysis (SA)
becomes clear. SA helps determine whether products meet expectations and allows businesses to
tailor offerings to customer needs. While traditional data analysis focuses on objective facts,
sentiment analysis explores subjective opinions and emotions, paving the way for new
applications. For example, recommendation systems can improve accuracy by incorporating SA
to predict product recommendations based on user sentiment.
Problem Statement
Objectives
Data Collection and Preprocessing: To gather a diverse dataset of tweets and preprocess the
data to remove noise, such as irrelevant content, hashtags, and links, ensuring a clean dataset for
analysis.
Sentiment Classification: To develop and implement algorithms for classifying tweets into
distinct sentiment categories (e.g., positive, negative, neutral) using machine learning and natural
language processing techniques.
Feature Extraction: To identify and extract relevant linguistic features from tweets, such as
sentiment-laden words, phrases, and contextual cues that contribute to accurate sentiment
detection.
Model Evaluation and Optimization: To evaluate the performance of various sentiment
analysis models and optimize them for higher accuracy, precision, and recall in classifying
sentiments.
Insights and Visualizations: To generate meaningful insights from the analyzed data and
present the findings through visualizations that illustrate trends, patterns, and public sentiment
over time
Methodology
Requirement Identification
Literature Review
The paper [1] (Kaur, 2015) describe regarding geographic area flood information set collected
from twitter and realize the opinion of individuals. They used Naive Bayes formula for the
classification of information and result they got 67% accuracy. They need collected several
resolution from the individuals that are useful for each government and non-government
organization to handle such scenario in an exceedingly higher manner. These strategies simpler
than lexicon-based formula.
In 2019, Saad and Yang [2] have aimed for giving a complete tweet sentiment analysis on the
basis of ordinal regression with machine learning algorithms. The suggested model included pre-
processing tweets as first step and with the feature extraction model, an effective feature was
generated. The methods such as SVR, RF, Multinomial logistic regression (SoftMax), and DTs
were employed for classifying the sentiment analysis. Moreover, twitter dataset was used for
experimenting the suggested model. The test results have shown that the suggested model has
attained the best accuracy, and also DTs were performed well when compared over other
methods.
In 2018, Fang et al. [3] have suggested multi-strategy sentiment analysis models using the
semantic fuzziness for resolving the issues. The outcomes have demonstrated that the proposed
model has attained high efficiency
Requirement Analysis
Feasibility study
i)Technical
Data Access: Twitter provides APIs for data collection, enabling access to large volumes of
tweets for sentiment analysis.
Tools and Technologies: Utilize established libraries and frameworks (e.g., Python, NLTK,
TensorFlow) for natural language processing and machine learning.
Infrastructure: Assess the computational requirements for data processing and model training;
cloud-based solutions can provide scalable resources.
Algorithm Development: Existing machine learning algorithms (e.g., SVM, LSTM) are well-
suited for sentiment classification, ensuring a solid foundation for development.
ii)Operational
Ensure the availability of team members with expertise in data science, machine learning, and
natural language processing and developing workflows for data collection, preprocessing,
analysis, and reporting that can be integrated into existing business operations.Establish
processes for continuous monitoring and updating of models to adapt to evolving language and
sentiment trends on Twitter.
iii)Economic
Sentiment analysis is a term that encompasses many tasks such as emotion mining, sentiment
classification, subjective classification, opinion synthesis, or opinion spam detection, among
others. It aims to analyze feelings, attitudes, opinions, emotions, etc. of people related to items
such as products, individuals, objects, organizations, and services. Mathematically, we can
represent an opinion as a percentile (o, f, so, h, t), where: o = object; f = feature of object o; so =
orientation or polarity of opinion about the f characteristic of object o; h = opinion holder; t =
time the opinion was expressed.
Object An entity can be a person, event, product, organization, or subject
Feature An attribute (or part) of the object on which the evaluation is performed.
Opinion orientation The orientation of an opinion about a characteristic f indicates whether
or polarity the opinion is positive, negative, or neutral.
Opinion holder An opinion holder is a person or organization or entity expressing an
opinion.
In recent years, several researchers have done a lot of work in the field of "Twitter Sentiment
Analysis". In its early days, it was used for binary classifications that assigned opinions or ratings
to dipole classes, such as positive or negative.
[1] proposed a model to classify tweets as objective, positive, and negative. They created a
Twitter database by collecting tweets using the Twitter API and automatically captioning those
tweets with emojis. Using this dataset, they developed a sentiment classifier based on the
polynomial Naive Bayes method using features such as N-gram tags and POS. The training set
they used was less effective because it only contained tweets with emojis.
[2] deployed two models, a Naive Bayes bigram model and a Maximum Entropy model to
classify tweets. They found that the Naive Bayes classifier performed much better than the
Maximum Entropy model.
[3] proposed an emotion analysis solution for Twitter data using remote monitoring, where
their training data consisted of tweets with emojis acting as big tags. They build models using
Naive Bayes, MaxEnt, and Support Vector Machines (SVM). Their feature spaces include
unigram, bigram, and store. They conclude that SVM is superior to other models and unigram is
more efficient in terms of features.
[4] designed a two-stage automated sentiment analysis method for ranking tweets. They
classify tweets as objective or subjective, then in the second stage, subjective tweets are
classified as positive or negative. The feature space used includes retweets, hashtags, links,
punctuation, and exclamation points along with features like pre-word polarization and selling
points.
[5] used Twitter streaming data provided by the Firehouse API, making every user post
public in real time. They experimented with naive polynomial Bayes, stochastic gradient descent,
and Hoeffding trees. They concluded that the SGD-based model, when used with an appropriate
learning rate, is better than the rest used.
[6] developed a 3-dimensional model to classify sentiments into positive, negative, and
neutral classes. They experimented with models such as: unigram model, feature-based model,
and tree-core model. For the tree-core-based model, they represent tweets as a tree. The feature-
based model uses 100 features, and the unigram model uses more than 10,000 features. They
concluded that the features that combine the anterior polarities of words with their pronunciation
tag (pos) are the most important and play a key role in the classification task. The tree-core-
based model outperforms the other two.
[7] proposed an approach to use user-defined hashtags in tweets as sentiment classification
using punctuation, single word, n-gram and patterns as feature types are then combined into a
single feature vector for sentiment classification. They used the K-Nearest Neighbor strategy to
label sentiment by constructing a feature vector for each example in the training and testing sets.
[8] used Twitter API to collect Twitter data. Their training data falls into three different
categories (camera, film, mobile). The data is labeled as positive, negative and no opinion. Posts
containing filtered comments. The Unigram Naive Bayes model was implemented, and the
simplified Naive Bayes independence assumption was used. They also remove unnecessary
features using mutual information and chi-square feature extraction methods. Finally, the
direction of a tweet is predicted. i.e., positive, or negative.
[9] presented variations of the Naive Bayes classifier to detect the polarization of tweets in
English. Two different variations of the Naive Bayes classifier were built, namely Baseline
(trained to classify tweets as positive, negative, and neutral) and Binary (using polar and
categorical vocabulary) are positive and negative. Neutral tweets are ignored). Features taken
into account by taxonomists are lemma (noun, verb, adjective, and adverb), polar and multiple
lexicons from different sources, and valence shifts.
[10] used the bag-of-words method for sentiment analysis, where the relationship between
words is not considered at all and a document is represented as a set of simple words. To
determine the sentiment for the entire document, the sentiment of each word was determined,
and these values were unified with several aggregate functions.
[11] used the word Network vocabulary database to determine the emotional content of a word
in different dimensions. They developed a distance measure on the WordNet and determined the
semantic poles of adjectives.
[12] used a synthetic framework for sentiment classification achieved by combining different
feature sets and classification techniques. In their work, they used two types of feature sets
(partial speech information and word relationships) and three basic classifiers (Naive Bayes,
Maximum Entropy and Support Vector Machines). They applied synthetic approaches such as
fixed matching, weighted matching and meta-classifier matching to classify sentiments and
achieve better accuracy.
[13] highlights effective challenges and techniques for extracting comments from Twitter
tweets. Spam and extremely diverse languages make it difficult to get comments in Twitter.
A general model for sentiment analysis is as follows:
Figure – 1: Sentiment Analysis Architecture
Feature Extraction
The preprocessed data set has many special properties. In feature extraction method, we extract
facets from the processed data set. This aspect is then used to calculate the negative and positive
polarity in a sentence, which is useful in determining the opinions of individuals using patterns
like unigram, bigram [15]
Machine learning techniques claim to represent the main features of a text or document to
process. These key features are considered as feature vectors used for the classification task.
Some examples of features that have been reported in the documentation are:
1. Words And Their Frequencies: Unigram, bigram and n-gram samples with their frequency
counts are considered as features. There have been many studies on using word presence
rather than frequency to better describe this characteristic. [16] showed better results when
using presence instead of frequency.
2. Parts Of Speech Tags: Parts of speech such as adjectives, adverbs, and certain groups of
verbs and nouns are good indications of subjectivity and sentimentality. We can generate
syntactic dependency models by parsing or dependency trees.
3. Opinion Words and Phrases: Besides specific words, a number of emotional phrases and
expressions can be used as traits. Example: costs someone an arm and a leg.
4. Position Of Terms: The position of a term in the text can affect the difference between the
term and the overall feeling of the text.
5. Negation: Negation is an important but difficult characteristic to interpret. The presence of a
negative often shifts the polarity of opinion. For example: I am not happy
6. Syntax: Syntactic patterns such as phrases are used by many researchers as traits to learn
subjective patterns.
Supervised learning is an important technique for solving classification problems. Classifier
training facilitates future predictions for unknown data.
1. Naive Bayes Classification: It is a probabilistic classifier and can learn the pattern by
examining a set of classified documents . It compares content with word lists to classify
documents according to their correct category or class.
3. Support Vector Machine: Vector machines support data analysis, decision bounds
determination, and kernel use for calculations performed in the input space . The input data
are two sets of vectors of size m each. Then all the data represented by a vector is classified
into a class. Then we find a margin of between the two layers, which is far from any
document. The distance determines the classifier's margin, maximizing the margin reduces
undecided decisions. SVM also supports classification and regression, which is very useful
for statistical learning theory, and it also helps to accurately recognize the factors that need to
be considered in order to understand it successfully.
2. Supervised learning: It is based on a labeled dataset and hence the labels are fed to the model
in the process. These labeled data sets are trained to yield meaningful outputs when
encountered during decision making.
The success of these two learning methods mainly depends on selecting and extracting the
specific set of features used for affective detection. The machine learning method applied to
sentiment analysis mainly belongs to supervised classification. In machine learning
technique, two datasets are required - Training Set, Test Set.
Several machine learning techniques have been built by to classify tweets into classes.
Machine learning techniques such as Naive Bayes (NB), Maximum Entropy (ME), and
Support Vector Machines (SVM) have achieved great success in sentiment analysis.
Machine learning starts with collecting a set of training data. Then we train a classifier on the
training data. When the supervised classification technique was chosen, an important
decision for the was to choose a feature. They can tell us how the documents are presented.
The most common features used in the sentiment classification are:
For supervised techniques, Support Vector Machines (SVM), Naive Bayes, Maximum Entropy
are some of the most commonly used techniques.
While semi-supervised and unsupervised techniques are introduced when it is not possible to
have an initial set of labeled documents/assessments to classify the remaining items.
Document level refers to marking individual documents with their sentiment. At the document
level, the entire document is classified as either positive or negative at the grade level. General
Approach is to find the poles of each phrase or word in and combine them to find the pole of
document & other approach is Complex linguistic phenomena such as resolution of co-
references, pragmatics, etc. The various tasks involved in this are: Task: Emotions Categorize
entire document Categories: Positive, negative, and neutral Assumption: Each document focuses
on a single audience (this is not true in discussion posts, blogs, etc.) and contains comments by a
single opinion holder
Sentence-level analysis deals with tagging individual sentences with their respective emotional
poles. Sentence level classification classifies sentences into levels of positive, negative, or
neutral. General Approach is to find the sentimental orientation of each word in the
phrase/sentence, then combine them to determine the sentiment of the phrase or the whole phrase
& other approach is examining the discursive structure of text, the various tasks involved in this
are Task 1: Determine Subjective/Objective Sentence Type: Objective and Subjective Task 2:
Categorize Sentiment Sentence Type: Positive and Negative Hypothesis: A sentence contains
only one opinion may not always be true.
Aspect Level or Functional Level This involves labeling each word with their emotion and
identifying the entity to which the sentiment is directed. Sentiment classification problems at the
aspect or function level with identifying and extracting product features from source data.
Techniques such as dependency parsing, and discourse construction are used for this purpose.
Various Tasks involved in this are Task 1: Identify and extract the characteristics of an object
that have been commented on by an opinion holder (e.g., reviewer) Task 2: Determine if the
opinion about the object is negative, positive, or neutral Task 3: Look up synonyms for the
object.
Most recent works have used pre-sentence-word polarization to classify sentiments at the
sentence level and the document level. Classification of feelings from mainly using adjectives as
characteristics but adverbs, the two automatic word-level sentiment annotation methods are
Dictionary-based approach & The corpus-based approach.
Data Preparation
Since this dataset is quite large, during the exploration we start by generating sub-datasets during
training, to speed up the testing steps. We also re-index the dataframe, to make it easier to add
columns later in the process.
Here we explore the data, check if it is balanced and check for patterns in missing rows. This can
usually be done automatically with pandas config. Sections in this title include:
• Basic visualization - We can display some basic statistics about the data using pandas and
also display a few entries from the dataset, to see the sample points we will be working with.
• Automated data exploration with pandas configuration - Pandas-profiling is a library used to
automate data discovery. This gives us a good overview of the dataset, which we can use to
inform our further work.
• Checking balances in output categories - We want to check the balance of the output column
(Sentiment), so we don't train a model that always predicts the output. This model can be
highly accurate, but we won't learn anything about trends in the data other than numbers from
a general point of view. You can also think about balancing test data, but keep in mind that
real world data can hardly be well balanced and experimental data is like real world data
Sentiment ratios change d' by about 0.15 to 0.3, which is usually a good balance, so we
are unlikely to see a scenario where only one class is predicted. However, we will be
watching to see if our training accuracy gets around 0.3, which could be a sign of this
problem.
We can create an indexer to convert the perception from label to index and vice versa. This is
very useful for understanding our predictions later. Then we convert the "Opinion" column in the
training data into labels, we will learn how to predict. After that we work with the text in the
"Tweet_Content" column, extract all the information we can and convert it into a format usable
for the neural network that we will train later. This involves removing words with little meaning
(keyword removal) and grouping words with the same meaning regardless of details such as time
(lemmatization). Next, we use encryption to encode the presence of words in the matrix, similar
to hot encoding. This is called the "bag of words" method. Sections in this title include:
• Stop Word Removal and Lemmatization with NLTK - Here, we first split each string into its
individual words, before checking if these strings - Contains text, yes in keyword list If no
text in word, meaning is, has only digits or punctuation (or other characters), or the word is a
stop word (words like "with", "a", "the"), the word is removed from the string. We also
perform lemma in this step, where we convert words back to their original form, so tenses
and other details can be omitted in our final model (a negative past tense statement is always
negative).
• Tokenization - We generate tokens for the most common words in the dataset, so that we can
represent the presence of words in our generated corpus (n most common words) with a list
of integers.
• Adding the Tokenized Strings to the DataFrame
The data is now almost ready for a trained model, but some final preparation will need to take
place. For example, we need to remove columns that we do not intend to use, such as the
"Tweet_Content" column, from which useful information has been extracted. We also split the
data into a training and test set so that we can evaluate the performance of our model without
touching the retained data. We do this because if we constantly examine this retained data, it
loses its usefulness as invisible “real world” data. Sections in this heading include Clear unused
data, Test stream separation.
• Dropping Unused Data - Here we remove unnecessary columns from the DataFrame. They
have no use value (Tweet ID) or useful information has been extracted (Tweet Content).
We're also removing "y" or the dependent variable here, so we don't accidentally train on it.
• Test-Train Split - Here, we use SciKit-Learn's built-in functionality to split our data into a
test set and a train set, with the appropriate labels. We use a constant random state to make
this repeatable.