Ml termwork report
Ml termwork report
Classification Techniques
Several classifiers are used in sentiment analysis, including:
Naive Bayes: Based on feature independence assumptions.
Maximum Entropy (MaxEnt): Estimates conditional
distributions using exponential family distributions.
Support Vector Machine (SVM): Classifies data by maximizing
the margin between classes.
Lexicon Techniques
Lexicon-based sentiment analysis assigns predefined scores to words
based on their polarity (positive, neutral, or negative). The total
polarity of a text is calculated by summing individual token scores.
However, it faces challenges in domain-specific contexts, where
words may have different meanings.
Hybrid Techniques
Hybrid methods combine multiple strategies to improve sentiment
analysis. These can include:
Contextual Opinion Mining: Uses semantic similarity to extract
contextual sentiment.
Unsupervised Learning with POS Patterns: Extracts sentiment
phrases using part-of-speech rules.
Each technique has its strengths and is chosen based on the dataset
and analysis goals.
Research design
The choice of descriptive and exploratory research was made with
the expectation that it would provide marketers with a clear
understanding of the millennial mindset. By using the Bayes
algorithm to classify emotion, this method seeks to extract emotions
from the dataset and categorize them into emotions by assigning
scores based on the emotions connected with that tweet. The graph
is plotted following the given score. Figure 3.1 shows the research
design for the sentiment analysis model and figure 3.2 shows the
sentiment polarity model design.
A Sentiment Polarity Model classifies text into sentiment categories:
positive, negative, or neutral. It determines the direction (polarity)
of the sentiment expressed in the text.
Positive indicates favorable sentiment, negative indicates
unfavorable sentiment, and neutral means no clear sentiment.
The model often uses feature extraction techniques like
tokenization, TF-IDF, or word embeddings to process text.
Some models use polarity scores: +1 for positive, -1 for negative, and
0 for neutral.
Research Methodology for Sentiment Analysis
1. Data Collection:
The first step in opinion mining is gathering substantial volumes
of data.
Sources: Data can be obtained from:
o Social media platforms (Facebook, Twitter)
o Reviews and comments from various sources (blogs,
ratings, forums)
o Datasets from Kaggle (e.g., product reviews from Flipkart)
Example Dataset: A Flipkart product review dataset from
Kaggle with 205,053 rows and 5 columns containing product
names, prices, ratings, reviews, and summaries.
2. Data Preprocessing:
Preprocessing cleans and prepares the raw data for analysis.
Steps:
o Tokenization: Splitting text into words, phrases, symbols,
or characters.
o Stopword Removal: Removing common words (e.g.,
"the", "is", "in") that don't contribute to sentiment
analysis.
o Filling Missing Values: Replacing missing values with a
global constant.
Tools Used: NLTK (Natural Language Toolkit) for tokenization,
stopword removal, and stemming.
Goal: Ensure data is cleaned and fragmented properly, reducing
noise for better model accuracy.
3. Feature Extraction:
This step involves identifying key features to build the
sentiment analysis model.
Methods:
o Skip N-Gram Model: A generalization of n-grams that
allows gaps between tokens, addressing data sparsity.
o TF-IDF Hybrid Method: Combines Term Frequency (TF)
and Inverse Document Frequency (IDF) to calculate word
importance in the text.
Feature extraction reduces model complexity and improves
accuracy by selecting relevant features.
4. Classification:
Training Stage: A classification model is built using a training
dataset.
Testing Stage: The accuracy of the model is tested using test
data.
Classifier Used: Naïve Bayes Classifier to classify sentiment.
Polarity Determination: Sentiment values are calculated using
subjectivity (0-1) and polarity (-1 to +1).
The text is positive if the polarity is more than 0, negative if the polarity
is less than 0, and neutral if the polarity is equal to 0. The subjective
range runs from 0.0 to 1.0. A more excellent score indicates that the text
is more subjective. Table 3.4 shows the value counts for positive,
negative and neutral reviews.
5. Sentiment Calculation:
Sentiment Values: For each review, subjectivity and polarity
values are computed.
Final Sentiment Score: The overall sentiment score is the
summation of the product of individual subjectivity and polarity
values.
6. Visualization:
Visualization of the classification and the outcomes generated by the
machine learning algorithms is strongly advised after developing a
machine learning model. Any common dataset used to train the
machine should be represented as a graph for sentinel analysis so
that the continuous distribution of the data is visible. The results are
displayed using graphs and charts. The 33 word cloud was built using
the frequency of occurrence of words. Figure 3.4 shows the
visualization and figure 3.5 shows the Count Plot of the sentiments.
Testing:
Testing is necessary to confirm whether or not the developed model
accurately predicts the intended result. Another name for it is the
validation step. The fact that this process should function well when
applied to large-scale applications makes it crucial. The final stage is
testing, in which a user inputs a text into a machine at runtime, and
the computer makes predictions about the statement: whether good,
negative, or neutral.