VG Computer Science AI Recommender
VG Computer Science AI Recommender
1. Abstract
E-commerce plays an important role in economic development and has become an indispensable part of
daily life in India because of the rapid growth of the country’s information technology and knowledge
systems. E-commerece heavily relies on customer reviews, these reviews are a testament of the credibility
of a product. They play an essential part in shaping consumer behaviour and have a notable impact on the
business of a product. Reviews have potential to portray the suitability of different products and its
aspects to a particular customer, which may reflect the efficiency of a recommendation. Analysing E-
commerce reviews can help producers gain an insight into customer expectations and programme
recommender systems accordingly. In our research we will analyse the overall sentiments of online
reviews and different aspects of various products that were appealing for a consumer, giving us a brief
idea of consumer choices and behaviour. Natural Language Processing (NLP) and Sentiment
Analysis(SA) has been used to perform this analysis detecting emotion behind online texts. Sentiment
analysis is widely applied with the goal of examining opinions, evaluations, attitudes, judgments, and
emotions toward a product.This method is used to conclude the amount of positive and negative reviews
that a product has received and further identify the opinions of consumers towards its various aspects.
Aspect-based sentiment analysis(ABSA) can identify different aspects of a product and determine the
corresponding sentiment of that aspect based on how they are related.
In our research, machine learning classifiers like logistic regression, SVM , decision tree, XGboost have
been used on the flipkart review dataset and later each model’s accuracy was studied. We have applied
ABSA to detect aspects associated with different products. This has helped us identify which aspects of a
product have negative sentiment and hence the company needs to work on improving it. We have also
executed the same classifiers after ABSA and drawn the conclusion that the accuracy of the models
increases after aspect detection.
Keywords: Online Reviews, Natural Language Processing, Consumer Behaviour, Sentiment Analysis,
Aspect Based Sentiment Analysis
2. Introduction
Sentiment Analysis (SA) or Opinion Mining (OM) is the computational study of people’s emotions,
attitudes and opinions toward an item [1], deciphering whether the tone of the emotion is negative,
positive or neutral.The item can represent individuals, events or topics. Sentiment analysis allows
organisations to gain insights into the vast volumes of unstructured data from different online sources like
social media, emails, chats, blogs, and forums. The terms SA and OM are interchangeable and share a
common meaning [2]. Some scholars believe that SA and OM have different meanings. Concept mining
extracts and analyses people's opinions on a topic, while sentiment analysis detects and analyses opinions
in text. From a data mining perspective, sentiment analysis or sentiment mining can be seen as a multi-
level classification problem.
Research in sentiment analysis has become popular with the increase in volumes of online text. Numerous
real-world operations require sentiment analysis for in depth investigation. For example, product analysis
discovers which aspects of a product appeal to customers in terms of quality [3]. Consumers frequently
go through reviews of products, restaurants and hotels before making a purchase.[4] Through sentiment
analysis the overall tone of these reviews can be identified. Content from a user’s perspective that is
posted through different online platforms such as forums, micro-blogs, or social networking sites may
potentially hinder the process of sentiment analysis. This may be due to online spammers and opinions
lacking quality[5]. This flaw has been overcome in this research as the dataset used of flipkart reviews go
through moderation before they are visible to other consumers, secondly each review must be submitted
with a rating which acts as a ground for comparison.
Sentiment analysis often focuses on classifying the overall sentiment in a text without clarifying the
content of the sentiment.[6] This may not be enough if the text is simultaneously referring to different
topics or entities (also known as aspects), possibly expressing different sentiments towards different
aspects. Identifying sentiments relating to specific aspects in a text is a more complex task known as
aspect-based sentiment analysis (ABSA).It is an advanced form of sentiment analysis that digs deep into
text to identify specific aspects or features of a product or service and determines the sentiment around it
all. This approach can give more insight into specific content that users find positive or negative.
Summarising reviews of products using Sentiment Analysis helps in determining the product features that
need improvement. Extracting sentiment from product reviews makes it easy for the brand marketing
team to reach its customers who need extra care and hence, benefits the business.
3. Material and Methodology
In this research paper we have analysed a dataset from Kaggle which consisted of product reviews on
flipkart https://www.kaggle.com/datasets/mansithummar67/flipkart-product-review-dataset using
sentiment analysis. This dataset contains 189874 rows and 5 columns of product information such as
product name, product price, rate, and review and summary of 104 different types of products.
Python libraries have been used in this research, namely- PIL, nltk, sklearn, textblob and wordcloud.
Python provides a favourable setting for carrying out this analysis due to its numerous libraries and
frameworks.
Data
Review Processing
Results Sentiment
Analysis
Figure 1 explains the methodology used by us in this research in which data processing and cleaning is a
crucial step before carrying out Sentiment Analysis. This is important as it increases productivity and
efficiency of the results, data processing is essential for the findings of classifiers to be accurate. This
process also helps in facilitating tokenization. The right data set must be used to train a classifier properly.
Sentiment Analysis was then conducted in this research using different Machine Learning (ML)
algorithms, which helped us come to the most accurate results.
a. Data Cleaning: The reviews dataset consisted of unstructured and unformatted data which had to
be converted into a structured format. Data without a proper framework and structure is difficult
to work with and causes unnecessary errors while running programmes. Structured data draws
attention to the characteristics in reviews that we want the algorithm to identify as tokenization
becomes easier. Tokenization is when text is broken down into smaller units to make it
meaningful to the machine without losing the text’s initial essence. We cleaned the data set by
carrying out following steps:
1) All text was converted into lower case to maintain uniformity and punctuation
marks were replaced with spaces to ease tokenization.
2) Any missing values (NaN) were replaced with spaces
3) Non Alphanumeric characters were removed.
4) Common stop-words were filtered out using NLTK's English stopwords list. Stop
words are words frequently used in verbal or written language but do not carry
useful information. For eg: a, is, are, that, etc.
5) Phrases like "mind blowing" were also removed to prevent bias in sentiment
analysis. Words like these are hyperboles used in text to express emotions with
exaggeration, these words can often be taken in opposite meaning by the algorithm
which makes it important to filter these words from the text.
b. Sentiment Analysis
VADER (Valence Aware Dictionary and Sentiment Reasoner) was used to classify the negative and
positive reviews, polarity scores were then assigned to every review. Polarity scores are numerical scores
ranging from 1 (most positive) to -1 (most negative) that indicate the overall sentiment and tone of a
phrase or word. A column mentioning the polarity score was added to the dataset which categorised each
review as positive, negative or neutral based on the score.
Figure 2 depicts how in Aspect Based Analysis a phrase is broken into discrete aspects and emotions are
identified. The sentence “Great car cover and has long range” is broken into smaller aspects which later
will be used to come to a conclusion about the nature of the phrase.
Figure 3 depicts the frequency of words that occurred in the reviews in descending order. These words
brought out the main essence of the reviews and helped us identify the aspects that consumers found
appealing. The words “good”, “product” and “nice” are shown to have higher frequencies whereas
“much” and “amazing” hold lower frequencies.
Figure 4. Bar graph for words used most frequently in positive reviews in descending order.
Figure 4 illustrates the frequency of words that were used by consumers most frequently in positive
reviews. Good is shown as the word with highest frequency, this suggests that most consumers convey
their high satisfaction with a product using this word.
Figure 5 is a word cloud that consists of words that have the highest occurrence in positive reviews. This
correlates to Figure 4 which was a bar graph for the same. A word cloud is a Data Visualisation tool, in
which words are displayed in an appealing manner, where the size of each word corresponds to its
frequency. Larger and prominent words are most frequent in the text.
Figure 6. Bar graph for words used most frequently in positive reviews in descending order.
Figure 6 depicts words used most frequently in negative reviews, with very having the highest frequency.
This shows that consumers' level of dissatisfaction with a product may be to the highest degree.
Figure 7. Word Cloud for words used most frequently in negative reviews.
The word cloud in Figure 7 is made up of the words that had the highest occurrence in reviews that were
negative. The most common words in the text are the ones that are larger and more noticeable.
Figure 10. Bar Graph for aspects mentioned most frequently in Negative Reviews (Top 10)
Figure 9 and 10 illustrate the frequency of most common aspects mentioned in positive and negative
reviews respectively. Consumers have pointed out aspects of these products while writing their reviews,
this aspect identification gives us a brief understanding of the scope of improvement that products may
need.
Classification Models
Machine learning models including Logistic Regression (word-level and n-gram), Decision Tree, and
Support Vector Machines (SVM), Random Forest and XGBoost were used to predict sentiment labels.
The accuracies of these models were compared, the results as shown in table1.
Figure 11 compares the accuracy levels of all 5 models used in Sentiment and Aspect based analysis. The
Decision Tree, Random Forest and XG Boost Models show highest levels of accuracy with accuracy
score 1.00, in the Aspect Based Analysis. The SVM, Random Forest and XG Boost Models are shown to
have the highest levels of accuracy in Sentiment Based Analysis. The Logistic regression (N-gram) model
showed lowest levels of accuracy in both. All models displayed higher accuracy scores when used for
Aspect based analysis as compared to sentiment based analysis.
Conclusions
Aspect-based sentiment analysis provides a more detailed and precise knowledge of sentiments than
typical sentiment-based analysis. While sentiment-based analysis calculates an overall sentiment score for
a given text, aspect-based analysis breaks down sentiment based on particular aspects or qualities stated
in the text, enabling a more precise and context-specific interpretation of feelings, which is especially
important for comprehending client feedback on specific product or service characteristics.
Because aspect-based sentiment analysis assesses sentiments more thoroughly, it frequently produces
results with higher precision and accuracy. It lessens the uncertainty that can be caused from generic
sentiment analysis by concentrating on specific elements, like "battery life" or "customer service." As a
result, aspect-based sentiment analysis models typically produce more actionable insights since they are
able to collect and distinguish the sentiment associated with specific variables with greater efficacy.
In conclusion, compared to traditional sentiment-based analysis, aspect-based sentiment analysis
improves model performance by isolating and evaluating particular elements, allowing for a deeper and
more precise assessment of sentiments.
References
1. Walaa Medhat, Ahmed Hassan, Hoda Korashy,Sentiment analysis algorithms and applications:A
survey,Ain Shams Engineering Journal,Volume 5, Issue 4,2014,Pages 1093-1113,ISSN 2090-
4479, https://doi.org/10.1016/j.asej.2014.04.011.
2. Aspect Based Sentiment Analysis Survey, Naveen Kumar Laskari, Suresh Kumar Sanampudi,
IOSR Journal of Computer Engineering (IOSR-JCE)e-ISSN: 2278-0661,p-ISSN: 2278-8727,
Volume 18, Issue 2, Ver. I (Mar-Apr. 2016), PP 24-28, www.iosrjournals.org \
3. Wankhade, M., Rao, A.C.S. & Kulkarni, C. A survey on sentiment analysis methods,
applications, and challenges. Artif Intell Rev 55, 5731–5780 (2022).
https://doi.org/10.1007/s10462-022-10144-1
4. Stine, Robert A, Sentiment Analysis, 2019 Annual Review of Statistics and Its Application,
Volume 6, 2019, P 287-308, 2326-831,https://doi.org/10.1146/annurev-statistics-030718-105242
5. Fang, X., Zhan, J. Sentiment analysis using product review data. Journal of Big Data 2, 5 (2015).
https://doi.org/10.1186/s40537-015-0015-2
6. Mickel Hoang, Oskar Alija Bihorac, and Jacobo Rouces. 2019. Aspect-Based Sentiment Analysis
using BERT. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, pages
187–196, Turku, Finland. Linköping University Electronic Press.
7. A. Poornima and K. S. Priya, “A Comparative Sentiment Analysis of Sentence Embedding
Using Machine Learning Techniques,” 2020 6th Int. Conf. Adv. Comput. Commun. Syst.
ICACCS 2020, pp. 493–496, 2020, doi: 10.1109/ICACCS48705.2020.9074312.
8. Asha J and Meenakowshalya A, Fake News Detection Using N-Gram Analysis and Machine
Learning Algorithms, ISSN: 2349-901X, Volume 8, Issue 1, 2021, DOI (Journal):
10.37591/JoMCCMN
9. Rokach, L., Maimon, O. (2005). Decision Trees. In: Maimon, O., Rokach, L. (eds) Data Mining
and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-
25465-X_9
10. Ahmedbahaaaldin Ibrahem Ahmed Osman, Ali Najah Ahmed, Ming Fai Chow, Yuk Feng Huang,
Ahmed El-Shafie, Extreme gradient boosting (Xgboost) model to predict the groundwater levels
in Selangor Malaysia,Ain Shams Engineering Journal,Volume 12, Issue 2,2021,Pages 1545-
1556,ISSN 2090-4479,https://doi.org/10.1016/j.asej.2020.11.011.