0% found this document useful (0 votes)
10 views

Report Final

The document discusses a project titled 'Monitoring Fake Review of Products: Ensuring Authenticity in Consumer Feedback,' which aims to detect fake reviews using Logistic Regression and Long Short-Term Memory (LSTM) networks. The study highlights the importance of authentic consumer feedback in e-commerce and proposes a hybrid approach that combines traditional machine learning with advanced deep learning techniques to enhance accuracy in identifying deceptive reviews. The project emphasizes the need for scalable solutions to maintain trust in online platforms and outlines future work to adapt the models to evolving review manipulation tactics.

Uploaded by

Archana suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Report Final

The document discusses a project titled 'Monitoring Fake Review of Products: Ensuring Authenticity in Consumer Feedback,' which aims to detect fake reviews using Logistic Regression and Long Short-Term Memory (LSTM) networks. The study highlights the importance of authentic consumer feedback in e-commerce and proposes a hybrid approach that combines traditional machine learning with advanced deep learning techniques to enhance accuracy in identifying deceptive reviews. The project emphasizes the need for scalable solutions to maintain trust in online platforms and outlines future work to adapt the models to evolving review manipulation tactics.

Uploaded by

Archana suresh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

MONITORING FAKE REVIEW OF PRODUCTS: ENSURING

AUTHENTICITY IN CONSUMER FEEDBACK

by

KAVIYASRI R - 71382106030
SHAKTHI R - 71382106046
SWATHY M - 71382106055

Report submitted in partial fulfilment


of the requirements for the Degree
of Bachelor of Technology in
Information Technology

Sri Ramakrishna Institute of Technology

Coimbatore – 641010

April 2025
CERTIFICATE

Certified that the project titled MONITORING FAKE REVIEW OF PRODUCTS:


ENSURING AUTHENTICITY IN CONSUMER FEEDBACK is the bonafide work done by
Kaviyasri R (71382106030), Shakthi R (71382106046), Swathy M (71382106055), in the
FINAL YEAR PROJECT (Phase - I) of this institution, as prescribed by Sri Ramakrishna Institute
of Technology for the SEVENTH Semester / B. Tech, Programme during the year 2024 - 2025.
.

Project Supervisor Head of the Department

Submitted to the project viva-voce held on ____________________

INTERNAL EXAMINER EXTERNAL EXAMINER

i
ACKNOWLEDGEMENT

First of all, we record our grateful thanks to the almighty for his blessings to make this project a
grand success.

We would like to express our sincere gratitude and heartfelt thanks to our Principal, Dr. J. David
Rathnaraj, Ph.D., for providing all kind of technological resources to bring out excellence for us.

We express our profound thanks to Dr. J. J. Adri Jovin, Ph.D., Head of the Department,
Information Technology for extending his help in using all the lab facilities.

We thank our Project Coordinator Dr. T. C. Ezhil Selvan, Ph.D., Assistant Professor (Selection
Grade), Department of Information Technology, Sri Ramakrishna Institute of Technology for her
guidance throughout the entire period for the completion of the project.

We take immense pleasure to thank our Project Supervisor, Ms.U.Elakkiya, Assistant Professor
(Selection Grade), Department of Information Technology, for spending her valuable time in
guiding us and for her constant encouragement through the success of this project.

Finally, we would like to express our heartfelt thanks to our beloved parents for their blessing, our
friends for their help and wishes for the successful completion of this project.

ii
APPROVAL AND DECLARATION

This project report titled “MONITORING FAKE REVIEW OF PRODUCTS: ENSURING


AUTHENTICITY IN CONSUMER FEEDBACK” was prepared and submitted by Kaviyasri R
(71382106030), Shakthi R (71382106046), Swathy M (71382106055) and has been found satisfactory
in terms of scope, quality and presentation as partial fulfilment of the requirement for the Bachelor of
Technology in Sri Ramakrishna Institute of Technology, Coimbatore (SRIT).

Checked and Approved by

Ms. U. Elakkiya

Project Supervisor

Assistant Professor

Department of Information Technology


Sri Ramakrishna Institute of Technology,
Coimbatore -10.

April 2025

iii
MONITORING FAKE REVIEW OF PRODUCT: ENSURING
AUTHENTICITY IN CONSUMER FEEDBACK

ABSTRACT

In the digital age, customer reviews play a pivotal role in shaping consumer decisions and building
brand credibility. However, the increasing prevalence of fake reviews—those artificially
generated to either boost or tarnish product ratings—threatens the authenticity of online feedback,
leading to mistrust among consumers. This study presents an approach to monitoring and detecting
fake reviews using both the Logistic Regression algorithm and Long Short-Term Memory (LSTM)
networks. The method involves analyzing text-based review features, user behavior, and product
metadata to identify patterns characteristic of inauthentic feedback. Logistic Regression, a robust
and interpretable machine learning algorithm, models the probability of a review being fake, while
LSTM captures deep semantic patterns and contextual dependencies in the review text. The
models are trained on a labeled dataset of verified and unverified reviews to enhance prediction
accuracy, with results showing that combining Logistic Regression and LSTM achieves high
precision and recall in distinguishing fake from real reviews. Performance is further optimized
through feature extraction and sequence modeling, resulting in a balanced approach that minimizes
false positives and negatives. This research offers a scalable solution for e-commerce platforms,
promoting transparency in customer feedback. Future work includes enhancing NLP integration
and adapting the models to evolving review manipulation tactics.

iv
TABLE OF CONTENTS

CERTIFICATE i
ACKNOWLEDGEMENT ii
APPROVAL AND DECLARATION iii
ABSTRACT iv
TABLE OF CONTENTS v
LIST OF FIGURES viii
LIST OF ABBREVIATIONS ix

CHAPTER 1 INTRODUCTION
1.1 Historical Background 1
1.2 Overview 2

1.3 Monitoring fake review of products 3


1.4 Proposed system 4
1.5 Necessity for monitoring fake review of product 5
1.6 Advantages of the system 6
1.7 Applications 7

CHAPTER 2 LITERATURE REVIEW 9

CHAPTER 3 METHODOLOGY
3.1 Existing system 14
3.2 Architecture overview 15
3.3 Data collection and preprocessing 15
3.4 Feature Extraction 16
3.5 Word embeddings in text classification 17
3.6 Model Development 18
3.7 Long short term memory 19

v
3.8 Data Flow Diagram 20
3.9 Integration and Deployment 20
3.10 Monitoring and Maintenance 21
3.11 Requirement Specification 21
3.11.1 Hardware Requirements 21
3.11.2 Software Requirements 21
3.11.3 Anaconda 22
3.11.4 Python 22
3.11.5 Scikit-learn 22
3.11.6 Natural Language Toolkit 23
3.11.7 Pandas 23
3.11.8 TensorFlow 24

CHAPTER 4 RESULTS AND DISCUSSION


4.1 Data overview 25
4.2 Text Preprocessing 26
4.3 Model training and evaluation 26
4.4 Model accuracy comparison 27
4.5 Data distribution visualization 28
4.5 User interaction and real-time classification 29

CHAPTER 5 CONCLUSION
5.1 Future work 31

REFERENCES 33

APPENDIX 34

vi
LIST OF FIGURES

Fig no Name of the Fig Page No

3.2.1 Architecture Overview 16

3.7.1 Logistic regression graph model 20

3.8.1 Data flow diagram 21

4.1.1 Data overview 26

4.1.2 Number of Reviews 26

4.3.1 Evaluation 27

4.3.2 Classification Report 28

4.4.1 Visualization of review type 28

4.4.2 Confusion matrix 29

4.5.1 User input 29

4.5.2 Final Result 29

4.5.3 Algorithm choosing 29

4.5.4 User Input and Result 30

vii
LIST OF ABBREVIATIONS

Acronym Abbreviations

TF-IDF Term Frequency-Inverse Document Frequency

NLP Natural Language Processing

LR Logistic Regression

SCIKIT Scientific Kit

OR Original review

fake Fake

NLTK Natural Language Toolkit

LSTM Long Short-Term Memory

viii
CHAPTER 1

INTRODUCTION

1.1 Historical Background

The history of fake reviews is tied to the rise of e-commerce and the increasing influence
of user-generated content on consumer choices. As platforms like Amazon and eBay introduced
product reviews in the 1990s and early 2000s, consumers began to rely on these reviews when
making purchasing decisions. In the early days, fake reviews were not common and were not
considered important by the platform. However, as the impact of reviews on sales became clear,
companies began to publish fake reviews to promote their products or to tarnish the reputations of
their customers. By 2010, as e-commerce continued to grow, fake reviews had become a major
problem, disrupting the simple book review process on many platforms.

To solve this problem, platforms have started exploring the process of machine learning, where
transport is a popular choice due to its simplicity and efficiency in determining whether the truth
is true or not. Advances in the field of (NLP) have led to the development of more advanced
discovery models, including decision trees and support vector machines, which increase accuracy
by identifying complex patterns in file information. Despite the emergence of these models,
logistic regression is still widely used due to its balance between performance and interpretation,
and is especially useful for platforms that require paid, reliable solutions. Detecting fake reviews
is important for platforms and regulators to keep online reviews reliable. As fraud strategies
evolve, this model continues to play a significant role, often supplemented by more sophisticated
methods to ensure trust and credibility of online reviews.

1
1.2 Overview

In today’s digital marketplace, online product reviews are vital for consumers making
purchasing decisions and for businesses seeking to maintain a positive brand reputation. Reviews
provide valuable feedback, allowing potential customers to gauge the quality, usability, and
trustworthiness of a product based on the experiences of others. Unfortunately, the rise of fake
reviews artificially generated or manipulated feedback aimed at deceptively influencing a
product’s ratings has compromised the integrity of these systems. fake reviews may be fabricated
to inflate ratings of low-quality products or to sabotage competitors, making it increasingly
difficult for consumers to discern authentic feedback from fraudulent input. This manipulation
undermines consumer trust, distorts competition, and ultimately affects the entire e-commerce
ecosystem.

Traditional detection techniques, such as manual moderation or keyword-based filtering, fall short
of adequately addressing this issue, as they lack the scalability and adaptability required for
complex review patterns. As a result, machine learning (ML) approaches have become essential
for detecting fake reviews. Logistic Regression is a strong choice for fake review detection due to
its interpretability and effectiveness in binary classification. It captures key patterns like abnormal
word usage, extreme sentiment, and suspicious user behavior. To enhance this, LSTM networks
are used to understand contextual and sequential relationships in the review text. LSTM helps
detect subtle patterns that Logistic Regression may miss. Together, they provide a balanced and
accurate approach to identifying fake reviews.

This study explores the use of Logistic Regression to classify product reviews as genuine or
computer-generated using textual, behavioral, and temporal features. The aim is to develop a
scalable and interpretable model for detecting fake reviews in online platforms. Logistic
Regression captures key indicators like sentiment polarity, word frequency, and suspicious user
behavior. To enhance performance, Long Short-Term Memory (LSTM) networks are incorporated
to understand sequential and contextual patterns in review text. LSTM helps detect subtle linguistic
cues that may go unnoticed by traditional models. The combined approach improves accuracy and
balances interpretability with deep learning. This contributes to safer e-commerce environments
by ensuring review authenticity. It also lays the groundwork for further research in NLP and online
fraud detection.

2
1.3 Monitoring fake Review of Products

Online reviews significantly influence consumer purchasing behaviour and brand reputation. With
the growing reliance on e-commerce, the emergence of fake reviews—either generated by bots or
posted by paid individuals poses a critical challenge by manipulating product ratings and
misleading consumers. Therefore, detecting fake reviews is vital for maintaining the integrity and
trustworthiness of online platforms.

This project introduces a hybrid approach combining Logistic Regression and LSTM-based neural
networks to classify product reviews as genuine or fake. Logistic Regression serves as a strong
baseline for binary classification tasks, leveraging TF-IDF (Term Frequency–Inverse Document
Frequency) to convert textual reviews into numerical features that highlight significant words and
patterns common in deceptive content.

To enhance the model’s performance in capturing semantic and contextual cues, a deep learning
pipeline is also implemented. This begins with an Embedding layer, which transforms tokenized
words into dense vector representations that retain semantic meaning. Padding ensures uniform
input sequence length, which is essential for batch processing in LSTM layers. The LSTM (Long
Short-Term Memory) component is capable of learning long-range dependencies and
understanding the sequence of words, which helps detect subtle inconsistencies in fake reviews.

Following the LSTM, one or more Dense layers are used to process the extracted features and
output the final classification label genuine or fake. Dropout regularization may be applied in the
dense layers to prevent overfitting, and activation functions like ReLU and Sigmoid are used for
non-linearity and binary classification respectively.

The system is evaluated using key performance metrics including accuracy, precision, recall, and
F1-score. These help in understanding the model's ability to correctly identify fake reviews while
minimizing both false positives (genuine reviews marked fake) and false negatives (fake reviews
marked genuine).

This approach supports real-time integration into online review systems where it can automatically
flag or remove potentially deceptive reviews. By automating this process, platforms can reduce
manual moderation and maintain the credibility of user-generated content.

3
The model’s scalability allows it to adapt across various domains, such as product reviews, travel
feedback, and app ratings. With proper retraining, it can also be customized for specific languages,
regions, or review formats, expanding its usability and impact.

1.4 Proposed System

In an increasingly digital market where online reviews significantly influence consumer decisions,
the need for an effective system to monitor and manage fake reviews is more crucial than ever.
The proposed system combines traditional machine learning techniques like logistic regression
with advanced deep learning methods, specifically LSTM (Long Short-Term Memory) networks,
to enhance the accuracy and reliability of fake review detection. This hybrid system utilizes
Natural Language Processing (NLP) and user behavior analysis to uncover deceptive patterns
embedded in textual reviews and reviewer activity.

The logistic regression component serves as a robust and interpretable binary classifier that
analyzes historical data of both genuine and fake reviews. It captures behavioral and textual
patterns such as review frequency, sentiment polarity, and unusual activity timelines like a user
posting numerous reviews in a short span. Additionally, TF-IDF (Term Frequency-Inverse
Document Frequency) is used to vectorize review texts, helping the model identify linguistic cues
like repetitive or promotional language often found in fake reviews.

To deepen the system's understanding of textual context, an LSTM-based neural network is


introduced. The review text is pre-processed through tokenization and padding, followed by an
embedding layer that converts each word into a dense vector representation. The LSTM
component captures the sequential dependencies in the review text, learning the flow and tone,
which are key in identifying subtle inconsistencies and fabricated narratives. This layer is followed
by dense layers for classification, with dropout regularization to prevent overfitting and sigmoid
activation for binary output.

Furthermore, the system incorporates sentiment consistency checks across multiple reviews and
flags anomalies in overly positive or negative feedback. Behavioral signals like changes in posting
patterns or coordinated review bursts are used as additional flags. Reviews flagged by the model
are scored and prioritized for further manual or automated scrutiny, ensuring rapid action on highly
suspicious entries.

4
A feedback mechanism is integrated into the platform, allowing users to flag reviews they find
suspicious. These inputs are used to retrain the model periodically, ensuring it adapts to evolving
deceptive strategies. This continuous learning loop, coupled with multi-layered analysis, allows
the system to scale effectively and remain accurate over time.

Overall, this proposed system offers a scalable, adaptive, and multi-dimensional solution for fake
review detection—enhancing trust in online platforms, protecting consumers, and helping
maintain a transparent digital marketplace.

1.5 Necessity for Monitoring fake review of product

The Power of Online Reviews in Consumer Decision-Making


In the digital marketplace, product reviews serve as a significant influence on consumer
choices. A product's reputation is frequently improved by positive reviews, which increase sales
and foster trust among prospective customers. On the other hand, negative reviews can discourage
purchases, impacting both a brand’s image and its revenue. As a result, reviews have become a
powerful tool that can shape the success of products and businesses in the online ecosystem.

The Rise of fake Reviews


With the impact of reviews so profound, the phenomenon of fake reviews has increased,
creating challenges for both consumers and platforms. These fake reviews can be generated by
competitors looking to discredit a product, or by businesses attempting to artificially inflate their
ratings to appear more favorable. Such deceptive practices undermine the integrity of online
review systems, often leading consumers to make misguided purchasing decisions.

The Need for Monitoring and Detecting fake Reviews


Monitoring fake reviews has become essential for maintaining fair and trustworthy online
marketplaces. fake reviews not only erode consumer trust but also create an unfair competitive
landscape where genuine businesses struggle to compete against artificially boosted competitors.
The presence of deceptive reviews can also degrade the quality of a platform, as users become
skeptical of the authenticity of all reviews, even legitimate ones.

5
Benefits of Monitoring fake Reviews
By implementing systems to detect and filter fake reviews, platforms can uphold an honest
review ecosystem that benefits consumers and businesses alike. Monitoring fake reviews helps
protect consumers from misinformation, fosters a fair business environment, and contributes to
customer satisfaction by ensuring that the reviews they rely on reflect genuine feedback. In this
way, detecting fake reviews is not only a tool for preventing fraud but a step towards promoting
transparent and ethical business practices online.

1.6 Advantages of the system

The proposed system using the Logistic Regression algorithm offers several advantages in
detecting fake reviews, making it a robust choice for e-commerce and review-driven platforms.
Here are some key benefits:

• Scalability: The Logistic Regression model is computationally efficient, allowing it to


handle large volumes of incoming reviews in real time. This makes it ideal for high-traffic
platforms where thousands of reviews are posted daily, ensuring timely and continuous
monitoring.

• Interpretability: One of the primary strengths of Logistic Regression is its interpretability.


The model’s coefficients provide insights into how each feature impacts the likelihood of
a review being fake, which is valuable for platform administrators who want transparency
into the decision-making process. This interpretability builds trust among users and
stakeholders, as the system can explain its classifications clearly.

• Adaptability: The system is designed to evolve with feedback loops, allowing it to adapt
to new patterns in fake review tactics. By continuously updating the model with newly
identified fake reviews, it remains relevant and effective, even as fake review methods
become more sophisticated.

• Real-Time Monitoring: The system is capable of monitoring reviews in real time,


providing immediate insights into review authenticity. This helps platforms maintain up-

6
to-date information and respond quickly to deceptive practices, protecting consumer trust
and ensuring review quality.

• Flexibility for Feature Updates: The model can easily incorporate new features as
needed, such as additional behavioral signals or advanced text processing techniques, to
improve its accuracy and adapt to platform-specific requirements.

• User Trust and Platform Credibility: By removing fake reviews, users can rely on
genuine feedback, which builds trust and loyalty among customers. A credible review
system attracts more users and businesses, as customers know they are getting honest
opinions.

• Easy to Maintain and Update: Once deployed, the model requires minimal maintenance,
and updates can be performed quickly with new data. This ensures that platforms can keep
up with evolving fake review tactics without extensive downtime or costs

1.7 Applications

• E-commerce Marketplaces: Platforms like Amazon, eBay, and Walmart rely heavily on
customer reviews to influence purchasing decisions. A fake review monitoring system can
help these platforms maintain high levels of credibility by filtering out deceptive reviews.
This enhances customer trust and ensures that product ratings accurately reflect user
experiences, improving the overall shopping experience.
• Travel and Hospitality Platforms: Websites like TripAdvisor, Booking.com, and Airbnb
use reviews as primary indicators of quality. In this sector, fake reviews can harm both
customers and hosts. A monitoring system can identify and remove inauthentic reviews,
protecting guests from misleading listings and helping hosts and service providers build
authentic reputations.
• Mobile Apps and Digital Goods: App stores like Google Play and Apple’s App Store rely
on user reviews for app visibility and user trust. fake review detection is essential here, as
app developers may try to manipulate ratings to increase downloads or discredit

7
competitors. Monitoring ensures that users have access to accurate information when
choosing apps or digital services.
• Product Review Aggregators and Comparison Sites: Platforms like Consumer Reports,
Wirecutter, and CNET aggregate reviews and make product recommendations. A fake
review monitoring system helps these platforms maintain the integrity of their
recommendations, as they can filter out misleading reviews before including them in
product comparisons or rankings.
• Social media and Influencer Marketing: With the rise of social commerce on platforms
like Instagram, Facebook, and TikTok, reviews and testimonials play a significant role in
influencing purchases. Monitoring fake reviews in this context prevents misleading
promotional content from skewing consumer perceptions and protects users from
potentially unreliable endorsements.
• Online Grocery and Food Delivery Services: Apps like Instacart, DoorDash, and
UberEats use reviews to influence customers’ choices of restaurants and groceries. A
monitoring system for fake reviews can help ensure that consumers are not misled by false
praise or criticism, allowing them to make decisions based on genuine feedback.

8
CHAPTER 2

LITERATURE REVIEW

[1] JINGDONG WANG.et.al, (2020) Fake Review Detection Based on Multiple Feature
Fusion and Rolling Collaborative Training.

Fake reviews can easily mislead customers and damage a company’s reputation, sometimes even
causing significant financial losses. Therefore, it is essential to identify and filter out fake reviews
effectively. Most traditional methods suffer from low accuracy because they rely on a single
feature type and lack sufficient labeled data for training. To solve this problem, the authors
proposed a novel method combining multiple feature fusion and rolling collaborative training to
detect fake reviews more accurately. The method starts by building an initial index system that
includes a wide range of features such as textual content, reviewer behavior, and sentiment
analysis. An initial training dataset is then prepared, where features from the reviews are extracted
and corresponding algorithms are applied. The reviews in this dataset are manually labeled to
ensure accuracy. Seven different classifiers are trained on this labeled dataset. Among them, the
one with the highest accuracy is selected to classify new reviews. As more reviews are classified,
the training sample size automatically increases, improving the model’s performance over time.
This rolling collaborative training approach allows the system to learn from both labeled and high-
confidence unlabeled data, making it more adaptive and effective in identifying fake reviews
across various platforms.

[2] SONGKAI TANG, et.al, (2021) proposed Fraud Detection in Online Product Review
Systems via Heterogeneous Graph Transformer.

With the rapid growth of e-commerce, fake reviews have become a serious threat to online product
review systems. To effectively detect fraudulent behavior, proposed a novel method using a
Heterogeneous Graph Transformer (HGT). Traditional approaches often fail to capture complex
relationships between users, products, and reviews. This method overcomes that limitation by
modeling the review ecosystem as a heterogeneous graph, where different types of nodes (users,
products, reviews) and edges (writing, rating, purchasing) are considered. The key idea of the
proposed approach is to capture the semantic and structural information embedded within these

9
heterogeneous interactions. This helps the model to focus on important connections, such as
suspicious behavior patterns between users and products. The use of graph-based learning also
makes the model scalable and adaptable to new data without needing complete retraining. This
work is significant as it introduces graph neural networks into the domain of fake review detection,
offering a powerful way to model real-world relationships and detect anomalies with greater
accuracy.

[3] RAMI MOHAWESH, et.at, (2021) A Survey of Fake Review Detection Techniques in E-
Commerce.

As online shopping continues to dominate the retail space, e-commerce platforms face increasing
challenges due to the rise of fake reviews. These reviews can manipulate public perception,
mislead consumers, and cause financial damage. conducted a comprehensive survey of various
fake review detection techniques used in e-commerce environments. The paper categorizes
existing methods into several groups based on the type of data used, such as text-based, behavior-
based, and hybrid approaches. Text-based methods analyze the linguistic content of reviews using
techniques like natural language processing (NLP), sentiment analysis, and keyword frequency.
Behavior-based methods focus on reviewer activities, such as review timing, rating patterns, and
user-review-product relationships. Additionally, the authors discuss the application of various
machine learning and deep learning algorithms in this domain, including Support Vector Machines
(SVM), Decision Trees, Neural Networks, and more recently, Graph Neural Networks (GNNs).
The survey also highlights challenges such as data imbalance, label scarcity, and adaptive fraud
strategies used by fake reviewers.

[4] MEILING LIU, et.al, (2021) Detecting FAKE Reviews Using Multidimensional
Representations with Fine-Grained Aspects Plan.

Fake reviews have become a major concern in e-commerce, often misleading buyers and damaging
brand trust. They proposed a method combining multiple feature fusion and rolling collaborative
training, using text, behavior, and sentiment features to improve classification accuracy over time
and introduced a heterogeneous graph transformer model that captures complex relationships
among users, products, and reviews for more accurate fraud detection. Developed a model using
multidimensional representations and fine-grained aspect analysis to detect inconsistencies

10
between review content and ratings. These studies emphasize the need for advanced models that
integrate diverse features and leverage deep semantic analysis to improve fake review detection.

[5] HINA TUFAIL, et.al, (2022) The Effect of Fake Reviews on e-Commerce During and
After Covid-19 Pandemic: SKL-Based Fake Reviews Detection.

A study focused on the effect of fake reviews on e-commerce during and after the COVID-19
pandemic introduced a detection method based on Sequential K-means Learning (SKL). The
increase in online shopping during the pandemic led to a surge in fake reviews, affecting consumer
trust and buying behavior. The proposed SKL model clusters reviews based on content similarity
and identifies outliers as potential fake reviews. This method is efficient for processing large
volumes of data and supports real-time detection. It also addresses both behavioral and textual
features, improving overall accuracy. The study highlights the importance of adaptive systems that
can detect fake content in evolving online environments.

[6] MUJAHED ABULQADER, et.al, (2022) Unified Fake Review Detection Using Deception
Theories.

A unified fake review detection method was introduced using deception theories to enhance the
accuracy of identifying fraudulent content. This approach goes beyond standard text or behavior-
based techniques by incorporating psychological and linguistic cues rooted in how people naturally
lie or manipulate information. It analyzes reviews for signs of exaggeration, emotional imbalance,
and inconsistencies in tone or structure—patterns often linked with deceptive intent. The system
also evaluates the coherence between the review content and the given rating, a mismatch that
frequently occurs in fake reviews. Deception theories provide a theoretical foundation to detect
subtle dishonesty, allowing the model to distinguish between genuine and fabricated reviews more
effectively. Machine learning algorithms are trained using these deception-based features, which
improves classification performance. The model performs well across different datasets and can
adapt to various e-commerce scenarios. By integrating multiple types of deceptive signals, the
system offers a more comprehensive and reliable method for fake review detection in online
platforms.

11
[7] MOHAMMED ENNAOURI, et.al, (2023) A Comprehensive Review of Sentiment
Analysis Techniques in fake Review Detection.

A comprehensive review was conducted on sentiment analysis techniques used in fake review
detection, highlighting their effectiveness in identifying deceptive content in online platforms.
Sentiment analysis focuses on understanding the emotional tone of a review and identifying
inconsistencies between expressed sentiment and actual rating or context. Rule-based techniques
rely on predefined sentiment dictionaries, while machine learning approaches use classifiers like
SVM and Naive Bayes to detect sentiment-driven anomalies. Deep learning models, including
RNNs, CNNs, and transformers, are capable of capturing complex linguistic features and hidden
patterns within reviews. The study also explores aspect-based sentiment analysis, which evaluates
sentiments related to specific product features, revealing mismatches that may suggest
manipulation. By analyzing these irregularities, sentiment analysis serves as a powerful tool to
uncover deceptive behavior. The review concludes that combining sentiment analysis with
behavioral and contextual data significantly improves detection accuracy, and recommends hybrid
models for robust performance in real-world scenarios.

[8] SHUNXIANG ZHANG, et.al, (2023) Building Fake Review Detection Model Based on
Sentiment Intensity and PU Learning.

A Fake review detection model was developed using sentiment intensity analysis combined with
Positive-Unlabeled (PU) learning to improve accuracy in scenarios with limited labeled data. The
method focuses on identifying emotional extremes and inconsistencies in reviews, as fake reviews
often contain exaggerated sentiments to manipulate consumer perception. Sentiment intensity is
measured to evaluate the strength and polarity of the review content, helping detect unnatural
emotional expressions. Since manually labeling all fake reviews is impractical, PU learning is
applied to work with only a small set of positively labeled fake samples and a larger set of
unlabeled data. This approach reduces dependence on large labeled datasets, making it scalable
and more practical for real-world applications. Experimental results show improved performance
compared to traditional supervised learning models. This method offers a flexible and efficient
solution for platforms facing challenges with unbalanced or incomplete review datasets,
emphasizing the potential of semi-supervised learning in fake review detection.

12
[9] HINA TUFAIL, et.al, (2024) Fake Reviews Detection in Online Shopping Platforms
During the COVID-19 Era.

Fake review detection during the COVID-19 era focused on the surge of deceptive activities in
online shopping platforms caused by the pandemic-driven e-commerce boom. The approach
explores how behavioral changes in consumers and sellers during this period led to an increase in
fake reviews aiming to exploit market trends. The detection model integrates temporal features,
user behavior, and review content to identify fraudulent patterns that emerged specifically during
the pandemic. The system uses machine learning techniques to classify reviews based on these
combined features, showing improved adaptability to the changing online landscape. Special
attention is given to the context of reviews, as many fake entries used pandemic-related keywords
or sentiments to gain credibility. This research emphasizes the importance of context-aware
systems and adaptive models that can respond to external events influencing online behavior. It
contributes to the development of more resilient fake review detection systems suited for dynamic
and crisis-affected e-commerce environments.

[10] EHSAN ABEDIN, et.al, (2024) Understanding the Credibility of Online Drug Reviews.

A study was conducted to understand the credibility of online drug reviews, focusing on the unique
challenges posed by the healthcare sector. Unlike typical e-commerce products, drug reviews carry
critical implications for patient safety and treatment decisions, making the detection of fake or
misleading content especially important. The model evaluates reviews based on linguistic patterns,
user expertise indicators, and consistency with known medical information. Features such as
overly emotional language, unsupported health claims, and mismatched symptoms are flagged as
potential signs of inauthenticity. Advanced natural language processing techniques are used to
assess the trustworthiness of review content and its alignment with clinical knowledge. The study
highlights the difficulty in distinguishing subjective experiences from fake information in drug
reviews. By combining domain-specific knowledge with machine learning, the model aims to
improve the accuracy of credibility assessment. This research stresses the need for specialized fake
review detection frameworks tailored to sensitive fields like healthcare, where the cost of
misinformation is significantly higher than in other domains.

13
CHAPTER 3

METHODOLOGY

3.1 Existing system

The existing system employs Logistic Regression as the initial model for detecting
computer-generated (fake) reviews. Logistic Regression is a well-established and interpretable
algorithm, particularly effective for binary classification problems. In this project, the model uses
TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text data into
numerical features that capture the significance of words within each review. This representation
allows the model to distinguish between fake and genuine reviews based on term importance and
frequency patterns. Once the text data is vectorized, Logistic Regression learns to associate
specific word patterns and combinations with the likelihood of a review being computer-generated.
The sigmoid activation function is used to output probabilities between 0 and 1, enabling the
system to classify reviews as fake or original based on a defined threshold. Regularization
techniques are applied to prevent overfitting, ensuring that the model generalizes well to unseen
data. The simplicity and transparency of Logistic Regression also allow for interpretability. It
becomes easier to analyse which terms influence the classification most strongly, helping to
understand the language patterns typical in fake reviews. This phase of the system offers a solid
baseline and serves as the foundation for more complex models introduced in subsequent phases.
While effective and computationally efficient, Logistic Regression may struggle to capture deeper
contextual relationships or sequential dependencies between words, which can limit its
performance in detecting more sophisticated or subtly crafted fake reviews. This limitation
motivates the transition to more advanced models like LSTM in the next phase.

14
3.2 Architecture Overview:

Figure No 3.2.1 Architecture Overview

3.3 Data Collection and Preprocessing


For the effective detection of fake reviews, reliable data collection and thorough
preprocessing are critical steps. This project utilizes publicly available datasets, especially those
sourced from platforms like Kaggle, which provide labeled reviews marked as either genuine or
fake. These datasets offer a rich source of training and evaluation data, allowing the model to learn
and generalize patterns associated with authentic and deceptive reviews. Given the variety of
expressions and writing styles in review text, preprocessing is essential to clean and standardize
the data for model training.

The preprocessing phase involves several key steps. Initially, text cleaning is performed to remove
any irrelevant information, such as HTML tags and special characters, which could introduce noise
and reduce the model’s accuracy. Following this, normalization ensures that all text is converted
to lowercase, providing uniformity and preventing the model from treating the same word
differently based on case. Tokenization is then employed to break down the text into individual
tokens or words, facilitating further processing and feature extraction. For tokenization, NLTK, a

15
widely used Python toolkit installed through Anaconda, serves as the primary tool, ensuring
precise separation of words and phrases.

The next steps focus on refining the text data by removing common stop words, such as “the” or
“and,” which hold little meaning in identifying fake reviews. Using NLTK’s list of stop words,
this process helps reduce noise and focus the model on more informative words. Additionally,
lemmatization and stemming are applied to reduce words to their base forms, making the text more
uniform. For example, “running,” “runs,” and “ran” are all converted to their base form “run,”
aiding in normalization and helping the model better recognize patterns across variations of the
same word.

3.4 Feature Extraction


Once the data is pre-processed, the next step involves extracting features that the model can use
for classification. In the case of deep learning models like LSTM (Long Short-Term Memory), the
feature extraction process differs significantly from traditional methods like TF-IDF. Instead of
manually computing term frequencies, LSTM relies on sequential learning and word embeddings
to extract contextual patterns from text.

The LSTM model does not require manual feature engineering. Instead, it uses a Tokenizer to
convert words into integers, followed by sequence padding to ensure uniform input length. These
steps convert raw text into a format suitable for input into the model. The numerical sequences are
then passed through an Embedding layer, which transforms each word index into a dense vector
that captures the word’s semantic meaning. These embeddings serve as the primary features for
the LSTM network.

In addition to word embeddings, sequence modelling allows the LSTM to extract temporal and
contextual features. For example, the order in which words appear, recurring sentence structures,
or unusual word combinations. This enables the model to detect patterns often found in computer-
generated (fake) reviews, such as repetitive phrasing, unnatural transitions, or overuse of
sentiment-heavy words.

By representing reviews as sequences of dense vectors and learning features directly from the data,
LSTM provides a powerful mechanism for identifying subtle cues that distinguish genuine reviews
from deceptive ones.

16
3.5 Word Embeddings in Text Classification

In deep learning-based text classification, word embeddings play a crucial role by transforming
words into dense, low-dimensional vectors that carry semantic meaning. Unlike traditional
representations such as one-hot encoding or TF-IDF, embeddings are capable of capturing the
contextual similarity between words. This makes them more effective for understanding the deeper
patterns in natural language. The process begins with tokenization, where each word in the input
text is converted into an integer index. These integer sequences are then padded to ensure that all
inputs have the same length, allowing for efficient batch processing during training. The padded
sequences are passed through an embedding layer. This layer is essentially a matrix of learnable
parameters where each row corresponds to a vector representation of a word. During training, the
model adjusts these vectors so that words used in similar contexts are mapped to vectors that are
close together in the embedding space.

To quantify the similarity between word vectors, cosine similarity is often used. It measures the
angle between two vectors and is given by the formula:

cosine_similarity(v1, v2) = (v1 · v2) / (||v1|| * ||v2||)

Where:

➢ v1 · v2 = Dot product of vectors.


➢ ||v1||, ||v2|| = Magnitudes (Euclidean norms) of the vectors.

A higher cosine similarity (closer to 1) indicates greater semantic similarity.

After the embedding step, the resulting vectors are fed into an LSTM layer. The LSTM is designed
to capture the sequential nature of language, preserving both short and long-term dependencies
between words in a sentence. This is important because the meaning of a word often depends on
the words that come before or after it. By combining embeddings with LSTM, the model is able
to interpret context, identify unusual patterns, and detect nuances in language usage that may
indicate a review is computer-generated. This approach enables the model to go beyond surface-
level word matching and analyse the underlying structure and sentiment of the review, which is
essential for distinguishing between genuine and deceptive content.

17
3.6 Model Development
The development of the model begins with the implementation of two different classification
techniques: Logistic Regression and Long Short-Term Memory (LSTM) neural networks. Logistic
Regression was selected as the initial model due to its simplicity, interpretability, and strong
performance in binary classification tasks. It serves as an effective baseline, allowing insight into
how individual features such as the frequency of specific terms contribute to the classification of
reviews as either computer-generated (fake) or original (OR). Logistic Regression is particularly
useful for understanding the importance of specific keywords in the decision-making process.

Following the baseline, an LSTM model was introduced to leverage sequential dependencies
within the text data. LSTMs are especially effective for handling natural language because they
are capable of capturing both short and long-range patterns across sequences. This makes them
well suited for detecting subtle language cues or unnatural phrasing often found in fake reviews.

For both models, the dataset was split into training and testing subsets using an 80 to 20 ratio. This
ensures that the models are trained on the majority of the data while reserving a portion for
evaluating performance on unseen inputs. The Logistic Regression model uses TF-IDF vectorized
features which represent the importance of words across the dataset. In contrast, the LSTM model
uses word embeddings generated through a neural embedding layer, allowing the model to learn
semantic relationships between words.

To assess performance, standard evaluation metrics are applied including accuracy, precision,
recall, and F1 score. These metrics offer comprehensive insights into the strengths and weaknesses
of each model. Accuracy represents the overall proportion of correctly classified reviews while
precision highlights the ability of the model to correctly identify fake reviews among those it
predicts as fake. Recall focuses on the model’s capability to detect all true fake reviews and the F1
score balances both precision and recall to give a holistic measure of model effectiveness.

Through this two-model setup, the project benefits from both interpretable and deep learning
approaches. Logistic Regression offers speed and transparency while the LSTM model provides a
more nuanced understanding of sequential patterns in language. This combination ensures a
balanced and well-rounded detection system capable of identifying deceptive reviews with a
higher degree of reliability.

18
3.7 Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM) is a type of recurrent neural network designed to


effectively capture dependencies in sequential data. It is especially suitable for text classification
tasks like detecting computer-generated reviews, where understanding the order and context of
words is crucial. Unlike traditional models that may ignore the sequence in which words appear,
LSTM can retain information across long distances in a text, allowing it to identify patterns that
span multiple words or sentences.

In this system, LSTM is employed to classify reviews by learning the underlying structure and
emotional tone of the text. The model begins with an embedding layer, which transforms each
word into a dense vector that captures its meaning in relation to other words. These vectors allow
the model to interpret words in context, rather than treating them as isolated terms. For instance,
words like “excellent” and “great” may be mapped to vectors that are close to each other in the
embedding space, helping the model understand that they convey similar sentiment.

To improve generalization and avoid overfitting, a SpatialDropout1D layer is used after


embeddings. This randomly drops entire features across sequences during training, ensuring the
model doesn’t rely too heavily on specific words or positions. The processed vectors are then
passed to the LSTM layer, which has 50 memory units. This layer is responsible for capturing the
temporal dependencies between words, determining which information to retain and which to
discard as it processes each word in the sequence. Dropout and recurrent dropout values of 0.2 are
applied here to further enhance robustness.

The output is passed through a dense layer with a sigmoid activation function that produces a
probability score, representing how likely the review is computer-generated. If the score is above
0.5, the review is classified as fake; otherwise, it is marked as original. This probability-based
decision makes the classification smooth and flexible, especially useful in borderline cases where
reviews may show mixed characteristics.

19
3.8 Data Flow diagram

Figure No 3.8.1 Dataflow Diagram

3.9 Integration and Deployment

To implement this system effectively, the integration and deployment phases are critical
for enabling user access and ensuring real-time functionality. The system includes a user-friendly
web interface, which can be developed using web frameworks such as Flask or Django. This
interface allows users to submit reviews for authenticity analysis in an intuitive manner. By
integrating the system within the Anaconda environment, developers can leverage a stable,
cohesive environment for managing dependencies and updates. This integration ensures that users
can seamlessly interact with the model without dealing with technical complexities. In addition,
deploying the model for real-time analysis enables the system to immediately evaluate and classify
incoming reviews as either genuine or fake. Real-time processing ensures that users receive instant
feedback on the authenticity of reviews, which is particularly valuable for high-traffic platforms
where immediate insights are necessary. To optimize response times, caching mechanisms are
implemented to store recent analysis results temporarily, improving the overall speed and
efficiency of the system’s response.

20
3.10 Monitoring and Maintenance

For the system to remain effective over time, ongoing monitoring and maintenance are
essential. A feedback loop is incorporated, enabling the model to learn continuously from new data
and adapt to evolving patterns in review behavior. As fake review tactics change, the model can
incorporate these new patterns, maintaining high accuracy and relevance. Performance monitoring
is also a key component, with regular assessments conducted to detect any potential declines in
model performance. This proactive approach ensures that the system stays reliable and effective,
especially as the volume of data and user interactions grows. Additionally, the system benefits
from a user feedback loop, where input from both users and stakeholders is gathered to refine both
the model and the user interface. By listening to end-users, the system can be enhanced to meet
user needs and improve overall satisfaction, which contributes to the model’s longevity and
effectiveness in a real-world setting.

3.11 Requirement Specification

3.11.1 Hardware Requirements

Processor : Any Intel or AMD x84-64 processor


RAM : Minimum 8 GB and Recommended 16 GB
GPU : Minimum GTX 1650 and Recommended GTX 3750
Storage : An SSD of 256GB is strongly required
Input Device : Standard Keyboard , Mouse and Camera
Output Device : High-Resolution Monitor
3.11.2 Software Requirements

Operating System : Windows 11


Simulator Tool : Anaconda
Programming language : Python

21
3.11.3 Anaconda

Anaconda is an open-source distribution of Python and R specifically designed for data


science, scientific computing, and machine learning projects. It offers a comprehensive ecosystem
that simplifies package management and deployment, which is essential for a project involving
complex dependencies across different libraries. Anaconda includes a variety of pre-installed
libraries frequently used in data science, such as Scikit-learn, Pandas, NumPy, and Matplotlib,
allowing developers to focus on coding rather than package installations and compatibility issues.

In this project, Anaconda provides a unified environment where all dependencies are managed
efficiently, preventing version conflicts and simplifying the installation of machine learning
libraries. Additionally, Anaconda comes with Jupyter Notebook, an interactive tool ideal for
iterative development, visualization, and testing. This makes Anaconda not only a convenient but
also a powerful choice for creating a stable development environment for building and testing this
fake review detection system.

3.11.4 Python

Python is a high-level, versatile programming language renowned for its readability and
flexibility, making it a preferred choice in data science and machine learning. Python’s syntax
emphasizes code readability, allowing developers to write and understand code more efficiently.
It offers a broad range of libraries and frameworks that support data analysis, machine learning,
and NLP tasks, which are all crucial components of this project.

The large community of developers contributing to Python ensures continuous updates and
improvements in the language’s libraries, which is beneficial when working on projects like fake
review detection that require up-to-date tools and techniques. Additionally, Python’s compatibility
with other technologies and ability to handle large datasets make it highly suitable for building
scalable solutions like this. Python’s modularity also facilitates easy integration of the logistic
regression model, making the development process more efficient and accessible.

3.11.5 Scikit-learn

Scikit-learn is one of the most widely used libraries for machine learning in Python,
offering tools for classification, regression, clustering, and model evaluation. It provides an

22
extensive range of algorithms, including Logistic Regression, Support Vector Machines, and
Random Forests, along with utilities for data preprocessing and model evaluation. Scikit-learn’s
user-friendly API and clear documentation make it accessible even for those new to machine
learning.

Scikit-learn plays a critical role in implementing the logistic regression model, which is used to
classify product reviews as either fake or genuine. Logistic Regression is a particularly suitable
algorithm due to its simplicity, interpretability, and effectiveness in binary classification tasks.
Scikit-learn also provides essential tools like the TF-IDF vectorizer, which converts text data into
numerical features, allowing this model to process and learn from textual data. Furthermore,
Scikit-learn's performance metrics, such as accuracy, precision, recall, and F1 score, will be
invaluable for evaluating and improving the model’s ability to correctly identify fake reviews.

3.11.6 Natural Language Toolkit (NLTK)

The NLTK is a robust library for working with human language data (text). It includes a
collection of text processing libraries that support a range of NLP tasks, such as tokenization,
stemming, lemmatization, and part-of-speech tagging. These tools allow developers to break down
and analyze text, making it possible to extract meaningful features that a machine learning model
can use to make predictions.

fake review detection system, NLTK is essential for preprocessing the text data within reviews.
Before feeding the text into the model, reviews must be tokenized (split into words), cleaned
(removing punctuation and stop words), and transformed into a format that emphasizes the
significant words or phrases. For instance, words with excessive sentiment or repetition might
indicate fake content. By using NLTK, can efficiently preprocess the text, allowing this model to
capture nuances in the language that are indicative of deceptive reviews.

3.11.7 Pandas

Pandas is a highly efficient library for data manipulation and analysis in Python. It
provides data structures like DataFrames, which are ideal for storing, managing, and analyzing
structured data. Pandas is particularly useful for handling and cleaning datasets, which is crucial
when working with large volumes of review data in this project.

23
In this project, Pandas enables us to load, inspect, and clean the dataset of product reviews,
preparing it for analysis and model training. DataFrames allow us to easily handle missing values,
filter relevant information, and perform transformations needed for feature extraction. For
example, Pandas can be used to group and summarize reviews, spot outliers, and even analyze the
distribution of fake and genuine reviews in the dataset. Additionally, Pandas integrates seamlessly
with other libraries such as NLTK and Scikit-learn, creating an efficient workflow where data can
be processed, analyzed, and fed into the model with minimal friction.

3.11.8 TensorFlow

TensorFlow is Google created the open-source TensorFlow library for deep learning and
machine learning applications. TensorFlow, which is renowned for its adaptability and scalability,
facilitates intricate calculations and offers resources for creating and implementing deep learning
models. TensorFlow is a flexible supplement to any data science project, even though it is
frequently linked to neural networks. It offers a number of tools for generic machine learning.
If necessary, TensorFlow can be used to test more complex models in this false review detection
project than just standard machine learning methods like Logistic Regression, LSTM.

24
CHAPTER 4

RESULTS AND DISCUSSION

4.1. Data Overview:

The dataset consists of product reviews labeled as either "Original Review" (OR) or "Fake"
(fake). The dataset was processed by mapping the labels to binary values (0 for Original Reviews
and 1 for Fake). The dataset includes user-generated reviews, their corresponding product IDs,
ratings, and review texts.

Figure No 4.1.1 Data overview

The dataset contains several columns:


• Category: Identifier for the reviewed product.
• Text: The review's textual content.
• Rating: The rating assigned to the product (ranging from 1 to 5).

The dataset contains a mix of original reviews and computer-generated (fake) reviews. By
analyzing the target column (label), they were able to count the number of reviews for each
category. This step is essential for understanding the distribution of review types and for preparing
the dataset for further analysis and model training.

Figure No 4.1.2 Number of Reviews

25
4.2 Text Preprocessing:

To prepare the text data for machine learning, used TF-IDF to convert the textual data into
numerical features. TF-IDF assigns higher weights to words that are frequent in a document but
rare across the entire dataset. This transformation is particularly useful for identifying important
words in reviews that can distinguish between original and computer-generated content.

1. Max Features: This limited the number of features (terms) to 3000, ensuring that only the
most relevant words are considered for training the model.
2. TF-IDF Calculation: For example, a term like "genuine" might have a high TF-IDF score
in an original review, indicating its importance in distinguishing original reviews from
computer-generated ones. In contrast, common filler words like "the" or "is" would have
lower TF-IDF scores, as they appear in most reviews and are less informative.
3. Tokenization & Padding: Before feeding data into the LSTM model, each review was
tokenized into sequences of integers. To ensure uniform input lengths, sequences were
padded with zeros, allowing consistent matrix dimensions for batch processing.
4. Embedding Layer: A trainable embedding layer was added at the beginning of the LSTM
model. This layer maps each word index to a dense vector of fixed size, capturing semantic
relationships between words and enhancing the model's ability to understand contextual
meaning.
5. Dense Layer: Following the LSTM output, dense (fully connected) layers were applied
for further abstraction and classification. A sigmoid activation function was used in the
final dense layer to predict whether a review is genuine or computer-generated.

4.3 Model Training and Evaluation:

The logistic regression model was trained on TF-IDF features to classify reviews as fake
or genuine based on word importance. While logistic regression efficiently handles linear
relationships, an LSTM model was also implemented to capture the sequential and contextual flow
of words in reviews. The LSTM uses an embedding layer for word vectorization, followed by
padded sequences to handle variable-length inputs. Dense layers are added after the LSTM to

26
perform final binary classification. Both models are evaluated using accuracy, precision, recall,
and F1-score to ensure reliable fake review detection.

Key Metrics:

• Precision: Indicates the proportion of positive identifications (fake reviews) that were
actually correct.
• Recall: Shows the percentage of real fake reviews that the model accurately identified.
• F1-Score: Balances recall and precision to provide a single performance score.

4.4 Model Accuracy Comparison


The system displays the accuracy of each model:

Figure No 4.3.1 Evaluation

This comparison highlights the performance advantage of LSTM, which better captures word
dependencies and context, especially in longer or more complex reviews.

Figure No 4.3.2 Classification Report

27
4.5 Data Distribution Visualization:

The distribution of original and computer-generated reviews was visualized to understand


class balance.

Figure No 4.4.1 Visualization of review type

This bar chart illustrates the relative number of original vs. computer-generated reviews in the
dataset.

Figure No 4.4.2 Confusion matrix

28
The confusion matrix is structured with:
• Actual Values (True Labels) on the vertical axis (Actual: Original Review, fake).
• Predicted Values on the horizontal axis (Predicted: Original Review, fake).

4.5 User Interaction and Real-time Classification:

The model allows users to input a new review and classify it as either original or computer-
generated. This feature simulates a real-world application where incoming reviews could be
monitored.

4.5.1 Review Classification Output:

Figure No 4.5.3 Algorithm choosing

29
Figure No 4.5.4 User Input and Result(fake)

Figure No 4.5.5 User Input and Result (OR)

30
CHAPTER 5

CONCLUSION

The proposed system for detecting fake reviews using a hybrid approach of Logistic
Regression and LSTM represents a significant advancement in ensuring the authenticity of
customer feedback in online platforms. By integrating efficient data collection, advanced
preprocessing using TF-IDF, and robust feature extraction with both interpretable and deep
learning models, this methodology effectively addresses the challenges of identifying deceptive
reviews. The advantages of the system, including scalability, real-time analysis, and deep
contextual understanding via LSTM with embedding and padding layers, make it well-suited for
deployment across diverse review-driven environments. The inclusion of TF-IDF helps highlight
critical textual patterns, while the LSTM model captures sequential and linguistic nuances often
missed by traditional models. This not only enhances review credibility but also supports
consumers in making informed purchasing decisions. In conclusion, the proposed methodology
provides a strong foundation for fighting fake reviews, ensuring transparency, and reinforcing
consumer trust in digital ecosystems.

Future Enhancement

Future improvements for fake review detection systems can focus on enhancing accuracy,
adaptability, and transparency. One key advancement involves integrating deep learning
techniques such as LSTM and attention mechanisms to better capture complex linguistic patterns
and context, improving the system’s ability to distinguish between genuine and deceptive reviews.
Real-time monitoring capabilities can also be added, allowing platforms to detect and flag fake
reviews as they are posted, minimizing their impact on consumer perception. Another important
direction includes expanding the system’s reach through multi-language support and cross-
platform integration, ensuring global scalability and consistency in review analysis. To ensure
fairness and trust, implementing explainable AI techniques will allow users and platform
moderators to understand why certain reviews are flagged. By incorporating these enhancements,
the system can evolve into a more robust and intelligent solution, staying ahead of emerging fraud
tactics while promoting credibility and trust in online review ecosystems.

31
REFERENCES

[1] Wang, J., Tang, S., & Zhang, H. fake Review Detection Based on Multiple Feature Fusion and
Rolling Collaborative Training. IEEE Access, Vol. 15, Issue 8, pp. 2075–2089, 2020.

[2] Tang, S., Liu, Y., & Zhang, J. Fraud Detection in Online Product Review Systems via
Heterogeneous Graph Transformer. IEEE Access, Vol. 32, Issue 3, pp. 1234–1245, 2021.

[3] Mohawesh, R., Mulyana, T., & Abduallah, M. A Survey of fake Review Detection Techniques
in E-Commerce. Journal of Electronic Commerce Research, Vol. 22, Issue 4, pp. 45–61, 2021.

[4] Tufail, H., H. fake Reviews Detection in Online Shopping Platforms During the COVID-19
Era. Journal of Computer Science and Technology, Vol. 39, Issue 11, pp. 302–315, 2024.

[5] Liu, M., Zhang, L., & Wu, Q. Detecting fake Reviews Using Multidimensional Representations
with Fine-Grained Aspects Plan. IEEE Transactions on Neural Networks and Learning Systems,
Vol. 33, Issue 9, pp. 2765–2779, 2021.

[6] Tufail, H. The Effect of fake Reviews on e-Commerce During and After Covid-19 Pandemic:
SKL-Based fake Reviews Detection. IEEE Access, Vol. 10, pp. 25555–25564, 2022.

[7] Abulqader, M., Sadeghi, M., & Abdullah, S. Unified fake Review Detection Using Deception
Theories. Springer Journal of Artificial Intelligence Research, Vol. 70, Issue 5, pp. 599–613, 2022.

[8] Ennaouri, M., Benazzouz, A., & Boudraa, R. A Comprehensive Review of Sentiment Analysis
Techniques in fake Review Detection. Springer Journal of Computational Intelligence, Vol. 21,
Issue 4, pp. 1482–1495, 2023.

[9] Zhang, S., Wei, Z., & Li, K.-C. Building fake Review Detection Model Based on Sentiment
Intensity and PU Learning. IEEE Transactions on Neural Networks and Learning Systems, Vol.
34, Issue 10, pp. 6926–6939, 2023.

[10] Abedin, E., & Boudraa, R. Understanding the Credibility of Online Drug Reviews. Springer
Journal of Health Informatics, Vol. 25, Issue 6, pp. 1324–1342, 2024.

32
APPENDIX

Program Code:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

import matplotlib.pyplot as plt

import seaborn as sns

df = pd.read_csv('fake reviews dataset.csv')

df['label'] = df['label'].map({'fake': 1, 'OR': 0}) # Mapping labels 'fake' to 1, 'OR' to 0

print(df.head())

X = df['text_'] # The review text

y = df['label'] # The labels (0: Original Review, 1: Fake)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=3000) # Limit the features to 3000 terms

X_train_tfidf = vectorizer.fit_transform(X_train)

X_test_tfidf = vectorizer.transform(X_test)

logistic_model = LogisticRegression(max_iter=500)

33
logistic_model.fit(X_train_tfidf, y_train)

logistic_pred = logistic_model.predict(X_test_tfidf)

logistic_accuracy = accuracy_score(y_test, logistic_pred)

print(f"Logistic Regression Model Accuracy: {logistic_accuracy * 100:.2f}%")

print("Classification Report:")

print(classification_report(y_test, logistic_pred))

conf_matrix = confusion_matrix(y_test, logistic_pred)

print("Confusion Matrix:")

print(conf_matrix)

plt.figure(figsize=(6, 4))

sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues", xticklabels=['Original Review',


'Fake'], yticklabels=['Original Review', 'Fake'])

plt.title('Confusion Matrix')

plt.xlabel('Predicted')

plt.ylabel('Actual')

plt.show()

original_reviews = df[df['label'] == 0].shape[0]

computer_generated_reviews = df[df['label'] == 1].shape[0]

print(f"Number of Original Reviews: {original_reviews}")

print(f"Number of Fake Reviews: {Fake _reviews}")

labels = ['Original Review', 'Fake']

34
counts = [original_reviews, Fake_reviews]

plt.figure(figsize=(6, 4))

plt.bar(labels, counts, color=['green', 'red'])

plt.title('Distribution of Reviews')

plt.xlabel('Review Type')

plt.ylabel('Count')

plt.show()

def predict_with_logistic(review_text):

review_tfidf = vectorizer.transform([review_text])

prediction = logistic_model.predict(review_tfidf)

label = 'Fake' if prediction[0] == 1 else 'Original Review'

return label

user_review = input("Please enter a review to classify: ")

logistic_result = predict_with_logistic(user_review)

print(f"Logistic Regression Prediction: {logistic_result}")

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

from tensorflow.keras.preprocessing.text import Tokenizer

35
from tensorflow.keras.preprocessing.sequence import pad_sequences

from tensorflow.keras.models import Sequential, load_model

from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D

df = pd.read_csv('fake reviews dataset.csv')

df['label'] = df['label'].map({'fake': 1, 'OR': 0})

X = df['text_'] # The review text

y = df['label'] # The labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer(max_features=3000) # Reduced features

X_train_tfidf = vectorizer.fit_transform(X_train)

X_test_tfidf = vectorizer.transform(X_test)

logistic_model = LogisticRegression(max_iter=500) # Reduced iterations

logistic_model.fit(X_train_tfidf, y_train)

logistic_pred = logistic_model.predict(X_test_tfidf)

logistic_accuracy = accuracy_score(y_test, logistic_pred)

print(f"Logistic Regression Model Accuracy: {logistic_accuracy:.2f}")

max_words = 3000 # Reduced words

max_len = 100

tokenizer = Tokenizer(num_words=max_words)

tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)

36
X_test_seq = tokenizer.texts_to_sequences(X_test)

X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)

X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

try:

lstm_model = load_model('lstm_model.h5')

print("LSTM model loaded from disk.")

except:

lstm_model = Sequential()

lstm_model.add(Embedding(max_words, 100, input_length=max_len))

lstm_model.add(SpatialDropout1D(0.2))

lstm_model.add(LSTM(50, dropout=0.2, recurrent_dropout=0.2)) # Reduced LSTM units

lstm_model.add(Dense(1, activation='sigmoid'))

lstm_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model.fit(X_train_pad, y_train, epochs=2, batch_size=32, validation_data=(X_test_pad,


y_test

lstm_model.save('lstm_model.h5')

print("LSTM model trained and saved to disk.")

lstm_pred = lstm_model.predict(X_test_pad)

lstm_pred_binary = (lstm_pred > 0.5).astype(int) # Convert probabilities to binary output

lstm_accuracy = accuracy_score(y_test, lstm_pred_binary)

print(f"LSTM Model Accuracy: {lstm_accuracy:.2f}")

37
def predict_with_logistic(review_text):

review_tfidf = vectorizer.transform([review_text])

prediction = logistic_model.predict(review_tfidf)

label = 'Fake' if prediction[0] == 1 else 'Original Review'

return label

def predict_with_lstm(review_text):

review_seq = tokenizer.texts_to_sequences([review_text])

review_pad = pad_sequences(review_seq, maxlen=max_len)

prediction = lstm_model.predict(review_pad)

label = 'Fake' if prediction[0][0] > 0.5 else 'Original Review'

return label

user_review = input("Please enter a review to classify: ")

logistic_result = predict_with_logistic(user_review)

lstm_result = predict_with_lstm(user_review)

print(f"Logistic Regression Prediction: {logistic_result}")

print(f"LSTM Prediction: {lstm_result}")

Accuracy:

38
Final outcome :

39
40

You might also like