0% found this document useful (0 votes)
97 views

Predicting Fake Online Reviews Using Machine Learning

Uploaded by

Keerthi Guru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views

Predicting Fake Online Reviews Using Machine Learning

Uploaded by

Keerthi Guru
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Scientific Research and Engineering Development-– Volume 3 Issues 2 Mar-

Apr 2020
Available at www.ijsred.com
RESEARCH ARTICLE OPEN ACCESS

Predicting Fake online Reviews using Machine Learning

Sabira karim*, Dr. Kiruthiga G**


*Department Of Computer Science and Engineering, I.E.S College of Engineering Thrissur India
[email protected]
**Department of Computer Science and Engineering, I.E.S College of Engineering Thrissur India
[email protected]

Abstract:Online reviews are very important in decision making of customer whether to purchase a
product or service. These are main source of information getting from the past customer experience about
the features of that service which we are going to purchase. This paper introduces some machine learning
techniques like Naïve-Bayes, Support Vector Machine and Decision Tree for sentiment classification of
reviews and to detect fake online reviews using the data set of a Hotel reviews. Sentiment Analysis has
become most interesting in analysis of text. Using sentiment analysis we can separate negative and positive
reviews as well.

Index Terms – Spam reviews, machine learning, Naïve-Bayes, Support Vector Machine, Decision
Tree algorithm.

I. INTRODUCTION
Recent developments in fields like Natural
Language Processing (NLP) has paved the way
A fake review is a misuse of the user review for accurately understanding people’s sentiments,
system by fake personalities. Fake reviews are emotions, and behavioral patterns. Emotions such
also generated by bots. Fake reviews mislead as joy, anger, surprise, disgust can be extracted
customers to take decision on wrong product and from the reviews. For example If we want to
the customer spends money on the product. The book a hotel then we checks the reviews of that
reviews can be either positive or Negative, to hotel website and gets the past customer
increase the promotion and sale or to bring down experiences. Online reviews have great impact on
the competitive company products. Many people customers. This application can detect potential
look at online reviews before making a decision fake reviews in order to reduce the misguidance
whether it should be purchase or not. Many that follows it.
companies depend on several applications to
detect Fake reviews using machine learning. In machine learning based techniques, there
are many algorithms can be applied for the
In this paper we use Sentiment Analysis to classification and prediction. Here we used
formulate the data.The sentiment is usually Naive-Bayes classifier, Support Vector Machine
formulated as a two-class classification problem, (SVM), Random Forest Classifier and Decision
positive and negative. The basis of Sentiment Tree for predicting the reviews. We detect fake
Analysis is detecting the polarity of a give text or positive, fake negative, True positive and True
document. In this project we are using a set negative reviews. And finally we compare the
Polarity as negative or positive.
accuracy of each algorithm. Main objective of
this paper is to classify the dataset or reviews
ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 269
International Journal of Scientific Research and Engineering Development-–
Deve Volume 3 Issues 2 Mar
Mar- Apr 2020
Available at www.ijsred.com
into true and fake reviews using machine and deceptive hotel reviews using a machine
learning techniques. learning algorithm.

II. RELATED WORKS

A. Detecting fake reviews through sentiment


analysis

A number of studies conducted and


experiments done on several sample data.
Many reviews on product are scraped from
the webpage of products and conducted
Fig.1 Structure of the reviews set on which we are going to work
studies. In our work we have decided to use
the deceptive opinion spam dataset. A. Data preprocessing

We have extracted the data from the dataset We have created the data frame with three
and stored in a list. Then we have createdthe data columns named reviews, polarity class and
Frame with corresponding labels.Using
labels spamity class. The columnn named reviews
sentiment analysis all the reviews is gives the text or reviews posted by the
analyzed.The polarity is determined as Positive
Posit customerwhereas spamity class shows
or Negative.Also we have classified the Spamity whether it is deceptive or True. And the
as True or Deceptive.Later thehe polarity class and polarity class shows that whether the polarity
Spamity class converter into 0and 1s. Then we is positive or negative.
can apply the algorithm. Mainly we used two
method Naïve Bayes classification, Support The extracted review look
ook like the below
Vector Machine and Decision Tree. table:

Naive Bayes is a classification algorithm for Table 1


binary (two-class) and multi-class
class classification
problems

0 polarity_class review spamity_class


III. PROPOSED WORKS
0 negative "My $200..” t
We are using the deceptive dataset. The
Deceptive opinion spam dataset is a corpus 1 negative "This was a ..”t
consisting of truthful and deceptive hotel reviews
of 20 Chicago hotels. The corpus contains 400 2 negative"The hotel..”
..”t
truthful, positive reviews from Trip Advisor
Advisor, 400
3 negative "Going to..” t
deceptive positive reviews from m Mechanical
Turk, 400 truthful negative reviews fr from ... ... ... ….. ………….
Expedia, Hotels.com, Orbitz,Priceline,
Priceline, Trip
Advisor, & Yelp and 400 deceptive negative
reviews from Mechanical Turk. In total we have
1600 reviews.Our ur task is to classify the truthful [1600 rows x 3 columns]

ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 270


International Journal of Scientific Research and Engineering Development-– Volume 3 Issues 2 Mar- Apr 2020
Available at www.ijsred.com
First of all we need to remove the stopwords we have gamma parameter keeping constant for
from the reviews. For removal of stop words we perfect fit model.
used nltk package from sklearn. Text mining
techniques have been applied and the strings are IV. RESULTS AND PERFOMANCE
converted into numbers.Extracted parts of speech ANALYSIS
from reviews which will be fed as a Feature Input
to the model.The reviews are stored as array A. Experimental Environment
format. The spamity class and Polarity class of
dataFrame converted into 0’s or 1’s instead of We have applied our experiments on a
True or False then the model can be created. The machine with Processor: Intel core i3 – 2330M
dataset contains only 1600 rows so that we and CPU- 2GHz, RAM: 4GB,S system with 64
splitthe data into 80:20ratiosfor train and test, bit OS, We have used Windows as an operating
stored as an array. system. We have used Python as programming
language with sklearn, numpy and pandas
B. Model selection and Prediction packages. Spyder 4.1.1 used as IDE.

To fit the model we have used sklearn Of B. Results


Python programming language provides the
needful libraries for the classifiers.We have We have used Naïve Bayes classifier,
different classification techniques in machine Support Vector Machine (SVM), Decision Tree
learning like Naïve-Bayes, Support Vector and Random Forest classifiers to classify the
Machine, Decision Tree and Random Forest reviews dataset. We have divided the dataset of
classifiers. We have applied different predictions 1600 rows with 3 columnswith column names
methods to reach the more accurate model. reviews, polarity and spamity for each
classification process. The data split into train
Random Forest algorithm isan Ensemble and test in the ratio 80:20.
model which creates decision trees on data
samples and then gets the prediction from each of For Naïve-Bayes classification we applied
them and finally selects the best solution by Multinomial NB. After fitting the model the
means of voting .algorithm creates decision trees predicted data NB has given accuracy of 90.31%.
on data samples and then gets the prediction from SVM has given accuracy of 83.75 % Using
each tree and finally selects the best solution Decision Tree algorithm we got the accuracy of
from that. This produces the highest accuracy. 66.56 %. Comparing each of the algorithms, we
Random Forest can also use for classification as found that Random Forest is giving highest
well as Regression analysis. accuracy, secondly the Naïve-Bayes and
Decision Tree classifier has given the least
Naïve-Bayes is popularly used for text accuracy.
categorization to predict the text with word
frequencies as the features. Naïve –Bayes TABLE 2: Accuracy Comparison
typically uses bag-of-words feature from NLP to
identify the fake in text categorization. NB SVM DT RFC

SVMs can efficiently perform a non-linear 90.31 84.166 66.56 92.7279


classification using what is called the kernel trick,
implicitly mapping their inputs into high-
dimensional feature spaces. For SVM classifier

ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 271


International Journal of Scientific Research and Engineering Development-–
Deve Volume 3 Issues 2 Mar
Mar- Apr 2020
Available at www.ijsred.com
C. Performance Analysis

We can choose Random Forest Classifier


as well as Naïve-Bayes as our model since
both are giving highest accuracy.By
.By choosing
ch
Random Forest Classifier,, we could improve
the performance accuracy up to 92.7 percent.
It is the highest accuracy compared
mpared to the
other techniques. By importing metric fro
from
skleaarn package we can have the confusion
metric for the same predictions. Confusion
metrics has given the perfect accuracy for
each algorithm. Fig 3

The ROC curve is a graph with: The xx-


axis showing 1 – specificity (= false positive
i. The Accuracy graphical representation of fraction = FP/(FP+TN)) The y-axis
y showing
each algorithm shown in figure2
sensitivity (= true positive fraction =
TP/(TP+FN))

In a Receiver Operating Characteristic


(ROC) curve the true positive rate
(Sensitivity) is plotted in function of the false
positive rate (100-Specificity)
Specificity) for different
cut-off points.

Each point on the ROC curve represents a


sensitivity/specificity pair corresponding to a
particular decision threshold
threshold. Area under
curve (AUC) is a summary measure of the
Fig 2
accuracy of a quantitative diagnostic test. It
ii. ROC CURVE is 89.01

The true positive rate is calculated as the


number of true positives divided by the sum V. CONCLUSIONS AND FUTURE
of the number of true positives and the WORK
number of false negatives. It describes how
good the model is at predicting the positive In this paper, we proposed several
class when the actual outcome is positive.
methods to analyze a dataset of hotel
The true
rue positive rate is also referred to as
sensitivity. reviews. We also presented sentiment
classification algorithms to apply a
supervised learning of the hotel reviews
dataset.

ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 272


International Journal of Scientific Research and Engineering Development-– Volume 3 Issues 2 Mar- Apr 2020
Available at www.ijsred.com
Language Technologies (ACL-HLT), vol. 1, pp. 309–
319,Association for Computational Linguistics, Portland,
For future work, we would like to extend Ore, USA, June
this study to use other datasets such as 2011.
Amazon dataset or eBay dataset and use
[5] J. W. Pennebaker, M. E. Francis, and R. J. Booth,
different feature selection methods. “Linguistic Inquiryand Word Count: Liwc,” vol. 71, 2001.
Furthermore, we may apply sentiment
[6] S. Feng, R. Banerjee, and Y. Choi, “Syntactic
classification algorithms to detect fake
stylometry for deceptiondetection,” in Proceedings of the
reviews using various tools such as 50th Annual Meeting of the Associationfor Computational
Python and R or R studio, Statistical Linguistics: Short Papers, Vol. 2, 2012.
Analysis System (SAS), and Stata; then [7] J. Li, M. Ott, C. Cardie, and E. Hovy, “Towards a
we will evaluate the performance of our general rule foridentifying deceptive opinion spam,” in
work with some of these tools. Proceedings of the 52nd AnnualMeeting of the Association
for Computational Linguistics (ACL), 2014.

[8] E. P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W.


ACKNOWLEDGEMENT Lauw, “Detectingproduct review spammers using rating
behaviors,” in Proceedings ofthe 19th ACM International
Conference on Information and KnowledgeManagement
This research was supported by Technical (CIKM), 2010.
University of Kerala. We are thankful to our
colleagues who provided expertise that [9] J. K. Rout, A. Dalmia, and K.-K. R. Choo, “Revisiting
semi-supervisedlearning for online deceptive review
greatly assisted the research, although they
detection,” IEEE Access, Vol. 5,pp. 1319–1327, 2017.
may not agree with all of the interpretations
provided in this paper. [10] J. Karimpour, A. A. Noroozi, and S. Alizadeh, “Web
spam detection bylearning from small labeled samples,”
nternational Journal of ComputerApplications, vol. 50, no.
21, pp. 1–5, July

REFERENCES

[1] Rakibul Hassan and Md. Rabiul Islam “Detection


of fake online reviews using semi-supervised and
supervised learning” 2019 International Conference
on Electrical, Computer and Communication
Engineering (ECCE)

[2] Chengai Sun, Qiaolin Du and Gang Tian, “Exploiting


Product RelatedReview Features for Fake Review
Detection,” Mathematical Problems inEngineering, 2016.

[3] A. Heydari, M. A. Tavakoli, N. Salim, and Z. Heydari,


“Detection ofreview spam: a survey”, Expert Systems with
Applications, vol. 42, no.7, pp. 3634–3642, 2015.

[4] M. Ott, Y. Choi, C. Cardie, and J. T. Hancock,


“Finding deceptiveopinion spam by any stretch of the
imagination,” in Proceedings ofthe 49th Annual Meeting
of the Association for Computational Linguistics:Human

ISSN : 2581-7175 ©IJSRED: All Rights are Reserved Page 273

You might also like