Project Report
Project Report
Team Members:
Abhijeet Prakash 01FE16BCS003
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 1 of 16
Earlier known as
B. V. B. College of Engineering & Technology
1.Introduction
Today, customer reviews in social media have a deep impact on the chances of success of any
business. Restaurant customers look for a complete and satisfactory experience regarding food
quality, service, ambience and they often seek the opinion of patrons when they are choosing a
place for their next meal. Yelp offers this information to its users. When users look for a place to
eat, they can ask the service for a list of nearby restaurants for a cuisine category. Users also get
the overall rating that other customers gave to the restaurant as well as some reviews about the
restaurant.
Reviews content is very diverse. They can talk about the food, the service, the ambiance; they
can reflect a positive experience or a complain about some specific aspect of their experience.
Therefore, reviews are a wealth of information and usually are more informative than a numeric
rating. On the other hand, a service like Yelp receives thousands of reviews each day from every
corner of the world and summarizing or extracting specific pieces of information from such a big
corpus is a challenging task.
Data mining and more concretely text mining techniques allow us to explore a massive corpus
like the one of Yelp reviews. We can obtain new insights about the text content that may be
helpful for customers, restaurant owners, government or even for Yelp.
In this project, we mine a corpus of Yelp restaurant reviews to explore the next questions: What
are the best restaurants in a city? How many different cuisines restaurants serve? Can we
recommend dishes for a cuisine and which restaurant is best to try them? This problem is made
easier for users by recommendation systems which utilize their personal preferences to suggest
best restaurant according to their preferred cuisine.
2. Problem Statement
The project is designed in a manner to search for and recommend best restaurants in a city for
different kinds of cuisines based on reviews given by customers.
3. Objectives
To predict rating of restaurants listed in the Yelp dataset based on the reviews given by
the users. Classification techniques such Support Vector Machines are used.
Recommending restaurants to the users using the predicted stars and sentiment polarity
values.
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 2 of 16
Earlier known as
B. V. B. College of Engineering & Technology
Graphical User Interface which particularly takes two inputs from user or customers to
predict ten best restaurants available in a particular city for a particular cuisine provided
by customer.
4. Data Description
The data used in this project is part of the Yelp Dataset Challenge (Round 12). The dataset
consists of a set of JSON files that include business information, reviews, tips (shorter reviews),
user information and check-ins. Business objects list name, location, opening hours, category,
average star rating, the number of reviews about the business and a series of attributes like noise
level or reservations policy. Review objects list a star rating, the review text, the review date, and
the number of votes that the review has received. In this project, we have focused on these two
types of objects. The data consists of six sub datasets which describes the data with a brief
information
5. Related Work:
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 3 of 16
Earlier known as
B. V. B. College of Engineering & Technology
Due to the rich information contained in the Yelp dataset, many past research and projects tried
to use it to predict ratings of restaurants and to evaluate the future development. For example,
Kong, Nguyen and Xu classified restaurants based on cultural categories and analyzed
international restaurants success mostly with Gaussian Discriminant Analysis (GDA). Several
other previous papers focused on the sentiment analysis with text content from Yelp. Xu, Wu
and Wang combined the customer reviews and ratings together to conduct sentiment analysis,
while Gingerich and Bochkov mainly used matrix factorization to analyze text information and
predict Yelp ratings. Linshi worked on user-based text analysis on Yelp rating prediction. He
showed that how Yelp user experience can be improved from rating prediction. Other than Yelp
review, Tang, Qin, Liu and Yang introduced neural network to predict movie reviews. They
claimed that matrix-vector multiplication would be more effective than vector concatenation
when considering text analysis. So far, most research works on text analysis of customer
reviews, but leaves out other features in Yelp Dataset Challenge. In this project, we apply non-
text features to predict restaurants ratings and aims to work on a region-based analysis instead of
a user-based analysis in order to provide suggestions to Yelp restaurants.
6. Methodology:
We aim to build a recommendation system that will enable us to make sophisticated restaurant
recommendations for Yelp users. We begin by providing a brief explanation of the dataset we
used while creating our recommendation system. We follow this with a relevant exploratory
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 4 of 16
Earlier known as
B. V. B. College of Engineering & Technology
analysis of data.
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 5 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 6 of 16
Earlier known as
B. V. B. College of Engineering & Technology
6.4 Predicting rating from the review text, we implemented the following the
model:
● Linear Support Vector Machine Classifier
To build a Linear SVM Classifier using the reviews text, we carried out the following
preprocessing steps:
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 7 of 16
Earlier known as
B. V. B. College of Engineering & Technology
The classifier needs some sort of feature vector in order to perform the classification task. We
used the TF.IDF feature to convert the review text into vector format. So, each review is now
represented as a set of coordinates in a high-dimensional space. During training, the SVM will
try to find some hyperplanes that separate our training examples. When we feed it the test data, it
will use the boundaries it learned during training to predict the rating of each test review.
8. Code:
Our code is divided into three parts:
3) Recommendation of restaurants.
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 8 of 16
Earlier known as
B. V. B. College of Engineering & Technology
Pre-processed the review text by removing the stop words using NLTK and removed the
punctuations.
Converted the review text into vector format using TF-IDF approach using the
TfidfVectorizer in sklearn.
Split the dataset into train and test set (80:20) using train-test split of sklearn.
Considering the rows having stars value greater than 3.5 and sentiment polarity values
greater than 0.
8.3 Result:
Model Feature Precision Recall Accuracy Number of
Classes
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 9 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 10 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 11 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 12 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 13 of 16
Earlier known as
B. V. B. College of Engineering & Technology
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 14 of 16
Earlier known as
B. V. B. College of Engineering & Technology
9. References:
[1] Yelp Challenge Presentation: http://www.ics.uci.edu/~vpsaini
[2] http://www.ics.uci.edu/~vpsaini/files/technical_report.pdf
[3] Scaria, Aju Thalappillil, Rose Marie Philip, and Sagar V. Mehta. “Predicting Star Ratings of
Movie Review Comments.”
[4] https://cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/017.pdf
[5] Chada, Rakesh, and Chetan Naik. “Data Mining Yelp Data Predicting rating stars from
review text.”
[6] Li, Chen, and Jin Zhang. “Prediction of Yelp Review Star Rating using Sentiment Analysis.”
[7] https://nycdatascience.com/blog/student-works/yelp-recommender-part-1/
[8] https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-
systemsin-Python/index.html
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 15 of 16
Earlier known as
B. V. B. College of Engineering & Technology
[9] Arun Babu, Rahool Arun Paliwal and Syamsankar Kottukkal. “Content-Aware Collaborative
Filtering for Yelp Restaurant Recommendation
KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 16 of 16