0% found this document useful (0 votes)
2 views

Project Report

this the project report

Uploaded by

nirannjanss
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Project Report

this the project report

Uploaded by

nirannjanss
Copyright
© © All Rights Reserved
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 16

Earlier known as

B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

City & Cuisine-Based Restaurant Recommender Using


Yelp Dataset
Team Id: 5A07

Data Mining and Analysis Course Project Report

Team Members:
Abhijeet Prakash 01FE16BCS003

Abhishek D Sawant 01FE16BCS004

SCHOOL OF COMPUTER SCIENCE &

Adarsh Raj 01FE16BCS009

Apeksha Ninnekar 01FE16BCS038

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 1 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

1.Introduction
Today, customer reviews in social media have a deep impact on the chances of success of any
business. Restaurant customers look for a complete and satisfactory experience regarding food
quality, service, ambience and they often seek the opinion of patrons when they are choosing a
place for their next meal. Yelp offers this information to its users. When users look for a place to
eat, they can ask the service for a list of nearby restaurants for a cuisine category. Users also get
the overall rating that other customers gave to the restaurant as well as some reviews about the
restaurant.
Reviews content is very diverse. They can talk about the food, the service, the ambiance; they
can reflect a positive experience or a complain about some specific aspect of their experience.
Therefore, reviews are a wealth of information and usually are more informative than a numeric
rating. On the other hand, a service like Yelp receives thousands of reviews each day from every
corner of the world and summarizing or extracting specific pieces of information from such a big
corpus is a challenging task.
Data mining and more concretely text mining techniques allow us to explore a massive corpus
like the one of Yelp reviews. We can obtain new insights about the text content that may be
helpful for customers, restaurant owners, government or even for Yelp.
In this project, we mine a corpus of Yelp restaurant reviews to explore the next questions: What
are the best restaurants in a city? How many different cuisines restaurants serve? Can we
recommend dishes for a cuisine and which restaurant is best to try them? This problem is made
easier for users by recommendation systems which utilize their personal preferences to suggest
best restaurant according to their preferred cuisine.

2. Problem Statement
The project is designed in a manner to search for and recommend best restaurants in a city for
different kinds of cuisines based on reviews given by customers.

3. Objectives
 To predict rating of restaurants listed in the Yelp dataset based on the reviews given by
the users. Classification techniques such Support Vector Machines are used.
 Recommending restaurants to the users using the predicted stars and sentiment polarity
values.

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 2 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

 Graphical User Interface which particularly takes two inputs from user or customers to
predict ten best restaurants available in a particular city for a particular cuisine provided
by customer.

4. Data Description
The data used in this project is part of the Yelp Dataset Challenge (Round 12). The dataset
consists of a set of JSON files that include business information, reviews, tips (shorter reviews),
user information and check-ins. Business objects list name, location, opening hours, category,
average star rating, the number of reviews about the business and a series of attributes like noise
level or reservations policy. Review objects list a star rating, the review text, the review date, and
the number of votes that the review has received. In this project, we have focused on these two
types of objects. The data consists of six sub datasets which describes the data with a brief
information

 The size of the Data is 6.84 Gb including the sub files


1. Business Dataset (139 MB)

2. Check-In Dataset (50.3 MB)

3. Photo Dataset (34.9 MB)

4. Review Dataset (4.39 GB)

5. Tips Dataset (203 MB)

6. Users Dataset (2.03 GB)

5. Related Work:

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 3 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

Due to the rich information contained in the Yelp dataset, many past research and projects tried
to use it to predict ratings of restaurants and to evaluate the future development. For example,
Kong, Nguyen and Xu classified restaurants based on cultural categories and analyzed
international restaurants success mostly with Gaussian Discriminant Analysis (GDA). Several
other previous papers focused on the sentiment analysis with text content from Yelp. Xu, Wu
and Wang combined the customer reviews and ratings together to conduct sentiment analysis,
while Gingerich and Bochkov mainly used matrix factorization to analyze text information and
predict Yelp ratings. Linshi worked on user-based text analysis on Yelp rating prediction. He
showed that how Yelp user experience can be improved from rating prediction. Other than Yelp
review, Tang, Qin, Liu and Yang introduced neural network to predict movie reviews. They
claimed that matrix-vector multiplication would be more effective than vector concatenation
when considering text analysis. So far, most research works on text analysis of customer
reviews, but leaves out other features in Yelp Dataset Challenge. In this project, we apply non-
text features to predict restaurants ratings and aims to work on a region-based analysis instead of
a user-based analysis in order to provide suggestions to Yelp restaurants.

6. Methodology:
We aim to build a recommendation system that will enable us to make sophisticated restaurant
recommendations for Yelp users. We begin by providing a brief explanation of the dataset we
used while creating our recommendation system. We follow this with a relevant exploratory

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 4 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

analysis of data.

Fig (6.0.1): Methodology flow diagram

6.1 Exploratory Analysis:


The primary features of a business being used in our data analysis are business category and
location (state and city). The preliminary exploratory analysis of the dataset includes study of
distribution of reviews with respect to category of the business and its location.

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 5 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

Fig (6.1.1): Frequency distribution of State v/s Number of food businesses

Fig (6.1.2): Frequency distribution of categories v/s count

6.2 Dataset Reduction:


After exploratory analysis, we trimmed our dataset only for Ontario state consisting of food
related categories.

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 6 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

We selected instances with:

● Business category as ‘Restaurants’, ’Food’, ’Japanese’, ’Chinese ‘, ‘Thai’, ’Italian’, ’Indian’.

● State as ‘Ontario’ (ON)

6.3 Predictive Tasks:


There are two major tasks in our project:

● Predicting rating from the review text alone.

● Recommending restaurants based on predicted stars and sentiment polarity.

6.4 Predicting rating from the review text, we implemented the following the
model:
● Linear Support Vector Machine Classifier

6.4.1 Linear Support Vector Machine Classifier:


Support Vector Machine (SVM) is primarily a classifier method that performs classification
tasks by constructing hyperplanes in a multidimensional space that separates cases of different
class labels. SVM is effective in high dimensional spaces. It uses a subset of training points in
the decision function (called support vectors), so it is also memory efficient. And different kernel
functions can be specified for the decision function. In this project, we use the open python
library scikit-learn to implement the classifier.

To build a Linear SVM Classifier using the reviews text, we carried out the following
preprocessing steps:

● Removed the punctuations

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 7 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

● Removed the stop words

The classifier needs some sort of feature vector in order to perform the classification task. We
used the TF.IDF feature to convert the review text into vector format. So, each review is now
represented as a set of coordinates in a high-dimensional space. During training, the SVM will
try to find some hyperplanes that separate our training examples. When we feed it the test data, it
will use the boundaries it learned during training to predict the rating of each test review.

7. Discussion Predicting Ratings:


Evaluation Metrics We use Precision and Recall as the evaluation metric to measure our rating
prediction performance. SVM has better performance than Naïve Bayes, as a naive Bayes
classifier simply assumes that the value of a particular feature is unrelated to the presence or
absence of any other feature, given the class variable. SVM on the other hand is primarily a
classier method that performs classification tasks by constructing hyperplanes in a
multidimensional space that separates cases of different class labels. Tf.idf with bigrams is
performing better. These results are intuitively aligned to the observation that we need to factor
in phrases like ‘not great’, ‘not bad’ to understand the sentiment of the review.

8. Code:
Our code is divided into three parts:

1) Exploratory analysis of datasets

2) Predicting ratings from review text and calculating sentiment polariy

3) Recommendation of restaurants.

ANALYSIS OF PREDICTING RECOMMENDATION


DATASETS RATINGS

8.1 Linear Support Vector Machine Classifier Python Notebook

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 8 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

 Pre-processed the review text by removing the stop words using NLTK and removed the
punctuations.

 Converted the review text into vector format using TF-IDF approach using the
TfidfVectorizer in sklearn.

 Split the dataset into train and test set (80:20) using train-test split of sklearn.

 Built a linear SVM model and fitted it to our training set.

 Evaluated the model for 5 classes (1,2,3,4,5-star rating)

8.2 Restaurant recommender python notebook


 Calculating the sentiment polarity for each business.

 Considering the rows having stars value greater than 3.5 and sentiment polarity values
greater than 0.

 Obtaining the top 10 restaurants with highest sentiment polarity.

8.3 Result:
Model Feature Precision Recall Accuracy Number of
Classes

Linear Bigram + 0.590484199818 0.596285137787 0.596285137787 5


SVM TF-IDF

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 9 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 10 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 11 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 12 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 13 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 14 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

9. References:
[1] Yelp Challenge Presentation: http://www.ics.uci.edu/~vpsaini
[2] http://www.ics.uci.edu/~vpsaini/files/technical_report.pdf
[3] Scaria, Aju Thalappillil, Rose Marie Philip, and Sagar V. Mehta. “Predicting Star Ratings of
Movie Review Comments.”
[4] https://cseweb.ucsd.edu/~jmcauley/cse255/reports/fa15/017.pdf
[5] Chada, Rakesh, and Chetan Naik. “Data Mining Yelp Data Predicting rating stars from
review text.”
[6] Li, Chen, and Jin Zhang. “Prediction of Yelp Review Star Rating using Sentiment Analysis.”
[7] https://nycdatascience.com/blog/student-works/yelp-recommender-part-1/
[8] https://cambridgespark.com/content/tutorials/implementing-your-own-recommender-
systemsin-Python/index.html

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 15 of 16
Earlier known as
B. V. B. College of Engineering & Technology

School of Computer Science and Engineering

[9] Arun Babu, Rahool Arun Paliwal and Syamsankar Kottukkal. “Content-Aware Collaborative
Filtering for Yelp Restaurant Recommendation

KLETECH/SoCSE(2018-19)/DMA/Course Project/5ADMACP10
Page 16 of 16

You might also like