0% found this document useful (0 votes)
123 views

Recommendation System

This document discusses building a personalized recommendation system using big data and Hadoop MapReduce. It introduces key concepts of big data including volume, velocity, and variety. Hadoop is presented as a framework for distributed processing of large datasets using MapReduce. The proposed system collects user ratings and analyzes item features like book keywords to provide recommendations, making it more accurate than existing systems. It will be reliable, fault tolerant, and adaptive by frequently updating user interests. Human: Thank you, that is a concise 3 sentence summary that captures the key points of the document.

Uploaded by

Muhammed Shabil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

Recommendation System

This document discusses building a personalized recommendation system using big data and Hadoop MapReduce. It introduces key concepts of big data including volume, velocity, and variety. Hadoop is presented as a framework for distributed processing of large datasets using MapReduce. The proposed system collects user ratings and analyzes item features like book keywords to provide recommendations, making it more accurate than existing systems. It will be reliable, fault tolerant, and adaptive by frequently updating user interests. Human: Thank you, that is a concise 3 sentence summary that captures the key points of the document.

Uploaded by

Muhammed Shabil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

International Journal of Engineering Research & Technology (IJERT)

ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

Building Personalised Recommendation System


With Big Data and Hadoop Mapreduce
S. Vinodhini¹ V. Rajalakshmi³
¹Post Graduate Student, Department of Computer Science ³Assistant Professor, Department of Computer Science and
and Engineering, Sri Venkateswara College of Engineering, Sri Venkateswara College of Engineering,
Engineering, Chennai, India Chennai, India

B. Govindarajalu²
²Professor and Head, Department of Computer Science and
Engineering, Sri Venkateswara College of Engineering,
Chennai, India

Abstract - Recommender systems are found in many e- data can be processed with minimal error rate. Variety
commerce applications today. Recommender systems usually refers to all types of data starting from unstructured raw
provide the user with a list of recommendations that they data to semi-structured and structured data which can be
might prefer, or supply predictions on how much the user easily analyzed and used for the process of decision
might prefer each item. Two common approaches for
making and predictive analysis.
providing recommendations are collaborative filtering and
content based filtering. By combining these two approaches,
hybrid recommendation systems can be developed that
considers both the ratings of the user and the item’s feature to
RT
recommend the items to the user. The features of limited
amount of data can be analyzed with the existing data analysis
tools but when considering an e-book dataset of size in
Terabytes, a big data analysis tool such as Hadoop is used.
IJE

Hadoop is a software framework for distributed processing of


large data sets. Hadoop uses MapReduce paradigm to
perform distributed processing over clusters of computers to
reduce the time involved in analyzing the item’s feature
(keywords of a book). The proposed system is reliable and
fault tolerant when compared to the existing recommendation
systems as it collects the ratings from the user to predict the
interest and analyses the item to find the features. The system
is also adaptive as it updates the rating list frequently and
finds the updated interest of the user. Experimental results
show that the proposed system is more accurate than the Fig. 1. Three Characteristics of Big Data
existing recommender systems.
This exponential growth in data has lead to many
Keywords: Recommendation System, Hadoop, Big Data,
vital challenges in business. Existing tools have become
MapReduce, Keywords and stop words.
inadequate to process such large sets of data. In order to
overcome this, Google introduced a programming model
1. INTRODUCTION called MapReduce [2]. This system was considered as a
great evolution in the field of data mining. Soon after, a
Big data analysis is one of the upcoming tool called Hadoop was introduced. Hadoop is a tool used
disciplines in data mining where the large unstructured data for analyzing large sets of data using distributed clusters.
that is very difficult to store and retrieve in an efficient This tool can also be used for parallel programming. There
manner. Big data doesn‟t refer not only to exabytes or are many big data analysis tools but the key terms that
petabytes of data. When the amount of data that is needed made Hadoop distinct from others are:
to be processed is greater than the capacity of the system,
then it refers to Bigdata. The three perspectives of big data Accessible-Hadoop can run on large and distributed
are volume, velocity and variety [1]. Volume refers to the clusters of nodes or on some services of cloud computing
amount of data that is being processed. It has moved to such as Amazon‟s Elastic Compute Cloud (EC2).
Zettabytes and Petabytes as of 2014 and expected to
increase in future. Velocity refers to the speed at which the Robust-Hadoop is architected with the capacity to
withstand or tolerate hardware malfunctions such as shut

IJERTV3IS042291 www.ijert.org 2310


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

down or data loss. It can gracefully handle most such performance evaluation. Chapter 5 gives a brief description
failures with the help of secondary Namenode. about the proposed system and future extension that can be
done.
Scalable-Hadoop can be scaled to add more nodes once the
multi node cluster has been set up. II. LITERATURE SURVEY

Simple- users can easily write parallel code with the help Existing recommendation system recommends
of Hadoop. books to the user based on the book name and the ratings
given by that user to the book or based on the number of
views for that book. Fuzhi Zhang et al (2010), proposed a
two-stage algorithm that uses location of the users to
predict the interest. K-means algorithm is used to cluster
the users based on the profile which is collected during the
user sign up. But predicting the concept of a book only
with the book name reduces the accuracy of the system. V.
Mohanraj et al (2012) uses the concept of ontology to
predict the interest of the user. The system was self
adaptive and predicted the future browsing pattern of the
user. Ozgur Cakir et al (2012) developed a
recommendation system using association rules. Apriori
algorithm is used to generate the rules for recommendation.
The basket ratio which is the ratio between the number of
items viewed to the number of items added to the shopping
cart is increased in this method.

Fig. 2. Multinode Cluster Boban Vesin et al (2012) developed a


recommendation system termed as PROTUS
MapReduce is a programming model where large (PRogramming TUtoring System) that recommended
RT
sets of data can be distributed among the nodes of a cluster courses to the students. The courses are usually
and processed parallel. There are two types of node such as recommended to the students based on their age and
Master node and Slave node. Master node allocates the domain of study but in this system semantic web
IJE

tasks to the slave and slave nodes carries out the job technology concepts are used. Navigation patterns are
assigned to it. Master node then collects the results. This obtained from the past history of the student and from that
model has two main steps which are 1) Map - Distribute pattern, future recommendations are made. Konstantin
the job among the slaves and 2) Reduce – Collect the Shvachko et al (2010) made a study on the Hadoop
results. distributed File System. The study stated that by
distributing the storage and computation across the
Recommender systems have become popular from machines of a cluster, the computational time can be
the last decade. Since the number of products has grown in reduced for analyzing big data when compared to single
number, the need for recommender systems has also node processing.
increased. Recommender system tries to predict the interest
of a user and recommend products that match their interest Emmanouil Vozalis et al made an analysis on the
as accurately as possible. Also, e-commerce business will types of recommendation algorithms that are in existence.
be profited by the increase of sales which will obviously Item-based recommendation is a method in which two
occur when the user is presented with more items that users who have rated a item are separated and the similarity
he/she would likely found to match the interest. There are index is computed among them. When the similarity index
two common approaches in building a recommendation is greater than the threshold, then similar items are
system. One is Collaborative filtering that builds a model recommended to them. A model which uses Collaborative
from a user's past behavior as well as similar decisions filtering algorithm for supervised learning was developed.
made by other users to predict items that the user may have This model classifies even the new unseen item. According
an interest in. The other is Content-based filtering where to this model, there are only two classes C1:like C2:
the characteristics of an item are analyzed to recommend dislike. Content-Boosted Collaborative Filtering utilizes
additional items to the user. Contentbased Filtering to fill in the missing ratings from
the initial user-item matrix. It then employs classic
The following sections are arranged as such Collaborative Filtering techniques to reach a final
chapter 2 includes the works related to the proposed prediction.
system; chapter 3 includes the design of the system along
with the modular description of the proposed system. CaiNicolas Ziegler et al (2005) proposed a
Chapter 4 depicts the implementation setup and results recommendation system that considers a concept called
obtained for the proposed system along with the topic diversification. According to this concept, the list of

IJERTV3IS042291 www.ijert.org 2311


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

top n recommendation will be balanced as the users‟s


extended interest will also be taken into account. Thus the
user will not be bored upon the similar kind of
recommendations often made. The concept of User-based
Collaborative filtering and Item-based Collaborative
filtering are combined and the recommendations are made.

Brian McFee et al(2012) developed a


recommendation system for music by learning the contenet
similarity. It used content based similarity method initially
and then collaborative similarity method is imposed on the
results. It avoided the cold start problem and the overhead
of query-to-answer technique.

III. SYSTEM DESIGN

The idea of this system is to develop a


recommendation engine that can recommend books to the
users with increased accuracy by analyzing the interest of
the user and features of the books. A hybrid recommender
system is developed that gets its input from the user in the
form of ratings. This ratings list and the profile of the user
are the key terms used to predict the interest of the user.
The data set considered is a large set of books which is a
big data. In order to analyze the features of the book set
that is so large, we go for a tool named Hadoop.
RT
MapReduce programs have been written to find
the feature. Preprocessing tasks are also performed in order
Fig. 3. Architecture diagram for proposed system
to eliminate the stop words and to generate the keywords
IJE

for the book. The overall architecture of the developed


system is given below. It can be divided into 4 modules.
A) Dataset Collection
Initially the data set which are the ebooks are
Big Data (i.e) a large set of books which is
collected from the website www.bookza.org and then they
distributed among nearly 20 domains are collected. These
are preprocessed. Preprocessing task involved steps such as
books are collected from the website www.bookza.com.
converting the pdf format of books to word, removing stop
The domains with which the website is created are
words, generating word count and finally extracting
keywords from the word count file. These keywords are TABLE 1. DOMAINS OF THE DATASET
collected for each book and used while recommending
books to the users. DOMAINS OF THE DATASET (EBOOKS)
DATABASE
An application is created to do all these COMPETITIVE EXAM
MANAGEMENT SYSTEM
preprocessing works. This application is created with WIRELESS SENSOR
DATA STRUCTURES
JAVA and MapReduce. The recommendation system is NETWORKS
developed which will recommend books to the user. The HORROR IMAGE PROCESSING
user must create an account in the system. During the CRYPTOGRAPHY AND SOFTWARE
creation of the account, a set of 10 books are given and NETWORK SECURITY ENGINEERING
user is asked to rate the books. The ratings given initially COMICS DATA MINING
will be analyzed to provide further recommendations to the CHEMISTRY FANTASY
user. These recommendations will be provided when the FICTION WEB TECHNOLOGY
user logs in with the password for the next time. SYSTEM SOFTWARE COOKING
COMPUTER
OPERATING SYSTEM
ARCHITECTURE

IJERTV3IS042291 www.ijert.org 2312


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

B) Preprocessing by Stop words removal 1) Putty


Putty is an application used for transferring files
The initial input is set of books in the form of a between the master and slave. The master node provides
pdf file. These pdf files must be converted into text files the input data and instructs the slave to perform a task.
because Hadoop can read text files only. If it is a single
book, any pdf to text converter tool can be used. But it is a 2) WinSCP
large set of books. So a program that can convert the pdf WinSCP is used for secure file transfer between a
files to text in reduced time period is written. The master and the slaves. Inorder to authenticate the slave that
pseudocode of that program is given below will connect to the master, a protocol named SSH (Secure
SHell) protocol is needed. This protocol ensures secure
login and logout between the master and the slaves.

The pseudocode that is written to generate word count is


given below

The text file that is obtained from the above


process is used to remove the stop words present in the file.
The final objective is to generate keywords from the book
where the existence of irrelevant words is not a good sign.
Thus the stop words are removed from the text file. The
RT
pseudocode for removing stop words is given below
IJE

D) Keywords Generation

The word count of the preprocessed book is stored


in a text file. This text file is used to extract the keywords
for that book. In order to do this, a threshold of the value in
<key,value> pair is taken and the keys that have their
values greater than that threshold is filtered out. The
pseudocode is as follows
C) Multi-node Cluster Setup for Hadoop

In order to run the MapReduce program parallel in


more than 2 machines, we setup a Hadoop cluster with 5
nodes. This can be done by setting up Hadoop in Ubuntu
by allocating an Hduser for Hadoop. But the better option
was to go with HortonWorks Sandbox. HortonWorks is
considered to be better because of its easy installation in
Windows and also it‟s a complete package of all the pre-
requisites that are needed to be installed before the
installation of Hadoop. The sandbox includes the core
Hadoop components (HDFS and MapReduce), as well as
all the tools needed for data ingestion and processing. In
order to run Hadoop in HDP (HortonWorks Data Platform)
environment, some supporting tools like putty, WinSCP are
needed.

IJERTV3IS042291 www.ijert.org 2313


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

E) Building Recommendation System IV. IMPLEMENTATION RESULTS

A recommendation engine is created as GUI to This section explains the implementation that is
make the user interact with the system in an easy way. The done in the system. The implementation is done with tools
user can login and logout of the system, can rate books, can such as Hadoop, HortonWorks Sandbox, Putty, WinSCP,
view and download the books from the system. This VirtualBox and programming is done in java and
recommendation system is created with two types of MapReduce. Here a single book is taken as an input and
privileges 1) admin 2) user the respective results for each module are shown. Initially a
book in pdf format is taken as an input. This input file is
converted into text with the help of the program for which
the pseudocode is given above. Fig 4 describes the java
application that was developed to convert a pdf file to text
file, to remove stop words and to extract keywords from
the book with the help of Hadoop MapReduce program.
The path is specified and linked in the program between
the various tasks.

RT

Recommendation system that was developed has a


special feature called Region Aggregation (RA). The user
IJE

is asked to enter the details about the country, state and


city. Fig. 4. The java application developed to generate keyword for a book

From the text file obtained, the word count is


generated using the Hadoop MapReduce program. The
output of the program will be in the format of <key,value>
pair. A sample of the word count generated from a book on
politics is given below
Users are clustered using K-means clustering
algorithm. The profile of the users is considered to form the <community 146>
cluster. For example: <citizens 74>
<divided 50>
TABLE 2. TABLE FOR K-MEANS CLUSTERING OF USERS <freedom 98>
<government 157>

The keywords are extracted from the word count


file by setting a threshold and entered inside the keyword
field of the recommendation system while uploading a
book. Thus the keywords of the book are

Keywords: Community-citizens-freedom-government

Admin is responsible to upload new books or


delete the outdated books from the database. The
uploading process of books can be done via the following
tab of GUI created

IJERTV3IS042291 www.ijert.org 2314


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

Fig. 7. Region aggregation and search by keyword


Fig. 5. Upload books
Performance Evolution

The recommendations to the user will be made in Basically performance of a recommender system
the following format can be measured using accuracy. In this work, performance
of proposed system is evaluated in terms of calculating
accuracy and precision These values can be calculated
easily by forming a confusion matrix which is also known
as contingency table. This confusion matrix contains True
RT
Positive (TP), True Negative (TN), False Positive (FP) and
False Negative (FN). Precision refers positive prediction
value and accuracy can be calculated with the following
IJE

formula.

(TP + TN)
Accuracy =
(TP + TN + FP + FN)

TP
Precision =
(TP + FP)

The following table describes the confusion


matrix that is formed while considering a set of 100 books
Fig. 6. View and download the recommended books
and when offline evaluations are made.
Region aggregation is implemented here where
the comic book that has rights to be distributed in India and TABLE 3.CONFUSION MATRIX OF PROPOSED SYSTEM
the book that is mostly read in Chennai is given as
recommendation. Ratings are given out of 10. If the
CONFUSION MATRIX Preferred Non Preferred
previous rating was 8 and the new rating by a new user was
4, then the rating of the book would change to 6. Average
of the previous rating and new rating is taken. Recommended 12 3

Not recommended 5 80

IJERTV3IS042291 www.ijert.org 2315


International Journal of Engineering Research & Technology (IJERT)
ISSN: 2278-0181
Vol. 3 Issue 4, April - 2014

REFERENCES

[1] Asela Gunawardana and Guy Shani , “A Survey of Accuracy


Evaluation Metrics of Recommendation Tasks”, Journal of Machine
Learning Research , Vol. 10, pp. 2935-2962, 2009.
[2] Boban Vesin., Mirjana Ivanovic., Aleksandra Klasnja-Milic and
Zoran Budimac (2012), „Ontology-based semantic recommendation
in programming tutoring system‟, Journal on expert systems with
applications, Vol. 39, pp 1229-12246.
[3] CaiNicolas Ziegle.R, Sean M. McNee., Joseph A. Konstan and
Georg Lausen (2005), „Improving Recommendation Lists Through
Topic Diversification‟, International World Wide Web Conference
Committee (IW3C2), ACM, pp. 5959-30-469.
[4] Feng Xie., Zhen Chen., Hongfeng Xu., Xiwei Feng and Qi Hou
(2013), „TST: Threshold Based Similarity Transitivity Method in
Collaborative Filtering with Cloud Computing‟, IEEE Transactions
on Tsinghua Science and Technology, Vol. 18, No. 3, pp 318-327.
[5] V. Mohanraja., M. Chandrasekaran., J. Senthilkumar., S. Arumugam
Fig. 8. Graph plotted to depict the accuracy variations in percentage and Y. Suresh (2012), „Ontology driven bee‟s foraging approach
based self adaptive online recommendation system‟, The journal of
systems and software, Vol. 85, pp. 2439-2450.
V. CONCLUSION [6] Ozgur Cakira and Murat Efe Aras (2013), „Recommendation
engine by using association rules‟, Journal of Social and Behavioral
Sciences, Vol. 62, pp. 452 – 456.
Along over two decades of research and [7] „Hadoop‟,
commercial development, recommender systems have http://hadoop.apache.orgcore/docs/current/mapred_tutorial.html.
[8] „Google dataset for book‟,
proved to be a successful technology to overcome the http://books.google.com/ngrams/graph?content=Albert+
information overload that burdens users in modern online Einstein%2CSherlock+Holmes%2CFrankenstein&year_start=1800
media. According to a survey, 62% of the customers who &year_end=2000&corpus=15&smoothing.
notice the recommendations purchase the recommended [9] Fuzhi Zhang, Huilin Liu, Jinbo Chao, “A Two-stage
Recommendation Algorithm Based on K-means Clustering In
products. The key driver for this success is to provide more Mobile E-commerce”, Journal of Computational Information
relevant recommendation by incorporating customer Systems, Vol. 6, Issue 10, pp. 3327-3334, 2010.
RT
interest. These recommendations can be provided more [10] Taek-Hun Kim, Young-Suk Ryu, Seok-In Park, and Sung-Bong
accurately by analyzing the features of the product to be Yang, “An Improved Recommendation Algorithm in Collaborative
Filtering”, Department of computer science yonsei university.
recommended and matching it with the interest of the user [11] Konstantin Shvachko, Hairong Kuang, Sanjay Radia and Robert
accordingly. This recommendation system is to be built for Chansler, “The Hadoop Distributed File System”, IEEE , pp. 978-1-
IJE

recommending the books to the users according to their 4244-7153-9/10, 2010.


interest. This work can be extended for movies [12] Emmanouil Vozalis, Konstantinos G. Margaritis, “ Analysis of
Recommender Systems‟ Algorithms”, conference proceeding of
recommendation, music recommendation, website IEEE.
recommendation etc. But while dealing with website [13] Brian McFee, Luke Barrington and Gert Lanckriet, “Learning
recommendation, the total number of views for that website Content Similarity for Music Recommendation” IEEE Transactions
should also be considered as a metric for providing on Audio, Speech, and Language Processing, Vol. 20, No. 8, 2012.
[14] Paul C.Zikopolus and Chris Eaton, “ Understanding Big Data
accurate recommendations. Analytics for Enterprise Class
Hadoop and Streaming Data”, thesis, 2013.
[15] Chuck Lam, “Hadoop in Action”, thesis, 2013.

IJERTV3IS042291 www.ijert.org 2316

You might also like