Tourist Data Analysis
Tourist Data Analysis
Project report submitted in partial fulfillment of the requirement for the degree of Bachelor of
Technology in
By
to
I hereby declare that the work presented in this report entitled “Tourist Data Analysis ” in partial
fulfillment of the requirements for the award of the degree of Bachelor of Technology in
Computer Science and Engineering/Information Technology submitted in the department of
Computer Science & Engineering and Information Technology, Jaypee University of Information
Technology, Waknaghat is an authentic record of my own work carried out over a period from
January 2020 to May 2020 under the supervision of Dr. Suman Saha (Assistant Professor Sr.
Grade Dept. of CSE).
The matter embodied in the report has not been submitted for the award of any other degree or
diploma.
This is to certify that the above statement made by the candidate is true to the best of my
knowledge.
Dr.Hemraj Saini
(Associate Professor)
Dr. Suman Saha
(Assistant Professor Sr. Grade)
. Dept. of CSE
Dated: 28/05/2020
i
i
Acknowledgement
It is our privilege to express our sincerest regards to our project supervisor Dr. Suman Saha
(Assistant Professor Sr. Grade), for their valuable inputs, able guidance, encouragement,
wholehearted cooperation and direction throughout the duration of our project.
We deeply express our sincere thanks to our Head of Department for encouraging and allowing us
to present the project on the topic “Tourist Data Analysis” at our department premises for the
partial fulfillment of the requirements leading to the award of B-Tech degree.
At the end we would like to express our sincere thanks to all my friends and others who helped me
directly or indirectly during this project work.
Manika Kansal(161318)
ii
TABLE OF CONTENTS
CONTENT PAGES
DECLARATION i
ACKNOWLEDGEMENT ii
LIST OF FIGURES vi
LIST OF SNIPPETS vii
LIST OF TABLES viii
ABSTRACT ix
CHAPTER 1: INTRODUCTION 1-13
1.1 Introduction
1.1.1. Java 1
1.1.2. MySQL 1
1.1.3. Netbeans 1
1.1.4. Swing 1
1.1.5. JFrameClass 1-5
1.1.6. Big Data 5-8
1.1.7. Use Of Big Data Analaytics 9
iii
2.7 Study based on MySQL Storage Engine 16
2.8 Analysis of the perception of accommodation Consumers
on the use of Online Travel Agencies 17
2.9 Scalability study of Hadoop map Reduce and Hive in Big
Data Analytics 17
2.10 An Overview of the Map reduce/HBase /Hadoop framework
And its current application in Biofield 18
2.11 Storage and processing speed for knowledge from enhanced
Cloud framework 18
iv
CHAPTER 4 : RESULT AND PERFOMANCE ANALYSIS 37-52
4.1 Using SQL and Java Applets 37-41
4.2 Using Hive(HiveQL) 42-46
4.3 Using Pig 47-51
4.4 Comparison between existing and adopted solution 51-52
CHAPTER 5: CONCLUSION 53
5.1 Conclusion
5.2 Future work
References 54
List of Figures
v
8 Data Modeling for handling structured, unstructured and 23
semi structured data
11 Mapper Stage 31
12 Reducer stage 32
List of Snippets
1 Code Snippets #1 2
2 Code Snippets #2 2
3 Code Snippets #3 2
4 Code Snippets #4 3
5 Code Snippets #5 3
6 Code Snippets #6 4
7 Code Snippets #7 4
vi
9 Snippets #9 for Flow Chart 27
List Of Tables
vii
Abstract
Big data is the large amount of data that is generated every second. Such large data becomes very
difficult to process using manual checking and old methods. Now, we have a number of toolsand
techniques to handle such data. With continually increasing customer use on online platforms, it is
necessarily required to know customer’s likes and interests. It is really necessary as through this, the
preferences of the customers can be well known. The best way to handle this problem is by
applying Big Data Analytics.
Tourism is considered to be the most favourite pass time . People travel with friends to have a good
time or for business purposes, usually for a limited period of time. Tourism is usually associated
with domestic or international travel. There are several travel organizations are available on the
internet. Depending on their personal interest, people or tourists choose their cabs service with the
favourable package. Travel companies rely on the enthusiasm associated with tourism to ensure
that they increase their particular market value and provide massive package deals.Given the large
amount of information available on the travel package, it is important to meet a tourist's personal
needs and preferences to offer more attractive packages.In order to make their Travel Package
more effective recommender system is becoming very popular and it attracts people, as it allows
them in a short time to choose the best package cabs service.
Hobbies: Customers still have different opinions and understandings to choose a bundle due to
variations in gender, profession, hobbies and knowledge. Customers have a lot of interest and taste
so they can choose any kit based on their interest.Product descriptions are provided to the user by
selecting the location where the products are displayed based on travel location and duration.
This is a project report on “Tourist Data Analysis”. During the exploration of this project, we
develop new ideas and functionalities while using the Big Data technologies. So,this project is for
recommending best cab service for a given location to a new customer based on rating of old
customers
The project report covers the implementation of the project on Java and Big Data technologies and
the reason why there is a shift from Java/SQL to Big Data using Hive and Pig over Hadoop witha
concluding result at the end.
viii
CHAPTER 1
INTRODUCTION
1.1 Introduction
1.1.1 JAVA:
Java is object oriented programming language, that was developed at Sun Microsystems by James
Gosling. It was later on acquired by Oracle. Java was developed to counter some of the problems that
incurred in C++ like portability, multithreading etc. The latest version of java that is released is Java 14
,released on March 2020.
1.1.2 MySQL:
MySQL is the most famous open source relational database management System (RDBMS).It was
initially released on 23 May,1995. Earlier it was owned by a Swedish company MySQL AB. But
later on it was acquired by Sun Microsystems. And by 2010 it was taken by Oracle.
Right now, MySQL is used by many popular sites such as Facebook, twitter ,YouTube etc.
1.1.3 NetBeans:
Netbeans is an application of integrated development environment (IDE) for java . Basically it
provides all the modules integrated at one place in order to build any java application. It has
some features like: NetBeans Visual Library, User Interface management, Integrated
development tools etc.
1.1.4 Swing:
Java Swing is built on the top of AWT(Abstract Window Toolkit).
Java Swing have lightweight components as compared to Abstract Window Toolkit.It is entirely written
in java. Java Swing components are platform independent.Swing follow MVC(Model View Controller).
It is possible to run Java Swing applications on any system that supports Java. These are
lightweight applications. This means not taking up a lot of space or using a lot of system
resources. JFrame is a Java class with its own methods and builders. Methods are functions that
affect the JFrame, such as size or visibility setting. When the instance is created, constructors are
running:
one constructor can create a blank JFrame, while another can create it with a default name.
1
Code Snippet #1
Creating a JFrame
When you create a new JFrame, you actually create a JFrame class instance. You can create a titl e
or an empty one. If you pass a string to the constructor, the following title will be created:
Code Snippet #2
But, if you run this script, you won't see an application window! The explanation for this is that t here
is no concep t of size or position. You have to give it a size and make it visible to the users.
Code Snippet #3
2
Setting the window size and location
Code Snippet #4
Code Snippet #5
3
Example for JFrame
Code Snippet #6
Output:
Code Snippet #7
4
Buttons
Generic buttons are designed from the class JButton (in the package javax.swing). Like other
controls, hundreds of methods are inherited. The standard builder defines a string to be used as a
button tag (or we can later call setLabel with any string). We can also call inherited methods such
as setFont, setBackground, and setForeground, although default values are used for most standard
buttons. We can also call the setEnabled method (with a boolean parameter) to tell Java if abutton
can be pressed (if not, the label appears as a faded color).
5 Vs of Big Data :
• Volume: This meant huge voluminous information; all the requests is ofsterabytes and petabytes.
• Velocity: This means the high speed information which are we creating on our process.
• Variety: This meant to the enormous variant in the huge processed information.
• Veracity: This meant alludes to vulnerabilities and improper value in huge information, for
example, missing, copy and fragmented sections.
5
Fig. 1: 5 V‟s Of Big Data
• Value: This trademark alludes to the valuable information contained in huge information.
6
Fig. 3 Big Data Uses
Big data analytics actually help companies to get accurate results which can be used in the favour of
their company as well. Thus this leads to more productivity. Some of the important qualities of Big Data
are:
7
Fig. 5 Significance of Big Data
8
1.1.7 Big Data Analytics uses:
Banking
Since Banking is a very big sector ,so it must contain very huge information which can easily be dealt by
Big data. By analyzing the data ,it can protect people from frauds and risks.
Health care
The huge amount of information that a health care center can generate can easily be analyzed by big
data analytics. Like the records of patients, information about prescriptions, records of medicines etc.
Manufacturing
Big data analysis can also help in manufacturing sector as well.As with the help of Big data analysis
We can make more informed choices,can make our business strategies according to it and can remain
ahead of our competition in the market.
Government
Nowadays governments are also using big data analysis for their use.By the analysis of the data they can
easily decide what can be their next election strategy,how they can reduce crime etc.
9
1.2 Problem Statement
There are several travel agencies available for a travel system. But currently there is no efficient
recommendation system available , which actually help us to choose a particular travel agency
based on people views. To overcome this problem, we are coming up with Travel Package
Recommendation System using tourist data where you can select the best package.
Firstly, the manual dataset is observed and , with the requirement, various attributes are taken and
shown together. The above dataset is executed on two different platform i.e. Java applets & on
Hadoop. Also the data is, shown in tabular format. Secondly, we have analyze what sort of cabs
services were taken by the customer in the preferred customer location and then we recommend
them that particular cabs service according to their requirement .
Finally, we have examined the required problem through Big Data using HQL and PIG due to the
constraint ,in the SQL. The reason of the l switch from SQL to Big Data platform is parallel and
distributed processing in hadoop platform.
1.3 Objective
• To study the constraints of SQL & swittching to Big Data technology using Hadoop platform to get
better results using Hive and Pig as both use map reduce framework which is done on parallel
processing and thus comparing the Hadoop and sql platforms.
10
1.4 METHODOLOGY:
This part of the report ,covers the implementation of the above said problem using Java Swing and MySql.
MySQL is an open source database which is free to use .It is quite stable ,reliable and consists of some
advanced features as: Security of data ,On demand Scalability and Comp lete Workflow Control etc.
The excel sheets comprises of dataset which includes the following attributes:
Person’s Name(Name), Travel service used(Travel Agency), Experience With the
service(Rating),Package opted(Package),Location of the places(Location).
The project is implemented via the Graphical User Interface which is made using java Swing and java
abstract Window Toolkit. The GUI consists of:
Main JFrame form which asks the customer about the login details such as Name, Age , email , gender ,
package to be opted , no. of persons , no. of days and location.
Depending on the data provided, customer will be suggested best cabs service he/she should choose in
order to meet their demands optimally.
11
When the User enters the required details such as package information , location he wants to
go,duration , no. of persones etc.,he will be provided with the choice of cabs service that suits him
the best and for that result comes is done at the backend using sql .
Hadoop capabilities
Apache Hadoop is a well-known Big Data software .It was developed with all the modules
integrated at one place.Actually it is allowing processing which is distributed of larger datasets
across fragements of clusters of computers. Storage part which is the core of the Hadoop is
HDFS(Hadoop Distributed File System) whereas its processing model is Map Reduce .It actually
split the files to larger blocks which then can be distributed over a node in the cluster Then the
data is processed in parallel. Therefore in this the processing of data takes place much faster as
compared to other architectures as here there is parallel computing. Eg: it can process terabytes of
data in seconds whereas earlier it used to take hours and hours to analyze that data.
In addition, with several open source modules, the Hadoop community has helped to expand the
ecosystem. In parallel, IT vendors have unique Hadoop distribution company hardening features.
HDFS is a storing unit in Hadoop. It can actually supports structured, semi-structured as well as
unstructured data. It can handle huge volume of data(it can be even Zettabytes).HDFS has a
master slave architecture. It actually distribute larger data into cluster .Each cluster have one
master ie Namenode and several slave nodes ie Datanodes. Namenodes actually give a particular
Task to a particular datanode. But since its based on fault tolerance so it keeps a copy in case of
any failure. Usually it keeps 3 copies.
12
Hive is a datawarehouse software built on the high level of apache Hadoop for providing data queries
and analysis. It is written in Java.Hive is more like SQL since it has somewhat similar syntax as
compared to SQL. Also it is known as HIVEQL.
Whereas pig is also one of ths data querying language built on the top of the apache Hadoop.The
language for this platform is Pig Latin.It is basically a hhigh level programming language that is used to
analyze larger datasets.It was developed at yahoo.
13
CHAPTER 2
LITERATURE SURVEY
Paper actually talks about hive and its abilities that can be used as data warehouse.
In this paper how hive functions in the field of warehousing is appeared:
1. Usefulness — It indicates that HiveQL receives developed I with the aid of potential of
the StatusMeme software .
2. 2 Tuning — It show off the inquiry design watcher o which demonstrates how HiveQL questions
are converted into bodily plans of guide lessen employments.
3. UI — It demonstrates the graphical UI which permits customers to l investigate a Hive
database, creator HiveQL inquiries, and display screen inquiry execution.
4. Versatility — It delineates thel adaptability & of the framework with the aid of& increasing the
sizes of the data data and the multifaceted mnature of the questions.
This paper discusses how an organisation can discover genuine lpen doorways in consolidating
disconnected and on line informationl to give astuteness on how combining ldisconnected and on-line
facts can be useful. Organizations makes use of inspiration calculations which have the above
advantages. Proposal calculations are great perceived for their , utilization on on-line business Web
sites. Here theoy utilize client's pursuits as a contribution to create an index of prescribed things.
These calculations are partitioned into two types :
The preliminary ones are referred lto as content based totally filtering. Content based filtering can
likewise be known as as cognitive filtering, which prescribes items based on an examination between the
lsubstance of the objects and a purchaser profile.
Also, tlhe 2nd one is collaborative filtering. It depends uponl no longer clearly the traits of the items
but alternatively how men and women i.e distinct clientsf react to comparable articles. Associations
need to get every one of the factss traits, disconnected and on the web, into a solitary database, which
would be
14
additionally refined via cutting area examination strategies, and utilize the consolidated information
for accuracy focusing on specific technologies.
With limitless Data is an data whose scale, better than common assortment, and multifaceted nature
require new designing, strategies, figurings, and examination to direct it and focus regard and disguised
gaining from it.
This paperactually dicusses that why Hadoop is better. Following focuses center around the benefits of
hadoop:
Adaptability , exceeds expectations at dealing with statistics of complicated nature and its open-source
nature make it a great deal nicely known. In this paper the format of hadoop and mapreduce is clarified.
MapReduce has been shown as a free stage as the advantage layer perfect for extraordinary want with
the aid of cloud providers.
It in addition allows clients to respect the facts dealing with and exploring.
As the components look and feel is decided by platform and not by java.
So this paper talks about swing . Swing was basically developed after Abstract Window
Toolkit(AWT) because of some limitation ofsAWT .As swing components are light weight as
compared to AWT. Swing supports platform independence and MVC. It supports platform
15
independence because of the fact that swing is written in Java. Also it provide some of the extra
features as compared to AWT ie scroll panels, trees, lists etc
Big data is an ocean of datasets.It has changed various spheres of our life .It has helped in analyzing
the data easily and effectively which in turn has helped us making decisions effectively.
Big data has contributed in various fields such as:
Medicines
Business sites
Science and Engineering
Government
This paper talks about,how big data has revolutioned our lives from science to government and
from enterprises to customers.Its a turnkey solution because it has changed the scenario of
business.as we these the e-commerce sites are on tremendous increase.With lots of advantages
,some challenges still remain like security and manymore.
MySQL provides different types of storage engines. Various engines uses various datastorage
mechanism, techniques used for indexing ,lock level so as to provide distinct features and
capabalities.
This paper basically talks about various storage engines of MySQl.
MY ISAM ENGINE:
It is a default storage engine in mysql 5.1 .
Foreign key and transaction are not supported
in this. It has higher access speeds.
INNO DB ENGINE:
InnoDB provides MySQL trnsaction-safe tables .
They consists of the capabilities like trans action rollback and crash recovery.
These features increase multipl user.
Thus,this paper total up the capabilities of the two engines and introduces the optimization concepts
by interrogating the two main engines.
16
2.8 Analysis of the Use of Online Travel Agencies (OTAs) based on the perception of
Consumers[8]
In order to develop effective work in tourism companies, it is necessary to know the opinion of
consumers about their services. Considering that the travel agency sector is constantly changing,
because of the easy access of consumers to information technology, these companies must focus
on service quality.
As for hotel managers and OTAs, they should not focus solely on direct sales, but should also be
concerned about brand reputation, which is projected in UGC on the Internet, as highlighted
throughout the paper.
Regarding traditional travel agencies, the conclusions obtained also suggest that an online presence
should be created and maintained in order to survive and recover the competitiveness of the sector,
since Internet use by young consumers is increasing. Soon, in the very near future, even older
consumers will book accommodation through online platforms.
2.9 “Scalability Study Of Hive and Hadoop Mapreduce In Big Data Analytics” [9]
First inquiry emerges that how would you cross a contemporary facts framework to Hadoop, when that
basis depends on l conventional social databases and the Structured Query Language (SQL)?
This is the area Hive appears. Facebook created Hive which relies upon on herbal thoughts of tables,
sections and segments, giving l an ordinary state inquiryg equipment for getting to information fromk
their contemporary Hadoop distribution center.
In this paper this is a correlation between Hive versus Mapreduce :
All the work performed is even though j hive however the mapreduce is the one which l surely work
inside. An examination was directed on phrase test in which hive performed out the exercise inside 35
seconds and well-known information reduce took around 1 min 10 seconds.
Along these lines this examination paper presumeksvthat hive is unmistakably greater finest to regular
mapreduce & outperforms mapreduce execution.
Hadoop is a product structure which is added on a Linux background to expand splendid records
examinations . The Hadoop Distributed File System (HDFS) is a sturdy gadget given with the aid of
hadoop and in addition it is a Java-based API that lets in parallel handling on the clusters of the group.
Projects use a Map/Reduce , l execution which works as a incredible appropriated processing
framework over usual informational indexes - a strategy given through Google.
17
There are discrete Map and Reduce steps, the place each l of the ranges are accomplished in parallel,
working one through one on units of key-value sets. Process is parallelized greater than lots of clusters
chipping away at terabyte or better measured informational l indexes. The Hadoop structure as a result
plans t define close to the statistics l on which they will work, with "close" which meanss a
comparable hub or, at any rate, ss the identical rack. Hub disappoientments are moreover taken care of l
consequently. Notwithstanding Hadoop l itself, which is a fantastic dimension Apache venture.
2.11 “Storage and Processing Speed For Knowledge from Enhanced Cloud Framework”[11]
Cloud is the pool of servers, all of the servers are inter - connecteds through web, The precept factor
in cloud is improving data (learning) and operationvthat assortment of information and other factor
is security for that information. Basically in todays time extraordinary kinds of capability assortment
of facts (Structured, semi-organized and Unstructured information) is managed in the various social
operational platform.
So,other trouble is verifiable statistics recovering. These types of troubles are settled with resource
of hadoop and Sqoop and flume devices. Sqoop is stack records beginning database to Hadoop
(HDFS), and flume stacks statistics from server documents to hadoop appropriated report
frameworks. Last point is settling with assist of clusters in hadoop dispersed record machine with
taking care offsettling in information diminishin and pig and hive and begin, etc.
This paper here condenses capability & coping with speed in upgraded cloud with hadoop system.
18
CHAPTER 3
SYSTEM DEVELOPMENT
3.1 Designing
This topic completes the making of projects by Java and big data models using HQL and PIG.
This area covers the hardware and software used while applying our problem statement using Hive
and Pig and Sql.
3.2.1 Java
Software Requirements
• Language : Java
• Version : JDK 1.5
• IDE : Net Beans
• Back-end : Sql
19
3.2.2 For Hive and Pig platform
For Single Node , hardware requirements are described by the following details :
Memory 4 GB
(2.5 GB on virtual machine)
Softwares Versions
Hadoop 2.7.1
Hive 1.2.1
Pig
Netbeans IDE 8.0.2
Web browser Google Chrome
20
Advantages of VMware Workstation are :
This part discusses about the analytical models in the field of Big Data.
The layout of the table is forced in traditional databases in the middle of the information stack
cycle, on the of chance that the stacked information does not conform to the construction, the
information stack is rejected, this practice is known as Schema-on-Write . texture on-Write
allows to execute the inquiry quicker, as the data is now arranged in a specific arrangement and
locating the section folder or labeling the information is anything but difficult.
The main points of interest in composite mapping are precision and pace of questioning. In the
normalmanner(RDBMS. Therefore, we assume that we construct a schema consisting of 9
columns and we try to lineup information that can satisfy only 8 segments with information
would be reje cted so outcome of the composition schema, here the information is perused against
blueprint before it is0 kept in contact with the data base.
21
Fig. 6 Data Models for shaping data into tabular structure
With the Big Data and NoSQL worldview , "Schema-on-Read" implies we don't have to be aware of how
we will utilize our information when we are inserting away it.
We do want to be aware of how we will utilize our statistics when we are using it and model in like manner.
Model: We might also at first put the information on HDFS in records , then practice a desk structure
in Hive.
22
Fig. 7 Process for creating table and uploading data using the HDFS
In the large records operational device , there are three sorts of datas i.e. structured information , semi
structured and unstructured records . There are a few techniques through which we ought to execute
these sorts of information.
23
For instance , let us think about the instance of organized data for that we have to make use of HIVE
with HQl.
For unstructured statistics we have to at the beginning stack the report into HDFS record framework and
after that convert it into an unthinkable employer using individual mapreduce strategies. For semi
organized data we have to utilize JSON/XML records to change over it into an unthinkable
configuration.
3.4 Analysis
This topic explains the exploring of data according to the required problem statement using Hive.
24
In the Data Analysis, following steps takes place:
1. The downloaded .csv file (tourist.csv) is attached intof the HDFS using:
2. After the file is attached to the HDFS, database (name - tourist) is made.
25
3.4.2 Design Analysis:
Snippet #8
26
FLOW CHART DIAGRAM:
Snippet #9
27
3.5 Algorithms
This topic details the algorithms we used in the concept of the problem topic of the project.
CSV is dataset format used in ML and data science. MS Excel can be used in CSV format for
basic data manipulation. Often, complex SQL queries need to be executed on CSV files, which is
not possible with MS Excel.
Nonetheless, we need to convert CSV files to data tables before we can perform complex SQL
queries on CSV files. There are many ways to transform CSV data into a database table format.
One approach is to create a new table and copy all the information to the table from the CSV file.
However, when the dataset is very large, copying and pasting data can be extremely cumbersome
and time-consuming.
Another way is to write a script that reads the data from the CSV and inserts it into the data table.
This method is faster than copy-pasting, but a manual script is still required.
28
• Establish a connection: Using the DriverManager.getConnection() method to
create a Connection object, shows a physical connection with the database.
• To Execute a query: Using an object of type Statement for framing and proposing an
SQL statement to the database.
• To abstract data from result set: Using the appropriate ResultSet.getXXX() method to
extract data from the result set.
• Cleaning up the environment: Explicitly closing all database resources versus depending
onthe JVM's garbage collection.
The MapReduce calculation contains two important steps, in particular Map and Reduce.
29
The reducing is done by methods for Reducer Class.
Mapper class catches information, fragements it, maps t and sorts it. The output of Mappers class is
used as input by Reducers class, and thus seeks coordinating sets and lessens them.
MapReduce actualizes different numerical calculations to separate a job into little fragements
and appoint to differentvframeworks. In short , MapReduces calculation helps in direct the Map
and Reduce undertakings for suitable servers bunch.
• Sorting
• Searching
30
Fig. 11 Mapper Stage
• Sorting
Sorting is a basic MapReduce calculation to execute & fragment the given information. Map
- Reduce executes arranging calculation naturally sortout the key-values sets from
the mapperclass with the help of its keys.
• Searching
Searching plays a very important role in Map-Reduce calculations. Firstly, it is used in the combiner
stage then in the Reducers stage.
31
Fig. 12 Reducer Stage
Big data apllications has become increavsingly important over s past years. As the information from
large volumes of data is increasingly dependent on many organizations from different sectors.
Current data approaches and frameworks in the sense of big data are less well organized. Traditional
approaches gave a slow response including lack of scalabilitys, reliability d & precision. A lot of
work has been done to face the complex challenges of Big Data. This has resulted in j the
32
development of innumerable types of applications and technxologies. With intention to help define
and incorporate the best fusion of different Big Data k technologies based on their technologicall
requirements and determined applications. Not only offers a global impact of major Big Data
technollogies, but also correlations across k various layers of structures like data storage layers,
information processing layers, request layers, access layers, and suitable managiment layers.
Big Data mining offers a lot of desirable prospects. Nonetheless, when research done on Big Data
sets and abstract values and expertise from data mines, Researcheers face many challenges. The
difficulties includes : data captures, processing, search, evaluation, managing and visualizing at
individual levels. With addition to , security and personal issues comes, specially in applications
done by distributed data. Deluge of data and distributed flows sometimes surpass our abilities to
harm. In reality, although Big Data's scale continues to grow exponentially, with the current
technology ability to operate and inspect Big Data sets is relatively lower rates of data petabytes,
exaabytes, and zetabytes
33
3.7 Test Plan
This area details the test plan for the executing of the data.
3.7.1 Dataset
In Table , the column name calls as the name ofsexperienced customer who have reviewed the
travel agencies .The field Travel Agency stands for various travel agencies which are available in
the area. Rating field defines the rating of the travel agency by experienced customer. Package field
represent the package that customer had opted for. Location field represents the location of the tour.
One record from the dataset (login values.csv) is shown in the given figure:
34
Table. 2 Sample record of the tourist dataset
With this scenerio , the actual dataset is analyzed contains 100 rows.
35
• to analyze the data records of various column fields collectively.
36
CHAPTER 4
37
• STEP 2: Importing The CSV file to MySQL 5.7
First page
38
Sign up page for new user
Log in page for registered user wanting recommendation on best cabs service for a
particular asked location.
39
Log in page for user who wants to give rating for the cabs agency they have used.
• STEP 4: Connection built – java applets to mySql 5.7 using mysql connector.
40
• STEP 5: Analysing the datasheet for recommendation on best cab agency for a
given location using sql query.
• STEP 6: Result
41
42
4.2 Using HIVE(HiveQL)
STEP 1: Loading the data set from local file system to hadoop file system. Data
set contains attributes i.e id(int) , name(String) , destination(String) ,
travelagency(String) , rating(float) , month(String) , package(String) ,id2(int).
43
STEP 3: Creating table new2 in hiveQL.This dataset contains customer
information who had request of suggesting them travelagency as pertheir
information provided.
44
STEP 5: Analyzing the table new1 and new2 in order to fulfill the customer
request who wanted to have suggestion of travelagency in terms of package ,
duration and destination.
45
STEP 6: Using ORDER BY the analyzed result would be :
STEP 8: Final result : Destination Go would be the best travel agency for the
customer named “Aman” requested for package “C “ for destination =
“Rishikesh” in the duration = “October”.
46
4.3 Using PIG
47
STEP 3: Joining table1 and table2 on package = “C” to get all existing
customer information who had given rating on same package.
STEP 4: Using FOR EACH and GENERATE to get travelagency and ratings
for the package = “C”.
48
STEP 5: GROUP all travelagency for calculating the average of ratings in
order to get the highest rating for recommending the best travelagency tothe
customer in terms of package , duration , rating .
49
STEP 6: Using ORDER for obtaining the descending order of travelagaency
based on ratings.
50
4.4 Comparison between existing system and adopted solution
Existing System:
Disadvantages:
Topic including unique features to distinct suggestions forsuitable travel packages from standard
recommender systems remains very open .Due to the disadvantages of above existing system i.e not
able to handle very large data so we shifted to Big Data(Distributed
Architecture).With coming practical and domain challenges in structuring and executing the
suitable system of recommendations in customized travel recommendations. Plan would aid
51
visitors to have best travel agency with the selected package deal among all the Travel agencies.
A customer will have a best travel agency with his desired package for a particular destination
based on the ratings given by the existing customers who had experience with the packages.With
this travel recommendation system aids in making the right choice with the best travel agency
makes deal easy for the client.
Advantages:
52
CHAPTER 5
CONCLUSION
Conclusion
This project is done using Sql and Java applets but for the optimal efficiency when data set is too big
to query with sql it is done using Pig and Hql over MapReduce framework .
Application helps to suggest best cabs agency with respect to factors like package, duration and
destination etc among allother cabs services. Customer would choose a cabs agency for a destined
place based on the recommendations provided by the existing customers who had experience with
the same package in a travel agency. This makes simple and ease for the new customers to select
the best travel agency deal.
New customers could select the best travel agency in short amount of time (instead of navigating
to other websites).
Finally, the aim of project is to have an optimal system which is effecient in terms of cost , time and
money plus with less hardwork. This can be seen in the results of sql , hive and pig. Computing time
for analyzing the query by them is SQL > Hive >Pig.
Also, we conclude that existed system is not efficient for handling real time data which is very large
so moving to distributed architecture and parallel computing using Hive & Pig over Map-Reduce
framework have certain advantages over it .
53
REFERENCES
[1] Ashish Thusoo, Joydeep Sen Sarma, Jain, Zheng Shao, Prasad Chakka, Suresh
Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy ,’ Hive - A Warehousing Solution Over
a Map-Reduce Framework’
[2] Swapna Sahu ,’Pattern Finding In Log Data Using Hive on Hadoop’, IJIRMPS | Volume 6, Issue 4,
2018
[3] Scalability Study of Hadoop MapReduce and Hive in Big Data Analytics Jabeen1,
Dr TSS Balaji2 1B , International Journal Of Engineering And Computer Science ISSN: 23197242
,Volume 5 Issue 11 Nov. 2016, Page No. 18790-18792
[5] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng , Prasad Chakka, Ning Zhang, Suresh Antony,
Hao Liu and Raghotham Murthy ,’ Hive – A Petabyte Scale Data Warehouse Using Hadoop ‘
[6] SK. Jilani Basha, P. Anil Kumar, S. Giri Babu ,‘Storage and Processing Speed for Knowledge from
Enhanced Cloud Computing With Hadoop Frame Work : A Survey’
[7] Ronald Taylor ,’ An Overview of the Hadoop/Mapreduce/Hbase framework and its current
applications in bioinformatics’
[8] Namrata B Bothe ,’Migration of Hadoop To Android Platform Using ‘Chroot’, Volume 1 | Issue 5
[9] Nishant Rajput , Nikhil Ganage ,and Jeet Bhavesh Thakur,’ REVIEW PAPER ON HADOOP
AND MAP REDUCE’, IJRET: International Journal of Research in Engineering and Technology,
Volume: 06 Issue: 09 | Sep-2017
[11] Wanliang Tan Xinyu Wang Xinyu Xu,’ Travel agencies OTA 2015
54
AS
ORIGINALITY REPORT
2% 1% 19%
19 %
SIMILARITY INDEX INTERNET SOURCES PUBLICATIONS STUDENT PAPERS
PRIMARY SOURCES
<1%
<1% 55
Media LLC, 2017
Publication
www.iaeme.com
9
<1 %
Internet Source
56