RESTAURANT REVIEW PRODUCTION ANALYSIS USING PYTHON (1)
RESTAURANT REVIEW PRODUCTION ANALYSIS USING PYTHON (1)
One of the most effective tools any restaurant has is the ability to track food and
also in beverage sales daily. Currently, Recommender systems plays an important role in
both academia and industry. These are very helpful for managing information overload. In
this paper, we applied machine learning techniques for user reviews and analyze valuable
information in the reviews. Reviews are useful for making decisions for both customers and
owners. We build a Machine learning model with Natural Language Processing techniques
that can capture the user's opinions from users’ reviews. For experimentation, the python
language was used.
The growth of the internet due to social networks such as facebook, twitter,
Linkedin, instagram etc. has led to significant users interaction and has empowered
users to express their opinions about products, services, events, their preferences among
others. It has also provided opportunities to the users to share their wisdom and
experiences with each other. The faster development of social networks is causing
explosive growth of digital content. It has turned online opinions, blogs, tweets, and
posts into a very valuable asset for the corporates to get insights from the data and plan
their strategy. Business organizations need to process and study these sentiments to
investigate data and to gain business insights. Traditional approach to manually extract
complex features, identify which feature is relevant, and derive the patterns from this
huge information is very time consuming and require significant human efforts.
However, Deep Learning can exhibit excellent performance via Natural Language
Processing (NLP) techniques to perform sentiment analysis on this massive
information. The core idea of Deep Learning techniques is to identify complex features
extracted from this vast amount of data without much external intervention using deep
neural networks. These algorithms automatically learn new complex features. Both
automatic feature extraction and availability of resources are very important when
comparing the traditional machine learning approach and deep learning techniques.
Here the goal is to classify the opinions and sentiments expressed by users.
The online medium has become a significant way for people to express their
opinions and with social media, there is an abundance of opinion information available.
Using sentiment analysis, the polarity of opinions can be found, such as positive,
negative, or neutral by analyzing the text of the opinion. Sentiment analysis has been
useful for companies to get their customer's opinions on their products predicting
outcomes of elections, and getting opinions from movie reviews. The information
gained from sentiment analysis is useful for companies making future decisions. Many
traditional approaches in sentiment analysis uses the bag of words method. The bag of
words technique does not consider language morphology, and it could incorrectly
classify two phrases of having the same meaning because it could have the same bag of
words. The relationship between the collections of words is considered instead of the
relationship between individual words. When determining the overall sentiment, the
sentiment of each word is determined and combined using a function. Bag of words
also ignores word order, which leads to phrases with negation in them to be incorrectly
classified.
1.2 EXISTING SYSTEM WITH DRAWBACKS
Without NLP and access to the right data, it is difficult to discover and collect
insight necessary for driving business decisions. Deep Learning algorithms are used to
build a model
The advanced techniques like natural language processing is used for the
sentimental analysis. It make our project very accurate.
NLP defines a relation between user posted tweet and opinion and in addition
suggestions of people.
NLP is a best way to understand natural language used by the people and
uncover the sentiment behind it. NLP makes speech analysis easier.
CHAPTER 2
ANALYSIS
System analysis is conducted for the purpose of studying a system or its parts
in order to identify its objectives. It is a problem-solving technique that improves the
system and ensures that all the components of the system work efficiently to accomplish
their purpose.
SOFTWARE REQUIREMENTS
One of the most difficult tasks is that, the selection of the software, once
system requirement is known that is determining whether a particular software package
fits the requirements
For predicting the literacy rate of India, our project has been divided
into following modules:
1. Data Loading:
We should collect the data from various sources like different websites,
pdfs and word document. After collecting the data we will convert it into csv file.
Pandas:
In order to be able to work with the data in Python, we'll need to read the
csv file into a Pandas Data Frame. A Data Frame is a way to represent and work with
tabular data. Tabular data has rows and columns, just like our csv file.
2. Data Preprocessing:
After collecting the data we will convert it into csv file then, we will break the data into
individual sentences.Then.by using Natural Language Processing (NLP) we eliminate
stop words. Stop words are those words which are referred as useless words in the
sentence or the extra data which are of no use. For example, ”the” , ”a” , “an”, “in” ,
are some of the examples of stop words in English.
3. Model Training:
It is a classification technique based on Bayes’ Theorem with
an assumption of independence among predictors. In simple terms, a Naive
Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Naive Bayes model is easy to
build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification
methods.
--------------- I
4. Prediction
Here we use Google Chrome or Browser to predict the model or visualize the
Output. When we enter the review, the model will classify whether the review is +’ ve
or _’ ve.
It will give the Result in the form of 2 Labels:
a. Positive Label(Rating Greater than 3)
b. Negative Label(Rating Less than 3)
3. DESIGN
3.1 BLOCK DIAGRAM
The block diagram is typically used for a higher level, less detailed description
aimed more at understanding the overall concepts and less at understanding the details
of implementation.
A DFD shows what kinds of information will be input to and output from the system,
where the data will come from and go to, and where the data will be stored. It doesn’t
show information about timing of processes, or information about whether processes
will operate in sequence or parallel. A DFD is also called as “bubble chart”.
DFD Symbols:
Process: People, procedures or devices that use or produce (Transform) data. The
physical component is not identified.
In our project, we had built the data flow diagrams at the very beginning of
business process modelling in order to model the functions that our project has to carry
out and the interaction between those functions together with focusing on data
exchanges between processes.
A Context level Data flow diagram created using select structured systems
analysis and design method (SSADM). This level shows the overall context of the
system and its operating environment and shows the whole system as just one process.
It does not usually show data stores, unless they are “owned” by external systems, e.g.
are accessed by but not maintained by this system, however, these are often shown as
external entities. The Context level DFD is shown in fig.3.2.1
After starting and executing the application, training and testing the dataset can
be done as shown in the above figure
3.2.3 Detailed Level Diagram
This level explains each process of the system in a detailed manner. In first
detailed level DFD (Generation of individual fields): how data flows through individual
process/fields in it are shown.
In second detailed level DFD (generation of detailed process of the individual fields):
how data flows through the system to form a detailed description of the individual
processes.
After starting and executing the application, training the dataset is done by using
dividing into 2D array and scaling using normalization algorithms, and then testing is
done.
ii. The analysis representation describes a usage scenario from the end- users
Perspective.
i. In this model the data and functionality are arrived from inside the system.
In this the structural and behavioral as parts of the system are represented
as they are to be built.
• Environmental Model View
In this the structural and behavioral aspects of the environment in which the
system is to be implemented are represented.
3.3.1 Use Case Diagram:
Use case diagrams are one of the five diagrams in the UML for modeling
the dynamic aspects of the systems (activity diagrams, sequence diagram, state chart
diagram, collaboration diagram are the four other kinds of diagrams in the UML for
modeling the dynamic aspects of systems).Use case diagrams are central to modeling
the behavior of the system, a sub-system, or a class. Each one shows a set of use cases
and actors and relations.
Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.
What is Python?
• Windows 10 & 11
• Python Programming
• Open source libraries: Pandas, NumPy, Skippy, matplotlib, OpenCV
Python Versions
Python 2.0 was released on 16 October 2000 and had many major new features,
including a cycle-detecting, garbage collector, and support for Unicode. With this
release, the development process became more transparent and community- backed.
Python 3.0 (initially called Python 3000 or py3k) was released on 3 December
2008 after a long testing period. It is a major revision of the language that is not
completely backward-compatible with previous versions. However, many of its major
features have been back ported to the Python 2.6.xand 2.7.x version series, and releases
of Python 3 include the 2to3 utility, which automates the translation of Python 2 code
to Python 3.
Python 2.7's end-of-life date (a.k.a. EOL, sunset date) was initially set at 2015,
then postponed to 2020 out of concern that a large body of existing code could not
easily be forward-ported to Python 3.In January 2017, Google announced work on a
Python 2.7 to go Trans compiler to improve performance under concurrent workloads.
Python 3.6 had changes regarding UTF-8 (in Windows, PEP 528 and PEP 529)
and Python 3.7.0b1 (PEP 540) adds a new "UTF-8 Mode" (and overrides POSIX
locale).
Why Python?
• Python is a scripting language like PHP, Perl, and Ruby.
• Excellent documentation
• Thriving developer community
• You'll know how to use Python and its libraries to explore your data with the help of
matplotlib and Principal Component Analysis (PCA).
• And you'll preprocess your data with normalization and you'll split your data into
training and test sets.
• Next, you'll work with the well-known K-Means algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.
• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.
• It was born from pattern recognition and theory that computers can learn without being
programmed to specific tasks.
• It is a method of Data analysis that automates analytical model building.
Machine learning tasks are typically classified into two broad categories,
depending on whether there is a learning "signal" or "feedback" available to a learning
system. They are
Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs. As special cases, the input signal can be only partially available, or restricted
to special feedback:
Active learning: The computer can only obtain training labels for a limited set of
instances (based on a budget), and also has to optimize its choice of objects to acquire
labels for. When used interactively, these can be presented to the user for labelling.
Reinforcement learning: training data (in form of rewards and punishments) is given
only as feedback to the program's actions in a dynamic environment, such as driving a
vehicle or playing a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data) or a means towards an end (feature learning).
In regression, also a supervised problem, the outputs are continuous rather than
discrete.
Regression: The analysis or measure of the association between one variable (the
dependent variable) and one or more other variables (the independent variables), usually
formulated in an equation in which the independent variables have parametric
coefficients, which may enable future values of the dependent variable to be predicted.
Classification
20
Downloaded by M.Phani Kumar MPK ([email protected])
Filtering emails “spam” or “not spam”, when looking at transaction data, “fraudulent”,
or “authorized”. In short Classification either predicts categorical class labels or
classifies data (construct a model) based on the training set and the values (class labels)
in classifying attributes and uses it in classifying new data.
1. Logistic regression
2. Decision tree
3. Random forest
4. Naive Bayes.
1. Logistic Regression
2. Decision Tree
Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained
by two entities, namely decision nodes and leaves.
3. Random Forest
The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
II
Naive Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Naive Bayes model is easy to build and
particularly useful for very large data sets. Along with simplicity, Naive Bayes is
1.Structured Algorithm
2.Unstructured Algorithm
1. Structured Algorithm
Artificial Neural Networks are computational models and inspire by the human
brain. Many of the recent advancements have been made in the field of Artificial
Intelligence, including Voice Recognition, Image Recognition, Robotics using
Artificial Neural Networks. Artificial Neural Networks are the biologically inspired
simulations performed on the computer to perform certain specific tasks like –
• Clustering
• Classification
• Pattern Recognization
2. Unstructured Algorithm
For example, a DNN that is trained to recognize dog breeds will go over the
given image and calculate the probability that the dog in the image is a certain breed.
The user can review the results and select which probabilities the network should
display (above a certain threshold, etc.) and return the proposed label. Each
mathematical manipulation as such is considered a layer, and complex DNN have many
layers, hence the name "deep" networks.
Modules in python
Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column
labels
Any other form of observational / statistical data sets. The data actually need not be
labelled at all to be placed into a panda’s data structure
The two primary data structures of pandas, Series (1-dimensional) and Data
Frame (2dimensional), handle the vast majority of typical use cases in finance,
statistics, social science, and many areas of engineering. For R users, Data Frame
provides everything that R’s data frame provides and much more. Pandas is built on top
of NumPy and is intended to integrate well within a scientific computing environment
with many other 3rd party libraries. Few of the things that pandas does well:
Easy handling of missing data (represented as Nan) in floating point as well as non
floating-point data
Size mutability: columns can be inserted and deleted from Data Frame and higher
dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of
labels, or the user can simply ignore the labels and let Series, Data Frame, etc.
automatically align the data for you in computations
Powerful, flexible group by functionality to perform split-apply-combine operations
on data sets, for both aggregating and transforming data
Make it easy to convert ragged, differently-indexed data in other Python and NumPy
data structures into Data Frame objects
Intelligent label-based slicing, fancy indexing, and sub setting of large datasets
merging and joining data sets
Robust IO tools for loading data from flat files (CSV and delimited), Excel files,
databases, and saving / loading data from the ultrafast HDF5 format
Time series-specific functionality: date range generation and frequency conversion,
moving window statistics, moving window linear regressions, date shifting and lagging,
etc.
Pandas is fast. Many of the low-level algorithmic bits have been extensively improved
in Python code. However, as with anything else generalization usually sacrifices
performance. So, if you focus on one feature for your application you may be able to
create a faster specialized tool.
Pandas are a dependency of stats models, making it an important part of the statistical
computing ecosystem in Python.
Pandas have been used extensively in production in financial applications.
NumPy: -
Operations related to linear algebra. NumPy has in-built functions for linear algebra.
And random number generation.
Sickit-learn: -
e. Manifold Learning: For summarizing and depicting complex multi- dimensional data.
f. Supervised Models: a vast array not limited to generalize linear models, discriminate
analysis, naive bayes, lazy methods, neural networks, support vector machines and
decision trees.
Matplotlib:-
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use
is discouraged. SciPy makes use of matplotlib.
5. TESTING
In the process of testing for the given project Restaurant review analysis using AI
classification system we using the given below testing techniques.
Testing is the process of testing the functionality and it is the process of
executing a program with the intent of finding an error. A good test case is one that
has a high probability of finding an as at undiscovered error. A successful test is one
that uncovers an as at undiscovered error. Software testing is usually performed for
one of two reasons:
1) Defect Detection
2) Reliability estimation
The base of the black box testing strategy lies in the selection of appropriate
data as per functionality and testing it against the functional specifications in order
to check for normal and abnormal behavior of the system. Now a days, it is becoming
to route the testing work to a third party as the developer of the system knows too
much of the internal logic and coding of the system, which makes it unfit to test
application by the developer. The following are different types of techniques
involved in black box testing. They are:
• Equivalence Partitioning
White box testing [8] requires access to source code. Though white box testing [8] can
be performed any time in the life cycle after the code is developed, it is a good practice
to perform white box testing [8] during unit testing phase.
In designing of database the flow of specific inputs through the code, expected output
and the functionality of conditional loops are tested.
• UNIT TESTING: in which each unit (basic component) of the software is tested to
verify that the detailed design for the unit has been correctly implemented
• INTEGRATION TESTING: in which progressively larger groups of tested software
components corresponding to elements of the architectural design are integrated and
tested until the software works as a whole.
• SYSTEM TESTING: in which the software is integrated to the overall product and
tested to show that all requirements are met. A further level of testing is also done, in
accordance with requirements:
• REGRESSION TESTING: is used to refer the repetition of the earlier successful tests
to ensure that changes made in the software have not introduced new bugs/side effects.
• ACCEPTANCE TESTING: Testing to verify a product meets customer specified
requirements. The acceptance test suite is run against supplied input data. Then the
results obtained are compared with the expected results of the client. A correct match
was obtain.
.
BIBLIOGRAPHY
[1] J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning: Data Mining, Inference, and Prediction, Second
Edition. Springer Series in Statistics, 2009.
[2] G. Szabo and B. Huberman. Predicting the popularity of online content.
Community. Of ACM, 53(8), 2010.
[3] A Machine Learning Model for Stock Market Prediction Article by Osman
Hegazy and Mustafa Abdul Salam.
[4]. The Unified Modeling Language User Guide, Low Price Edition Grady Booch,
James Rumbaugh, Ivar Jacob, ISBN: 81-7808-769-5, 91, 1997.
[8] .Black_ Box Testing: Techniques for Functional Testing of Software and
Systems Boris Beizer - Wiley Publications, ISBN: 978-0-471-120-940, 1995.