0% found this document useful (0 votes)
12 views33 pages

RESTAURANT REVIEW PRODUCTION ANALYSIS USING PYTHON (1)

This paper discusses the application of machine learning and natural language processing (NLP) techniques to analyze user reviews for sentiment classification in the restaurant industry. It highlights the limitations of traditional sentiment analysis methods and proposes a deep learning approach to improve accuracy in understanding customer opinions. The system architecture includes data loading, preprocessing, modeling, and prediction phases, utilizing Python for implementation.

Uploaded by

mpkcomputers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

RESTAURANT REVIEW PRODUCTION ANALYSIS USING PYTHON (1)

This paper discusses the application of machine learning and natural language processing (NLP) techniques to analyze user reviews for sentiment classification in the restaurant industry. It highlights the limitations of traditional sentiment analysis methods and proposes a deep learning approach to improve accuracy in understanding customer opinions. The system architecture includes data loading, preprocessing, modeling, and prediction phases, utilizing Python for implementation.

Uploaded by

mpkcomputers
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

ABSTRACT

One of the most effective tools any restaurant has is the ability to track food and
also in beverage sales daily. Currently, Recommender systems plays an important role in
both academia and industry. These are very helpful for managing information overload. In
this paper, we applied machine learning techniques for user reviews and analyze valuable
information in the reviews. Reviews are useful for making decisions for both customers and
owners. We build a Machine learning model with Natural Language Processing techniques
that can capture the user's opinions from users’ reviews. For experimentation, the python
language was used.

Keywords: recommender systems, machine learning, python


INTRODUCTION

1.1 ABOUT PROJECT

The growth of the internet due to social networks such as facebook, twitter,
Linkedin, instagram etc. has led to significant users interaction and has empowered
users to express their opinions about products, services, events, their preferences among
others. It has also provided opportunities to the users to share their wisdom and
experiences with each other. The faster development of social networks is causing
explosive growth of digital content. It has turned online opinions, blogs, tweets, and
posts into a very valuable asset for the corporates to get insights from the data and plan
their strategy. Business organizations need to process and study these sentiments to
investigate data and to gain business insights. Traditional approach to manually extract
complex features, identify which feature is relevant, and derive the patterns from this
huge information is very time consuming and require significant human efforts.
However, Deep Learning can exhibit excellent performance via Natural Language
Processing (NLP) techniques to perform sentiment analysis on this massive
information. The core idea of Deep Learning techniques is to identify complex features
extracted from this vast amount of data without much external intervention using deep
neural networks. These algorithms automatically learn new complex features. Both
automatic feature extraction and availability of resources are very important when
comparing the traditional machine learning approach and deep learning techniques.
Here the goal is to classify the opinions and sentiments expressed by users.

It is a set of techniques / algorithms used to detect the sentiment (positive,


negative, or neutral) of a given text. It is a very powerful application of natural language
processing (NLP) and finds usage in a large number of industries. It refers to the use
of NLP, text analysis, computational linguistics, and biometrics to systematically
identify, extract, quantify, and study different states and subjective information. The
sentiment analysis sometimes goes beyond the categorization of texts to find opinions
and categorizes them as positive or negative, desirable or undesirable. Below figure
describes the architecture of sentiment classification on texts. In this, we modify the
provided reviews by applying specific filters, and we use the prepared dataset by
applying the parameters and implement our proposed model for evaluation.
Another challenge of microblogging is the incredible breadth of
topic that is covered. It is not an exaggeration to say that people tweet about anything
and everything. Therefore, to be able to build systems to mine sentiment about any
given topic, we need a method for quickly identifying data that can be used for training.
In this paper, we explore one method for building such data: using hashtags (e.g.,
#best feeling, #epic fail, #news) to identify positive, negative, and neutral reviews to
use for training three way sentiment classifiers.

The online medium has become a significant way for people to express their

opinions and with social media, there is an abundance of opinion information available.

Using sentiment analysis, the polarity of opinions can be found, such as positive,

negative, or neutral by analyzing the text of the opinion. Sentiment analysis has been

useful for companies to get their customer's opinions on their products predicting

outcomes of elections, and getting opinions from movie reviews. The information

gained from sentiment analysis is useful for companies making future decisions. Many

traditional approaches in sentiment analysis uses the bag of words method. The bag of

words technique does not consider language morphology, and it could incorrectly

classify two phrases of having the same meaning because it could have the same bag of

words. The relationship between the collections of words is considered instead of the

relationship between individual words. When determining the overall sentiment, the

sentiment of each word is determined and combined using a function. Bag of words

also ignores word order, which leads to phrases with negation in them to be incorrectly

classified.
1.2 EXISTING SYSTEM WITH DRAWBACKS

Existing approaches to Sentimental Analysis are Knowledge-based


techniques.

Knowledge-based techniques make use of prebuilt lexicon sources containing


polarity of sentiment words SentiWordNet (SWN) for determining the polarity of a
tweet. Lexicon based approach suffers from poor recognition of sentiment.

1.3 PROPOSED SYSTEM WITH FEATURES

In the proposed system , sentimental analysis is done using natural language


processing , which defines a relation between user posted tweet and opinion and in
addition , suggestions of people.

Truly listening to a customer’s voice requires deeply understanding what they


have expressed in natural language. NLP is a best way to understand natural language
used and uncover the sentiment behind it. NLP makes speech analysis easier.

Without NLP and access to the right data, it is difficult to discover and collect
insight necessary for driving business decisions. Deep Learning algorithms are used to
build a model

ADVANTAGES OF PROPOSED SYSTEM

The advanced techniques like natural language processing is used for the
sentimental analysis. It make our project very accurate.

NLP defines a relation between user posted tweet and opinion and in addition
suggestions of people.

NLP is a best way to understand natural language used by the people and
uncover the sentiment behind it. NLP makes speech analysis easier.
CHAPTER 2
ANALYSIS

System analysis is conducted for the purpose of studying a system or its parts
in order to identify its objectives. It is a problem-solving technique that improves the
system and ensures that all the components of the system work efficiently to accomplish
their purpose.

2.1 HARDWARE REQUIREMENTS

The selection of hardware is very important in the existence and proper


working of any software. In the selection of hardware, the size and the capacity
requirements are also important.

PROCESSOR INTEL CORE i5 Processer


RAM CAPACITY 4GB
HARDDISK 512 GB

SOFTWARE REQUIREMENTS

One of the most difficult tasks is that, the selection of the software, once
system requirement is known that is determining whether a particular software package
fits the requirements

PROGRAMMING LANGUAGE PYTHON


TOOL PYCHARM
OPERATING SYSTEM WINDOWS

2.2 FUNCTIONAL REQUIREMENTS

The following are the functional requirements of our project:

• A training dataset has to be created on which training is performed.

• A testing dataset has to be created on which testing is performed.

NON FUNCTIONAL REQUIREMENTS:

• Maintainability: Maintainability is used to make future maintenance easier,


meet new requirements.
• Robustness: Robustness is the quality of being able to withstand stress, pressures
or changes in procedure or circumstance.
• Reliability: Reliability is an ability of a person or system to perform and
maintain its functions in circumstances.
• Size: The size of a particular application play a major role, if the size is less
then efficiency will be high.
• Speed: If the speed is high then it is good. Since the no of lines in our code
is less, hence the speed is high.

3.2 MODULE DESCRIPTION

For predicting the literacy rate of India, our project has been divided
into following modules:

1. Data set Loading


2. Data Preprocessing
3. Data Modeling
4. Prediction

1. Data Loading:

We should collect the data from various sources like different websites,
pdfs and word document. After collecting the data we will convert it into csv file.

Pandas:
In order to be able to work with the data in Python, we'll need to read the
csv file into a Pandas Data Frame. A Data Frame is a way to represent and work with
tabular data. Tabular data has rows and columns, just like our csv file.

2. Data Preprocessing:

After collecting the data we will convert it into csv file then, we will break the data into
individual sentences.Then.by using Natural Language Processing (NLP) we eliminate
stop words. Stop words are those words which are referred as useless words in the
sentence or the extra data which are of no use. For example, ”the” , ”a” , “an”, “in” ,
are some of the examples of stop words in English.

3. Model Training:
It is a classification technique based on Bayes’ Theorem with
an assumption of independence among predictors. In simple terms, a Naive
Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any other feature. Naive Bayes model is easy to
build and particularly useful for very large data sets. Along with simplicity,
Naive Bayes is known to outperform even highly sophisticated classification
methods.

Bayes theorem provides a way of calculating posterior probability P(c|x)


from P(c), P(x) and P(x|c). Look at the equation below:

--------------- I

4. Prediction

Here we use Google Chrome or Browser to predict the model or visualize the
Output. When we enter the review, the model will classify whether the review is +’ ve
or _’ ve.
It will give the Result in the form of 2 Labels:
a. Positive Label(Rating Greater than 3)
b. Negative Label(Rating Less than 3)
3. DESIGN
3.1 BLOCK DIAGRAM
The block diagram is typically used for a higher level, less detailed description
aimed more at understanding the overall concepts and less at understanding the details
of implementation.

Figure 3.1.1 Block Diagram for Sentimental Analysis


3.2 DATAFLOW DIAGRAMS:
Data flow diagram (DFD) is a graphical representation of “flow” of data
through an information system, modelling its process concepts. Often they are a
preliminary step used to create an overview of the system which can later be elaborated.
DFD’s can also be used for the visualization of data processing (structured design). [5]

A DFD shows what kinds of information will be input to and output from the system,
where the data will come from and go to, and where the data will be stored. It doesn’t
show information about timing of processes, or information about whether processes
will operate in sequence or parallel. A DFD is also called as “bubble chart”.

DFD Symbols:

In the DFD, there are four symbols:

• A square define a source or destination of system data.

• An arrow indicates dataflow. It is the pipeline through which the information


flows.

• A circle or a bubble represents transforms dataflow into outgoing dataflow.

• An open rectangle is a store, data at reset or at temporary repository of data.

Dataflow: Data move in a specific direction from an origin to a destination.

Process: People, procedures or devices that use or produce (Transform) data. The
physical component is not identified.

Sources: External sources or destination of data, which may be programs, organizations


or other entity.
Data store: Here data is stored or referenced by a process in the system’s #

In our project, we had built the data flow diagrams at the very beginning of
business process modelling in order to model the functions that our project has to carry
out and the interaction between those functions together with focusing on data
exchanges between processes.

3.2.1 Context level DFD:

A Context level Data flow diagram created using select structured systems
analysis and design method (SSADM). This level shows the overall context of the
system and its operating environment and shows the whole system as just one process.
It does not usually show data stores, unless they are “owned” by external systems, e.g.
are accessed by but not maintained by this system, however, these are often shown as
external entities. The Context level DFD is shown in fig.3.2.1

Figure 3.2.1 Context Level DFD for Sentimental Analysis


The Context Level Data Flow Diagram shows the data flow from the
application to the database and to the system.

3.2.2 Top level DFD:

Figure 3.2.2 Top Level DFD for Sentimental Analysis


a business venture. In the process of coming up with a data flow diagram, the
level one provides an overview of the major functional areas of the undertaking. After
presenting the values for most important fields of discussion, it gives room for level
two to be drawn.

After starting and executing the application, training and testing the dataset can
be done as shown in the above figure
3.2.3 Detailed Level Diagram

This level explains each process of the system in a detailed manner. In first
detailed level DFD (Generation of individual fields): how data flows through individual
process/fields in it are shown.

In second detailed level DFD (generation of detailed process of the individual fields):

how data flows through the system to form a detailed description of the individual
processes.

Figure 3.2.3.1 Detailed level DFD for Sentimental Analysis

After starting and executing the application, training the dataset is done by using
dividing into 2D array and scaling using normalization algorithms, and then testing is
done.

Figure 3.2.3.2 Detailed level DFD for Sentimental Analysis


After starting and executing the application, training the dataset is done by using
linear regression and then testing is done.

3.3 UNIFIED MODELLING LANGUAGE DIAGRAMS:

The Unified Modelling Language (UML) is a Standard language for


specifying, visualizing, constructing and documenting the software system and its
components. The UML focuses on the conceptual and physical representation of the
system. It captures the decisions and understandings about systems that must be
constructed. A UML system is represented using five different views that describe the
system from distinctly different perspective. Each view is defined by a set of diagram,
which is as follows. [6]

• User Model View

i. This view represents the system from the user’s perspective.

ii. The analysis representation describes a usage scenario from the end- users

Perspective.

• Structural Model View

i. In this model the data and functionality are arrived from inside the system.

ii. This model view models the static structures.

• Behavioral Model View

It represents the dynamic of behavioral as parts of the system, depicting the


interactions of collection between various structural elements described in the user
model and structural model view.

• Implementation model View

In this the structural and behavioral as parts of the system are represented
as they are to be built.
• Environmental Model View

In this the structural and behavioral aspects of the environment in which the
system is to be implemented are represented.
3.3.1 Use Case Diagram:

Use case diagrams are one of the five diagrams in the UML for modeling
the dynamic aspects of the systems (activity diagrams, sequence diagram, state chart
diagram, collaboration diagram are the four other kinds of diagrams in the UML for
modeling the dynamic aspects of systems).Use case diagrams are central to modeling
the behavior of the system, a sub-system, or a class. Each one shows a set of use cases
and actors and relations.

Figure 3.3.1 USECASE DIAGRAM

3.3.2 Sequence Diagram:


Sequence diagram is an interaction diagram which is focuses on the time
ordering of messages. It shows a set of objects and messages exchanged between these
objects. This diagram illustrates the dynamic view of a system.
Figure 3.3.2 Sequence Diagram
3.3.3 Collaboration Diagram:
Collaboration diagram is an interaction diagram that emphasizes the
structural organization of the objects that send and receive messages. Collaboration
diagram and sequence diagram are isomorphic.

Figure 3.3.3 Collaboration Diagram


3.3.4 Activity Diagram:
An Activity diagram shows the flow from activity to activity within a system
it emphasizes the flow of control among objects.

Figure 3.3.4 Activity Diagram


3.3.5 DATA DICTIONARY

Fig 3.3.5 Data Dictionary


4. IMPLEMENTATION

Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.

The implementation stage involves careful planning, investigation of the


existing system and it’s constraints on implementation, designing of methods to achieve
changeover and evaluation of changeover methods.

The project is implemented by accessing simultaneously from more than one


system and more than one window in one system. The application is implemented in
the Internet Information Services 5.0 web server under the Windows 7 & above and
accessed from various clients.

4.1 TECHNOLOGIES USED

What is Python?

Python is an interpreter, high-level programming language for general-purpose


programming by “Guido van Rossum” and first released in 1991, Python has a design
philosophy that emphasizes code readability, and a syntax that allows programmers to
express concepts in fewer lines of code, notably using significant whitespace. It
provides constructs that enable clear programming on both small and large scales.
Python features a dynamic type system and automatic memory management. It supports
multiple programming paradigms, including object-oriented, imperative, functional,
procedural, and has a large and comprehensive standard library.

Python is a general purpose, dynamic, high level and interpreted programming


language. It supports object-oriented programming approach to develop applications. It
is simple and easy to learn and provides lots of high level data structures.

• Windows 10 & 11
• Python Programming
• Open source libraries: Pandas, NumPy, Skippy, matplotlib, OpenCV
Python Versions
Python 2.0 was released on 16 October 2000 and had many major new features,
including a cycle-detecting, garbage collector, and support for Unicode. With this
release, the development process became more transparent and community- backed.

Python 3.0 (initially called Python 3000 or py3k) was released on 3 December
2008 after a long testing period. It is a major revision of the language that is not
completely backward-compatible with previous versions. However, many of its major
features have been back ported to the Python 2.6.xand 2.7.x version series, and releases
of Python 3 include the 2to3 utility, which automates the translation of Python 2 code
to Python 3.

Python 2.7's end-of-life date (a.k.a. EOL, sunset date) was initially set at 2015,
then postponed to 2020 out of concern that a large body of existing code could not
easily be forward-ported to Python 3.In January 2017, Google announced work on a
Python 2.7 to go Trans compiler to improve performance under concurrent workloads.

Python 3.6 had changes regarding UTF-8 (in Windows, PEP 528 and PEP 529)
and Python 3.7.0b1 (PEP 540) adds a new "UTF-8 Mode" (and overrides POSIX
locale).

Why Python?
• Python is a scripting language like PHP, Perl, and Ruby.

• No licensing, distribution, or development fees


• It is a Desktop application.
• Linux, windows

• Excellent documentation
• Thriving developer community

• For us job opportunity


4.2 MACHINE LEARNING

Machine Learning is an application of artificial intelligence (AI) that provides


system the ability to automatically learn and improve from experience without being
explicitly programmed. Machine learning focuses on the development of computer
programs that can access data and use it learn for themselves. [1]
Basics of python machine learning:

• You'll know how to use Python and its libraries to explore your data with the help of
matplotlib and Principal Component Analysis (PCA).
• And you'll preprocess your data with normalization and you'll split your data into
training and test sets.
• Next, you'll work with the well-known K-Means algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.
• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.

Why Machine Learning?

• It was born from pattern recognition and theory that computers can learn without being
programmed to specific tasks.
• It is a method of Data analysis that automates analytical model building.

Machine learning tasks are typically classified into two broad categories,
depending on whether there is a learning "signal" or "feedback" available to a learning
system. They are

Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that maps inputs to
outputs. As special cases, the input signal can be only partially available, or restricted
to special feedback:

Semi-supervised learning: The computer is given only an incomplete training signal:


a training set with some (often many) of the target outputs missing.

Active learning: The computer can only obtain training labels for a limited set of
instances (based on a budget), and also has to optimize its choice of objects to acquire
labels for. When used interactively, these can be presented to the user for labelling.

Reinforcement learning: training data (in form of rewards and punishments) is given
only as feedback to the program's actions in a dynamic environment, such as driving a
vehicle or playing a game against an opponent.

Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data) or a means towards an end (feature learning).

In regression, also a supervised problem, the outputs are continuous rather than
discrete.

Regression: The analysis or measure of the association between one variable (the
dependent variable) and one or more other variables (the independent variables), usually
formulated in an equation in which the independent variables have parametric
coefficients, which may enable future values of the dependent variable to be predicted.

Fig. 4.2 Regression analysis

What is Regression Analysis?


Regression analysis is a form of predictive modelling technique which
investigates the relationship between a dependent (target) and independent variable (s)
(predictor). This technique is used for forecasting, time series modelling and finding the
causal effect relationship between the variables. For example, relationship between rash
driving and number of road accidents by

Classification

A classification problem is when the output variable is a category, such as “red”


or “blue” or “disease” and “no disease”. A classification model attempts to draw some
conclusion from observed values. Given one or more inputs a classification model will
try to predict the value of one or more outcomes. For example, when

20
Downloaded by M.Phani Kumar MPK ([email protected])
Filtering emails “spam” or “not spam”, when looking at transaction data, “fraudulent”,
or “authorized”. In short Classification either predicts categorical class labels or
classifies data (construct a model) based on the training set and the values (class labels)
in classifying attributes and uses it in classifying new data.

There are a number of classification models. Classification models include

1. Logistic regression
2. Decision tree
3. Random forest
4. Naive Bayes.

1. Logistic Regression

Logistic regression is a supervised learning classification algorithm used to


predict the probability of a target variable. The nature of target or dependent variable
is dichotomous, which means there would be only two possible classes. In simple
words, the dependent variable is binary in nature having data coded as either 1 (stands
for success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of


X. It is one of the simplest ML algorithms that can be used for various classification
problems such as spam detection, Diabetes prediction, cancer detection etc.

2. Decision Tree

Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the
data is continuously split according to a certain parameter. The tree can be explained
by two entities, namely decision nodes and leaves.

3. Random Forest

Random Forest is a popular machine learning algorithm that belongs to the


supervised learning technique. It can be used for both Classification and Regression
problems in ML. It is based on the concept of ensemble learning, which is a process
of combining multiple classifiers to solve a complex problem and to improve the
performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number


of decision trees on various subsets of the given dataset and takes the average to
improve the predictive accuracy of that dataset." Instead of relying on one decision
tree, the random forest takes the prediction from each tree and based on the majority
votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents
the problem of overfitting.

The below diagram explains the working of the Random Forest algorithm:

Fig 4.2.1 Image for Random Forest Algorithm

4. Naive Bayes: It is a classification technique based on Bayes’ Theorem

with an assumption of independence among predictors. In simple terms, a

II
Naive Bayes classifier assumes that the presence of a particular feature in a class is

unrelated to the presence of any other feature. Naive Bayes model is easy to build and

particularly useful for very large data sets. Along with simplicity, Naive Bayes is

known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c| x)


from P(c), P(x) and P(x|c). Look at the equation above:

4.3 DEEP LEARNING

Deep Learning is a class of Machine Learning which performs much better


on unstructured data. Deep learning techniques are out performing current machine
learning techniques. It enables computational models to learn features progressively
from data at multiple levels. The popularity of deep learning amplified as the amount
of data available increased as well as the advancement of hardware that provides
powerful computers.

Deep learning has emerged as a powerful machine learning technique that


learns multiple layers of representations or features of the data and produces state-of-
the-art prediction results. Along with the success of deep learning in many other
application domains, deep learning is also popularly used in sentiment analysis in recent
years. [2]

Deep Learning Algorithms:

There are two types of algorithms

1.Structured Algorithm

2.Unstructured Algorithm

1. Structured Algorithm

One of the structured algorithm is


Artificial Neural Network

Artificial Neural Networks are computational models and inspire by the human
brain. Many of the recent advancements have been made in the field of Artificial
Intelligence, including Voice Recognition, Image Recognition, Robotics using
Artificial Neural Networks. Artificial Neural Networks are the biologically inspired
simulations performed on the computer to perform certain specific tasks like –

• Clustering
• Classification
• Pattern Recognization

2. Unstructured Algorithm

One of the unstructured algorithm is

Deep Neural Network

A deep neural network (DNN) is an artificial neural network (ANN) with


multiple layers between the input and output layers. There are different types of neural
networks but they always consist of the same components: neurons, synapses, weights,
biases, and functions. These components functioning similar to the human brains and
can be trained like any other ML algorithm.

For example, a DNN that is trained to recognize dog breeds will go over the
given image and calculate the probability that the dog in the image is a certain breed.
The user can review the results and select which probabilities the network should
display (above a certain threshold, etc.) and return the proposed label. Each
mathematical manipulation as such is considered a layer, and complex DNN have many
layers, hence the name "deep" networks.

Modules in python

Module: - A module allows you to logically organize your Python code.


Grouping related code into a module makes the code easier to understand and use. A
module is a Python object with arbitrarily named attributes that you can bind and
reference.
Pandas: -
Pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real world data analysis in Python. Additionally, it has the broader goal of becoming
the most powerful and flexible open source data analysis / manipulation tool available
in any language. It is already well on its way toward this goal.

Pandas is well suited for many different kinds of data:

Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spread


sheet
Ordered and unordered (not necessarily fixed-frequency) time series data.

Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column
labels
Any other form of observational / statistical data sets. The data actually need not be
labelled at all to be placed into a panda’s data structure

The two primary data structures of pandas, Series (1-dimensional) and Data
Frame (2dimensional), handle the vast majority of typical use cases in finance,
statistics, social science, and many areas of engineering. For R users, Data Frame
provides everything that R’s data frame provides and much more. Pandas is built on top
of NumPy and is intended to integrate well within a scientific computing environment
with many other 3rd party libraries. Few of the things that pandas does well:

Easy handling of missing data (represented as Nan) in floating point as well as non
floating-point data
Size mutability: columns can be inserted and deleted from Data Frame and higher
dimensional objects
Automatic and explicit data alignment: objects can be explicitly aligned to a set of
labels, or the user can simply ignore the labels and let Series, Data Frame, etc.
automatically align the data for you in computations
Powerful, flexible group by functionality to perform split-apply-combine operations
on data sets, for both aggregating and transforming data
Make it easy to convert ragged, differently-indexed data in other Python and NumPy
data structures into Data Frame objects
Intelligent label-based slicing, fancy indexing, and sub setting of large datasets
merging and joining data sets

Flexible reshaping and pivoting of data sets

Hierarchical labelling of axes (possible to have multiple labels per tick)

Robust IO tools for loading data from flat files (CSV and delimited), Excel files,
databases, and saving / loading data from the ultrafast HDF5 format
Time series-specific functionality: date range generation and frequency conversion,
moving window statistics, moving window linear regressions, date shifting and lagging,
etc.

Many of these principles are here to address the shortcomings frequently


experienced using other languages / scientific research environments. For data
scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analyzing / modelling it, then organizing the results of the analysis into
a form suitable for plotting or tabular display. Pandas is the ideal tool for all of these
tasks.

Pandas is fast. Many of the low-level algorithmic bits have been extensively improved
in Python code. However, as with anything else generalization usually sacrifices
performance. So, if you focus on one feature for your application you may be able to
create a faster specialized tool.
Pandas are a dependency of stats models, making it an important part of the statistical
computing ecosystem in Python.
Pandas have been used extensively in production in financial applications.

NumPy: -

NumPy, which stands for Numerical Python, is a library consisting of


multidimensional array objects and a collection of routines for processing those arrays.
Using NumPy, mathematical and logical operations on arrays can be performed. This
tutorial explains the basics of NumPy such as its architecture and
environment. It also discusses the various array functions, types of indexing, etc. An
introduction to Matplotlib is also provided. All this is explained with the help of
examples for better understanding.

NumPy is a Python package. It stands for 'Numerical Python'. It is a library


consisting of multidimensional array objects and a collection of routines for processing
of array. Numeric, the ancestor of NumPy, was developed by Jim Humulin. Another
package Numara was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Numara into
Numeric package. There are many contributors to this open source project.

Operations using NumPy: -

Using NumPy, a developer can perform the following operations −

Mathematical and logical operations on arrays.

Fourier transforms and routines for shape manipulation.

Operations related to linear algebra. NumPy has in-built functions for linear algebra.
And random number generation.

NumPy – A Replacement for MATLAB


NumPy is often used along with packages like SciPy (Scientific Python) and
Mat−plotid (plotting library). This combination is widely used as a replacement for
MATLAB, a popular platform for technical computing. However, Python alternative to
MATLAB is now seen as a more modern and complete programming language. It is
open source, which is an added advantage of NumPy.

Sickit-learn: -

Scikit-learn (formerly scikits. learn) is a free software machine learning library


for the Python programming language. It features various classification, regression and
clustering algorithms including support vector machines, random forests, gradient
boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.

The scikit-learn project started as scikits. learn, a Google Summer of Code


project by David Courmayeur. Its name stems from the notion that it is a
“SciKit” (SciPy Toolkit), a separately-developed and distributed third-party extension
to SciPy.
The original codebase was later rewritten by other developers. In 2010 Fabian
Pedrosa, Gael Viroqua, Alexandre Gramfort and Vincent Michel, all from INRIA took
leadership of the project and made the first public release on February the 1st 2010 .Of
the various scikits, scikit-learn as well as scikit-image were described as “well-
maintained and popular” in November 2012. Scikit-learn is largely written in Python,
with some core algorithms written in Cython to achieve performance. Support vector
machines are implemented by a Cython wrapper around LIBSVM; logistic regression
and linear support vector machines by a similar wrapper around LIBLINEAR.
Some popular groups of models provided by scikit-learn include:

a. Ensemble methods: for combining the predictions of multiple supervised models.

b. Feature extraction: for defining attributes in image and text data.

c. Feature selection: for identifying meaningful attributes from which to create


supervised models.
d. Parameter Tuning: for getting the most out of supervised models.

e. Manifold Learning: For summarizing and depicting complex multi- dimensional data.
f. Supervised Models: a vast array not limited to generalize linear models, discriminate
analysis, naive bayes, lazy methods, neural networks, support vector machines and
decision trees.
Matplotlib:-
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use
is discouraged. SciPy makes use of matplotlib.
5. TESTING
In the process of testing for the given project Restaurant review analysis using AI
classification system we using the given below testing techniques.
Testing is the process of testing the functionality and it is the process of
executing a program with the intent of finding an error. A good test case is one that
has a high probability of finding an as at undiscovered error. A successful test is one
that uncovers an as at undiscovered error. Software testing is usually performed for
one of two reasons:

1) Defect Detection
2) Reliability estimation

5.1 BLACK BOX TESTING:

The base of the black box testing strategy lies in the selection of appropriate
data as per functionality and testing it against the functional specifications in order
to check for normal and abnormal behavior of the system. Now a days, it is becoming
to route the testing work to a third party as the developer of the system knows too
much of the internal logic and coding of the system, which makes it unfit to test
application by the developer. The following are different types of techniques
involved in black box testing. They are:

• Decision Table Testing

• All pairs testing

• State transition tables testing

• Equivalence Partitioning

Software testing is used in association with Verification and Validation.


Verification is the checking of or testing of items, including software, for
conformance and consistency with an associated specification. Software testing is
just one kind of verification, which also uses techniques as reviews, inspections,
walk- through. Validation is the process of checking what has been specified is what
the user actually wanted. [6]
• Validation: Are we doing the right job?
• Verification: Are we doing the job right?
In order to achieve consistency in the Testing style, it is imperative to have and
follow a set of testing principles. This enhances the efficiency of testing within SQA
team members and thus contributes to increased productivity. The purpose of this
document is to provide overview of the testing, plus the techniques. Here, after training
is done on the training dataset, testing is done.

5.2 WHITE BOX TESTING:

White box testing [8] requires access to source code. Though white box testing [8] can
be performed any time in the life cycle after the code is developed, it is a good practice
to perform white box testing [8] during unit testing phase.

In designing of database the flow of specific inputs through the code, expected output
and the functionality of conditional loops are tested.

At SDEI, 3 levels of software testing is done at various SDLC phases

• UNIT TESTING: in which each unit (basic component) of the software is tested to
verify that the detailed design for the unit has been correctly implemented
• INTEGRATION TESTING: in which progressively larger groups of tested software
components corresponding to elements of the architectural design are integrated and
tested until the software works as a whole.
• SYSTEM TESTING: in which the software is integrated to the overall product and
tested to show that all requirements are met. A further level of testing is also done, in
accordance with requirements:
• REGRESSION TESTING: is used to refer the repetition of the earlier successful tests
to ensure that changes made in the software have not introduced new bugs/side effects.
• ACCEPTANCE TESTING: Testing to verify a product meets customer specified
requirements. The acceptance test suite is run against supplied input data. Then the
results obtained are compared with the expected results of the client. A correct match
was obtain.
.
BIBLIOGRAPHY
[1] J. Friedman, T. Hastie, and R. Tibshirani.
The elements of statistical learning: Data Mining, Inference, and Prediction, Second
Edition. Springer Series in Statistics, 2009.
[2] G. Szabo and B. Huberman. Predicting the popularity of online content.
Community. Of ACM, 53(8), 2010.
[3] A Machine Learning Model for Stock Market Prediction Article by Osman
Hegazy and Mustafa Abdul Salam.
[4]. The Unified Modeling Language User Guide, Low Price Edition Grady Booch,
James Rumbaugh, Ivar Jacob, ISBN: 81-7808-769-5, 91, 1997.

[5].The Elements of UMLTM 2.0,


Scott Ambler-Cambridge University Press
Newyork@2005, ISBN: 978-0-07-52616-7-82, 2005.
[6]. Software Testing, CemKaner, James beach, Bret Pettiehord,ISBN: 978-0-471-120-
940, 2001.

[7]. A Practitioner’s Approach Roger S. Pressman, Software Engineering, 3rd Edition,


ISBN: 978-007-126782-3,482, 2014.

[8] .Black_ Box Testing: Techniques for Functional Testing of Software and
Systems Boris Beizer - Wiley Publications, ISBN: 978-0-471-120-940, 1995.

You might also like