Exploratory_Project_Report
Exploratory_Project_Report
Submitted in partial fulfilment of the requirements for the award of the Degree of
B. Tech- AIDS
Submitted By:
Mohi Ganeriwal [2024PUFCEBADX17926]
Mahak Dhakad [2024PUFCEBADX17926]
Submitted To:
Department of First Year
Faculty of Computer Science & Engineering,
Poornima University
Ramchandrapura, Sitapura Ext., Jaipur, Rajasthan- (303905)
CANDIDATE’S DECLARATION
I hereby declare that the work presented in the Exploratory Project report entitled “Women
Safety App” is submitted by Mohi Ganeriwal [17926] is in the fulfillment of the requirements
for the award of Bachelor of Technology in Data Science from Poornima University, Jaipur
during the academic year [2024-25]. The work has been found satisfactory, authentic of my own
work carried out during me degree and approved for submission.
The work reported in this has not been submitted by me for award of any other degree or
diploma.
Date:
Mohi Ganeriwal [17926]
Mahak Dhakad [17925]
ii
CERTIFICATE
Certified that Exploratory Project work entitled “Women Safety App” is a bonafide work carried
out in the second semester by Mohi Ganeriwal [17926] in partial fulfillment for the award of
Bachelor of Technology in Computer Science & Engineering with Specialization in Data Science
from Poornima University Jaipur, during the academic year 2024 -2025.
iii
ACKNOWLEDGEMENT
I have undergone an exploratory project which was meticulously planned and guided at every
stage so that it became a life time experience for me. This could not be realized without the help
from numerous sources and people in the Poornima University and Programming Express.
I am thankful to Dr. Ajay Khunteta, Dean, FCE for providing us a platform to carry out this
activity successfully.
I am also very grateful to Mr. Pratish Rawat, HOD, for his kind support and guidance. I would
like to take this opportunity to show our gratitude towards M s . D e e p a l i C h a u d a r y
who helped me in successful completion of my project She has been a guide, motivator & source
of inspiration for us to carry out the necessary proceedings for completing this training and
related activities successfully and grateful for her guidance and support.
I am thankful for their kind support and providing us expertise of the domain to develop the
project. I would also like to express our heartfelt appreciation to all of our friends whose direct or
indirect suggestions help us to develop this project and to entire team members for their valuable
suggestions.
Lastly, thanks to all faculty members of Department of Computer Science & Engineering for their
moral support and guidance.
iv
ABSTRACT
Text classification (a.k.a. text categorization) is one of the most prominent applications of
Machine Learning. It is used to assign predefined categories (labels) to free-text documents
automatically. The purpose of text classification is to give conceptual organization to a large
collection of documents. It has become more relevant with the exponential growth of the data,
with broad applicability in real-world applications e.g., categorization of news stories by topics,
products classification into multilevel categories, organization of emails into various groups
(such as social, primary, and promotions), indexing of patient’s data in the healthcare sector
from multiple aspects, e.g., disease type, surgical procedure and medication given, etc.
In this project we classify the product using the NLP. Amazon catalog consists of billions of
products that belong to thousands of browse nodes (each browse node represents a collection of
items for sale). Browse nodes are used to help customer navigate through the website and classify
products to product type groups So, it is important to predict the node assignment at the time of
listing of the product or when the browse node information is absent. We will use product meta
data to classify products into browse nodes. We will also use big data analytics because our data
is so large. We have taken the data set from Amazon ml challenge. It will be challenging because
working with real life data is very difficult because it contains many ambiguities in the dataset.
We use NLP along with classification model to classify the Product.
v
TABLE OF CONTENTS
Cover Page i
Candidate’s Declaration ii
Certificate iii
Acknowledgement iv
Abstract v
Table of Contents vi-vii
List of figures vii-
viii
List of screenshots of Python scripts viii
List of Graphs viii-ix
List of Tables ix
Chapter 1- Introduction 1-6
1.1 Aims and Objectives 2
1.2 Problem Statement 2
1.3 Scope 2-3
1.4 Duration of Project 3
1.5 Dataset Description 3-4
1.6 Work Flow 4-5
1.7 Tools used 6
1.8 Platform used in model development 6
Chapter 2- Used libraries to develop project in python 7-10
2.1 Pandas 7
2.2 Numpy… 7
2.3 Matplotlib 8
2.4 Seaborn 8
2.5 Sklearn 9
2.6 RE 9
2.7 CSV 9
2.8 NLTK… 10
Chapter 3- Working of project 11-42
3.1 Study about the project 11-18
3.2 Data collection 18
3.3 Data cleaning/Preprocessing 18-24
vi
3.4 Remove duplicates 24
3.5 Tokenization 24-25
3.6 Stemming 25-27
3.7 Lemmatization 27
3.8 Word embedding 27-31
3.9 EDA 31-34
3.10 Final feature engineering 34-35
3.11 Model building 35-40
3.12 Model testing and accuracy measures 40
3.13 Model deployment and saving 40-42
Chapter 5- Conclusion 44
Conclusions 44
vii
LIST OF FIGURES
viii
LIST OF SCREENSHOT OF PYTHON SCRIPT
ix
Chapter 1
Introduction
Text data is very important kind of data nowadays we are dealing with. It is very hard to work
with text data because first we need to transform our data into numeric values because we can‘t
directly process text data. There are so many tools available in Natural Language Processing
(NLP) which makes our task easy. We can see that how much advanced Google Translator and
other such things are since they were created just because of advancement in technologies.
NLP is defined as the automatic manipulation of natural language, like speech and text, by
software. Product Classification is also area of study within NLP. It is also one of the biggest
challenges for ecommerce companies. With the help of advancement of NLP, AI, Machine
Learning techniques it is easy for companies to apply Product Classification or any other
problem.
Product Classification is the Placement and Organization of products into their respective
categories based on their description with the help of Machine Learning, Deep Learning and
NLP. It sounds simple: choose the correct department for a product. However, this process is
complicated by the sheer volume of products on many ecommerce platforms. Furthermore, there
are many products that could belong to multiple categories.
The problem is numerous sites contain incorrect product classifications. This is because these
sites often require merchants to input their product info and select categories manually.
Furthermore, sometimes multiple merchants select different categories for the same product. To
solve this problem, ecommerce sites often employ automated product classification.
For example, Amazon hosts around 350 million products on their platform. To help merchants
choose the correct category, Amazon and other ecommerce companies have automated product
categorization tools available. After simply inputting the title or a few words about the product,
the system can automatically choose the correct category for you.
When there are hundreds of millions of products in the catalogue, even a 1% increase in accuracy
can lead to millions of additional accurate classifications. As a result, many ecommerce
companies heavily invest toward improving their automatic product classification systems.
1.1 Objective of the Project
1. The aim of the project is multi-class text classification to make-up products based on
their description.
2. Strengthen the user experience.
3. Improve search relevance.
12
4. Help customers to find your site.
13
1 Study About the Project and gather all 12
information required to solve the problem
2 Data Collection 5
3 Data Cleaning/Preprocessing Using Python 10
4 Remove Duplicates 5
5 Tokenization 4
6 Stemming 6
7 Word Embedding 2
8 EDA(Exploratory Data Analysis) & Uncovering 7
Facets of data
9 Final Feature Engineering 4
10 Model Building using different algorithms of 15
classification.
11 Model Testing and Measures different Accuracy 5
measures like(AUC & ROC)
12 Model Deployment 5
Table 1 Tasks and Time Required
14
Overall Test dataset size – 110,775
1.6 Workflow
Fig1-Flow Chart1
Fig2-Flow Chart 2
15
Fig3-Mode Architecture
1.7 Tools used
1. Python
2. Spark
3. NLP (Natural Language Processing)
4. Machine Learning Models
5. Model deployment Technique
1.8 Platform used in Model development
SYSTEM CONFIGURATION: HARDWARE
SPECIFICATIONS:
1. Processor i5 at least
2. RAM =8GB (IN case of Hadoop)
3. Disk Space=100GB space(SSD/HDD)
SOFTWARE SPECIFICATIONS
1. Windows 7/8/10
2. Linux /Ubuntu
3. Vm-Workstation/Virtual Box
4. Jupyter Notebook(Python)
16
Chapter 2
Used necessary libraries to develop project in python
2.1 Pandas
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy
package and its key data structure is called the DataFrame. DataFrames allow you to store and
manipulate tabular data in rows of observations and columns of variables. pandas is a Python
package that provides fast, flexible, and expressive data structures designed to make working with
structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and
intuitive. Syntax in python: -import pandas as pd
Numpy is a python library used for working with arrays. It also has functions for working in domain
of linear algebra, Fourier transform, and matrices. Numpy was created in 2005 by Travis Oliphant. It
is an open source project and you can use it freely. Numpy stands for Numerical Python.
Syntax in python: -import numpy as np
Fig5-Numpy
17
2.3 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on Numpy arrays and designed to work with the broader
SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data
in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.
Syntax in python: - import matplotlib.pyplot as plt
Fig 6-Matplotlib
2.4 Seaborn
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you explore and understand your data.
Its plotting functions operate on DataFrames and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its
dataset-oriented, declarative API lets you focus on what the different elements of your plots mean,
rather than on the details of how to draw them.
Syntax in python: - import seaborn as sns; sns.set(style="ticks", color_codes=True)
Fig7-Seaborn
18
2.5 Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
2.6 Re Fig8-Sklearn
The Python module re provides full support for Perl-like regular expressions in Python. The re
module raises the exception re.error if an error occurs while compiling or using a regular expression.
A regular expression is a special sequence of characters that helps you match or find other strings or
sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in
UNIX world.
Syntax in python: - import re
Fig9-re
2.7 CSV
19
The Python CSV module is used to handle CSV files. CSV files can hold a lot of information, and
the CSV module lets Python read and write to CSV files with the reader() and writer() functions.
Syntax in python: - import csv
Fig10-CSV
2.8 Nltk
NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning, wrappers for industrial-strength NLP libraries.
Syntax in python: - import nltk
Fig11-nltk
20
Chapter 3
Working of Project
3.1 Study About the Project and gather all information required to
solve the problem
Text classification is a machine learning technique that assigns a set of predefined categories to open-
ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of
text – from documents, medical studies and files, and all over the web.
For example, new articles can be organized by topics; support tickets can be organized by urgency;
chat conversations can be organized by language; brand mentions can be organized by sentiment; and
so on.
Text classification is one of the fundamental tasks in natural language processing with broad
applications such as sentiment analysis, topic labeling, spam detection, and intent detection.
A text classifier can take this phrase as an input, analyze its content, and then automatically assign
relevant tags, such as UI and Easy To Use.
22
as football, basketball, LeBron James, etc., and words related to politics, such as Donald Trump,
Hillary Clinton, Putin, etc.).
Next, when you want to classify a new incoming text, you‘ll need to count the number of sport-
related words that appear in the text and do the same for politics-related words. If the number of
sports-related word appearances is greater than the politics-related word count, then the text is
classified as Sports and vice versa. For example, this rule-based system will classify the
headline ―When is LeBron James' first game with the Lakers?‖ as Sports because it counted one
sports-related term (LeBron James) and it didn‘t count any politics-related terms.
Rule-based systems are human comprehensible and can be improved over time. But this approach
has some disadvantages. For starters, these systems require deep knowledge of the domain. They are
also time-consuming, since generating rules for a complex system can be quite challenging and
usually requires a lot of analysis and testing. Rule-based systems are also difficult to maintain and
don‘t scale well given that adding new rules can affect the results of the pre-existing rules.
2. Machine learning based systems
Instead of relying on manually crafted rules, machine learning text classification learns to make
classifications based on past observations. By using pre-labeled examples as training data, machine
learning algorithms can learn the different associations between pieces of text, and that a particular
output (i.e., tags) is expected for a particular input (i.e., text). A ―tag‖ is the pre- determined
classification or category that any given text could fall into.
The first step towards training a machine learning NLP classifier is feature extraction: a method is
used to transform each text into a numerical representation in the form of a vector. One of the most
frequently used approaches is bag of words, where a vector represents the frequency of a word in a
predefined dictionary of words.
For example, if we have defined our dictionary to have the following words {This, is, the, not,
awesome, bad, basketball}, and we wanted to vectorize the text ―This is awesome,‖ we would
have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).
Then, the machine learning algorithm is fed with training data that consists of pairs of feature
sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:
23
Fig13-Text Classifier Training
24
Once it‘s trained with enough training samples, the machine learning model can begin to make
accurate predictions. The same feature extractor is used to transform unseen text to feature sets,
which can be fed into the classification model to get predictions on tags (e.g., sports, politics):
25
The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of
A being true, divided by the probability of B being true.
This means that any vector that represents a text will have to contain information about the
probabilities of the appearance of certain words within the texts of a given category, so that the
algorithm can compute the likelihood of that text belonging to the category.
Support Vector Machines
Support Vector Machines (SVM) is another powerful text classification machine learning algorithm,
becauseike Naive Bayes, SVM doesn‘t need much training data to start providing accurate results.
SVM does, however, require more computational resources than Naive Bayes, but the results are
even faster and more accurate.
In short, SVM draws a line or ―hyperplane‖ that divides a space into two subspaces. One
subspace contains vectors (tags) that belong to a group, and another subspace contains vectors
that do not belong to that group.
The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it
looks like this:
Fig16-SVM Hyperplane
Those vectors are representations of your training texts, and a group is a tag you have tagged your
texts with.
As data gets more complex, it may not be possible to classify vectors/tags into only two categories.
26
So, it looks like this:
Fig17-SVM classify
But that‘s the great thing about SVM algorithms – they‘re ―multi-dimensional.‖ So, the more
complex the data, the more accurate the results will be. Imagine the above in three dimensions, with
an added Z-axis, to create a circle.
Mapped back to two dimensions the ideal hyperplane looks like this:
27
Deep Learning
Deep learning is a set of algorithms and techniques inspired by how the human brain works,
called neural networks. Deep learning architectures offer huge benefits for text classification because
they perform at super high accuracy with lower-level engineering and computation.
The two main deep learning architectures for text classification are Convolutional Neural Networks
(CNN) and Recurrent Neural Networks (RNN).
Deep learning is hierarchical machine learning, using multiple algorithms in a progressive chain of
events. It‘s similar to how the human brain works when making decisions, using different techniques
simultaneously to process huge amounts of data.
Deep learning algorithms do require much more training data than traditional machine learning
algorithms (at least millions of tagged examples). However, they don‘t have a threshold for learning
from training data, like traditional machine learning algorithms, such as SVM and NBeep learning
classifiers continue to get better the more data you feed them with:
Fig19-Deep-Learning Vs SVM
Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector
representations for words and improve the accuracy of classifiers trained with traditional machine
learning algorithms.
Hybrid Systems
28
Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to
further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules
for those conflicting tags that haven‘t been correctly modeled by the base classifier. Metrics and
Evaluation
Cross-validation is a common method to evaluate the performance of a text classifier. It works by
splitting the training dataset into random, equal-length example sets (e.g., 4 sets with 25% of the
data). For each set, a text classifier is trained with the remaining samples (e.g., 75% of the samples).
Next, the classifiers make predictions on their respective sets, and the results are compared against
the human-annotated tags. This will determine when a prediction was right (true positives and true
negatives) and when it made a mistake (false positives, false negatives).
With these results, you can build performance metrics that are useful for a quick assessment on how
well a classifier works:
Accuracy: the percentage of texts that were categorized with the correct tag.
Precision: the percentage of examples the classifier got right out of the total number of examples
that it predicted for a given tag.
Recall: the percentage of examples the classifier predicted for a given tag out of the total number of
examples it should have predicted for that given tag.
F1 Score: the harmonic mean of precision and recall.
Balance Accuracy: The balanced accuracy in binary and multiclass classification problems to
deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
The best value is 1 and the worst value is 0
30
Fig 4 Script Screenshot Train data information
Fig 5 Script Screenshot Train and Test Data Set Descriptions
31
Remove punctuation
32
Fig 8 Script Screenshot Remove Urls
Remove URLS
34
Fig 13 Script Screenshot Removes Stopwords from Data Set
3.5 Tokenization:
There are other libraries such as Keras, Spacy etc which also supports stop words corpus definition
by default. Once we are done with the data cleaning make sure we perform the Tokenization on the
dataset. Tokenization splits the sentences into small pieces aka a Token. The token size can a single
word or number or it can also be a sentence. Once we have done with the Tokenization method we
can continue with our data cleaning part with the below methods for better results. NTLK library
supports most of these methods as functions.
35
Removal of Numerical/Alpha-Numeric words
Stemming / Lemmatization (Finding the root word)
Part of Speech (POS) Tagging
Create bi-grams or tri-grams model
Dealing with Typos
Splitting the attached words (eg: WeCanSplitThisSentence)
Spelling and Grammer Correction
Fig 14 Script Screenshot of Tokenizations
3.6 Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers.
Often when searching text for a certain keyword, it helps if the search returns variations of the word.
For instance, searching for ―boat‖ might also return ―boats‖ and ―boating‖. Here, ―boat‖ would be
the stem for [boat, boater, boating, boats].
Porter Stemmer
One of the most common — and effective — stemming tools is Porter‘s Algorithm developed by
Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set
of mapping rules.
36
Fig 15 Script Screenshot Steamming Porter Steammer
Snowball Stemmer
This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by
Martin Porter. The algorithm used here is more accurately called the ―English Stemmer‖ or
―Porter2 Stemmer‖. It offers a slight improvement over the original Porter stemmer, both in logic
and speed.
Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas
lemmatization would likely return either see or saw depending on whether the use of the token was as
a verb or a noun.
37
Fig 17 Script Screenshot Before and After Steamming
3.7 Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction and considers a language‘s full
vocabulary to apply a morphological analysis to words. The lemma of ‗was‘ is ‗be‘ and the lemma
of ‗mice‘ is ‗mouse‘.
Lemmatization is typically seen as much more informative than simple stemming, which is why
Spacy has opted to only have Lemmatization available instead of Stemming
39
Fig20- Continuous Bowl of Words
Fig21-Skip gram
After applying the above neural embedding methods, we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and
are converted to lower dimensions. The vectors with similar meaning or semantic information are
placed close to each other in space.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurence of each word with other words in the corpus. We get a co-
occurence matrix through this. The words which occur next to each other get a value of 1, if they
are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus: Corpus:
40
It is a nice evening. Good Evening!
Is it a nice evening?
Fig22-Glow Example
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame
as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps
gather information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see
how close they are to each other in space. If they occur together more often or have a higher value in
the co-occurence matrix and are far apart in space, then they are brought close to each other. If they
are close to each other but are rarely or not frequently used together then they are moved further apart
in space. After many iterations of the above process, we‘ll get a vector space representation that
approximates the information from the co-occurence matrix. The performance of GloVe is better
than Word2Vec in terms of both semantic and syntactic capturing.
Pre-trained Word Embedding Models:
People generally use pre-trained models for word embeddings. Few of them are:
SpaCy
41
fastText
Flair etc.
Common Errors made:
You need to use the exact same pipeline during deploying your model as were used to
create the training data for the word embedding. If you use a different tokenizer or different method
of handling white space, punctuation etc. you might end up with incompatible inputs.
Words in your input that doesn‘t have a pre-trained vector. Such words are known as Out of
Vocabulary Word(oov). What you can do is replace those words with ―UNK‖ which means
unknown and then handle them separately.
Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of
length say 400 and then try to apply vectors of length 1000 at inference time, you will run into
errors. So make sure to use the same dimensions throughout.
Benefits of using Word Embeddings:
It is much faster to train than hand build models like WordNet (which uses graph embeddings)
Almost all modern NLP applications start with an embedding layer
It Stores an approximation of meaning
Drawbacks of Word Embeddings:
It can be memory intensive
It is corpus dependent. Any underlying bias will have an effect on your model
It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.
42
Graph-1 Data set Dimensions
43
Graph-2 Unique Values in Training Data Set
44
Graph-4 Missing Values in Training Data Set
45
Graph-6 Title Sequence Vs No of Sequences
47
Fig23-KNN
We want to select a value of K that is reasonable and not something too big (it will predict the class
having majority among all data samples) or something too small.
48
Fig 19 Script Screenshot KNN Model Building
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes‘ Theorem to predict the tag of a text (like a piece of news or a customer review). They are
probabilistic, which means that they calculate the probability of each tag for a given text, and then
output the tag with the highest one. The way they get these probabilities is by using Bayes‘ Theorem,
which describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.
We're going to be working with an algorithm called Multinomial Naive Bayes. We‘ll walk through
the algorithm applied to NLP with an example, so by the end, not only will you know how this
method works, but also why it works. Then, we'll lay out a few advanced techniques that can make
Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural
networks.
Let‘s see how this works in practice with a simple example. Suppose we are building a
classifier that says whether a text is about sports or not. Our training data has 5 sentences:
Table 4 Navie Bayes Example
49
Now, which tag does the sentence A very close game belong to?
Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence
"A very close game" is Sports and the probability that it‘s Not Sports. Then, we take the largest one.
Written mathematically, what we want is P (Sports | a very close game) — the probability that the tag
of a sentence is Sports given that the sentence is ―A very close game‖.
Feature Engineering
The first thing we need to do when creating a machine learning model is to decide what to use as
features. We call features the pieces of information that we take from the text and give to the
algorithm so it can work its magic. For example, if we were doing classification on health, some
features could be a person‘s height, weight, gender, and so on. We would exclude things that maybe
are known but aren‘t useful to the model, like a person‘s name or favorite color.
In this case though, we don‘t even have numeric features. We just have text. We need to
somehow convert this text into numbers that we can do calculations on.
So what do we do? Simple! We use word frequencies. That is, we ignore word order and sentence
construction, treating every document as a set of the words it contains. Our features will be the
counts of each of these words. Even though it may seem too simplistic an approach, it works
surprisingly well.
Bayes’ Theorem
Now we need to transform the probability we want to calculate into something that can be calculated
using word frequencies. For this, we will use some basic properties of probabilities, and Bayes‘
Theorem. If you feel like your knowledge of these topics is a bit rusty, read up on it, and you'll be up
to speed in a couple of minutes.
Bayes' Theorem is useful when working with conditional probabilities (like we are doing here),
because it provides us with a way to reverse them:
In our case, we have P (Sports | a very close game), so using this theorem we can reverse the
conditional probability:
Since for our classifier we‘re just trying to find out which tag has a bigger probability, we can discard
the divisor —which is the same for both tags— and just compare with
This is better, since we could actually calculate these probabilities! Just count how many times the
sentence “A very close game” appears in the Sports tag, divide it by the total, and obtain P (a very
close game | Sports).
There's a problem though: ―A very close game‖ doesn‘t appear in our training data, so this
probability is zero. Unless every sentence that we want to classify appears in our training data, the
model won‘t be very useful.
50
Being Naive
So here comes the Naive part: we assume that every word in a sentence is independent of the other
ones. This means that we‘re no longer looking at entire sentences, but rather at individual words. So
for our purposes, ―this was a fun party‖ is the same as ―this party was fun‖ and ―party fun was this‖.
We write this as:
This assumption is very strong but super useful. It's what makes this model work well with little data
or data that may be mislabeled. The next step is just applying this to what we had before:
And now, all of these individual words actually show up several times in our training data, and we
can calculate them!
Calculating Probabilities
The final step is just to calculate every probability and see which one turns out to be larger.
Calculating a probability is just counting in our training data.
First, we calculate the a priori probability of each tag: for a given sentence in our training data, the
probability that it is Sports P (Sports) is ⅗. Then, P (Not Sports) is ⅖. That‘s easy enough.
Then, calculating P (game | Sports) means counting how many times the word ―game‖ appears in
Sports texts (2) divided by the total number of words in sports (11). Therefore,
However, we run into a problem here: ―close‖ doesn‘t appear in any Sports text! That means that P
(close | Sports) = 0. This is rather inconvenient since we are going to be multiplying it with the other
probabilities, so we'll end up with
This equals 0, since in a multiplication if one of the terms is zero, the whole calculation is nullified.
Doing things this way simply doesn't give us any information at all, so we have to find a way around.
How do we do it? By using something called Laplace smoothing: we add 1 to every count so it‘s
never zero. To balance this, we add the number of possible words to the divisor, so the division will
never'election',
be greater than
'clean', 1. In'the',
'close', our'was',
case,'forgettable',
the possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game',
'match']
Since the number of possible words is 14 (I counted them!), applying smoothing we get that The full
results are:
Word P (word | Sports) P (word | Not Sports)
a (2 + 1) ÷ (11 + 14) (1 + 1) ÷ (9 + 14)
very (1 + 1) ÷ (11 + 14) (0 + 1) ÷ (9 + 14)
51
Word P (word | Sports) P (word | Not Sports)
close (0 + 1) ÷ (11 + 14) (1 + 1) ÷ (9 + 14)
game (2 + 1) ÷ (11 + 14) (0 + 1) ÷ (9 + 14)
Table 5 Results of Navie Bayes Example
Now we just multiply all the probabilities, and see who is bigger:
Excellent! Our classifier gives ―A very close game‖ the Sports tag.
52
project or later sometime, to avoid the wastage of the training time, store trained model so that it can
be used anytime in the future.
There are two ways we can save a model in scikit learn:
1. Pickle string: The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object
2.
3. structure. Example: Let‘s apply K
Nearest
Neighbor on iris dataset and then save the model. import numpy as arrays. These functions also
np accept file-like object
# Load dataset instead of filenames.
from sklearn.datasets import load_iris # Split dataset into train and joblib.dump to serialize an
test object hierarchy joblib.load
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size to deserialize a data stream
= 0.3,random_state = 2018) \from sklearn.externals import
# import KNeighborsClassifier model joblib
'filename.pkl') joblib.load('filename.pkl')
# Load the model from the file knn_from_joblib = # Use the loaded model to
knn_from_joblib.predict(X_test) _test)
53
Chapter -4
Future Scope
54
Chapter -5
Conclusion
Text data is very important kind of data now a day we are dealing with. It is very hard to work with
text data and find the meaningful insights from that data and used for business growth. In this project
we classify the product brand using the NLP. Amazon is very big company in the market. Amazon
catalog consists of billions of products that belong to thousands of browse nodes (each browse node
represents a collection of items for sale). Browse nodes are used to help customer navigate through
the website and classify products to product type groups So, it is important to predict the node
assignment at the time of listing of the product or when the browse node information is absent. We
classify the product brand using NLP along with Big Data Analytics. In this Project We were Apply
many pre-processing steps like Remove Urls, Remove stops words, Remove Emojis and other
unwanted things which is not neccsary. After that we peform Tokenization, Steamming,
Leminitization, Word Emmedings and much more. After this we were perform final feature
Engineering and apply KNN and Navie Bayes models and we were get the Balance accuracy of KNN
model is 49.87 % and Navie bayes Model Balance Accuracy is 64 % . So, We conclude that navie
bayes work better than KNN.
55
Chapter -6
References
[1] "Classifying Itens with NLP", Medium, 2021. [Online]. Available:
https://medium.com/neuronio/classifying-itens-with-nlp-b3b28a4b7873. [Accessed: 16- Aug-
2021].
[2] "What is Natural Language Processing? An Introduction to NLP", SearchEnterpriseAI, 2021.
[Online]. Available: https://searchenterpriseai.techtarget.com/definition/natural-language-
processing-NLP. [Accessed: 16- Aug- 2021].
[3] "GitHub - aniass/Product-Categorization-NLP: Multi-Class Text Classification for products
based on their description with Machine Learning algorithms and Neural Networks (MLP,
CNN).", GitHub, 2021. [Online]. Available: https://github.com/aniass/Product-
Categorization- NLP. [Accessed: 16- Aug- 2021].
[4] "Beginner's Guide to Product Categorization in Machine Learning | Hacker Noon",
Hackernoon.com, 2021. [Online]. Available: https://hackernoon.com/beginners-guide-to-
product- categorization-in-machine-learning-bai3tip. [Accessed: 16- Aug- 2021].
[5] A. Saeed, "Research paper categorization using machine learning and NLP", Aqibsaeed.github.io,
2021. [Online]. Available: http://aqibsaeed.github.io/2016-07-26-text- classification/. [Accessed: 16-
Aug- 2021].
[6] How to use NLP in Python: a Practical Step-by-Step Example - Just into Data", Just into Data, 2021.
[Online]. Available: https://www.justintodata.com/use-nlp-in-python-practical-step- by-step-example/.
[Accessed: 16- Aug- 2021].
[7] N. Guide, "Natural Language Processing Step by Step Guide | NLP for Data Scientists", Analytics
Vidhya, 2021. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/05/natural-
language-processing-step-by-step-guide/. [Accessed: 16- Aug- 2021].
[8] U. Example, "Text Classification in Natural Language Processing", Analytics Vidhya, 2021. [Online].
Available: https://www.analyticsvidhya.com/blog/2020/12/understanding-text- classification-in-nlp-
with-movie-review-example-example/. [Accessed: 16- Aug- 2021].
[9] H. problems, B. DialogFlow, B. Processing and R. NLP, "How to solve NLP problems | i2tutorials",
i2tutorials, 2021. [Online].
[10] By: IBM Cloud Education, What is natural language processing? IBM. Available at:
https://www.ibm.com/cloud/learn/natural-language-processing [Accessed August 16, 2021].
[11] Chirag, W.by, 2017. NLP for big data Analytics - a guide. EuroSTAR Huddle. Available at:
https://huddle.eurostarsoftwaretesting.com/nlp-for-big-data-how-nlp-will-revolutionise-big-data-
analytics/ [Accessed August 16, 2021].
56
[12] By krishna Gandhi blog, P., 2020. NLP and big data. Data Science.
Available at: https://dapperdatadig.wordpress.com/2020/09/04/nlp-and-big-data/
[Accessed August 16, 2021].
[13] Anon, Amazon ml challenge. HackerEarth. Available at:
https://www.hackerearth.com/challenges/competitive/amazon-ml-challenge/ [Accessed
August 16, 2021].
57