0% found this document useful (0 votes)

10 views

Exploratory_Project_Report

Uploaded by

charu11ganeriwal12

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Exploratory_Project_Report

Uploaded by

charu11ganeriwal12

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 57

A

EXPLORATORY PROJECT REPORT

On
Women Safety App

Submitted in partial fulfilment of the requirements for the award of the Degree of

B. Tech- AIDS

Poornima University, Jaipur (Academic Session: 2024-25)

Submitted By:
Mohi Ganeriwal [2024PUFCEBADX17926]
Mahak Dhakad [2024PUFCEBADX17926]

Ist Year, Computer Engineering

Submitted To:
Department of First Year
Faculty of Computer Science & Engineering,
Poornima University
Ramchandrapura, Sitapura Ext., Jaipur, Rajasthan- (303905)
CANDIDATE’S DECLARATION

I hereby declare that the work presented in the Exploratory Project report entitled “Women
Safety App” is submitted by Mohi Ganeriwal [17926] is in the fulfillment of the requirements
for the award of Bachelor of Technology in Data Science from Poornima University, Jaipur
during the academic year [2024-25]. The work has been found satisfactory, authentic of my own
work carried out during me degree and approved for submission.

The work reported in this has not been submitted by me for award of any other degree or
diploma.

Date:
Mohi Ganeriwal [17926]
Mahak Dhakad [17925]

ii
CERTIFICATE

Certified that Exploratory Project work entitled “Women Safety App” is a bonafide work carried
out in the second semester by Mohi Ganeriwal [17926] in partial fulfillment for the award of
Bachelor of Technology in Computer Science & Engineering with Specialization in Data Science
from Poornima University Jaipur, during the academic year 2024 -2025.

Ms. Deepali Chaudhary

Assistant Professor
Faculty of Computer Science and Engineering
Poornima University, Jaipur

Mr. Pratish Rawat

Head of Department
Department of First Year
Poornima University, Jaipur

Dr. Ajay Khunteta

Dean
Faculty of Computer Science and Engineering
Poornima University, Jaipur

iii
ACKNOWLEDGEMENT
I have undergone an exploratory project which was meticulously planned and guided at every
stage so that it became a life time experience for me. This could not be realized without the help
from numerous sources and people in the Poornima University and Programming Express.
I am thankful to Dr. Ajay Khunteta, Dean, FCE for providing us a platform to carry out this
activity successfully.
I am also very grateful to Mr. Pratish Rawat, HOD, for his kind support and guidance. I would
like to take this opportunity to show our gratitude towards M s . D e e p a l i C h a u d a r y
who helped me in successful completion of my project She has been a guide, motivator & source
of inspiration for us to carry out the necessary proceedings for completing this training and
related activities successfully and grateful for her guidance and support.
I am thankful for their kind support and providing us expertise of the domain to develop the
project. I would also like to express our heartfelt appreciation to all of our friends whose direct or
indirect suggestions help us to develop this project and to entire team members for their valuable
suggestions.
Lastly, thanks to all faculty members of Department of Computer Science & Engineering for their
moral support and guidance.

Mohi Ganeriwal [17926]

Mahak Dhakad [17925]

iv
ABSTRACT
Text classification (a.k.a. text categorization) is one of the most prominent applications of
Machine Learning. It is used to assign predefined categories (labels) to free-text documents
automatically. The purpose of text classification is to give conceptual organization to a large
collection of documents. It has become more relevant with the exponential growth of the data,
with broad applicability in real-world applications e.g., categorization of news stories by topics,
products classification into multilevel categories, organization of emails into various groups
(such as social, primary, and promotions), indexing of patient’s data in the healthcare sector
from multiple aspects, e.g., disease type, surgical procedure and medication given, etc.

In this project we classify the product using the NLP. Amazon catalog consists of billions of
products that belong to thousands of browse nodes (each browse node represents a collection of
items for sale). Browse nodes are used to help customer navigate through the website and classify
products to product type groups So, it is important to predict the node assignment at the time of
listing of the product or when the browse node information is absent. We will use product meta
data to classify products into browse nodes. We will also use big data analytics because our data
is so large. We have taken the data set from Amazon ml challenge. It will be challenging because
working with real life data is very difficult because it contains many ambiguities in the dataset.
We use NLP along with classification model to classify the Product.

v
TABLE OF CONTENTS
Cover Page i
Candidate’s Declaration ii
Certificate iii
Acknowledgement iv
Abstract v
Table of Contents vi-vii
List of figures vii-
viii
List of screenshots of Python scripts viii
List of Graphs viii-ix
List of Tables ix
Chapter 1- Introduction 1-6
1.1 Aims and Objectives 2
1.2 Problem Statement 2
1.3 Scope 2-3
1.4 Duration of Project 3
1.5 Dataset Description 3-4
1.6 Work Flow 4-5
1.7 Tools used 6
1.8 Platform used in model development 6
Chapter 2- Used libraries to develop project in python 7-10
2.1 Pandas 7
2.2 Numpy… 7
2.3 Matplotlib 8
2.4 Seaborn 8
2.5 Sklearn 9
2.6 RE 9
2.7 CSV 9
2.8 NLTK… 10
Chapter 3- Working of project 11-42
3.1 Study about the project 11-18
3.2 Data collection 18
3.3 Data cleaning/Preprocessing 18-24

vi
3.4 Remove duplicates 24
3.5 Tokenization 24-25
3.6 Stemming 25-27
3.7 Lemmatization 27
3.8 Word embedding 27-31
3.9 EDA 31-34
3.10 Final feature engineering 34-35
3.11 Model building 35-40
3.12 Model testing and accuracy measures 40
3.13 Model deployment and saving 40-42

Chapter 4- Future Scope 43

Future Areas of Improvement 43

Chapter 5- Conclusion 44
Conclusions 44

Chapter 6- References 45-46

References Used 45-46

vii
LIST OF FIGURES

Fig1 Flow Chart 1 4

Fig2 Flow Chart 2 5
Fig3 Model Architecture 5
Fig4 Pandas 7
Fig5 Numpy 7
Fig6 Matplotlib 8
Fig7 Seaborn 8
Fig8 Sklearn 9
Fig9 re 9
Fig10 CSV 9

viii
LIST OF SCREENSHOT OF PYTHON SCRIPT

Fig1 Script Screenshot importing Modules...........................................................19

Fig2 Script Screenshot Top 5 Rows of Train Data...............................................19
Fig3 Script Screenshot Top 5 Rows of Test Data..................................................19
Fig4 Script Screenshot Train data information....................................................20
Fig5 Script Screenshot Train and Test Data Set Descriptions..............................20
Fig6 Script Screenshot Punctuations Example......................................................21
Fig7 Script Screenshot Remove Punctuations from Data set...............................21
Fig8 Script Screenshot Remove Urls.....................................................................22
Fig9 Script Screenshot Remove Urls from Data Sets............................................22
Fig10 Script Screenshot Removes Emoji Example.........................................................22
Fig11 Script Screenshot Removes Emoji from Data sets...............................................23
Fig12 Script Screenshot Removes Stopwords Example................................................23
Fig13 Script Screenshot Removes Stopwords from Data Set.........................................24
Fig14 Script Screenshot of Tokenizations......................................................................25
Fig15 Script Screenshot Steamming Porter Steammer.................................................26
Fig16 Script Screenshot Steamming Snowball Steammer.............................................26
Fig17 Script Screenshot Before and After Steamming..................................................27
Fig18 Script Screenshot Count Vectorizer....................................................................35
Fig19 Script Screenshot KNN Model Building..............................................................37
LIST OF GRAPHS

Graph-1 Data set Dimensions........................................................................................31

Graph-2 Unique Values in Training Data Set...............................................................32
Graph-3 Unique Values in Testing Data Set..................................................................32
Graph-4 Missing Values in Training Data Set...............................................................33
Graph-5 Missing Values in Testing Data Set.................................................................33
Graph-6 Title Sequence Vs No of Sequences.................................................................34
LIST OF TABLES

Table-1 Tasks and Time Required 3

Table-2 Count Vectorizer 34
Table-3 KNN 36
Table-4 Navie Bayes Example 37
Table-5 Results of Navie Bayes Example 39-40
Table-6 Model Testing Results 40

ix
Chapter 1
Introduction
Text data is very important kind of data nowadays we are dealing with. It is very hard to work
with text data because first we need to transform our data into numeric values because we can‘t
directly process text data. There are so many tools available in Natural Language Processing
(NLP) which makes our task easy. We can see that how much advanced Google Translator and
other such things are since they were created just because of advancement in technologies.
NLP is defined as the automatic manipulation of natural language, like speech and text, by
software. Product Classification is also area of study within NLP. It is also one of the biggest
challenges for ecommerce companies. With the help of advancement of NLP, AI, Machine
Learning techniques it is easy for companies to apply Product Classification or any other
problem.
Product Classification is the Placement and Organization of products into their respective
categories based on their description with the help of Machine Learning, Deep Learning and
NLP. It sounds simple: choose the correct department for a product. However, this process is
complicated by the sheer volume of products on many ecommerce platforms. Furthermore, there
are many products that could belong to multiple categories.
The problem is numerous sites contain incorrect product classifications. This is because these
sites often require merchants to input their product info and select categories manually.
Furthermore, sometimes multiple merchants select different categories for the same product. To
solve this problem, ecommerce sites often employ automated product classification.
For example, Amazon hosts around 350 million products on their platform. To help merchants
choose the correct category, Amazon and other ecommerce companies have automated product
categorization tools available. After simply inputting the title or a few words about the product,
the system can automatically choose the correct category for you.
When there are hundreds of millions of products in the catalogue, even a 1% increase in accuracy
can lead to millions of additional accurate classifications. As a result, many ecommerce
companies heavily invest toward improving their automatic product classification systems.
1.1 Objective of the Project
1. The aim of the project is multi-class text classification to make-up products based on
their description.
2. Strengthen the user experience.
3. Improve search relevance.
12
4. Help customers to find your site.

1.2 Problem Statement

Text classification (a.k.a. text categorization) is one of the most prominent applications of
Machine Learning. It is used to assign predefined categories (labels) to free-text documents
automatically. The purpose of text classification is to give conceptual organization to a large
collection of documents. It has become more relevant with the exponential growth of the data,
with broad applicability in real-world applications e.g., categorization of news stories by topics,
products classification into multilevel categories, organization of emails into various groups (such
as social, primary, and promotions), indexing of patient‘s data in the healthcare sector from
multiple aspects, e.g., disease type, surgical procedure and medication given, etc.
In this project we classify the product using the NLP. Amazon catalog consists of billions of
products that belong to thousands of browse nodes (each browse node represents a collection of
items for sale). Browse nodes are used to help customer navigate through the website and classify
products to product type groups So, it is important to predict the node assignment at the time of
listing of the product or when the browse node information is absent. We will use product meta
data to classify products into browse nodes. We will also use big data analytics because our data
is so large. We have taken the data set from Amazon ml challenge. It will be challenging because
working with real life data is very difficult because it contains many ambiguities in the dataset.
We use NLP along with classification model to classify the Product.
1.3 Scope
1. Companies can take Better and quick decisions easily.
2. This will help to make analysis much easier.
3. It will help to improve the quality of product brand.
4. It will save the companies time.
5. Customers will also get the benefits because if the customer writes a negative review
about product brand so the companies will look into it and they improve the quality of product.
1.4 Duration of Training
Duration of training: 12/08/2021 to 22/10/2021
Total Weeks: 12
Total Days: 80
Schedule
Sr. no. TASK TIME (Days)

13
1 Study About the Project and gather all 12
information required to solve the problem
2 Data Collection 5
3 Data Cleaning/Preprocessing Using Python 10
4 Remove Duplicates 5
5 Tokenization 4
6 Stemming 6
7 Word Embedding 2
8 EDA(Exploratory Data Analysis) & Uncovering 7
Facets of data
9 Final Feature Engineering 4
10 Model Building using different algorithms of 15
classification.
11 Model Testing and Measures different Accuracy 5
measures like(AUC & ROC)
12 Model Deployment 5
Table 1 Tasks and Time Required

1.5 Data Set Descriptions

Amazon catalog consists of billions of products that belong to thousands of browse nodes (each
browse node represents a collection of items for sale). Browse nodes are used to help customer
navigate through our website and classify products to product type groups. Hence, it is important
to predict the node assignment at the time of listing of the product or when the browse node
information is absent.
As part of this hackathon, you will use product metadata to classify products into browse nodes.
You will have access to product title, description, bullet points etc. and labels for ~3MM products
to train and test your submissions. Note that there is some noise in the data - real world data looks
like this!!
Data Description
Full Train/Test dataset details:
Key column – PRODUCT_ID
Input features – TITLE, DESCRIPTION, BULLET_POINTS, BRAND
Target column – BROWSE_NODE_ID
Train dataset size – 2,903,024
Number of classes in Train – 9,919

14
Overall Test dataset size – 110,775
1.6 Workflow

Fig1-Flow Chart1

Fig2-Flow Chart 2

15
Fig3-Mode Architecture
1.7 Tools used
1. Python
2. Spark
3. NLP (Natural Language Processing)
4. Machine Learning Models
5. Model deployment Technique
1.8 Platform used in Model development
SYSTEM CONFIGURATION: HARDWARE
SPECIFICATIONS:
1. Processor i5 at least
2. RAM =8GB (IN case of Hadoop)
3. Disk Space=100GB space(SSD/HDD)
SOFTWARE SPECIFICATIONS
1. Windows 7/8/10
2. Linux /Ubuntu
3. Vm-Workstation/Virtual Box
4. Jupyter Notebook(Python)

16
Chapter 2
Used necessary libraries to develop project in python
2.1 Pandas
Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy
package and its key data structure is called the DataFrame. DataFrames allow you to store and
manipulate tabular data in rows of observations and columns of variables. pandas is a Python
package that provides fast, flexible, and expressive data structures designed to make working with
structured (tabular, multidimensional, potentially heterogeneous) and time series data both easy and
intuitive. Syntax in python: -import pandas as pd

2.2 Numpy Fig4-Pandas

Numpy is a python library used for working with arrays. It also has functions for working in domain
of linear algebra, Fourier transform, and matrices. Numpy was created in 2005 by Travis Oliphant. It
is an open source project and you can use it freely. Numpy stands for Numerical Python.
Syntax in python: -import numpy as np

Fig5-Numpy

17
2.3 Matplotlib
Matplotlib is an amazing visualization library in Python for 2D plots of arrays. Matplotlib is a multi-
platform data visualization library built on Numpy arrays and designed to work with the broader
SciPy stack. It was introduced by John Hunter in the year 2002.
One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data
in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.
Syntax in python: - import matplotlib.pyplot as plt

Fig 6-Matplotlib

2.4 Seaborn
Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and
integrates closely with pandas data structures. Seaborn helps you explore and understand your data.
Its plotting functions operate on DataFrames and arrays containing whole datasets and internally
perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its
dataset-oriented, declarative API lets you focus on what the different elements of your plots mean,
rather than on the details of how to draw them.
Syntax in python: - import seaborn as sns; sns.set(style="ticks", color_codes=True)

Fig7-Seaborn
18
2.5 Sklearn
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

Syntax in python: - from sklearn.feature_extraction.text import CountVectorizer

2.6 Re Fig8-Sklearn

The Python module re provides full support for Perl-like regular expressions in Python. The re
module raises the exception re.error if an error occurs while compiling or using a regular expression.
A regular expression is a special sequence of characters that helps you match or find other strings or
sets of strings, using a specialized syntax held in a pattern. Regular expressions are widely used in
UNIX world.
Syntax in python: - import re

Fig9-re

2.7 CSV
19
The Python CSV module is used to handle CSV files. CSV files can hold a lot of information, and
the CSV module lets Python read and write to CSV files with the reader() and writer() functions.
Syntax in python: - import csv

Fig10-CSV
2.8 Nltk

NLTK is a leading platform for building Python programs to work with human language data. It
provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning, wrappers for industrial-strength NLP libraries.
Syntax in python: - import nltk

Fig11-nltk

20
Chapter 3
Working of Project
3.1 Study About the Project and gather all information required to
solve the problem
Text classification is a machine learning technique that assigns a set of predefined categories to open-
ended text. Text classifiers can be used to organize, structure, and categorize pretty much any kind of
text – from documents, medical studies and files, and all over the web.
For example, new articles can be organized by topics; support tickets can be organized by urgency;
chat conversations can be organized by language; brand mentions can be organized by sentiment; and
so on.
Text classification is one of the fundamental tasks in natural language processing with broad
applications such as sentiment analysis, topic labeling, spam detection, and intent detection.
A text classifier can take this phrase as an input, analyze its content, and then automatically assign
relevant tags, such as UI and Easy To Use.

Fig12-Text Classifier working

Why is Text Classification Important?
It‘s estimated that around 80% of all information is unstructured, with text being one of the
most common types of unstructured data. Because of the messy nature of text, analyzing,
understanding, organizing, and sorting through text data is hard and time-consuming, so most
companies fail to use it to its full potential. This is where text classification with machine learning
comes in. Using text classifiers, companies can automatically structure all manner of relevant text,
from emails, legal documents, social media, chatbots, surveys, and more in a fast and cost-effective
way. This allows companies to save time analyzing text data, automate business processes, and make
data-driven business decisions. Why use machine learning text classification? Some of the top
reasons:
21
Scalability
Manually analyzing and organizing is slow and much less accurate. Machine learning can
automatically analyze millions of surveys, comments, emails, etc., at a fraction of the cost, often in
just a few minutes. Text classification tools are scalable to any business needs, large or small.
Real-time analysis
There are critical situations that companies need to identify as soon as possible and take immediate
action (e.g., PR crises on social media). Machine learning text classification can follow your brand
mentions constantly and in real time, so you'll identify critical information and be able to take
action right away.
Consistent criteria
Human annotators make mistakes when classifying text data due to distractions, fatigue, and
boredom, and human subjectivity creates inconsistent criteria. Machine learning, on the other hand,
applies the same lens and criteria to all data and results. Once a text classification model is
properly trained it performs with unsurpassed accuracy.
How Does Text Classification Work?
You can perform text classification in two ways: manual or automatic.
Manual text classification involves a human annotator, who interprets the content of text and
categorizes it accordingly. This method can deliver good results but it‘s time-consuming and
expensive.
Automatic text classification applies machine learning, natural language processing (NLP), and other
AI-guided techniques to automatically classify text in a faster, more cost-effective, and more
accurate manner.
There are many approaches to automatic text classification, but they all fall under three types of
systems:
1. Rule-based systems
2. Machine learning-based systems
3. Hybrid systems
1. Rule-based systems
Rule-based approaches classify text into organized groups by using a set of handcrafted linguistic
rules. These rules instruct the system to use semantically relevant elements of a text to identify
relevant categories based on its content. Each rule consists of an antecedent or pattern and a
predicted category.
Say that you want to classify news articles into two groups: Sports and Politics. First, you‘ll
need to define two lists of words that characterize each group (e.g., words related to sports such

22
as football, basketball, LeBron James, etc., and words related to politics, such as Donald Trump,
Hillary Clinton, Putin, etc.).
Next, when you want to classify a new incoming text, you‘ll need to count the number of sport-
related words that appear in the text and do the same for politics-related words. If the number of
sports-related word appearances is greater than the politics-related word count, then the text is
classified as Sports and vice versa. For example, this rule-based system will classify the
headline ―When is LeBron James' first game with the Lakers?‖ as Sports because it counted one
sports-related term (LeBron James) and it didn‘t count any politics-related terms.
Rule-based systems are human comprehensible and can be improved over time. But this approach
has some disadvantages. For starters, these systems require deep knowledge of the domain. They are
also time-consuming, since generating rules for a complex system can be quite challenging and
usually requires a lot of analysis and testing. Rule-based systems are also difficult to maintain and
don‘t scale well given that adding new rules can affect the results of the pre-existing rules.
2. Machine learning based systems
Instead of relying on manually crafted rules, machine learning text classification learns to make
classifications based on past observations. By using pre-labeled examples as training data, machine
learning algorithms can learn the different associations between pieces of text, and that a particular
output (i.e., tags) is expected for a particular input (i.e., text). A ―tag‖ is the pre- determined
classification or category that any given text could fall into.
The first step towards training a machine learning NLP classifier is feature extraction: a method is
used to transform each text into a numerical representation in the form of a vector. One of the most
frequently used approaches is bag of words, where a vector represents the frequency of a word in a
predefined dictionary of words.
For example, if we have defined our dictionary to have the following words {This, is, the, not,
awesome, bad, basketball}, and we wanted to vectorize the text ―This is awesome,‖ we would
have the following vector representation of that text: (1, 1, 0, 0, 1, 0, 0).
Then, the machine learning algorithm is fed with training data that consists of pairs of feature
sets (vectors for each text example) and tags (e.g. sports, politics) to produce a classification model:

23
Fig13-Text Classifier Training

24
Once it‘s trained with enough training samples, the machine learning model can begin to make
accurate predictions. The same feature extractor is used to transform unseen text to feature sets,
which can be fed into the classification model to get predictions on tags (e.g., sports, politics):

Fig14-Text Classifier Prediction

Text classification with machine learning is usually much more accurate than human-crafted rule
systems, especially on complex NLP classification tasks. Also, classifiers with machine learning are
easier to maintain and you can always tag new examples to learn new tasks.
Machine Learning Text Classification Algorithms
Some of the most popular text classification algorithms include the Naive Bayes family of
algorithms, support vector machines (SVM), and deep learning.
Naive Bayes
The Naive Bayes family of statistical algorithms are some of the most used algorithms in text
classification and text analysis, overall.
One of the members of that family is Multinomial Naive Bayes (MNB) with a huge advantage, that
you can get really good results even when your dataset isn‘t very large (~ a couple of thousand
tagged samples) and computational resources are scarce.
Naive Bayes is based on Bayes‘s Theorem, which helps us compute the conditional probabilities of
the occurrence of two events, based on the probabilities of the occurrence of each individual event.
So we‘re calculating the probability of each tag for a given text, and then outputting the tag with the
highest probability.

Fig15- Conditional probabilities formula

25
The probability of A, if B is true, is equal to the probability of B, if A is true, times the probability of
A being true, divided by the probability of B being true.
This means that any vector that represents a text will have to contain information about the
probabilities of the appearance of certain words within the texts of a given category, so that the
algorithm can compute the likelihood of that text belonging to the category.
Support Vector Machines
Support Vector Machines (SVM) is another powerful text classification machine learning algorithm,
becauseike Naive Bayes, SVM doesn‘t need much training data to start providing accurate results.
SVM does, however, require more computational resources than Naive Bayes, but the results are
even faster and more accurate.
In short, SVM draws a line or ―hyperplane‖ that divides a space into two subspaces. One
subspace contains vectors (tags) that belong to a group, and another subspace contains vectors
that do not belong to that group.
The optimal hyperplane is the one with the largest distance between each tag. In two dimensions it
looks like this:

Fig16-SVM Hyperplane
Those vectors are representations of your training texts, and a group is a tag you have tagged your
texts with.
As data gets more complex, it may not be possible to classify vectors/tags into only two categories.
26
So, it looks like this:

Fig17-SVM classify
But that‘s the great thing about SVM algorithms – they‘re ―multi-dimensional.‖ So, the more
complex the data, the more accurate the results will be. Imagine the above in three dimensions, with
an added Z-axis, to create a circle.
Mapped back to two dimensions the ideal hyperplane looks like this:

Fig18-SVM Best Hyperplane

27
Deep Learning
Deep learning is a set of algorithms and techniques inspired by how the human brain works,
called neural networks. Deep learning architectures offer huge benefits for text classification because
they perform at super high accuracy with lower-level engineering and computation.
The two main deep learning architectures for text classification are Convolutional Neural Networks
(CNN) and Recurrent Neural Networks (RNN).
Deep learning is hierarchical machine learning, using multiple algorithms in a progressive chain of
events. It‘s similar to how the human brain works when making decisions, using different techniques
simultaneously to process huge amounts of data.
Deep learning algorithms do require much more training data than traditional machine learning
algorithms (at least millions of tagged examples). However, they don‘t have a threshold for learning
from training data, like traditional machine learning algorithms, such as SVM and NBeep learning
classifiers continue to get better the more data you feed them with:

Fig19-Deep-Learning Vs SVM
Deep learning algorithms, like Word2Vec or GloVe are also used in order to obtain better vector
representations for words and improve the accuracy of classifiers trained with traditional machine
learning algorithms.
Hybrid Systems

28
Hybrid systems combine a machine learning-trained base classifier with a rule-based system, used to
further improve the results. These hybrid systems can be easily fine-tuned by adding specific rules
for those conflicting tags that haven‘t been correctly modeled by the base classifier. Metrics and
Evaluation
Cross-validation is a common method to evaluate the performance of a text classifier. It works by
splitting the training dataset into random, equal-length example sets (e.g., 4 sets with 25% of the
data). For each set, a text classifier is trained with the remaining samples (e.g., 75% of the samples).
Next, the classifiers make predictions on their respective sets, and the results are compared against
the human-annotated tags. This will determine when a prediction was right (true positives and true
negatives) and when it made a mistake (false positives, false negatives).
With these results, you can build performance metrics that are useful for a quick assessment on how
well a classifier works:
Accuracy: the percentage of texts that were categorized with the correct tag.
Precision: the percentage of examples the classifier got right out of the total number of examples
that it predicted for a given tag.
Recall: the percentage of examples the classifier predicted for a given tag out of the total number of
examples it should have predicted for that given tag.
F1 Score: the harmonic mean of precision and recall.
Balance Accuracy: The balanced accuracy in binary and multiclass classification problems to
deal with imbalanced datasets. It is defined as the average of recall obtained on each class.
The best value is 1 and the worst value is 0

3.2 Data Collection

We collected the data from hackerearth. Below we provided the data link and more description we
already given in the data set description in chaper 1 part 5th.

3.3 Data Cleaning/Preprocessing Using Python

Data Preprocessing is an important concept in any machine learning problem, especially when
dealing with text-based statements in Natural Language Processing (NLP). In this tutorial, you will
learn how to clean the text data using Python to make some meaning out of it. Text can contain
words such as punctuations, stop words, special characters or symbols which makes it harder to work
with data. In the below example you will be learning about Sentiment Analysis using Python.
The ideal way to start with any machine learning problem is first to understand the data, clean the
data then apply algorithms to achieve better accuracy. Import the python libraries such as pandas to
store the data into the dataframe. We can use matplotlib and seaborn for better data analysis using
visualization methods.
29
Fig 1 Script Screenshot importing Modules

Fig 2 Script Screenshot Top 5 Rows of Train Data

Fig 3 Script Screenshot Top 5 Rows of Test Data

The dataframe has 5 PRODUCT_ID, TITLE, DESCRIPTION, BULLET_POINTS,

BRAND, BROWSE_NODE_ID. When dealing with the text analysis process, the preprocessing step
should be done for the columns TITLE, DESCRIPTION, BULLET_POINTS. The dependent
variable is ‘’ BROWSE_NODE_ID” column which gives multiple browse node id. As you would
have noticed in the above output, special characters like ^^, #, :), http:// is not useful to predict the
BROWSE_NODE_ID so that will remove from our text

30
Fig 4 Script Screenshot Train data information
Fig 5 Script Screenshot Train and Test Data Set Descriptions

Using Regex to clean data

The best and fastest way to clean data in python is the regex method. This way you need don‘t
have to import any additional libraries. Python has an inbuilt regex library which comes with any
python version.

31
Remove punctuation

Fig 6 Script Screenshot Punctuations Example

Fig 7 Script Screenshot Remove Punctuations from Data set

32
Fig 8 Script Screenshot Remove Urls
Remove URLS

Fig 9 Script Screenshot Remove Urls from Data Sets

Remove Emojis and Other Unwanted things

Fig 10 Script Screenshot Removes Emoji Example

33
Fig 11 Script Screenshot Removes Emoji from Data sets
Removing Stop Words:
Words like a, an, but, we, I, do etc are known as stop words. The reason we don‘t consider the usage
of stop words in our dataset because it increases the training time as well doesn‘t add any unique
value. There are close to 800+ stop words in the English dictionary. However, sometimes words like
Yes/No plays a role when dealing with problems such as sentiment analysis or reviews. Let‘s discuss
how to handle stop words in NLP problems.

Fig 12 Script Screenshot Removes Stopwords Example

34
Fig 13 Script Screenshot Removes Stopwords from Data Set

3.4 Remove Duplicates

Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of
data-centric python packages. Pandas is one of those packages and makes importing and analyzing
data much easier. An important part of Data analysis is analyzing Duplicate Values and removing
them. Pandas drop_duplicates() method helps in removing duplicates from the data frame.
Removing rows with all duplicate values. But in our case we didn‘t find any duplicate values so we
didn‘t remove any row from our data set.

3.5 Tokenization:
There are other libraries such as Keras, Spacy etc which also supports stop words corpus definition
by default. Once we are done with the data cleaning make sure we perform the Tokenization on the
dataset. Tokenization splits the sentences into small pieces aka a Token. The token size can a single
word or number or it can also be a sentence. Once we have done with the Tokenization method we
can continue with our data cleaning part with the below methods for better results. NTLK library
supports most of these methods as functions.

35
 Removal of Numerical/Alpha-Numeric words
 Stemming / Lemmatization (Finding the root word)
 Part of Speech (POS) Tagging
 Create bi-grams or tri-grams model
 Dealing with Typos
 Splitting the attached words (eg: WeCanSplitThisSentence)
 Spelling and Grammer Correction
Fig 14 Script Screenshot of Tokenizations

3.6 Stemming
Stemming is the process of producing morphological variants of a root/base word. Stemming
programs are commonly referred to as stemming algorithms or stemmers.
Often when searching text for a certain keyword, it helps if the search returns variations of the word.
For instance, searching for ―boat‖ might also return ―boats‖ and ―boating‖. Here, ―boat‖ would be
the stem for [boat, boater, boating, boats].
Porter Stemmer
One of the most common — and effective — stemming tools is Porter‘s Algorithm developed by
Martin Porter in 1980. The algorithm employs five phases of word reduction, each with its own set
of mapping rules.

36
Fig 15 Script Screenshot Steamming Porter Steammer
Snowball Stemmer
This is somewhat of a misnomer, as Snowball is the name of a stemming language developed by
Martin Porter. The algorithm used here is more accurately called the ―English Stemmer‖ or
―Porter2 Stemmer‖. It offers a slight improvement over the original Porter stemmer, both in logic
and speed.
Stemming has its drawbacks. If given the token saw, stemming might always return saw, whereas
lemmatization would likely return either see or saw depending on whether the use of the token was as
a verb or a noun.

Fig 16 Script Screenshot Steamming Snowball Steammer

37
Fig 17 Script Screenshot Before and After Steamming

3.7 Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction and considers a language‘s full
vocabulary to apply a morphological analysis to words. The lemma of ‗was‘ is ‗be‘ and the lemma
of ‗mice‘ is ‗mouse‘.
Lemmatization is typically seen as much more informative than simple stemming, which is why
Spacy has opted to only have Lemmatization available instead of Stemming

3.8 Word Embedding

What are Word Embedding’s?
It is an approach for representing words and documents. Word Embedding or Word Vector is a
numeric vector input that represents a word in a lower-dimensional space. It allows words with
similar meaning to have a similar representation. They can also approximate meaning. A word
vector with 50 values can represent 50 unique features.
Features: Anything that relates words to one another. Eg: Age, Sports, Fitness, Employed etc.
38
Each word vector has values corresponding to these features.
Goal of Word Embeddings
 To reduce dimensionality
 To use a word to predict the words around it
 Inter word semantics must be captured How are Word Embeddings used?
 They are used as input to machine learning models.
Take the words —-> Give their numeric representation —-> Use in training or inference
 To represent or visualize any underlying patterns of usage in the corpus that was used to train
them.
Implementations of Word Embeddings:
Word Embeddings are a method of extracting features out of text so that we can input those features
into a machine learning model to work with text data. They try to preserve syntactical and
semantic information. The methods such as Bag of Words(BOW), CountVectorizer and TFIDF rely
on the word count in a sentence but do not save any syntactical or semantic information. In these
algorithms, the size of the vector is the number of elements in the vocabulary. We can get a sparse
matrix if most of the elements are zero. Large input vectors will mean a huge number of weights
which will result in high computation required for training. Word Embeddings give a solution to
these problems.
1) Word2Vec:
In Word2Vec every word is assigned a vector. We start with either a random vector or one-hot
vector. One-Hot vector: A representation where only one bit in a vector is 1. If there are 500 words
in the corpus, then the vector length will be 500. After assigning vectors to each word we take a
window size and iterate through the entire corpus. While we do this there are two neural embedding
methods which are used:
1.1) Continuous Bowl of Words (CBOW)
In this model what we do is we try to fit the neighboring words in the window to the central
word.

39
Fig20- Continuous Bowl of Words

1.2) Skip Gram

In this model, we try to make the central word closer to the neighboring words. It is the complete
opposite of the CBOW model. It is shown that this method produces more meaningful embeddings.

Fig21-Skip gram
After applying the above neural embedding methods, we get trained vectors of each word after many
iterations through the corpus. These trained vectors preserve syntactical or semantic information and
are converted to lower dimensions. The vectors with similar meaning or semantic information are
placed close to each other in space.
2) GloVe:
This is another method for creating word embeddings. In this method, we take the corpus and iterate
through it and get the co-occurence of each word with other words in the corpus. We get a co-
occurence matrix through this. The words which occur next to each other get a value of 1, if they
are one word apart then 1/2, if two words apart then 1/3 and so on.
Let us take an example to understand how the matrix is created. We have a small corpus: Corpus:

40
It is a nice evening. Good Evening!
Is it a nice evening?

Fig22-Glow Example
The upper half of the matrix will be a reflection of the lower half. We can consider a window frame
as well to calculate the co-occurrences by shifting the frame till the end of the corpus. This helps
gather information about the context in which the word is used.
Initially, the vectors for each word is assigned randomly. Then we take two pairs of vectors and see
how close they are to each other in space. If they occur together more often or have a higher value in
the co-occurence matrix and are far apart in space, then they are brought close to each other. If they
are close to each other but are rarely or not frequently used together then they are moved further apart
in space. After many iterations of the above process, we‘ll get a vector space representation that
approximates the information from the co-occurence matrix. The performance of GloVe is better
than Word2Vec in terms of both semantic and syntactic capturing.
Pre-trained Word Embedding Models:
People generally use pre-trained models for word embeddings. Few of them are:
 SpaCy

41
 fastText
 Flair etc.
Common Errors made:
 You need to use the exact same pipeline during deploying your model as were used to
create the training data for the word embedding. If you use a different tokenizer or different method
of handling white space, punctuation etc. you might end up with incompatible inputs.
 Words in your input that doesn‘t have a pre-trained vector. Such words are known as Out of
Vocabulary Word(oov). What you can do is replace those words with ―UNK‖ which means
unknown and then handle them separately.
 Dimension mis-match: Vectors can be of many lengths. If you train a model with vectors of
length say 400 and then try to apply vectors of length 1000 at inference time, you will run into
errors. So make sure to use the same dimensions throughout.
Benefits of using Word Embeddings:
 It is much faster to train than hand build models like WordNet (which uses graph embeddings)
 Almost all modern NLP applications start with an embedding layer
 It Stores an approximation of meaning
Drawbacks of Word Embeddings:
 It can be memory intensive
 It is corpus dependent. Any underlying bias will have an effect on your model
 It cannot distinguish between homophones. Eg: brake/break, cell/sell, weather/whether etc.

3.9 EDA (Exploratory Data Analysis) & Uncovering Facets of data

42
Graph-1 Data set Dimensions

43
Graph-2 Unique Values in Training Data Set

Graph-3 Unique Values in Testing Data Set

44
Graph-4 Missing Values in Training Data Set

Graph-5 Missing Values in Testing Data Set

45
Graph-6 Title Sequence Vs No of Sequences

3.10Final Feature Engineering

Using CountVectorizer to Extracting Features from Text
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform
a given text into a vector on the basis of the frequency (count) of each word that occurs in the
entire text. This is helpful when we have multiple such texts, and we wish to convert each word in
each text into vectors (for using in further text analysis).
Let us consider a few sample texts from a document (each as a list element):
document = [ “One Geek helps Two Geeks”, “Two Geeks help Four Geeks”, “Each Geek helps
many other Geeks at GeeksforGeeks.”]
CountVectorizer creates a matrix in which each unique word is represented by a column of the
matrix, and each text sample from the document is a row in the matrix. The value of each cell is
nothing but the count of the word in that particular text sample.
This can be visualized as follows –

Table 2 Count Vectorizer

Key Observations:
46
1. There are 12 unique words in the document, represented as columns of
the table.
2. There are 3 text samples in the document, each represented as rows of
the table.
3. Every cell contains a number, that represents the count of the word in
that particular text.
4. All words have been converted to lowercase.
5. The words in columns have been arranged alphabetically.

Fig 18 Script Screenshot Count Vectorizer

3.11Model Building using different algorithms of classification

1. K-Nearest Neighbors
 Let's first understand the term neighbors here. The closeness/ proximity amongst samples of
data determines their neighborhood. There are several ways to calculate the proximity/distance
between data points depending on the problem to solved. Most familiar and popular is straight- line
distance (Euclidean Distance).
 Generally, neighbors share similar characteristics and behavior that's why they can be treated as
they belong to the same group.
 This is the main idea of this simple supervised learning classification algorithm.
 Now, for the K in KNN algorithm that is we consider the K-Nearest Neighbors of the unknown
data we want to classify and assign it the group appearing majorly in those K neighbors. For K=1, the
unknown/unlabeled data will be assigned the class of its closest neighbor.

47
Fig23-KNN
We want to select a value of K that is reasonable and not something too big (it will predict the class
having majority among all data samples) or something too small.

How K-NN works in text?

The major problem in classifying texts is that they are mixture of characters and words. We need
numerical representation of those words to feed them into our K-NN algorithm to compute distances
and make predictions.
One way of doing that numerical representation is bag of words with tf-idf(Term Frequency - Inverse
document frequency). If you have no idea about these terms, you should check out our previous
guide about them before moving ahead.
Let's say we have our text data represented in feature vectors as,

ID CLASS WORD1 WORD2 WORD3

1 0 2 0 3
2 1 3 1 1
3 1 0 3 2
4 0 1 2 1
Table 3 KNN
We will have a feature vector of unlabeled text data and its distance will be calculated from all these
feature vectors of our data-set. Out of them, K-Nearest vectors will be selected and the class having
maximum frequency will be labeled to the unlabeled data.

48
Fig 19 Script Screenshot KNN Model Building
Naive Bayes is a family of probabilistic algorithms that take advantage of probability theory and
Bayes‘ Theorem to predict the tag of a text (like a piece of news or a customer review). They are
probabilistic, which means that they calculate the probability of each tag for a given text, and then
output the tag with the highest one. The way they get these probabilities is by using Bayes‘ Theorem,
which describes the probability of a feature, based on prior knowledge of conditions that might be
related to that feature.
We're going to be working with an algorithm called Multinomial Naive Bayes. We‘ll walk through
the algorithm applied to NLP with an example, so by the end, not only will you know how this
method works, but also why it works. Then, we'll lay out a few advanced techniques that can make
Naive Bayes competitive with more complex Machine Learning algorithms, such as SVM and neural
networks.
Let‘s see how this works in practice with a simple example. Suppose we are building a
classifier that says whether a text is about sports or not. Our training data has 5 sentences:
Table 4 Navie Bayes Example

49
Now, which tag does the sentence A very close game belong to?
Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the sentence
"A very close game" is Sports and the probability that it‘s Not Sports. Then, we take the largest one.
Written mathematically, what we want is P (Sports | a very close game) — the probability that the tag
of a sentence is Sports given that the sentence is ―A very close game‖.
Feature Engineering
The first thing we need to do when creating a machine learning model is to decide what to use as
features. We call features the pieces of information that we take from the text and give to the
algorithm so it can work its magic. For example, if we were doing classification on health, some
features could be a person‘s height, weight, gender, and so on. We would exclude things that maybe
are known but aren‘t useful to the model, like a person‘s name or favorite color.
In this case though, we don‘t even have numeric features. We just have text. We need to
somehow convert this text into numbers that we can do calculations on.
So what do we do? Simple! We use word frequencies. That is, we ignore word order and sentence
construction, treating every document as a set of the words it contains. Our features will be the
counts of each of these words. Even though it may seem too simplistic an approach, it works
surprisingly well.
Bayes’ Theorem
Now we need to transform the probability we want to calculate into something that can be calculated
using word frequencies. For this, we will use some basic properties of probabilities, and Bayes‘
Theorem. If you feel like your knowledge of these topics is a bit rusty, read up on it, and you'll be up
to speed in a couple of minutes.
Bayes' Theorem is useful when working with conditional probabilities (like we are doing here),
because it provides us with a way to reverse them:
In our case, we have P (Sports | a very close game), so using this theorem we can reverse the
conditional probability:
Since for our classifier we‘re just trying to find out which tag has a bigger probability, we can discard
the divisor —which is the same for both tags— and just compare with
This is better, since we could actually calculate these probabilities! Just count how many times the
sentence “A very close game” appears in the Sports tag, divide it by the total, and obtain P (a very
close game | Sports).
There's a problem though: ―A very close game‖ doesn‘t appear in our training data, so this
probability is zero. Unless every sentence that we want to classify appears in our training data, the
model won‘t be very useful.

50
Being Naive
So here comes the Naive part: we assume that every word in a sentence is independent of the other
ones. This means that we‘re no longer looking at entire sentences, but rather at individual words. So
for our purposes, ―this was a fun party‖ is the same as ―this party was fun‖ and ―party fun was this‖.
We write this as:
This assumption is very strong but super useful. It's what makes this model work well with little data
or data that may be mislabeled. The next step is just applying this to what we had before:
And now, all of these individual words actually show up several times in our training data, and we
can calculate them!
Calculating Probabilities
The final step is just to calculate every probability and see which one turns out to be larger.
Calculating a probability is just counting in our training data.
First, we calculate the a priori probability of each tag: for a given sentence in our training data, the
probability that it is Sports P (Sports) is ⅗. Then, P (Not Sports) is ⅖. That‘s easy enough.
Then, calculating P (game | Sports) means counting how many times the word ―game‖ appears in
Sports texts (2) divided by the total number of words in sports (11). Therefore,
However, we run into a problem here: ―close‖ doesn‘t appear in any Sports text! That means that P
(close | Sports) = 0. This is rather inconvenient since we are going to be multiplying it with the other
probabilities, so we'll end up with
This equals 0, since in a multiplication if one of the terms is zero, the whole calculation is nullified.
Doing things this way simply doesn't give us any information at all, so we have to find a way around.
How do we do it? By using something called Laplace smoothing: we add 1 to every count so it‘s
never zero. To balance this, we add the number of possible words to the divisor, so the division will
never'election',
be greater than
'clean', 1. In'the',
'close', our'was',
case,'forgettable',
the possible words are ['a', 'great', 'very', 'over', 'it', 'but', 'game',
'match']

Since the number of possible words is 14 (I counted them!), applying smoothing we get that The full
results are:
Word P (word | Sports) P (word | Not Sports)
a (2 + 1) ÷ (11 + 14) (1 + 1) ÷ (9 + 14)
very (1 + 1) ÷ (11 + 14) (0 + 1) ÷ (9 + 14)

51
Word P (word | Sports) P (word | Not Sports)
close (0 + 1) ÷ (11 + 14) (1 + 1) ÷ (9 + 14)
game (2 + 1) ÷ (11 + 14) (0 + 1) ÷ (9 + 14)
Table 5 Results of Navie Bayes Example
Now we just multiply all the probabilities, and see who is bigger:
Excellent! Our classifier gives ―A very close game‖ the Sports tag.

3.12 Model Testing and Measures Different Accuracy measures

Accuracy Measure KNN Navie Bayes
Precision 0.701239998368184 0.6978875525834766
Recall 0.6631337891351615 0.5868848122090956
Accuracy Score 0.6631337891351615 0.5868848122090956
Balance Accuracy 0.4986201332780098 0.6400531616731238
Table-6 Model Testing Results
3.13Model Deployment / Saving the Model and further use
In machine learning, the deployment of a model is the most important step in the entire lifecycle of
the machine learning model to serve it as an end-to-end application. After training a model and
evaluating it on the test set, we need to serve it in a format so that we can use it when needed.
First, we train a model and then we save it in Pickle format. So, model deployment means consuming
the predictions made by a machine learning model in the form of an application. Depending on how
you want to use this model to make predictions, you can deploy the model for batch consumption or
real-time consumption.
Batch consumption means planning a model‘s forecast, such as after every hour, day or week
depending on the problem. In real-time consumptions, a trigger is defined to initiate the process of
the machine learning model to make predictions. For example, deciding whether a transaction is a
fraud or not when a transaction is initiated.
How To save the Model for further use
In machine learning, while working with scikit learn library, we need to save the trained models in
a file and restore them in order to reuse it to compare the model with other models, to test the model
on a new data. The saving of data is called Serialization, while restoring the data is called
Deserialization.
Also, we deal with different types and sizes of data. Some datasets are easily trained i.e- they take
less time to train but the datasets whose size is large (more than 1GB) can take very large time to
train on a local machine even with GPU. When we need the same trained data in some different

52
project or later sometime, to avoid the wastage of the training time, store trained model so that it can
be used anytime in the future.
There are two ways we can save a model in scikit learn:
1. Pickle string: The pickle module implements a fundamental, but powerful algorithm for
serializing and de-serializing a Python object
2.
3. structure. Example: Let‘s apply K
Nearest
Neighbor on iris dataset and then save the model. import numpy as arrays. These functions also
np accept file-like object
# Load dataset instead of filenames.
from sklearn.datasets import load_iris # Split dataset into train and joblib.dump to serialize an
test object hierarchy joblib.load
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size to deserialize a data stream
= 0.3,random_state = 2018) \from sklearn.externals import
# import KNeighborsClassifier model joblib

from sklearn.neighbors import KNeighborsClassifier as KNN knn = # Save the model as a

KNN(n_neighbors = 3) pickle in a file

# train model knn.fit(X_train, y_train) joblib.dump(knn,

1. Save model to string using pickle – from sklearn.externals 'filename.pkl')

import joblib # Load the model from the

# Save the model as a pickle in a file joblib.dump(knn, file knn_from_joblib =

'filename.pkl') joblib.load('filename.pkl')

# Load the model from the file knn_from_joblib = # Use the loaded model to

joblib.load('filename.pkl') make predictions

# Use the loaded model to make predictions knn_from_joblib.predict(X

knn_from_joblib.predict(X_test) _test)

2. Pickled model as a file using joblib: Joblib is the replacement

of pickle as it is more efficient on objects that carry large numpy

53
Chapter -4
Future Scope

1) We can extract useful information without

2) merging the feature.
3) A student model can be trained with the ensemble as Teacher model using Knowledge
Distillation.
4) Normalization and Stacking embeddings before applying KNN classifier.
5) More Generalized models can be trained by using more data and training for longer duration.
6) Alternative and Better approaches to KNN and Navie Bayes classifier can be explored, for
example PCA on Embeddings, Kmeans++, XGBoost, etc.
7) We can achieve a better result by hyperparameter tuning of the the various models used.
Training for more epochs can yield higher accuracy. Since we lacked in computational
resources we could not train any model for more than one time.
8) A better cleaning strategy can also be helpful.
9) Prediction can be improved by exploring the usage of a potentially better or larger model
architecture such as BERT (Large), GPT-3, etc. for the problem statement.
10) Training the model for many more epochs , as well as with greater processing power or time,
will result in a better outcome.
11) Possibly considering ALL of the text data for training and using better advanced techniques
for training or prediction.
12) The existence of configuration variables makes our approach is easily scalable and may be
utilised for quick prototyping or experimentation.

54
Chapter -5
Conclusion
Text data is very important kind of data now a day we are dealing with. It is very hard to work with
text data and find the meaningful insights from that data and used for business growth. In this project
we classify the product brand using the NLP. Amazon is very big company in the market. Amazon
catalog consists of billions of products that belong to thousands of browse nodes (each browse node
represents a collection of items for sale). Browse nodes are used to help customer navigate through
the website and classify products to product type groups So, it is important to predict the node
assignment at the time of listing of the product or when the browse node information is absent. We
classify the product brand using NLP along with Big Data Analytics. In this Project We were Apply
many pre-processing steps like Remove Urls, Remove stops words, Remove Emojis and other
unwanted things which is not neccsary. After that we peform Tokenization, Steamming,
Leminitization, Word Emmedings and much more. After this we were perform final feature
Engineering and apply KNN and Navie Bayes models and we were get the Balance accuracy of KNN
model is 49.87 % and Navie bayes Model Balance Accuracy is 64 % . So, We conclude that navie
bayes work better than KNN.

55
Chapter -6
References
[1] "Classifying Itens with NLP", Medium, 2021. [Online]. Available:
https://medium.com/neuronio/classifying-itens-with-nlp-b3b28a4b7873. [Accessed: 16- Aug-
2021].
[2] "What is Natural Language Processing? An Introduction to NLP", SearchEnterpriseAI, 2021.
[Online]. Available: https://searchenterpriseai.techtarget.com/definition/natural-language-
processing-NLP. [Accessed: 16- Aug- 2021].
[3] "GitHub - aniass/Product-Categorization-NLP: Multi-Class Text Classification for products
based on their description with Machine Learning algorithms and Neural Networks (MLP,
CNN).", GitHub, 2021. [Online]. Available: https://github.com/aniass/Product-
Categorization- NLP. [Accessed: 16- Aug- 2021].
[4] "Beginner's Guide to Product Categorization in Machine Learning | Hacker Noon",
Hackernoon.com, 2021. [Online]. Available: https://hackernoon.com/beginners-guide-to-
product- categorization-in-machine-learning-bai3tip. [Accessed: 16- Aug- 2021].
[5] A. Saeed, "Research paper categorization using machine learning and NLP", Aqibsaeed.github.io,
2021. [Online]. Available: http://aqibsaeed.github.io/2016-07-26-text- classification/. [Accessed: 16-
Aug- 2021].
[6] How to use NLP in Python: a Practical Step-by-Step Example - Just into Data", Just into Data, 2021.
[Online]. Available: https://www.justintodata.com/use-nlp-in-python-practical-step- by-step-example/.
[Accessed: 16- Aug- 2021].
[7] N. Guide, "Natural Language Processing Step by Step Guide | NLP for Data Scientists", Analytics
Vidhya, 2021. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/05/natural-
language-processing-step-by-step-guide/. [Accessed: 16- Aug- 2021].
[8] U. Example, "Text Classification in Natural Language Processing", Analytics Vidhya, 2021. [Online].
Available: https://www.analyticsvidhya.com/blog/2020/12/understanding-text- classification-in-nlp-
with-movie-review-example-example/. [Accessed: 16- Aug- 2021].
[9] H. problems, B. DialogFlow, B. Processing and R. NLP, "How to solve NLP problems | i2tutorials",
i2tutorials, 2021. [Online].
[10] By: IBM Cloud Education, What is natural language processing? IBM. Available at:
https://www.ibm.com/cloud/learn/natural-language-processing [Accessed August 16, 2021].
[11] Chirag, W.by, 2017. NLP for big data Analytics - a guide. EuroSTAR Huddle. Available at:
https://huddle.eurostarsoftwaretesting.com/nlp-for-big-data-how-nlp-will-revolutionise-big-data-
analytics/ [Accessed August 16, 2021].

56
[12] By krishna Gandhi blog, P., 2020. NLP and big data. Data Science.
Available at: https://dapperdatadig.wordpress.com/2020/09/04/nlp-and-big-data/
[Accessed August 16, 2021].
[13] Anon, Amazon ml challenge. HackerEarth. Available at:
https://www.hackerearth.com/challenges/competitive/amazon-ml-challenge/ [Accessed
August 16, 2021].

Text Mining and Natural Language Processing in Construction
No ratings yet
Text Mining and Natural Language Processing in Construction
16 pages
Object Detection Project Report
67% (6)
Object Detection Project Report
86 pages
Lan - Guage Mo - Del Cheat Sheet
100% (2)
Lan - Guage Mo - Del Cheat Sheet
3 pages
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Exploratory Project Report format 2024-28 Batch
No ratings yet
Exploratory Project Report format 2024-28 Batch
57 pages
MAJOR_AND_MINOR_PROJECT_REPORT_FORMAT_niist[1]
No ratings yet
MAJOR_AND_MINOR_PROJECT_REPORT_FORMAT_niist[1]
9 pages
Major Project Report Formatc - 2023-24 - BCA 30032024
No ratings yet
Major Project Report Formatc - 2023-24 - BCA 30032024
56 pages
MAJOR_AND_MINOR_PROJECT_REPORT_FORMAT_niist[1]
No ratings yet
MAJOR_AND_MINOR_PROJECT_REPORT_FORMAT_niist[1]
9 pages
Proposal Guid
No ratings yet
Proposal Guid
50 pages
b.e-cse-batchno-256
No ratings yet
b.e-cse-batchno-256
57 pages
Major Project Report
100% (1)
Major Project Report
48 pages
B2 Salma Fayaz
No ratings yet
B2 Salma Fayaz
56 pages
Major Project Report
No ratings yet
Major Project Report
37 pages
Real Report
No ratings yet
Real Report
62 pages
Final_Report Sadhik
No ratings yet
Final_Report Sadhik
68 pages
NLP Mini Project
No ratings yet
NLP Mini Project
19 pages
copy
No ratings yet
copy
27 pages
Sample Project Final Document
No ratings yet
Sample Project Final Document
68 pages
Accurate Traffic Prediction 4.4
No ratings yet
Accurate Traffic Prediction 4.4
50 pages
report12
No ratings yet
report12
40 pages
A Training Report
No ratings yet
A Training Report
24 pages
Major Merged (1)
No ratings yet
Major Merged (1)
67 pages
FYP Proposal
No ratings yet
FYP Proposal
18 pages
internship codsoft machine learning
No ratings yet
internship codsoft machine learning
36 pages
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
No ratings yet
Visvesvaraya Technological University: "Machine Learning Based Approach To Detect Phishing Attacks"
78 pages
Faculty Project Titles 2024
No ratings yet
Faculty Project Titles 2024
26 pages
25June Final_merged
No ratings yet
25June Final_merged
64 pages
Spam Filter - Machine Learning
No ratings yet
Spam Filter - Machine Learning
25 pages
Group 5 Report
No ratings yet
Group 5 Report
30 pages
Gundimeda. Veerabrahmachary: Veerag567 Memykt
No ratings yet
Gundimeda. Veerabrahmachary: Veerag567 Memykt
1 page
Gokul Project 1
No ratings yet
Gokul Project 1
63 pages
vtu internship hyderbad
No ratings yet
vtu internship hyderbad
11 pages
Price Detection
No ratings yet
Price Detection
48 pages
Combined Bb
No ratings yet
Combined Bb
20 pages
Crawler Thesis
No ratings yet
Crawler Thesis
188 pages
Internship Sahil's Report
No ratings yet
Internship Sahil's Report
22 pages
G H Raisoni College of Engineering and Management, Pune: Department Name
No ratings yet
G H Raisoni College of Engineering and Management, Pune: Department Name
22 pages
CV Re-Updated
No ratings yet
CV Re-Updated
7 pages
Report Jainam Machine Marathon Final
No ratings yet
Report Jainam Machine Marathon Final
65 pages
HELLO 1
No ratings yet
HELLO 1
12 pages
Avinash.pdf.PDF
No ratings yet
Avinash.pdf.PDF
23 pages
Sai Krishna Neelam Resume
No ratings yet
Sai Krishna Neelam Resume
4 pages
Data Science IT fsdfegg
No ratings yet
Data Science IT fsdfegg
31 pages
Minor Project-1 R21-Cse Report Template Ss2425
No ratings yet
Minor Project-1 R21-Cse Report Template Ss2425
39 pages
BCA Documents
No ratings yet
BCA Documents
38 pages
Training Report On Machine Learning
No ratings yet
Training Report On Machine Learning
32 pages
INTERN BDY - Merged
No ratings yet
INTERN BDY - Merged
39 pages
Fake News Detection
40% (10)
Fake News Detection
71 pages
CS - 6 Months Exp
No ratings yet
CS - 6 Months Exp
2 pages
MAJOR_FINAL_REPORT
No ratings yet
MAJOR_FINAL_REPORT
23 pages
Final Modified Document PG
No ratings yet
Final Modified Document PG
58 pages
heart disease
No ratings yet
heart disease
28 pages
Group Number - 2 - MOVING OBJECT DETECTION USING YOLO Algorithm - Kaustav
No ratings yet
Group Number - 2 - MOVING OBJECT DETECTION USING YOLO Algorithm - Kaustav
44 pages
Industrial Training Report (Sahil)
No ratings yet
Industrial Training Report (Sahil)
33 pages
SAIFULLA
No ratings yet
SAIFULLA
51 pages
Resume Screening Report (1) - Merged
100% (2)
Resume Screening Report (1) - Merged
43 pages
BERT Model
No ratings yet
BERT Model
69 pages
Dfhuynh Thesis
No ratings yet
Dfhuynh Thesis
134 pages
Python Diploma Inplant Report
No ratings yet
Python Diploma Inplant Report
41 pages
Final Review 1
No ratings yet
Final Review 1
29 pages
The Digital Practitioner Foundation Study Guide
From Everand
The Digital Practitioner Foundation Study Guide
Andrew Josey
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Designing deep learning systems: Software engineering, #1
From Everand
Designing deep learning systems: Software engineering, #1
rayaan
No ratings yet
Basic NLP Slides Combined
No ratings yet
Basic NLP Slides Combined
397 pages
Deep Multimodal Representation Learning A Survey
No ratings yet
Deep Multimodal Representation Learning A Survey
22 pages
Akshay DBpedia GSoC 2017 Proposal
No ratings yet
Akshay DBpedia GSoC 2017 Proposal
12 pages
A Deep Learning Approach For Public Sentiment Analysis in COVID-19 Pandemic
No ratings yet
A Deep Learning Approach For Public Sentiment Analysis in COVID-19 Pandemic
7 pages
UNIT_4_DL
No ratings yet
UNIT_4_DL
31 pages
Deep Learning and Multilingual Sentiment Analysis On Social Media
No ratings yet
Deep Learning and Multilingual Sentiment Analysis On Social Media
11 pages
Efficient Estimation of Word Representations in Vector Space: January 2013
No ratings yet
Efficient Estimation of Word Representations in Vector Space: January 2013
13 pages
Character Level Text Classification Via Convolutional Neural Network and Gated Recurrent Unit
No ratings yet
Character Level Text Classification Via Convolutional Neural Network and Gated Recurrent Unit
11 pages
Mossie and Wang 2020 Hate Speech Social Media
No ratings yet
Mossie and Wang 2020 Hate Speech Social Media
16 pages
Immediate Download Routledge Encyclopedia of Translation Technology 2nd Edition Sin-Wai Chan Ebooks 2024
100% (6)
Immediate Download Routledge Encyclopedia of Translation Technology 2nd Edition Sin-Wai Chan Ebooks 2024
70 pages
Utilizing Text Mining, Data Linkage and Deep Learning in Police and Health Records To Predict Future Offenses in Family and Domestic Violence
No ratings yet
Utilizing Text Mining, Data Linkage and Deep Learning in Police and Health Records To Predict Future Offenses in Family and Domestic Violence
17 pages
Bidirectional LSTM-CRF For Named Entity Recognition
No ratings yet
Bidirectional LSTM-CRF For Named Entity Recognition
10 pages
The Heart of The Matter Copyright, AI Training, and LLMs
No ratings yet
The Heart of The Matter Copyright, AI Training, and LLMs
29 pages
Exam ml4nlp1 Hs21.example Solution
No ratings yet
Exam ml4nlp1 Hs21.example Solution
6 pages
Handbook (1) 120 141
No ratings yet
Handbook (1) 120 141
22 pages
Project Report
No ratings yet
Project Report
56 pages
Improving The BERT Model For Long Text Sequences in Question Answering Domain
No ratings yet
Improving The BERT Model For Long Text Sequences in Question Answering Domain
10 pages
Deep Learning With Advanced NLP
No ratings yet
Deep Learning With Advanced NLP
18 pages
Unit 5 Updated
No ratings yet
Unit 5 Updated
107 pages
Rag Ultimate Guide
No ratings yet
Rag Ultimate Guide
8 pages
PPT for the First Paper (1)
No ratings yet
PPT for the First Paper (1)
49 pages
Text Summarization
No ratings yet
Text Summarization
60 pages
NLP FINAL
No ratings yet
NLP FINAL
33 pages
Exploring Neural Word Embeddings For Amharic Languages
No ratings yet
Exploring Neural Word Embeddings For Amharic Languages
105 pages
Introduction To Vector Embeddings and Vector Databases
No ratings yet
Introduction To Vector Embeddings and Vector Databases
11 pages
Error Logs Simplified
No ratings yet
Error Logs Simplified
10 pages
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
No ratings yet
Attention in Natural Language Processing: Andrea Galassi, Marco Lippi, and Paolo Torroni
18 pages

Exploratory_Project_Report

Uploaded by

Exploratory_Project_Report

Uploaded by

A

EXPLORATORY PROJECT REPORT

Poornima University, Jaipur (Academic Session: 2024-25)

Ist Year, Computer Engineering

Ms. Deepali Chaudhary

Mr. Pratish Rawat

Dr. Ajay Khunteta

Mohi Ganeriwal [17926]

Chapter 4- Future Scope 43

Chapter 6- References 45-46

Fig1 Flow Chart 1 4

Fig1 Script Screenshot importing Modules...........................................................19

Graph-1 Data set Dimensions........................................................................................31

Table-1 Tasks and Time Required 3

1.2 Problem Statement

1.5 Data Set Descriptions

2.2 Numpy Fig4-Pandas

Syntax in python: - from sklearn.feature_extraction.text import CountVectorizer

Fig12-Text Classifier working

Fig14-Text Classifier Prediction

Fig15- Conditional probabilities formula

Fig18-SVM Best Hyperplane

3.2 Data Collection

3.3 Data Cleaning/Preprocessing Using Python

Fig 2 Script Screenshot Top 5 Rows of Train Data

The dataframe has 5 PRODUCT_ID, TITLE, DESCRIPTION, BULLET_POINTS,

Using Regex to clean data

Fig 6 Script Screenshot Punctuations Example

Fig 7 Script Screenshot Remove Punctuations from Data set

Fig 9 Script Screenshot Remove Urls from Data Sets

Fig 10 Script Screenshot Removes Emoji Example

Fig 12 Script Screenshot Removes Stopwords Example

3.4 Remove Duplicates

Fig 16 Script Screenshot Steamming Snowball Steammer

3.8 Word Embedding

1.2) Skip Gram

3.9 EDA (Exploratory Data Analysis) & Uncovering Facets of data

Graph-3 Unique Values in Testing Data Set

Graph-5 Missing Values in Testing Data Set

3.10Final Feature Engineering

Table 2 Count Vectorizer

Fig 18 Script Screenshot Count Vectorizer

3.11Model Building using different algorithms of classification

How K-NN works in text?

ID CLASS WORD1 WORD2 WORD3

3.12 Model Testing and Measures Different Accuracy measures

from sklearn.neighbors import KNeighborsClassifier as KNN knn = # Save the model as a

KNN(n_neighbors = 3) pickle in a file

# train model knn.fit(X_train, y_train) joblib.dump(knn,

1. Save model to string using pickle – from sklearn.externals 'filename.pkl')

import joblib # Load the model from the

# Save the model as a pickle in a file joblib.dump(knn, file knn_from_joblib =

joblib.load('filename.pkl') make predictions

# Use the loaded model to make predictions knn_from_joblib.predict(X

2. Pickled model as a file using joblib: Joblib is the replacement

1) We can extract useful information without

You might also like