A Guide To Text Classification (NLP)
A Guide To Text Classification (NLP)
A guide to Text
Classification(NLP) using SVM
and Naive Bayes with Python
Gunjit Bedi Nov 9, 2018 · 8 min read
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 1/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
I went through a lot of articles, books and videos to understand the text
classification technique when I first started it. The content sometimes was
too overwhelming for someone who is just beginning with their conquest
on NLP or Text Classification Algorithms.
This is my take on explaining the Text classification technique with just the
right content to get you started. By the end of this article you will have
enough knowledge and a working model to take on the interesting world of
Natural Language Processing with Python.
This can be done with the help of Natural Language Processing and
different Classification Algorithms like Naive Bayes, SVM and even Neural
Networks in Python.
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 3/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
I will be using the Amazon Review Data set which has 10,000 rows of Text
data which is classified into “Label 1” and “Label 2”. The Data set has two
columns “Text” and “Label”. You can download the data from here.
The following libraries will be used ahead in the article. If not available,
these can be easily downloaded through their respective websites.
import pandas as pd
import numpy as np
This is used to reproduce the same result every time if the script is kept
consistent otherwise each run will produce different results. The seed can
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 4/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
np.random.seed(500)
The data set can be easily added as a pandas Data Frame with the help of
‘read_csv’ function. I have set the encoding to ‘latin-1’ as the text had many
special characters.
Corpus = pd.read_csv(r"C:\Users\gunjit.bedi\Desktop\NLP
Project\corpus.csv",encoding='latin-1')
This is an important step in any data mining process. This basically involves
transforming raw data into an understandable format for NLP models. Real-
world data is often incomplete, inconsistent, and/or lacking in certain
behaviors or trends, and is likely to contain many errors. Data pre-
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 5/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Below, I have explained the two techniques that are also performed besides
other easy to understand steps in data pre-processing:
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 6/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Here’s the complete script which performs the aforementioned data pre-
processing steps, you can always
add or remove steps which best suits the data set you are dealing with:
3. Word Tokenization
6. Word Lemmatization
Corpus['text'].dropna(inplace=True)
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 7/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
# Declaring Empty List to store the words that follow the rules
for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e if the word
is Noun(N) or Verb(V) or something else.
word_Final =
word_Lemmatized.lemmatize(word,tag_map[tag[0]])
Final_words.append(word_Final)
Corpus.loc[index,'text_final'] = str(Final_words)
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 8/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
The Corpus will be split into two data sets, Training and Test. The training
data set will be used to fit the model and the predictions will be performed
on the test data set.This can be done through the train_test_split from the
sklearn library. The Training Data will have 70% of the corpus and Test data
will have the remaining 30% as we have set the parameter test_size=0.3 .
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 9/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
Text Encoding
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 10/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Without going into the math, TF-IDF are word frequency scores that try to
highlight words that are more interesting, e.g. frequent in a document but
not across documents.
The following syntax can be used to first fit the TG-IDF model on the whole
corpus. This will help TF-IDF build a vocabulary of words which it has
learned from the corpus data and it will assign a unique integer number to
each of these words. Their will be maximum of 5000 unique words/features
as we have set parameter max_features=5000.
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 11/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
You can use the below syntax to see the vocabulary that it has learned from
the corpus
print(Tfidf_vect.vocabulary_)
{‘even’: 1459, ‘sound’: 4067, ‘track’: 4494, ‘beautiful’: 346, ‘paint’: 3045,
‘mind’: 2740, ‘well’: 4864, ‘would’: 4952, ‘recomend’: 3493, ‘people’: 3115,
‘hate’: 1961, ‘video’: 4761 …………}
And you can directly print the vectorized data to see how it looks like
print(Train_X_Tfidf)
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 12/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Output: — 1: Row number of ‘Train_X_Tfidf’, 2: Unique Integer number of each word in the first row, 3: Score
calculated by TF-IDF Vectorizer
Now our data sets are ready to be fed into different classification
Algorithms.
First up, lets try the Naive Bayes Classifier Algorithm. You can read more
about it here
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 13/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
predictions_NB = Naive.predict(Test_X_Tfidf)
Output:
Next is the SVM — Support Vector Machine. You can read more about it
here
SVM.fit(Train_X_Tfidf,Train_Y)
predictions_SVM = SVM.predict(Test_X_Tfidf)
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 14/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
Output:
I hope this has explained well what text classification is and how it can be
easily implemented in Python. If you want the full code you can access it
from here.
1. Play around with the Data pre-processing steps and see how it effects
the accuracy.
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 15/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
If you have any thoughts, suggestions please fell free to comment or if you
want you can reach me at [email protected], I will try to get back to
you as soon as possible. Hit the clap button or share if you like the post.
For some interesting tips and tricks on Text classification, you can refer
below article as well.
https://neptune.ai/blog/text-classification-tips-and-tricks-kaggle-
competitions
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 16/17
26-06-2021 A guide to Text Classification(NLP) using SVM and Naive Bayes with Python | by Gunjit Bedi | Medium
https://medium.com/@bedigunjit/simple-guide-to-text-classification-nlp-using-svm-and-naive-bayes-with-python-421db3a72d34 17/17