A Comprehensive Guide To Understand and Implement Text Classification in Python
A Comprehensive Guide To Understand and Implement Text Classification in Python
com/blog/2018/04/a-comprehensive-guide-to-understand-and-
implement-text-classification-in-python/
A Comprehensive Guide to
Understand and Implement Text
Classification in Python
SHIVAM BANSAL, APRIL 23, 2018
Introduction
One of the widely used natural language processing task in different business problems
is “Text Classification”. The goal of text classification is to automatically classify the text
documents into one or more defined categories. Some examples of text classification
are:
Table of Contents
In this article, I will explain about the text classification and the step by step process to
implement it in python.
4. Improve Performance of Text Classifier: In this article, we will also look at the
different ways to improve the performance of text classifiers.
Note : This article does not narrate NLP tasks in depth. If you want to revise the basics
and come back here, you can always go through this article.
You would need requisite libraries to run this code – you can install them at their
individual official links
Pandas
Scikit-learn
XGBoost
TextBlob
Keras
metrics, svm
1. Dataset preparation
For the purpose of this article, I am the using dataset of amazon reviews which can
be downloaded at this link. The dataset consists of 3.6M text reviews and their labels,
we will use only a small fraction of data. To prepare the dataset, load the downloaded
data into a pandas dataframe containing two columns – text and label. (Source)
data = open('data/corpus').read()
content = line.split()
labels.append(content[0])
texts.append(" ".join(content[1:]))
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
Next, we will split the dataset into training and validation sets so that we can train and
test classifier. Also, we will encode our target column so that it can be used in machine
learning models.
model_selection.train_test_split(trainDF['text'], trainDF['label'])
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
2. Feature Engineering
The next step is the feature engineering step. In this step, raw text data will be
transformed into feature vectors and new features will be created using the existing
dataset. We will implement the following different ideas in order to obtain relevant
features from our dataset.
Word level
N-Gram level
Character level
2.3 Word Embeddings as features
2.4 Text / NLP based features
2.5 Topic Models as features
count_vect.fit(trainDF['text'])
# transform the training and validation data using count vectorizer object
xtrain_count = count_vect.transform(train_x)
xvalid_count = count_vect.transform(valid_x)
TF(t) = (Number of times term t appears in a document) / (Total number of terms in the
document)
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
TF-IDF Vectors can be generated at different levels of input tokens (words, characters,
n-grams)
a. Word Level TF-IDF : Matrix representing tf-idf scores of every term in different
documents
b. N-gram Level TF-IDF : N-grams are the combination of N terms together. This Matrix
representing tf-idf scores of N-grams
c. Character Level TF-IDF : Matrix representing tf-idf scores of character level n-grams
in the corpus
max_features=5000)
tfidf_vect.fit(trainDF['text'])
xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)
ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(trainDF['text'])
xtrain_tfidf_ngram = tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram = tfidf_vect_ngram.transform(valid_x)
ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram_chars.fit(trainDF['text'])
xtrain_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(train_x)
xvalid_tfidf_ngram_chars = tfidf_vect_ngram_chars.transform(valid_x)
Following snnipet shows how to use pre-trained word embeddings in the model. There
are four essential steps:
embeddings_index = {}
values = line.split()
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index
# convert text to sequence of tokens and pad them to ensure equal length vectors
embedding_vector = embeddings_index.get(word)
embedding_matrix[i] = embedding_vector
2.4 Text / NLP based features
A number of extra text based features can also be created which sometimes are helpful
for improving text classification models. Some examples are:
These features are highly experimental ones and should be used according to the
problem statement only.
trainDF['char_count'] = trainDF['text'].apply(len)
in x if _ in string.punctuation)))
x.split() if wrd.istitle()]))
'noun' : ['NN','NNS','NNP','NNPS'],
'pron' : ['PRP','PRP$','WP','WP$'],
'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
'adj' : ['JJ','JJR','JJS'],
'adv' : ['RB','RBR','RBS','WRB']
# function to check and get the part of speech tag count of a words in a given
sentence
cnt = 0
try:
wiki = textblob.TextBlob(x)
ppo = list(tup)[1]
if ppo in pos_family[flag]:
cnt += 1
except:
pass
return cnt
learning_method='online', max_iter=20)
X_topics = lda_model.fit_transform(xtrain_count)
topic_word = lda_model.components_
vocab = count_vect.get_feature_names()
n_top_words = 10
topic_summaries = []
topic_words = numpy.array(vocab)[numpy.argsort(topic_dist)][:-
(n_top_words+1):-1]
topic_summaries.append(' '.join(topic_words))
3. Model Building
The final step in the text classification framework is to train a classifier using the features
created in the previous step. There are many different choices of machine learning
models which can be used to train a final model. We will implement following different
classifiers for this purpose:
Lets implement these models and understand their details. The following function is a
utility function which can be used to train a model. It accepts the classifier,
feature_vector of training data, labels of training data and feature vectors of valid data as
inputs. Using these inputs, the model is trained and accuracy score is computed.
is_neural_net=False):
classifier.fit(feature_vector_train, label)
predictions = classifier.predict(feature_vector_valid)
if is_neural_net:
predictions = predictions.argmax(axis=-1)
xvalid_count)
xvalid_tfidf)
xvalid_tfidf_ngram)
train_y, xvalid_tfidf_ngram_chars)
xvalid_count)
xvalid_tfidf)
train_y, xvalid_tfidf_ngram)
accuracy = train_model(linear_model.LogisticRegression(),
Random Forest models are a type of ensemble models, particularly bagging models.
They are part of the tree based model family. One can read more about Bagging and
random forests here
# RF on Count Vectors
xvalid_count)
xvalid_tfidf)
print "RF, WordLevel TF-IDF: ", accuracy
Boosting models are another type of ensemble models part of tree based models.
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and
also variance in supervised learning, and a family of machine learning algorithms that
convert weak learners to strong ones. A weak learner is defined to be a classifier that is
only slightly correlated with the true classification (it can label examples better than
random guessing). Read more about these models here
xvalid_count.tocsc())
xvalid_tfidf.tocsc())
train_y, xvalid_tfidf_ngram_chars.tocsc())
/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:
False, but in future this will result in an error. Use `array.size > 0` to check
if diff:
/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:
False, but in future this will result in an error. Use `array.size > 0` to check
if diff:
/usr/local/lib/python2.7/dist-packages/sklearn/preprocessing/label.py:151:
if diff:
def create_model_architecture(input_size):
classifier.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
return classifier
classifier = create_model_architecture(xtrain_tfidf_ngram.shape[1])
xvalid_tfidf_ngram, is_neural_net=True)
Epoch 1/1
In Convolutional neural networks, convolutions over the input layer are used to compute
the output. This results in local connections, where each region of the input is connected
to a neuron in the output. Each layer applies different filters and combines their results.
def create_cnn():
# Add an Input Layer
input_layer = layers.Input((70, ))
weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
pooling_layer = layers.GlobalMaxPool1D()(conv_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
output_layer2 = layers.Dense(1, activation="sigmoid")(output_layer1)
model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
return model
classifier = create_cnn()
is_neural_net=True)
Epoch 1/1
Unlike Feed-forward neural networks in which activation outputs are propagated only in
one direction, the activation outputs from neurons propagate in both directions (from
inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates
loops in the neural network architecture which acts as a ‘memory state’ of the neurons.
This state allows the neurons an ability to remember what have been learned so far.
The memory state in RNNs gives an advantage over traditional neural networks but a
problem called Vanishing Gradient is associated with them. In this problem, while
learning with a large number of layers, it becomes really hard for the network to learn
and tune the parameters of the earlier layers. To address this problem, A new type of
RNNs called LSTMs (Long Short Term Memory) Models have been developed.
def create_rnn_lstm():
input_layer = layers.Input((70, ))
weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
lstm_layer = layers.LSTM(100)(embedding_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
return model
classifier = create_rnn_lstm()
is_neural_net=True)
Epoch 1/1
Gated Recurrent Units are another form of recurrent neural networks. Lets add a layer of
GRU instead of LSTM in our network.
def create_rnn_gru():
input_layer = layers.Input((70, ))
weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the GRU Layer
lstm_layer = layers.GRU(100)(embedding_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
return model
classifier = create_rnn_gru()
accuracy = train_model(classifier, train_seq_x, train_y, valid_seq_x,
is_neural_net=True)
Epoch 1/1
RNN layers can be wrapped in Bidirectional layers as well. Lets wrap our GRU layer in
bidirectional layer.
def create_bidirectional_rnn():
input_layer = layers.Input((70, ))
weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the LSTM Layer
lstm_layer = layers.Bidirectional(layers.GRU(100))(embedding_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
model.compile(optimizer=optimizers.Adam(), loss='binary_crossentropy')
return model
classifier = create_bidirectional_rnn()
is_neural_net=True)
print "RNN-Bidirectional, Word Embeddings", accuracy
Epoch 1/1
Once the essential architectures have been tried out, one can try different variants of
these layers such as recurrent convolutional neural network. Another variants can be:
def create_rcnn():
input_layer = layers.Input((70, ))
weights=[embedding_matrix], trainable=False)(input_layer)
embedding_layer = layers.SpatialDropout1D(0.3)(embedding_layer)
# Add the recurrent layer
(embedding_layer)
pooling_layer = layers.GlobalMaxPool1D()(conv_layer)
output_layer1 = layers.Dropout(0.25)(output_layer1)
return model
classifier = create_rcnn()
is_neural_net=True)
Epoch 1/1
1. Text Cleaning : text cleaning can help to reducue the noise present in text data in the
form of stopwords, punctuations marks, suffix variations etc. This article can help to
understand how to implement text classification in detail.
2. Hstacking Text / NLP features with text feature vectors : In the feature
engineering section, we generated a number of different feature vectros, combining
them together can help to improve the accuracy of the classifier.
3. Hyperparamter Tuning in modelling : Tuning the paramters is an important step, a
number of parameters such as tree length, leafs, network paramters etc can be fine
tuned to get a best fit model.
4. Ensemble Models : Stacking different models and blending their outputs can help to
further improve the results. Read more about ensemble models here
End Notes
In this article, we discussed about how to prepare a text dataset like cleaning/creating
training and validation dataset, perform different types of feature engineering like Count
Vector/TF-IDF/ Word Embedding/ Topic Modelling and basic text features, and finally
trained a variety of classifiers like Naive Bayes/ Logistic regression/ SVM/ MLP/ LSTM
and GRU. At the end, discussed about different approach to improve the performance of
text classifiers.
Note: There is a video course, Natural Language Processing using Python, with 3 real
life projects, two of them involve text classification.
Did you find this article useful ? Share your views and opinions in the comments section
below.