Multi-Class Text Classification With Scikit-Learn
Multi-Class Text Classification With Scikit-Learn
There are lots of applications of text classification in the commercial world. For
example, news stories are typically organized by topics; content or products are
often tagged by categories; users can be classified into cohorts based on how they
talk about a product or brand online …
However, the vast majority of text classification articles and tutorials on the
internet are binar y text classification such as email spam filtering (spam vs. ham),
sentiment analysis (positive vs. negative). In most cases, our real world problem are
much more complicated than that. Therefore, this is what we are going to do today:
Classifying Consumer Finance Complaints into 12 pre-defined classes. The data can
be downloaded from data.gov.
We use Python and Jupyter Notebook to develop our system, relying on Scikit-Learn
for the machine learning components. If you would like to see an implementation in
PySpark, read the next article.
Problem Formulation
The problem is super vised text classification problem, and our goal is to investigate
which super vised machine learning methods are best suited to solve it.
Given a new complaint comes in, we want to assign it to one of 12 categories. The
classifier makes the assumption that each new complaint is assigned to one and only
one categor y. This is multi-class text classification problem. I can’t wait to see what
we can achieve!
Data Exploration
Before diving into training machine learning models, we should look at some
examples first and the number of complaints in each class:
import pandas as pd
df = pd.read_csv('Consumer_Complaints.csv')
df.head()
Figure 1
For this project, we need only two columns — “Product” and “Consumer complaint
narrative”.
Input: Consumer_complaint_narrative
Output: product
We will remove missing values in “Consumer complaints narrative” column, and add
a column encoding the product as an integer because categorical variables are often
better represented by integers than strings.
After cleaning up, this is the first five rows of the data we will be working on:
df['category_id'] = df['Product'].factorize()[0]
category_id_df = df[['Product',
'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id',
'Product']].values)
df.head()
Figure 2
Imbalanced Classes
We see that the number of complaints per product is imbalanced. Consumers’
complaints are more biased towards Debt collection, Credit reporting and Mortgage.
Figure 3
When we encounter such problems, we are bound to have difficulties solving them
with standard algorithms. Conventional algorithms are often biased towards the
majority class, not taking the data distribution into consideration. In the worst case,
minority classes are treated as outliers and ignored. For some cases, such as fraud
detection or cancer prediction, we would need to carefully configure our model or
artificially balance the dataset, for example by undersampling or oversampling each
class.
However, in our case of learning imbalanced data, the majority classes might be of
our great interest. It is desirable to have a classifier that gives high prediction
accuracy over the majority class, while maintaining reasonable accuracy for the
minority classes. Therefore, we will leave it as it is.
Text Representation
The classifiers and learning algorithms can not directly process the text documents
in their original form, as most of them expect numerical feature vectors with a fixed
size rather than the raw text documents with variable length. Therefore, during the
preprocessing step, the texts are converted to a more manageable representation.
One common approach for extracting features from text is to use the bag of words
model: a model where for each document, a complaint narrative in our case, the
presence (and often the frequency) of words is taken into consideration, but the
order in which they occur is ignored.
Specifically, for each term in our dataset, we will calculate a measure called Term
Frequency, Inverse Document Frequency, abbreviated to tf-idf. We will use
sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for
each of consumer complaint narratives:
norm is set to l2 , to ensure all our feature vectors have a euclidian norm of 1.
features =
tfidf.fit_transform(df.Consumer_complaint_narrative).toarray()
labels = df.category_id
features.shape
(4569, 12633)
We can use sklearn.feature_selection.chi2 to find the terms that are the most
correlated with each of the products:
N = 2
for Product, category_id in sorted(category_to_id.items()):
features_chi2 = chi2(features, labels == category_id)
indices = np.argsort(features_chi2[0])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
print("# '{}':".format(Product))
print(" . Most correlated unigrams:\n. {}".format('\n.
'.join(unigrams[-N:])))
print(" . Most correlated bigrams:\n. {}".format('\n.
'.join(bigrams[-N:])))
After having this vector representations of the text we can train super vised
classifiers to train unseen “Consumer complaint narrative” and predict the
“product” on which they fall.
After all the above data transformation, now that we have all the features and
labels, it is time to train the classifiers. There are a number of algorithms we can use
for this type of problem.
Naive Bayes Classifier: the one most suitable for word counts is the
multinomial variant:
[‘Debt collection’]
Figure 4
[‘Credit reporting’]
Model Selection
We are now ready to experiment with different machine learning models, evaluate
their accuracy and find the source of any potential issues.
Logistic Regression
Random Forest
models = [
RandomForestClassifier(n_estimators=200, max_depth=3,
random_state=0),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels,
scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx',
'accuracy'])
cv_df.groupby('model_name').accuracy.mean()
model_name
LinearSVC: 0.822890
LogisticRegression: 0.792927
MultinomialNB: 0.688519
RandomForestClassifier: 0.443826
Name: accuracy, dtype: float64
LinearSVC and Logistic Regression perform better than the other two classifiers,
with LinearSVC having a slight advantage with a median accuracy of around 82%.
Model Evaluation
Continue with our best model (LinearSVC), we are going to look at the confusion
matrix, and show the discrepancies between predicted and actual labels.
model = LinearSVC()
Figure 7
The vast majority of the predictions end up on the diagonal (predicted label =
actual label), where we want them to be. However, there are a number of
misclassifications, and it might be interesting to see what those are caused by:
Figure 8
Figure 9
As you can see, some of the misclassified complaints are complaints that touch on
more than one subjects (for example, complaints involving both credit card and
credit report). This sort of errors will always happen.
Again, we use the chi-squared test to find the terms that are the most correlated
with each of the categories:
model.fit(features, labels)
N = 2
for Product, category_id in sorted(category_to_id.items()):
indices = np.argsort(model.coef_[category_id])
feature_names = np.array(tfidf.get_feature_names())[indices]
unigrams = [v for v in reversed(feature_names) if len(v.split(' '))
== 1][:N]
bigrams = [v for v in reversed(feature_names) if len(v.split(' '))
== 2][:N]
print("# '{}':".format(Product))
print(" . Top unigrams:\n . {}".format('\n .
'.join(unigrams)))
print(" . Top bigrams:\n . {}".format('\n .
'.join(bigrams)))
Source code can be found on Github. I look for ward to hear any feedback or
questions.
7.8K claps
WRIT T EN BY
Susan Li
Follow
Follow
See responses (102)