Clustering Text Documents using K-Means in Scikit Learn

Last Updated : 15 May, 2025

Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-Means algorithm in Scikit-Learn.

Implementation using Python

In this project we're building an application to detect sarcasm in headlines. Sarcasm can make sentences sound opposite to their true meaning which can confuse systems that analyze sentiment.

Step 1: Import Necessary Libraries

We need some Python libraries for our task like numpy, pandas, matplotlib and scikit learn.

Python

import json
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Step 2: Load the Dataset

Now let's load the dataset of sarcasm headlines. We download the dataset using the requests.get(url) method. The .json() method converts the raw data into a Python dictionary. Then we create a pandas DataFrame df to make the data easier to work with.

Python

url = "https://raw.githubusercontent.com/PawanKrGunjan/Natural-Language-Processing/main/Sarcasm%20Detection/sarcasm.json"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)

Step 3: Convert Text to Numeric Representation using TF-IDF

We need to convert the text data into a format that the K-Means algorithm can understand (numbers). We use TF-IDF for this.

TfidfVectorizer converts text into a numeric format.
stop_words='english' removes common words like "the", "and" that don't add much meaning.
fit_transform(sentence) creates a TF-IDF matrix where each row represents a document and each column represents a word’s importance.

Python

sentence = df['headline']
vectorizer = TfidfVectorizer(stop_words='english')
vectorized_documents = vectorizer.fit_transform(sentence)

Step 4: Reduce Dimensionality using PCA

Since TF-IDF produces a high-dimensional matrix we reduce its dimensions to make it easier to visualize.

TF-IDF output is high-dimensional and difficult to visualize.
PCA(n_components=2) reduces it to 2 dimensions so we can plot it.

Python

pca = PCA(n_components=2)
reduced_data = pca.fit_transform(vectorized_documents.toarray())

Step 5: Applying K-Means Clustering

We will now apply the K-Means algorithm to group the headlines into categories (sarcastic or not sarcastic).

KMeans(n_clusters=2): We choose 2 clusters since the dataset has headlines labeled as either sarcastic or not sarcastic.
n_init=5: Runs K-Means 5 times to get the best clustering result.
max_iter=500: The algorithm can iterate 500 times to find the best solution.
random_state=42: Ensures that results are reproducible.

Python

num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, n_init=5, max_iter=500, random_state=42)
kmeans.fit(vectorized_documents)

Output:

Step 6: Storing Clustering Results

After clustering we store the results in a DataFrame for easy viewing.

kmeans.labels_ contains the cluster label for each headline (0 or 1).
We print 5 random samples of the results to check the clustering.

Python

results = pd.DataFrame()
results['document'] = sentence
results['cluster'] = kmeans.labels_

print(results.sample(5))

Output:

Step 7: Visualizing Clusters

Finally we visualize the clustered headlines in a scatter plot.

We use plt.scatter to plot the data points.
Each cluster is shown in different colors red for non-sarcastic and green for sarcastic.
The scatter plot shows how K-Means has grouped the headlines.

Python

colors = ['red', 'green']
cluster_labels = ['Not Sarcastic', 'Sarcastic']
for i in range(num_clusters):
    plt.scatter(reduced_data[kmeans.labels_ == i, 0], 
                reduced_data[kmeans.labels_ == i, 1], 
                s=10, color=colors[i], 
                label=f'{cluster_labels[i]}')

plt.legend()
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of Sarcasm Headlines')
plt.show()

Output:

download13 — Text clustering using KMeans

The scatter plot shows the K-Means clustering results for sarcasm detection in headlines. Red points represent Not Sarcastic headline while Green points indicate Sarcastic headlines. This clustering reveals distinct patterns using TF-IDF and K-Means can effectively separate text categories. This showcases the potential of clustering for text analysis using scikit learn.

Image compression using K-means clustering

akshat22roy

Improve

Article Tags :

Practice Tags :

Clustering Text Documents using K-Means in Scikit Learn

Implementation using Python

Step 1: Import Necessary Libraries

Step 2: Load the Dataset

Step 3: Convert Text to Numeric Representation using TF-IDF

Step 4: Reduce Dimensionality using PCA

Step 5: Applying K-Means Clustering

Step 6: Storing Clustering Results

Step 7: Visualizing Clusters

Similar Reads

Thank You!

What kind of Experience do you want to share?