Open In App

Clustering Text Documents using K-Means in Scikit Learn

Last Updated : 15 May, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Clustering text documents is a common problem in Natural Language Processing (NLP) where similar documents are grouped based on their content. K-Means clustering is a popular clustering technique used for this purpose. In this article we'll learn how to perform text document clustering using the K-Means algorithm in Scikit-Learn.

Implementation using Python

In this project we're building an application to detect sarcasm in headlines. Sarcasm can make sentences sound opposite to their true meaning which can confuse systems that analyze sentiment.

Step 1: Import Necessary Libraries

We need some Python libraries for our task like numpy, pandas, matplotlib and scikit learn.

Python
import json
import numpy as np
import pandas as pd
import requests
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Step 2: Load the Dataset

Now let's load the dataset of sarcasm headlines. We download the dataset using the requests.get(url) method. The .json() method converts the raw data into a Python dictionary. Then we create a pandas DataFrame df to make the data easier to work with.

Python
url = "https://raw.githubusercontent.com/PawanKrGunjan/Natural-Language-Processing/main/Sarcasm%20Detection/sarcasm.json"
response = requests.get(url)
data = response.json()
df = pd.DataFrame(data)

Step 3: Convert Text to Numeric Representation using TF-IDF

We need to convert the text data into a format that the K-Means algorithm can understand (numbers). We use TF-IDF for this.

  • TfidfVectorizer converts text into a numeric format.
  • stop_words='english' removes common words like "the", "and" that don't add much meaning.
  • fit_transform(sentence) creates a TF-IDF matrix where each row represents a document and each column represents a word’s importance.
Python
sentence = df['headline']
vectorizer = TfidfVectorizer(stop_words='english')
vectorized_documents = vectorizer.fit_transform(sentence)

Step 4: Reduce Dimensionality using PCA

Since TF-IDF produces a high-dimensional matrix we reduce its dimensions to make it easier to visualize.

  • TF-IDF output is high-dimensional and difficult to visualize.
  • PCA(n_components=2) reduces it to 2 dimensions so we can plot it.
Python
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(vectorized_documents.toarray())

Step 5: Applying K-Means Clustering

We will now apply the K-Means algorithm to group the headlines into categories (sarcastic or not sarcastic).

  • KMeans(n_clusters=2): We choose 2 clusters since the dataset has headlines labeled as either sarcastic or not sarcastic.
  • n_init=5: Runs K-Means 5 times to get the best clustering result.
  • max_iter=500: The algorithm can iterate 500 times to find the best solution.
  • random_state=42: Ensures that results are reproducible.
Python
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, n_init=5, max_iter=500, random_state=42)
kmeans.fit(vectorized_documents)

Output:

K_means_clustering
KMeans Clustering

Step 6: Storing Clustering Results

After clustering we store the results in a DataFrame for easy viewing.

  • kmeans.labels_ contains the cluster label for each headline (0 or 1).
  • We print 5 random samples of the results to check the clustering.
Python
results = pd.DataFrame()
results['document'] = sentence
results['cluster'] = kmeans.labels_

print(results.sample(5))

Output:

result
Clustering results

Step 7: Visualizing Clusters

Finally we visualize the clustered headlines in a scatter plot.

  • We use plt.scatter to plot the data points.
  • Each cluster is shown in different colors red for non-sarcastic and green for sarcastic.
  • The scatter plot shows how K-Means has grouped the headlines.
Python
colors = ['red', 'green']
cluster_labels = ['Not Sarcastic', 'Sarcastic']
for i in range(num_clusters):
    plt.scatter(reduced_data[kmeans.labels_ == i, 0], 
                reduced_data[kmeans.labels_ == i, 1], 
                s=10, color=colors[i], 
                label=f'{cluster_labels[i]}')

plt.legend()
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('K-Means Clustering of Sarcasm Headlines')
plt.show()

Output:

download13
Text clustering using KMeans

The scatter plot shows the K-Means clustering results for sarcasm detection in headlines. Red points represent Not Sarcastic headline while Green points indicate Sarcastic headlines. This clustering reveals distinct patterns using TF-IDF and K-Means can effectively separate text categories. This showcases the potential of clustering for text analysis using scikit learn.


Next Article

Similar Reads