ProjectOutliner
ProjectOutliner
Datasets
Assignment Prepared By
Moin Mostakim
April 25, 2025
1. Introduction
This document outlines the process and requirements to develop a neural network model
for clustering using open datasets available on Hugging Face. Clustering is an unsupervised
task where the model identifies intrinsic groupings in the data without labeled outputs.
2. Objective
To design and implement a neural network capable of performing clustering on a selected
Hugging Face dataset, with evaluation on metrics such as Silhouette Score, Davies-Bouldin
Index, or cluster separation.
3. Requirements
3.2 Libraries
pip install torch torchvision datasets transformers scikit - learn
matplotlib
1
Build A Neural Network Model Clustering via NN
3.3 Dataset
Choose a dataset from Hugging Face: https://huggingface.co/datasets
Examples:
• Decoder: Mirror of the encoder for reconstruction (used only during training).
L = Lreconstruction + λ · Lclustering
Page 2
Build A Neural Network Model Clustering via NN
• Loss is KL divergence between soft cluster assignments and auxiliary target distribu-
tion:
XX pij
L = KL(P || Q) = pij log
i j
qij
Where:
α+1
(1 + ∥zi − µj ∥2 /α)− 2
qij = P − α+1
2
k (1 + ∥zi − µk ∥ /α)
2
2
P
qij / i qij
pij = P 2 P
k qik / i qik
Page 3
Build A Neural Network Model Clustering via NN
1 Visualization
Page 4
Build A Neural Network Model Clustering via NN
5. Clustering Algorithm
Apply a clustering algorithm to latent embeddings:
• K-Means
• DBSCAN
• Hierarchical Clustering
6. Evaluation Metrics
• Silhouette Score
• Davies-Bouldin Index
• Calinski-Harabasz Index
7. Expected Output
• Clustered representations
Page 5
Build A Neural Network Model Clustering via NN
5 Set Hyperparameters
Set the size of layers, learning rate, batch size, and number of training epochs.
input_size = 784 # 28 x28 images flattened
hidden_size = 128
output_size = 10 # Number of classes in MNIST
learning_rate = 0.001
batch_size = 64
epochs = 5
Page 6
Build A Neural Network Model Clustering via NN
device = torch . device ( " cuda " if torch . cuda . is_available () else " cpu " )
model = MyMLP ( input_size , hidden_size , output_size ) . to ( device )
criterion = nn . CrossEntropyLoss ()
optimizer = optim . Adam ( model . parameters () , lr = learning_rate )
8 Training Loop
Perform training for the specified number of epochs.
for epoch in range ( epochs ) :
model . train ()
for batch_idx , ( data , targets ) in enumerate ( train_loader ) :
data = data . view ( data . size (0) , -1) . to ( device )
targets = targets . to ( device )
optimizer . zero_grad ()
loss . backward ()
optimizer . step ()
print ( f " Epoch [{ epoch +1}/{ epochs }] , Loss : { loss . item () :.4 f } " )
9 Evaluation
Evaluate the trained model on the test dataset.
model . eval ()
correct = 0
total = 0
with torch . no_grad () :
for data , targets in test_loader :
data = data . view ( data . size (0) , -1) . to ( device )
targets = targets . to ( device )
References
• Hugging Face Datasets: https://huggingface.co/docs/datasets/index
• PyTorch: https://pytorch.org
Page 7