phase 3
phase 3
In this phase, we focus on selecting suitable algorithms, training the models using the processed
data, and evaluating their performance. We aim to choose algorithms that are well-suited for
deep clustering and market segmentation tasks. Hyper parameter tuning is performed to optimize
model performance, and various evaluation metrics are employed to assess the model's predictive
capabilities. Cross-validation is also performed to ensure that the model generalizes well to
unseen data.
For the Advanced Market Segmentation using Deep Clustering project, the key algorithms
are:
Source code :
# Import necessary libraries
import numpy as np
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, adjusted_rand_score
# Assume 'data' is the dataset that has been preprocessed (scaled, cleaned)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(data_scaled, data_scaled, epochs=50, batch_size=256, validation_split=0.2)
Hyperparameter tuning is a crucial step to ensure that the model performs optimally. In this
project, we will perform grid search for the K-Means algorithm to find the best number of
clusters. Additionally, the autoencoder model's architecture and training parameters (e.g.,
learning rate, batch size) can be tuned using techniques like random search or Bayesian
optimization.
Source code for grid search for K-Means to find the best number of clusters
# Assume 'data' is the dataset that has been preprocessed (scaled, cleaned)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')
autoencoder.fit(data_scaled, data_scaled, epochs=50, batch_size=256, validation_split=0.2)
# Step 2: Extract latent features
latent_features = autoencoder.predict(data_scaled)
The performance of the model is evaluated using several metrics that measure clustering quality
and the reconstruction accuracy of the autoencoder. These include:
1. Silhouette Score – Measures how similar data points are within their cluster compared to
other clusters. A higher score indicates better clustering.
Source code :
Adjusted Rand Index (ARI) – Measures the similarity between the predicted clusters and
ground truth labels, adjusting for chance. ARI values closer to 1 indicate better alignment
with true labels.
Source code :
Mean Squared Error (MSE) – Measures the difference between the original and
reconstructed data, indicating how well the autoencoder captures the data's structure.
Source code :
# Calculate mean squared error (MSE) between original and reconstructed data
reconstruction_loss = np.mean(np.square(data_scaled - reconstructed_data))
print(f"Reconstruction Loss: {reconstruction_loss:.4f}")
3.5 Cross-Validation
Source code:
silhouette_scores = []
# Perform cross-validation
for train_index, test_index in kf.split(latent_features):
X_train, X_test = latent_features[train_index], latent_features[test_index]
y_train, y_test = clusters[train_index], clusters[test_index]
In Phase 3, the model was trained using the autoencoder for dimensionality reduction and K-
Means for clustering. We tuned the K-Means clustering algorithm’s hyperparameters using grid
search and evaluated the model’s performance using several metrics, including silhouette score,
adjusted Rand index, and reconstruction loss. Cross-validation was applied to assess the model's
robustness and ensure generalizability. The evaluation metrics provided insights into the
clustering quality and the effectiveness of the autoencoder in capturing the underlying data
patterns.