0% found this document useful (0 votes)
2 views

phase3.3

The document details a customer journey analysis framework utilizing Principal Component Analysis (PCA) for dimensionality reduction and K-Means clustering for behavior segmentation. It outlines the model training and evaluation process, including algorithm selection, hyperparameter tuning, and the use of various evaluation metrics such as silhouette score and Adjusted Rand Index. Enhanced visualizations are also presented to illustrate customer segments and their behaviors, ultimately aiming to improve user experience through tailored interactions.

Uploaded by

sihiwrites
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

phase3.3

The document details a customer journey analysis framework utilizing Principal Component Analysis (PCA) for dimensionality reduction and K-Means clustering for behavior segmentation. It outlines the model training and evaluation process, including algorithm selection, hyperparameter tuning, and the use of various evaluation metrics such as silhouette score and Adjusted Rand Index. Enhanced visualizations are also presented to illustrate customer segments and their behaviors, ultimately aiming to improve user experience through tailored interactions.

Uploaded by

sihiwrites
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Customer Journey Analysis Using Clustering and Dimensionality

Reduc on
Phase 3: Model Training and Evaluation
3.1 Overview of Model Training and Evaluation
In this phase, we focus on selecting suitable algorithms, training the models using processed
customer journey data, and evaluating their performance. The goal is to identify distinct
customer behavior patterns and enhance user experience by optimizing touchpoints along
their journey. Principal Component Analysis (PCA) is used for dimensionality reduction,
followed by K-Means clustering for segmentation. Various evaluation metrics are employed
to assess clustering effectiveness, ensuring robust model performance.

3.2 Choosing Suitable Algorithms


For customer journey analysis, the key algorithms employed are:
1. Principal Component Analysis (PCA) (for feature extraction and dimensionality
reduction) – PCA reduces the high-dimensional customer interaction data into a
lower-dimensional space while retaining key behavioral features.
2. K-Means Clustering (for behavior segmentation) – After dimensionality reduction,
K-Means clustering is applied to group customers based on their journey patterns.
Source Code:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load and preprocess data


data = pd.read_csv('customer_journey_data.csv')
data = data.dropna()
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Apply PCA
pca = PCA(n_components=2)
pca_data = pca.fit_transform(scaled_data)
pca_df = pd.DataFrame(data=pca_data, columns=['PCA1', 'PCA2'])

# Determine optimal clusters


sil_scores = []
for i in range(2, 11):
kmeans = KMeans(n_clusters=i, random_state=42, n_init='auto')
kmeans.fit(pca_df)
sil_scores.append(silhouette_score(pca_df, kmeans.labels_))

optimal_clusters = sil_scores.index(max(sil_scores)) + 2

# Apply K-Means with optimal clusters


kmeans = KMeans(n_clusters=optimal_clusters, random_state=42, n_init='auto')
cluster_labels = kmeans.fit_predict(pca_df)
pca_df['Cluster'] = cluster_labels

3.3 Hyperparameter Tuning


Hyperparameter tuning is crucial to ensure optimal clustering. We optimize PCA components
to retain maximum variance while reducing dimensions effectively. Additionally, we
determine the best number of clusters using silhouette scores.
# Evaluate optimal PCA components
explained_variance = []
for n in range(1, data.shape[1] + 1):
pca = PCA(n_components=n)
pca.fit(scaled_data)
explained_variance.append(sum(pca.explained_variance_ratio_))
optimal_pca_components = next(i for i, var in enumerate(explained_variance) if var > 0.95)
+1

# Evaluate optimal clusters using Silhouette Score


best_k = optimal_clusters
best_score = max(sil_scores)

3.4 Model Evaluation Metrics


We evaluate clustering quality and the autoencoder’s reconstruction accuracy using:
1.Silhouette Score – Measures how well-separated clusters are.
2.Adjusted Rand Index (ARI) – Compares predicted clusters with ground truth labels.
3.Mean Squared Error (MSE) – Assesses the difference between original and
reconstructed data.
Source Code:
# Applying K-Means with the best number of clusters
kmeans = KMeans(n_clusters=best_k,
random_state=42) clusters = kmeans.fit_predict(latent_features)
# Silhouette Score sil_score = silhouette_score(latent_features, clusters)
print(f"Silhouette Score: {sil_score:.4f}")
# Adjusted Rand Index (Replace with actual labels if available)
true_labels = None
# Provide ground truth labels if available if true_labels is not None:
ari_score = adjusted_rand_score(true_labels, clusters)
print(f"Adjusted Rand Index: {ari_score:.2f}")
# Autoencoder Reconstruction
Loss reconstructed_data = autoencoder.predict(scaled_data)
reconstruction_loss = np.mean(np.square(scaled_data - reconstructed_data))
print(f"Reconstruction Loss: {reconstruction_loss:.4f}")
3.5 Cross-Validation
Cross-valida on ensures the model’s robustness by tes ng on mul ple data subsets.
from sklearn.model_selec on import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)

silhoue e_scores = []
for train_idx, test_idx in kf.split(pca_df):
X_train, X_test = pca_df.iloc[train_idx], pca_df.iloc[test_idx]
kmeans = KMeans(n_clusters=best_k, random_state=42)
kmeans.fit(X_train)
clusters_pred = kmeans.predict(X_test)

score = silhoue e_score(X_test, clusters_pred)


silhoue e_scores.append(score)

avg_silhoue e_score = np.mean(silhoue e_scores)


print(f'Average Silhoue e Score: {avg_silhoue e_score:.4f}')

3.6 Enhanced Visualizations


Sca er plot, bar graph, and cluster insights

Source code :
# Sca er Plot with Centroids
plt.figure(figsize=(14, 10))
sns.sca erplot(x=latent_features[:, 0], y=latent_features[:, 1], hue=clusters, pale e='viridis',
s=100, alpha=0.7, edgecolor='w', linewidth=0.6)
plt.sca er(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=400, c='red',
marker='X', edgecolor='black', linewidth=1.5, label='Centroids')
plt. tle('Customer Journey Clusters with Centroids', fontsize=18, weight='bold')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

# Bar Graph for Cluster Distribu on


plt.figure(figsize=(10, 6))
sns.countplot(x=clusters, pale e='Set3')
plt. tle('Customer Count per Cluster', fontsize=16, weight='bold')
plt.xlabel('Cluster', fontsize=14)
plt.ylabel('Count', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()

3.7 Conclusion of Phase 3


In this phase, we applied PCA for dimensionality reduction and used K-Means clustering to
segment customers based on their journey data. Hyperparameter tuning was conducted to
optimize PCA components and cluster numbers. Evaluation metrics such as silhouette score
and cluster distribution provided insights into clustering quality. Cross-validation ensured
robustness. These insights enable businesses to enhance user experience by tailoring
interactions based on customer behavior.
Visualizations:
1. PCA Projection Scatter Plot – Visualizing clusters in 2D space.
2. Silhouette Score vs. Cluster Count – Determining the optimal number of clusters.
3. Cluster Distribution – Analyzing customer distribution across segments.
This document outlines the customer journey analysis framework using PCA and K-Means
clustering for segmentation.
OUTPUT
Here are the generated visualizations:
• Enhanced Scatter Plot with Centroid : It helps identify the distinct boundaries between
clusters and the relative positioning of cluster centers, highlighting customer segments with
similar behaviors.
• Modified Bar Graph: Customer Count per Cluster: It reveals the popularity or
dominance of certain journey patterns and helps identify niche versus mainstream customer
behaviors.
• Histogram: Session Duration Distribution per Cluster: It highlights engagement
patterns, helping to distinguish between short-session users and more engaged customer
segments.
• Line Graph: Mean Feature Values per Cluster: It provides a comparative view of how
clusters differ across multiple dimensions, enabling targeted marketing strategies based on
behavioral tendencies

You might also like