0% found this document useful (0 votes)

3 views

Mastering Customer Segmentation with LLMs _ Towards Data Science

The document discusses advanced customer segmentation techniques using LLMs, focusing on three methods: Kmeans, K-Prototype, and LLM + Kmeans. It emphasizes the importance of preprocessing data, detecting outliers, and selecting the optimal number of clusters through methods like the Elbow Method and Silhouette analysis. The project utilizes a banking dataset from Kaggle and provides a structured approach to clustering analysis for data scientists looking to enhance their skills.

Uploaded by

edtele2rdso

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Mastering Customer Segmentation with LLMs _ Towards Data Science

Uploaded by

edtele2rdso

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Open in app

Get unlimited access to the best of Medium for less than $1/week. Become a member

Mastering Customer Segmentation with LLM

Unlock advanced customer segmentation techniques using LLMs, and improve your clustering models with advanced techniques

Damian Gil · Follow

Published in Towards Data Science
23 min read · Sep 27

Listen Share More

Content Table
· Intro
· Data
· Method 1: Kmeans
· Method 2: K-Prototype
· Method 3: LLM + Kmeans
· Conclusion

Intro
A customer segmentation project can be approached in multiple ways. In this article I will teach you advanced techniques, not
only to define the clusters, but to analyze the result. This post is intended for those data scientists who want to have several tools
to address clustering problems and be one step closer to being seniors DS.

What will we see in this article?

Let’s see 3 methods to approach this type of project:

Kmeans

K-Prototype

LLM + Kmeans

As a small preview I will show the following comparison of the 2D representation (PCA) of the different models created:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 1/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Graphic comparison of the three methods (Image by Author).

You will also learn dimensionality reduction techniques such as:

PCA

t-SNE

MCA

Some of the results being these:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 2/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Graphical comparison of the three dimensionality reduction methods (Image by Author).

You can find the project with the notebooks here. And you can also take a look at my github:

damiangilgonzalez1995 - Overview
Passionate about data, I transitioned from physics to data science. Worked at Telefonica, HP, and now CTO at…
github.com

A very important clarification is that this is not an end-to-end project. This is because we have skipped one of the most important
parts in this type of project: The exploratory data analysis (EDA) phase or the selection of variables.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 3/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Data
The original data used in this project is from a public Kaggle: Banking Dataset — Marketing Targets. Each row in this data set
contains information about a company’s customers. Some fields are numerical and others are categorical, we will see that this
expands the possible ways to approach the problem.

We will only be left with the first 8 columns. Our dataset looks like this:

Let’s see a brief description of the columns of our dataset:

age (numeric)

job : type of job (categorical: “admin.” ,”unknown”,”unemployed”, ”management”, ”housemaid”, ”entrepreneur”, ”student”,
“blue-collar”, ”self-employed”, ”retired”, ”technician”, ”services”)

marital : marital status (categorical: “married”,”divorced”,”single”; note: “divorced” means divorced or widowed)

education (categorical: “unknown”,”secondary”,”primary”,”tertiary”)

default: has credit in default? (binary: “yes”,”no”)

balance: average yearly balance, in euros (numeric)

housing: has housing loan? (binary: “yes”,”no”)

loan: has personal loan? (binary: “yes”,”no”)

For the project, I’ve utilized the train

training dataset by Kaggle. In the project repository, you can locate the “data” folder where a
compressed file of the dataset used in the project is stored. Additionally, you will find two CSV files inside of the compressed file.
One is the training
train dataset provided by Kaggle (train.csv)
train , and the other is the dataset after performing an embedding
(embedding_train.csv
train ), which we will explain further later on.

To further clarify how the project is structured, the project tree is shown:

clustering_llm
├─ data
│ ├─ data.rar
├─ img
├─ embedding.ipynb
├─ embedding_creation.py
├─ kmeans.ipynb
├─ kprototypes.ipynb
├─ README.md
└─ requirements.txt

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 4/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Method 1: Kmeans
This is the most common method and the one you will surely know. Anyway, we are going to study it because I will show
advanced analysis techniques in these cases. The Jupyter notebook where you will find the complete procedure is called
kmeans.ipynb

Preprocessed
A preprocessing of the variables is carried out:

1. It consists of converting categorical variables into numeric ones.

We are going to apply Onehot encoder for the nominal variables and a OrdinalEncoder for the ordinals features (education).

2. We try to ensure that the numerical variables have a Gaussian distribution. For them we will apply a PowerTransformer.

Let’s see how it looks in code.

import pandas as pd # dataframe manipulation

import numpy as np # linear algebra

# data visualization
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import plotly.express as px
import plotly.graph_objects as go
import seaborn as sns
import shap

# sklearn
from sklearn.cluster import KMeans
from sklearn.preprocessing import PowerTransformer, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.metrics import silhouette_score, silhouette_samples, accuracy_score, classification_report

from pyod.models.ecod import ECOD

from yellowbrick.cluster import KElbowVisualizer

import lightgbm as lgb

import prince

# Read file
df = pd.read_csv("train.csv"
train , sep = ";")
df = df.iloc[:, 0:8]

# Preprocessing part
categorical_transformer_onehot = Pipeline(
steps=[
("encoder", OneHotEncoder(handle_unknown="ignore", drop="first", sparse=False))
])

categorical_transformer_ordinal = Pipeline(
steps=[
("encoder", OrdinalEncoder())
])

num = Pipeline(
steps=[
("encoder", PowerTransformer())
])

preprocessor = ColumnTransformer(transformers = [
('cat_onehot', categorical_transformer_onehot, ["default", "housing", "loan", "job", "marital"]),
('cat_ordinal', categorical_transformer_ordinal, ["education"]),
('num', num, ["age", "balance"])
])

pipeline = Pipeline(
steps=[("preprocessor", preprocessor)]

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 5/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
)
pipe_fit = pipeline.fit(df)

data = pd.DataFrame(pipe_fit.transform(df), columns = pipe_fit.get_feature_names_out().tolist())

print(data.columns.tolist())

# OUTPUT
['cat_onehot__default_yes',
'cat_onehot__housing_yes',
'cat_onehot__loan_yes',
'cat_onehot__job_blue-collar',
'cat_onehot__job_entrepreneur',
'cat_onehot__job_housemaid',
'cat_onehot__job_management',
'cat_onehot__job_retired',
'cat_onehot__job_self-employed',
'cat_onehot__job_services',
'cat_onehot__job_student',
'cat_onehot__job_technician',
'cat_onehot__job_unemployed',
'cat_onehot__job_unknown',
'cat_onehot__marital_married',
'cat_onehot__marital_single',
'cat_ordinal__education',
'num__age',
'num__balance']

Output:

Outliers
It is crucial that there are as few outliers in our data as Kmeans is very sensitive to this. We can apply the typical method of
choosing outliers using the z score, but in this post I will show you a much more advanced and cool method.

Well, what is this method? Well, we will use the Python Outlier Detection (PyOD) library. This library is focused on detecting
outliers for different cases. To be more specific we will use the ECOD method (“empirical cumulative distribution functions for
outlier detection”).

This method seeks to obtain the distribution of the data and thus know which are the values where the probability density is
lower (outliers). Take a look at the Github if you want.

from pyod.models.ecod import ECOD

clf = ECOD()
clf.fit(data)
outliers = clf.predict(data)

data["outliers"] = outliers

# Data without outliers

data_no_outliers = data[data["outliers"] == 0]
data_no_outliers = data_no_outliers.drop(["outliers"], axis = 1)

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 6/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
# Data with Outliers
data_with_outliers = data.copy()
data_with_outliers = data_with_outliers.drop(["outliers"], axis = 1)

print(data_no_outliers.shape) -> (40690, 19)

print(data_with_outliers.shape) -> (45211, 19)

Modeling
One of the disadvantages of using the Kmeans algorithm is that you must choose the number of clusters you want to use. In this
case, in order to obtain that data, we will use Elbow Method. It consists of calculating the distortion that exists between the points
of a cluster and its centroid. The objective is clear, to obtain the least possible distortion. In this case we use the following code:

from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer

km = KMeans(init="k-means++", random_state=0, n_init="auto")
visualizer = KElbowVisualizer(km, k=(2,10))

visualizer.fit(data_no_outliers) # Fit the data to the visualizer

visualizer.show()

Output:

Elbow score for different numbers of clusters (Image by Author).

We see that from k=5, the distortion does not vary drastically. It is true that the ideal is that the behavior starting from k= 5 would
be almost flat. This rarely happens and other methods can be applied to be sure of the most optimal number of clusters. To be
sure, we could perform a Silhoutte visualization. The code is the following:

from sklearn.metrics import davies_bouldin_score, silhouette_score, silhouette_samples

import matplotlib.cm as cm

def make_Silhouette_plot(X, n_clusters):

plt.xlim([-0.1, 1])
plt.ylim([0, len(X) + (n_clusters + 1) * 10])
clusterer = KMeans(n_clusters=n_clusters, max_iter = 1000, n_init = 10, init = 'k-means++', random_state=10)
cluster_labels = clusterer.fit_predict(X)
silhouette_avg = silhouette_score(X, cluster_labels)
print(
"For n_clusters =", n_clusters,

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 7/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
"The average silhouette_score is :", silhouette_avg,
)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
plt.fill_betweenx(
np.arange(y_lower, y_upper),
0,
ith_cluster_silhouette_values,
facecolor=color,
edgecolor=color,
alpha=0.7,
)
plt.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
y_lower = y_upper + 10
plt.title(f"The Silhouette Plot for n_cluster = {n_clusters}", fontsize=26)
plt.xlabel("The silhouette coefficient values", fontsize=24)
plt.ylabel("Cluster label", fontsize=24)
plt.axvline(x=silhouette_avg, color="red", linestyle="--")
plt.yticks([])
plt.xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

range_n_clusters = list(range(2,10))

for n_clusters in range_n_clusters:

print(f"N cluster: {n_clusters}")
make_Silhouette_plot(data_no_outliers, n_clusters)
plt.savefig('Silhouette_plot_{}.png'.format(n_clusters))
plt.close()

OUTPUT:

"""
N cluster: 2
For n_clusters = 2 The average silhouette_score is : 0.18111287366156115
N cluster: 3
For n_clusters = 3 The average silhouette_score is : 0.16787543108034586
N cluster: 4
For n_clusters = 4 The average silhouette_score is : 0.1583411958880734
N cluster: 5
For n_clusters = 5 The average silhouette_score is : 0.1672987260052535
N cluster: 6
For n_clusters = 6 The average silhouette_score is : 0.15485098506258177
N cluster: 7
For n_clusters = 7 The average silhouette_score is : 0.1495307642182009
N cluster: 8
For n_clusters = 8 The average silhouette_score is : 0.15098396457075294
N cluster: 9
For n_clusters = 9 The average silhouette_score is : 0.14842917303536465
"""

It can be seen that the highest silhouette score is obtained with n_cluster=9, but it is also true that the variation in the score is
quite small if we compare it with the other scores. At the moment the previous result does not provide us with much information.
On the other hand, the previous code creates the Silhouette visualization, which gives us more information:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 8/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 9/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Graphic representation of the silhouette method for different numbers of clusters (Image by Author).

Since understanding these representations well is not the goal of this post, I will conclude that there seems to be no very clear
decision as to which number is best. After viewing the previous representations, we can choose K=5 or K= 6. This is because for
the different clusters, their Silhouette score is above the average value and there is no imbalance in cluster size. Furthermore, in
some situations, the marketing department may be interested in having the smallest number of clusters/types of customers (This
may or may not be the case).

Finally we can create our Kmeans model with K=5.

km = KMeans(n_clusters=5,
init='k-means++',
n_init=10,
max_iter=100,
random_state=42)

clusters_predict = km.fit_predict(data_no_outliers)

"""
clusters_predict -> array([4, 2, 0, ..., 3, 4, 3])
np.unique(clusters_predict) -> array([0, 1, 2, 3, 4])
"""

Evaluation
The way of evaluating kmeans models is somewhat more open than for other models. We can use

metrics

visualizations

interpretation (Something very important for companies).

In relation to the model evaluation metrics, we can use the following code:

from sklearn.metrics import silhouette_score

from sklearn.metrics import calinski_harabasz_score
from sklearn.metrics import davies_bouldin_score

"""
The Davies Bouldin index is defined as the average similarity measure
of each cluster with its most similar cluster, where similarity
is the ratio of within-cluster distances to between-cluster distances.

The minimum value of the DB Index is 0, whereas a smaller

value (closer to 0) represents a better model that produces better clusters.
"""
print(f"Davies bouldin score: {davies_bouldin_score(data_no_outliers,clusters_predict)}")

"""
Calinski Harabaz Index -> Variance Ratio Criterion.

Calinski Harabaz Index is defined as the ratio of the

sum of between-cluster dispersion and of within-cluster dispersion.

The higher the index the more separable the clusters.

"""
print(f"Calinski Score: {calinski_harabasz_score(data_no_outliers,clusters_predict)}")

"""
The silhouette score is a metric used to calculate the goodness of
fit of a clustering algorithm, but can also be used as
a method for determining an optimal value of k (see here for more).

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 10/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Its value ranges from -1 to 1.

A value of 0 indicates clusters are overlapping and either
the data or the value of k is incorrect.

1 is the ideal value and indicates that clusters are very

dense and nicely separated.
"""
print(f"Silhouette Score: {silhouette_score(data_no_outliers,clusters_predict)}")

OUTPUT:

"""Davies bouldin score: 1.676769775662788

Calinski Score: 6914.705500337112
Silhouette Score: 0.16729335453305272
"""

As far as can be shown, we do not have an excessively good model. Davies’ score is telling us that the distance between clusters is
quite small.

This may be due to several factors, but keep in mind that the energy of a model is the data; if the data does not have sufficient
predictive power, you cannot expect to achieve exceptional results.

For visualizations, we can use the method to reduce dimensionality, PCA. For them we are going to use the Prince library,
focused on exploratory analysis and dimensionality reduction. If you prefer, you can use Sklearn’s PCA, they are identical.

First we will calculate the principal components in 3D, and then we will make the representation. These are the two functions
performed by the previous steps:

import prince
import plotly.express as px

def get_pca_2d(df, predict):

pca_2d_object = prince.PCA(
n_components=2,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)

pca_2d_object.fit(df)

df_pca_2d = pca_2d_object.transform(df)
df_pca_2d.columns = ["comp1", "comp2"]
df_pca_2d["cluster"] = predict

return pca_2d_object, df_pca_2d

def get_pca_3d(df, predict):

pca_3d_object = prince.PCA(
n_components=3,
n_iter=3,
rescale_with_mean=True,
rescale_with_std=True,
copy=True,
check_input=True,
engine='sklearn',
random_state=42
)

pca_3d_object.fit(df)

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 11/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
df_pca_3d = pca_3d_object.transform(df)
df_pca_3d.columns = ["comp1", "comp2", "comp3"]
df_pca_3d["cluster"] = predict

return pca_3d_object, df_pca_3d

def plot_pca_3d(df, title = "PCA Space", opacity=0.8, width_line = 0.1):

df = df.astype({"cluster": "object"})
df = df.sort_values("cluster")

fig = px.scatter_3d(
df,
x='comp1',
y='comp2',
z='comp3',
color='cluster',
template="plotly",

# symbol = "cluster",

color_discrete_sequence=px.colors.qualitative.Vivid,
title=title).update_traces(
# mode = 'markers',
marker={
"size": 4,
"opacity": opacity,
# "symbol" : "diamond",
"line": {
"width": width_line,
"color": "black",
}
}
).update_layout(
width = 800,
height = 800,
autosize = True,
showlegend = True,
legend=dict(title_font_family="Times New Roman",
font=dict(size= 20)),
scene = dict(xaxis=dict(title = 'comp1', titlefont_color = 'black'),
yaxis=dict(title = 'comp2', titlefont_color = 'black'),
zaxis=dict(title = 'comp3', titlefont_color = 'black')),
font = dict(family = "Gilroy", color = 'black', size = 15))

fig.show()

Don’t worry too much about those functions, use them as follows:

pca_3d_object, df_pca_3d = pca_plot_3d(data_no_outliers, clusters_predict)

plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=1, width_line = 0.1)
print("The variability is :", pca_3d_object.eigenvalues_summary)

Output:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 12/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

PCA space and the clusters created by the model (Image by Author).

It can be seen that the clusters have almost no separation between them and there is no clear division. This is in accordance with
the information provided by the metrics.

Something to keep in mind and that very few people keep in mind is the PCA and
the variability of the eigenvectors.
Let’s say that each field contains a certain amount of information, and this adds its bit of information. If the accumulated sum of
the 3 main components adds up to around 80% variability, we can say that it is acceptable, obtaining good results in the
representations. If the value is lower, we have to take the visualizations with a grain of salt since we are missing a lot of
information that is contained in other eigenvectors.

The next question is obvious: What is the variability of the PCA executed?
The answer is the following:

As can be seen, we have 27.98% variability with the first 3 components, something insufficient to draw informed conclusions.

When we apply the PCA method, since it is a linear algorithm, it is not capable of capturing more complex relationships. Luckily
there is a method called t-SNE, which is capable of capturing these complex polynomial relationships. This can help us visualize,
since with the previous method we have not had much success.

If you try it on your computers, keep in mind that it has a higher computational cost. For this reason, I sampled my original
dataset and it still took me about 5 minutes to get the result. The code is as follows:

from sklearn.manifold import TSNE

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 13/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
sampling_data = data_no_outliers.sample(frac=0.5, replace=True, random_state=1)
sampling_clusters = pd.DataFrame(clusters_predict).sample(frac=0.5, replace=True, random_state=1)[0].values

df_tsne_3d = TSNE(
n_components=3,
learning_rate=500,
init='random',
perplexity=200,
n_iter = 5000).fit_transform(sampling_data)

df_tsne_3d = pd.DataFrame(df_tsne_3d, columns=["comp1", "comp2",'comp3'])

df_tsne_3d["cluster"] = sampling_clusters
plot_pca_3d(df_tsne_3d, title = "PCA Space", opacity=1, width_line = 0.1)

As a result, I got the following image. It shows a clearer separation between clusters but unfortunately, we still don’t have good
results.

t-SNE space and the clusters created by the model (Image by Author).

In fact, we can compare the reduction carried out by the PCA and by the t-SNE, in 2 dimensions. The improvement is clear using
the second method.

Different results for different dimensionality reduction methods and clusters defined by the model (Image by
Author).

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 14/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Finally, let’s explore a little how the model works, in which features are the most important and what are the main characteristics
of the clusters.

To see the importance of each of the variables we will use a typical “trick” in this type of situation. We are going to create a
classification model where the “X” is the inputs of the Kmeans model, and the “y” is the clusters predicted by the Kmeans model.

The chosen model is an LGBMClassifier. This model is quite powerful and works well having categorical and numerical
variables. Having the new model train
trained, using the SHAP library, we can obtain the importance of each of the features in the
prediction. The code is:

import lightgbm as lgb

import shap

# We create the LGBMClassifier model and train it

clf_km = lgb.LGBMClassifier(colsample_by_tree=0.8)
clf_km.fit(X=data_no_outliers, y=clusters_predict)

#SHAP values
explainer_km = shap.TreeExplainer(clf_km)
shap_values_km = explainer_km.shap_values(data_no_outliers)
shap.summary_plot(shap_values_km, data_no_outliers, plot_type="bar", plot_size=(15, 10))

Output:

The importance of the variables in the model (Image by Author).

It can be seen that feature age has the greatest predictive power. It can also be seen that cluster number 3 (green) is mainly
differentiated by the balance variable.

Finally we must analyze the characteristics of the clusters. This part of the study is what is decisive for the business. For them we
are going to obtain the means (for the numerical variables) and the most frequent value (categorical variables) of each of the
features of the dataset for each of the clusters:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 15/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

df_no_outliers = df[df.outliers == 0]
df_no_outliers["cluster"] = clusters_predict

df_no_outliers.groupby('cluster').agg(
{
'job': lambda x: x.value_counts().index[0],
'marital': lambda x: x.value_counts().index[0],
'education': lambda x: x.value_counts().index[0],
'housing': lambda x: x.value_counts().index[0],
'loan': lambda x: x.value_counts().index[0],
'contact': lambda x: x.value_counts().index[0],
'age':'mean',
'balance': 'mean',
'default': lambda x: x.value_counts().index[0],

}
).reset_index()

Output:

We see that the clusters with job=blue-collar do not have a great differentiation between their characteristics, except by the age
feature. This is something that is not desirable since it is difficult to differentiate the clients of each of the clusters. In the
job=management case, we obtain better differentiation.

After carrying out the analysis in different ways, they converge on the same conclusion: “We need to improve the results”.

Method 2: K-Prototype
If we remember our original dataset, we see that we have categorical and numerical variables. Unfortunately, the Kmeans
algorithm provided by Skelearn does not accept categorical variables, forcing the original dataset to be modified and drastically
altered.

Luckily, you’ve taken with me and my post. But above all, thanks to ZHEXUE HUANG and his article Extensions to the k-Means
Algorithm for Clustering Large Data Sets with Categorical Values, there is an algorithm that accepts categorical variables for
clustering. This algorithm is called K-Prototype. The bookstore that provides it is Prince.

The procedure is the same as in the previous case. In order not to make this article eternal, let’s go to the most interesting parts.
But remember that you can access the Jupyter notebook here.
Preprocessed
Because we have numerical variables, we must make certain modifications to them. It is always recommended that all numerical
variables be on similar scales and with distributions as close as possible to Gaussian ones. The dataset that we will use to create
the models is created as follows:

pipe = Pipeline([('scaler', PowerTransformer())])

df_aux = pd.DataFrame(pipe_fit.fit_transform(df_no_outliers[["age", "balance"]] ), columns = ["age", "balance"])

df_no_outliers_norm = df_no_outliers.copy()
# Replace age and balance columns by preprocessed values
df_no_outliers_norm = df_no_outliers_norm.drop(["age", "balance"], axis = 1)
df_no_outliers_norm["age"] = df_aux["age"].values

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 16/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
df_no_outliers_norm["balance"] = df_aux["balance"].values
df_no_outliers_norm

Outliers
Because the method that I have presented for outlier detection (ECOD) only accepts numerical variables, the same
transformation must be performed as for the kmeans method. We apply the outlier detection model that will provide us with
which rows to eliminate, finally leaving the dataset that we will use as input for the K-Prototype model:

Modeling
We create the model and to do this we first need to obtain the optimal k. To do this we use the Elbow Method and this piece of
code:

# Choose optimal K using Elbow method

from kmodes.kprototypes import KPrototypes
from plotnine import *
import plotnine
cost = []
range_ = range(2, 15)
for cluster in range_:

kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster, init = 'Huang', random_state = 0)

kprototype.fit_predict(df_no_outliers, categorical = categorical_columns_index)
cost.append(kprototype.cost_)
print('Cluster initiation: {}'.format(cluster))

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 17/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

# Converting the results into a dataframe and plotting them

df_cost = pd.DataFrame({'Cluster':range_, 'Cost':cost})
# Data viz
plotnine.options.figure_size = (8, 4.8)
(
ggplot(data = df_cost)+
geom_line(aes(x = 'Cluster',
y = 'Cost'))+
geom_point(aes(x = 'Cluster',
y = 'Cost'))+
geom_label(aes(x = 'Cluster',
y = 'Cost',
label = 'Cluster'),
size = 10,
nudge_y = 1000) +
labs(title = 'Optimal number of cluster with Elbow Method')+
xlab('Number of Clusters k')+
ylab('Cost')+
theme_minimal()
)

Output:

Elbow score for different numbers of clusters (Image by Author).

We can see that the best option is K=5.

Be careful, since this algorithm takes a little longer than those normally used. For the previous graph, 86 minutes were needed,
something to keep in mind.

Well, we are now clear about the number of clusters, we just have to create the model:

# We get the index of categorical columns

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
categorical_columns = df_no_outliers_norm.select_dtypes(exclude=numerics).columns
print(categorical_columns)

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 18/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
categorical_columns_index = [df_no_outliers_norm.columns.get_loc(col) for col in categorical_columns]

# Create the model

cluster_num = 5
kprototype = KPrototypes(n_jobs = -1, n_clusters = cluster_num, init = 'Huang', random_state = 0)
kprototype.fit(df_no_outliers_norm, categorical = categorical_columns_index)
clusters = kprototype.predict(df_no_outliers , categorical = categorical_columns_index)

print(clusters) " -> array([3, 1, 1, ..., 1, 1, 2], dtype=uint16)"

We already have our model and its predictions, we just need to evaluate it.
Evaluation
As we have seen before we can apply several visualizations to obtain an intuitive idea of how good our model is. Unfortunately the
PCA method and t-SNE do not admit categorical variables. But don’t worry, since the Prince library contains the MCA (Multiple
correspondence analysis) method and it does accept a mixed dataset. In fact, I encourage you to visit the Github of this library, it
has several super useful methods for different situations, see the following image:

The different methods of dimensionality reduction by type of case (Image by Author and Prince Documentation).

Well, the plan is to apply a MCA to reduce the dimensionality and be able to make graphical representations. For this we use the
following code:

from prince import MCA

def get_MCA_3d(df, predict):

mca = MCA(n_components =3, n_iter = 100, random_state = 101)
mca_3d_df = mca.fit_transform(df)
mca_3d_df.columns = ["comp1", "comp2", "comp3"]
mca_3d_df["cluster"] = predict
return mca, mca_3d_df

def get_MCA_2d(df, predict):

mca = MCA(n_components =2, n_iter = 100, random_state = 101)
mca_2d_df = mca.fit_transform(df)
mca_2d_df.columns = ["comp1", "comp2"]
mca_2d_df["cluster"] = predict
return mca, mca_2d_df

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 19/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
"-------------------------------------------------------------------"
mca_3d, mca_3d_df = get_MCA_3d(df_no_outliers_norm, clusters)

Remember that if you want to follow each step 100%, you can take a look at Jupyter notebook.

The dataset named mca_3d_df contains that information:

Let’s make a plot using the reduction provided by the MCA method:

MCA space and the clusters created by the model (Image by Author)

Wow, it doesn’t look very good… It is not possible to differentiate the clusters from each other. We can say then that the model is
not good enough, right?

I hope you said something like:

“Hey Damian, don’t go so fast!! Have you looked at the variability of the 3 components provided by the MCA?”

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 20/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Indeed, we must see if the variability of the first 3 components is sufficient to be able to draw conclusions. The MCA method
allows us to obtain these values in a very simple way:

mca_3d.eigenvalues_summary

Aha, here we have something interesting. Due to our data we obtain basically zero variability.

In other words, we cannot draw clear conclusions from our model with the information provided by the dimensionality reduction
provided by MCA.

By showing these results I try to give an example of what happens in real data projects. Good results are not always obtained, but
a good data scientist knows how to recognize the causes.

We have one last option to visually determine if the model created by the K-Prototype method is suitable or not. This path is
simple:

1. This is applying PCA to the dataset to which preprocessing has been performed to transform the categorical variables into
numerical ones.

2. Obtain the components of the PCA

3. Make a representation using the PCA components such as the axes and the color of the points to predict the K-Prototype
model.

Note that the components provided by the PCA will be the same as for method 1: Kmeans, since it is the same dataframe.

Let’s see what we get…

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 21/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
PCA space and the clusters created by the model (Image by Author).

It doesn’t look bad, in fact it has a certain resemblance to what has been obtained in Kmeans.

Finally we obtain the average value of the clusters and the importance of each of the variables:

The importance of the variables in the model. The table represents the most frequent value of each of the clusters
(Image by Author).

The variables with the greatest weight are the numerical ones, notably seeing that the confinement of these two features is
almost sufficient to differentiate each cluster.

In short, it can be said that results similar to those of Kmeans have been obtained.

Method 3: LLM + Kmeans

This combination can be quite powerful and improve the results obtained. Let’s get to the point!

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 22/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

LLMs cannot understand written text directly, we need to transform the input of this type of models. For this, Sentence
Embedding is carried out. It consists of transforming the text into numerical vectors. The following image can clarify the idea:

Concept of embedding and similarity (Image by Author).

This coding is done intelligently, that is, phrases that contain a similar meaning will have a more similar vector. See the following
image:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 23/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Concept of embedding and similarity (Image by Author).

Sentence embedding is carried out by so-called transforms, algorithms specialized in this coding. Typically you can choose what
the size of the numerical vector coming from this encoding is. And here is one of the key points:

Thanks to the large dimension of the vector created by embedding, small

variations in the data can be seen with greater precision.
Therefore, if we provide input to our information-rich Kmeans model, it will return better predictions. This is the idea we are
pursuing and these are its steps:

1. Transform our original dataset through Sentence embedding

2. Create a Kmeans model

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 24/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

3. Evaluate it

Well, the first step is to encode the information through Sentence embedding. What is intended is to take the information of each
client and unify it into text that contains all its characteristics. This part takes a lot of computing time. That’s why I created a
script that did this job, call embedding_creation.py. This script collects the values contained in the train
training dataset and creates a
new dataset provided by the embedding. This is the script code:

import pandas as pd # dataframe manipulation

import numpy as np # linear algebra
from sentence_transformers import SentenceTransformer

df = pd.read_csv("data/train.csv"
train , sep = ";")

# -------------------- First Step --------------------

def compile_text(x):

text = f"""Age: {x['age']},

housing load: {x['housing']},
Job: {x['job']},
Marital: {x['marital']},
Education: {x['education']},
Default: {x['default']},
Balance: {x['balance']},
Personal loan: {x['loan']},
contact: {x['contact']}
"""

return text

sentences = df.apply(lambda x: compile_text(x), axis=1).tolist()

# -------------------- Second Step --------------------

model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")
output = model.encode(sentences=sentences,
show_progress_bar=True,
normalize_embeddings=True)

df_embedding = pd.DataFrame(output)
df_embedding

As it is quite important that this step is understood. Let’s go by points:

Step 1: The text is created for each row, which contains the complete customer/row information. We also store it in a python
list for later use. See the following image that exemplifies it.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 25/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Graphic description of the first step (Image by Author).

Step 2: This is when the call to the transformer is made. For this we are going to use the model stored in HuggingFace. This
model is specifically train
trained to perform embedding at the sentence level, unlike Bert’s model, which is focused on encoding
at the level of tokens and words. To call the model you only have to give the repository address, which in this case is “sentence-
transformers/paraphrase-MiniLM-L6-v2”. The numerical vector that is returned to us for each text will be normalized, since
the Kmeans model is sensitive to the scales of the inputs. The vectors created have a length of 384. With them what we do is
create a dataframe with the same number of columns. See the following image:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 26/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Graphic description of the second step (Image by Author),

Finally we obtain the dataframe from the embedding, which will be the input of our Kmeans model.

This step has been one of the most interesting and important, since we have created the input for the Kmeans model that we will
create.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 27/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

The creation and evaluation procedure is similar to that shown above. In order not to make the post excessively long, only the
results of each point will be shown. Don’t worry, all the code is contained in the jupyter notebook called embedding, so you can
reproduce the results for yourself.

In addition, the dataset resulting from applying the Sentence embedding has been saved in a csv file. This csv file is called
embedding_train.csv
train . In the Jupyter notebook you will see that we access that dataset and create our model based on it.

# Normal Dataset
df = pd.read_csv("data/train.csv"
train , sep = ";")
df = df.iloc[:, 0:8]

# Embedding Dataset
df_embedding = pd.read_csv("data/embedding_train.csv"
train , sep = ",")

Preprocessed
We could consider embedding as preprocessing.

Outliers
We apply the method already presented to detect outliers, ECOD. We create a dataset that does not contain these types of points.

df_embedding_no_out.shape -> (40690, 384)

df_embedding_with_out.shape -> (45211, 384)

Modeling
First we must find out what the optimal number of clusters is. For this we use Elbow Method.

Elbow score for different numbers of clusters (Image by Author).

After viewing the graph, we choose k=5 as our number of clusters.

n_clusters = 5
clusters = KMeans(n_clusters=n_clusters, init = "k-means++").fit(df_embedding_no_out)
print(clusters.inertia_)
clusters_predict = clusters.predict(df_embedding_no_out)

Evaluation

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 28/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

The next thing is to create our Kmeans model with k=5. Next we can obtain some metrics like these:

Davies bouldin score: 1.8095386826791042

Calinski Score: 6419.447089002081
Silhouette Score: 0.20360442824114108

Seeing then that the values are really similar to those obtained in the previous case. Let’s study the representations obtained with
PCA analysis:

PCA space and the clusters created by the model (Image by Author).

It can be seen that the clusters are much better differentiated than with the traditional method. This is good news. Let us
remember that it is important to take into account the variability contained in the first 3 components of our PCA analysis. From
experience, I can say that when it is around 50% (3D PCA) more or less clear conclusions can be drawn.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 29/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

PCA space and the clusters created by the model. The variability of the first 3 components of the PCA is also
shown (Image by Author).

We see then that it is 40.44% cumulative variability of the 3 components, it is acceptable but not ideal.

One way I can visually see how compact the clusters are is by modifying the opacity of the points in the 3D representation. This
means that when the points are agglomerated in a certain space, a black spot can be observed. In order to understand what I’m
saying, I show the following gif:

plot_pca_3d(df_pca_3d, title = "PCA Space", opacity=0.2, width_line = 0.1)

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 30/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

PCA space and the clusters created by the model (Image by Author).

As can be seen, there are several points in space where the points of the same cluster cluster together. This indicates that they are
well differentiated from the other points and that the model knows how to recognize them quite well.

Even so, it can be seen that various clusters cannot be differentiated well (Ex: cluster 1 and 3). For this reason, we carry out a t-
SNE analysis, which we remember is a method that allows reducing dimensionality, taking into account complex polynomial
relationships.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 31/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
t-SNE space and the clusters created by the model (Image by Author).

A noticeable improvement is seen. The clusters do not overlap each other and there is a clear differentiation between points. The
improvement obtained using the second dimensionality reduction method is notable. Let’s see a 2D comparison:

Different results for different dimensionality reduction methods and clusters defined by the model (Image by
Author).

Again, it can be seen that the clusters in the t-SNE are more separated and better differentiated than with the PCA. Furthermore,
the difference between the two methods in terms of quality is smaller than when using the traditional Kmeans method.

To understand which variables our Kmeans model relies on, we do the same move as before: we create a classification model
(LGBMClassifier) and analyze the importance of the features.

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 32/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

The importance of the variables in the model (Image by Author).

We see then that this model is based above all on the “marital” and “job” variables. On the other hand we see that there are
variables that do not provide much information. In a real case, a new version of the model should be created without these
variables with little information.

The Kmeans + Embedding model is more optimal since it needs fewer variables to be able to give good predictions. Good news!

We finish with the part that is most revealing and important.

Managers and the business are not interested in PCA, t-SNE or embedding. What
they want is to be able to know what the main traits are, in this case, of their clients.
To do this, we create a table with information about the predominant profiles that we can find in each of the clusters:

Something very curious happens: the clusters where the most frequent position is that of “management” are 3. In them we find a
very peculiar behavior where the single managers are younger, those who are married are older and the divorced are the how
older they are. On the other hand, the balance behaves differently, single people have a higher average balance than divorced
people, and married people have a higher average balance. What was said can be summarized in the following image:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 33/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

Different customer profiles defined by the model (Image by Author).

This revelation is in line with reality and social aspects. It also reveals very specific customer profiles. This is the magic of data
science.

Conclusion
The conclusion is clear:

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 34/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

(Image by Author)

You have to have different tools because in a real project, not all strategies work and you must have resources to add value. It is
clearly seen that the model created with the help of the LLMs stands out.

Clustering Llm Hugging Face Hands On Tutorials Customer Segmentation

Written by Damian Gil

825 Followers · Writer for Towards Data Science

Passionate about data, I transitioned from physics to data science. Worked at Telefonica, HP, and now CTO at Seniority.AI since 2022.

Talk to your documents as PDFs, txts, and even web pages

Damian Gil in Towards AI

Talk to your documents as PDFs, txts, and even web pages

Complete guide to creating a web and the intelligence that allows you to ask questions to docs such as PDFs, TXTs, and even web pages using…

13 min read · Aug 24

648 5

Don’t Start Your Data Science Journey Without These 5 Must-Do Steps From a Spotify Data Scientist

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 35/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
Khouloud El Alami in Towards Data Science

Don’t Start Your Data Science Journey Without These 5 Must-Do Steps From a Spotify Data Scientist
A complete guide to everything I wish I’d done before starting my Data Science journey, here’s to acing your first year with data

18 min read · Sep 24

2.4K 23

Coding was Hard Until I Learned These 2 Things!

Natassha Selvaraj in Towards Data Science

Coding was Hard Until I Learned These 2 Things!

Here’s what helped me go from “aspiring programmer” to actually landing a job in the field.

· 7 min read · Oct 2

1.8K 21

AI interpretation of what Langchain does with the data(Stability AI)

Damian Gil in LatinXinAI

Data democratization with LangChain and OpenAI

Ask your database with natural language as a member of the marketing team.

8 min read · Jul 31

120 2

See all from Damian Gil

See all from Towards Data Science

Recommended from Medium

Talk to your documents as PDFs, txts, and even web pages

Damian Gil in Towards AI

Talk to your documents as PDFs, txts, and even web pages

Complete guide to creating a web and the intelligence that allows you to ask questions to docs such as PDFs, TXTs, and even web pages using…

13 min read · Aug 24

648 5

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 36/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science

The ChatGPT Hype Is Over — Now Watch How Google Will Kill ChatGPT.

AL Anany

The ChatGPT Hype Is Over — Now Watch How Google Will Kill ChatGPT.
It never happens instantly. The business game is longer than you know.

· 6 min read · Sep 1

13.8K 423

Lists

Natural Language Processing

702 stories · 312 saves

Stories to Help You Grow as a Software Developer

19 stories · 458 saves

Predictive Modeling w/ Python

20 stories · 490 saves

ChatGPT
21 stories · 197 saves

Cat holding a atom where the nucleus are documents.

Adrian H. Raudaschl in Towards Data Science

Forget RAG, the Future is RAG-Fusion

The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion and Generated Queries

· 10 min read · Oct 6

1K 17

Data engineering at Meta: High-Level Overview of the internal tech stack

Analytics at Meta

Data engineering at Meta: High-Level Overview of the internal tech stack

This article provides an overview of the internal tech stack that we use on a daily basis as data engineers at Meta. The idea is to shed…

12 min read · 4 days ago

720 11

You’re Decent At Python If You Can Find These 4 Mistakes

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 37/38
10/15/23, 5:09 PM Mastering Customer Segmentation with LLMs | Towards Data Science
Liu Zuo Lin in Level Up Coding

You’re Decent At Python If You Can Find These 4 Mistakes

How this works

· 2 min read · Aug 20

1.91K 22
Customer Retention for Data Scientists with Causal AI

causaLens Research & Development

Customer Retention for Data Scientists with Causal AI

“Predictive insights are of little help unless combined with data-driven prescriptive recommendations for where to engage, what to offer…

9 min read · Sep 22

See more recommendations

https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41 38/38

Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
67% (3)
Low Code AIML USL Project CreditCardCustomerSegmentation Vijay Borade Aug23
66 pages
Three Words, Eight Letters. Say It and I - M Yours
100% (4)
Three Words, Eight Letters. Say It and I - M Yours
327 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
From Everand
The Cloud Adoption Playbook: Proven Strategies for Transforming Your Organization with the Cloud
Moe Abdula
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
SolidWorks CAM 2021 Black Book
From Everand
SolidWorks CAM 2021 Black Book
Gaurav Verma
No ratings yet
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
From Everand
Métodos numéricos aplicados a Ingeniería: Casos de estudio usando MATLAB
Héctor Jorquera González
5/5 (1)
Ads Phase 4
No ratings yet
Ads Phase 4
12 pages
K-Means Clustering For Customer Segmentation - A Practical Example - Kimberly Coffey, PH.D - PDF
100% (1)
K-Means Clustering For Customer Segmentation - A Practical Example - Kimberly Coffey, PH.D - PDF
41 pages
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
Modern C++23 QuickStart Pro: Advanced programming including variadic templates, lambdas, async IO, multithreading and thread sync
From Everand
Modern C++23 QuickStart Pro: Advanced programming including variadic templates, lambdas, async IO, multithreading and thread sync
Jarek Thalor
No ratings yet
Modern C++23 QuickStart Pro
From Everand
Modern C++23 QuickStart Pro
Jarek Thalor
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
11 pages
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
ML Review PPT 2
No ratings yet
ML Review PPT 2
29 pages
A Beginner's Guide To Customer Segmentation With Python - by Sigli Mumuni - Medium
No ratings yet
A Beginner's Guide To Customer Segmentation With Python - by Sigli Mumuni - Medium
14 pages
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
C++ Programming: From Novice to Expert in a Step-by-Step Journey
From Everand
C++ Programming: From Novice to Expert in a Step-by-Step Journey
Ryan Campbell
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Extension courseware based on the ArchiMate Standard, Version 3.1 Standard by Van Haren Publishing
From Everand
Extension courseware based on the ArchiMate Standard, Version 3.1 Standard by Van Haren Publishing
Van Haren Learning Solutions a.o.
No ratings yet
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
From Everand
Pragmatic Machine Learning with Python: Learn How to Deploy Machine Learning Models in Production
Avishek Nag
No ratings yet
Customer Profiling Segmentation and Sales Predicti
No ratings yet
Customer Profiling Segmentation and Sales Predicti
12 pages
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
5
No ratings yet
5
14 pages
C Programming Concepts
From Everand
C Programming Concepts
Jitendra Patel
No ratings yet
Behavioural Customer Segmentation Based
No ratings yet
Behavioural Customer Segmentation Based
7 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Software Design Simplified
From Everand
Software Design Simplified
Liviu Catalin Dorobantu
No ratings yet
Creo Manufacturing 4.0 Black Book
From Everand
Creo Manufacturing 4.0 Black Book
Gaurav Verma
No ratings yet
Six Sigma Yellow Belt: Introduction to Lean six Sigma Methodology for Beginners
From Everand
Six Sigma Yellow Belt: Introduction to Lean six Sigma Methodology for Beginners
Elias Soussi
No ratings yet
Literature[1]
No ratings yet
Literature[1]
22 pages
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
From Everand
DESIGN ALGORITHMS TO SOLVE COMMON PROBLEMS: Mastering Algorithm Design for Practical Solutions (2024 Guide)
ARCHER PAUL
No ratings yet
How Can Algorithms Help in Segmenting Users and Customers? A Systematic Review and Research Agenda For Algorithmic Customer Segmentation
No ratings yet
How Can Algorithms Help in Segmenting Users and Customers? A Systematic Review and Research Agenda For Algorithmic Customer Segmentation
16 pages
A Machine Learning
No ratings yet
A Machine Learning
9 pages
Learn Design and Analysis of Algorithms in 24 Hours
From Everand
Learn Design and Analysis of Algorithms in 24 Hours
Alex Nordeen
No ratings yet
AI-Driven Web Apps: Practical Machine Learning for Software Developers
From Everand
AI-Driven Web Apps: Practical Machine Learning for Software Developers
Sivaramarajalu Ramadurai Venkataraajalu
No ratings yet
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
From Everand
Top 20 MS Excel VBA Simulations, VBA to Model Risk, Investments, Growth, Gambling, and Monte Carlo Analysis
Andrei Besedin
2.5/5 (2)
Customer Segmentation Using Machine Learning
No ratings yet
Customer Segmentation Using Machine Learning
6 pages
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Blue Prism Developer Certification Case Based Practice Question - Latest 2023
From Everand
Blue Prism Developer Certification Case Based Practice Question - Latest 2023
Exam OG
No ratings yet
The MSP’s Guide to the Ultimate Client Experience: Optimizing service efficiency, account management productivity, and client engagement with a modern digital-first approach.
From Everand
The MSP’s Guide to the Ultimate Client Experience: Optimizing service efficiency, account management productivity, and client engagement with a modern digital-first approach.
Jeff Farris
No ratings yet
1
No ratings yet
1
15 pages
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
No ratings yet
Customer Profiling, Segmentation, and Sales Prediction Using AI in Direct Marketing
35 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
First Coding Session - Overview!
No ratings yet
First Coding Session - Overview!
5 pages
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
From Everand
C# Data Structures and Algorithms: Harness the power of C# to build a diverse range of efficient applications
Marcin Jamro
No ratings yet
Getting Started with Greenplum for Big Data Analytics
From Everand
Getting Started with Greenplum for Big Data Analytics
Sunila Gollapudi
No ratings yet
Cloud Native AI and Machine Learning on AWS: Use SageMaker for building ML models, automate MLOps, and take advantage of numerous AWS AI services (English Edition)
From Everand
Cloud Native AI and Machine Learning on AWS: Use SageMaker for building ML models, automate MLOps, and take advantage of numerous AWS AI services (English Edition)
Premkumar Rangarajan
No ratings yet
C++ Essentials
From Everand
C++ Essentials
Zoe Codewell
No ratings yet
ADS_Phase2
No ratings yet
ADS_Phase2
6 pages
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
From Everand
Practical Full Stack Machine Learning: A Guide to Build Reliable, Reusable, and Production-Ready Full Stack ML Solutions
Alok Kumar
No ratings yet
Weaknesses of Indonesian Language and Literature
No ratings yet
Weaknesses of Indonesian Language and Literature
2 pages
Simple Present Tense 1
No ratings yet
Simple Present Tense 1
1 page
Aragorn Training Document
No ratings yet
Aragorn Training Document
34 pages
Cagayan Hymn
No ratings yet
Cagayan Hymn
1 page
Bodewitz - Prana, Apana and Other Pranas in Vedic Literature
No ratings yet
Bodewitz - Prana, Apana and Other Pranas in Vedic Literature
13 pages
4ps Monthly Distributioin - Abderrahmane Omara
No ratings yet
4ps Monthly Distributioin - Abderrahmane Omara
10 pages
DPTE-Full-Course-2024-1
100% (1)
DPTE-Full-Course-2024-1
2 pages
A Proposal For Intervention in Dysorthographic Disorders by Attending To The Semiology of The Mistakes
No ratings yet
A Proposal For Intervention in Dysorthographic Disorders by Attending To The Semiology of The Mistakes
11 pages
1 Reading Grammar
No ratings yet
1 Reading Grammar
54 pages
Phonetics & Phonology. Lecture 8 - Aspects of Connected Speech
No ratings yet
Phonetics & Phonology. Lecture 8 - Aspects of Connected Speech
32 pages
Grammar: What Are You Going To Do On Friday?
No ratings yet
Grammar: What Are You Going To Do On Friday?
2 pages
Title of Paper: (Tajuk Artikel Dalam Bahasa Malaysia)
No ratings yet
Title of Paper: (Tajuk Artikel Dalam Bahasa Malaysia)
3 pages
Autobiographical Slide Show
No ratings yet
Autobiographical Slide Show
2 pages
Evonaj mat' ves' noс television karaulil - His mother watched TV all night long. On the loss of gender as a grammatical category in Alaskan Russian - Daly 1986
No ratings yet
Evonaj mat' ves' noс television karaulil - His mother watched TV all night long. On the loss of gender as a grammatical category in Alaskan Russian - Daly 1986
31 pages
Summer Programme 13 PDF
No ratings yet
Summer Programme 13 PDF
12 pages
1st Periodical Test in English
0% (1)
1st Periodical Test in English
3 pages
Writing With Noun Clauses: Example: She Didn't Hear What I Was Saying
No ratings yet
Writing With Noun Clauses: Example: She Didn't Hear What I Was Saying
2 pages
Celtic Thesis
100% (3)
Celtic Thesis
6 pages
Museum Analysis Assignment
100% (1)
Museum Analysis Assignment
2 pages
Unit 7 - Go For It!: Grammar Vocabulary Pronunciation
No ratings yet
Unit 7 - Go For It!: Grammar Vocabulary Pronunciation
2 pages
Summery of Cognitive Accounts
No ratings yet
Summery of Cognitive Accounts
2 pages
Gcse English Poetry Revision
No ratings yet
Gcse English Poetry Revision
40 pages
Semicolon
No ratings yet
Semicolon
2 pages
Vocabulary Words: Subject-Verb Agreement
No ratings yet
Vocabulary Words: Subject-Verb Agreement
3 pages
Misolas - EDUM 204 - Synchronous Activity 3
No ratings yet
Misolas - EDUM 204 - Synchronous Activity 3
1 page
Elements Tchniques Literary Devices
No ratings yet
Elements Tchniques Literary Devices
16 pages
Spring Syllable Counting Cards
No ratings yet
Spring Syllable Counting Cards
6 pages
7th Eng Unit 3 Lesson Plan
100% (1)
7th Eng Unit 3 Lesson Plan
5 pages
Sample Research Gap
No ratings yet
Sample Research Gap
2 pages