Agglomerative Clustering - Customer Segmentation Term Paper
Agglomerative Clustering - Customer Segmentation Term Paper
ABSTRACT
This project explores the application of agglomerative clustering on customer
data taken from a grocery mart’s database. The aim of this project is to segment
the customers into distinct clusters based on purchasing behaviour. The dataset
has been streamlined using dimensionality reduction methods, followed by
agglomerative clustering to identify clusters. The research resulted in 4 distinct
customer segments that were profiled based on various factors such as family
structures, income levels and spending patterns. These insights offer valuable
opportunities for developing targeted marketing strategies to meet the needs of
each customer segment, thereby increasing the effectiveness of marketing
strategies in the retail industry.
INTRODUCTION
The process of grouping a set of physical or abstract objects into classes of
similar objects is called clustering. A cluster is a collection of data objects that
are like one another within the same cluster and are dissimilar to the objects in
other clusters. A cluster of data objects can be treated collectively as one group
and so may be considered as a form of data compression. Although
classification is an effective means for distinguishing groups or classes of
objects, it requires the often-costly collection and labelling of a large set of
training tuples or patterns, which the classifier uses to model each group.
Clustering is also called data segmentation in some applications because
clustering partitions large data sets into groups according to their similarity .
(Huda Hamdan Ali, 2015)
4. Place
• NumWebPurchases – Number of purchases through website.
• NumCatalogPurchases – Number of purchases made using
catalogue.
• NumStorePurchases – Number of purchases made directly in
stores.
• NumWebVisitsMonth – Number of visits to company website in
last month.
For this project, the model has been built in Python as it is the most preferred
and largely used programming language for machine learning applications.
Execution of the code for the agglomerative clustering model has been done in
Jupyter Notebook, which is an IDE (Interactive Development Environment) for
Python.
The dataset has been imported into the IDE, following which the data has been
cleaned to deal with missing values. After cleaning, feature engineering has
been done to further aid with dimensionality reduction later in the project.
The features have been plotted and the identified outliers have been removed.
Clearly there are a few outliers in the Income and Age features.
Based on this plot, the optimal number of clusters chosen is 4. The point before
the curve plateaus has been chosen, the point indicates that the clusters have
high cohesion.
Based on the plot, we can infer that no customer has taken part in all 5
campaigns. The overall response is underwhelming.
Deals offered:
The deals offered have done well. The best outcomes can be seen with cluster 0
and cluster 3. Cluster 1 and 2 haven’t been attracted as much.
1. “Kidhome”,
2. “Teenhome”,
3. “Customer_for”,
4. “Age”,
5. “Children”,
6. “Family_Size”,
7. “Is_parent”,
8. “education”,
9. “Living_with”
Based on these plots, the following information can be deduced about the
customers:
Cluster 0:
• A parent.
• At least 2 and at most 4 members in the family.
• Single parents are a subset of this group.
• Most have a teenager at home.
• Relatively older.
Cluster 1:
• Not a parent.
• At most 2 family members.
• Slight majority of couples.
• Span all ages.
• High income.
Cluster 2:
• Majority Parents.
• At most 3 members in the family.
• They majorly have one kid.
• Relatively Younger.
Cluster 3:
• A parent.
• At most 5 and at least 2 family members.
• Majority of them have a teenager at home.
• Relatively older.
• Lower-income group.
REFERENCES:
1. H. H. Ali and L. E. Kadhum, ‘K-Means Clustering Algorithm Applications in Data Mining and Pattern
Recognition’, Int. J. Sci. Res., vol. 6, no. 8, pp. 1577–1584, 2017
2. Customer Personality Analysis (kaggle.com)