Customer Personality Analysis & Predictive Segmentation
Customer Personality Analysis & Predictive Segmentation
2.Customer Segmentation
4.Product Recommendations
Project Stages
The project will proceed in two major stages, mirroring real-life project
execution:
Customer Segmentation: This stage involves the use of clustering algorithms to classify
customers into different groups based on their distinct attributes and purchasing
behaviors. The derived segments will serve as a foundation for tailored marketing
strategies and decision-making processes.
Model Development for Future Data: Building on the customer segmentation, this stage
focuses on developing a predictive model that can handle future data. This model will
enable the company to anticipate changes in customer behavior and adapt their
strategies accordingly.
Business Requirements
localhost:8888/nbconvert/html/final_model.ipynb?download=false 1/77
6/25/23, 2:21 AM final_model
Proposed Solution
My solution focuses on leveraging machine learning techniques to meet
the defined business requirements:
Data Analysis and Clustering: Conduct extensive data preprocessing and exploratory
analysis to identify key features. Use clustering algorithms to create distinct customer
segments.
Approach
Our approach to this project follows a structured, step-by-step methodology grounded in
data science best practices. Each stage is thoughtfully designed to build upon the previous,
ensuring a cohesive and comprehensive solution.
Uderstanding the Data: The first step involves a thorough understanding of the dataset, its
variables, and its structure. This step is crucial for shaping the subsequent stages of the
project.
Data Preprocessing: After understanding the dataset, we clean and preprocess the data.
This involves handling missing values, potential outliers, and categorical variables, ensuring
the data is ready for analysis.
Exploratory Data Analysis (EDA): This stage involves unearthing patterns, spotting
anomalies, testing hypotheses, and checking assumptions through visual and quantitative
methods. It provides an in-depth understanding of the variables and their interrelationships,
which aids in feature selection.
Feature Selection: Based on the insights from EDA, relevant features are selected for
building the machine learning model. Feature selection is critical to improve the model's
performance by eliminating irrelevant or redundant information.
Customer Segmentation: The preprocessed data is then fed into a clustering algorithm to
group customers into distinct segments based on their attributes and behavior. This
segmentation enables targeted marketing and personalized customer engagement.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 2/77
6/25/23, 2:21 AM final_model
Prediction on Future Data: The final step involves utilizing the trained model to make
predictions on future data. This will allow the business to anticipate changes in customer
behavior and adapt their strategies accordingly.
This approach ensures a systematic and thorough analysis of the customer data, leading to
robust and reliable customer segments and predictions. It aims to provide a foundation
upon which strategic business decisions can be made and future customer trends can be
anticipated.
import warnings
warnings.filterwarnings('ignore')
Data Preprocessing
1.Data Collection
The data collection process in my project involved using the "requests" library to retrieve
a CSV file from a specific URL. The content was decoded using UTF-8 encoding, replacing
semicolons with commas. The decoded content was saved to a local file. This process
ensured the successful acquisition of the necessary data for further analysis and
modeling.
response = requests.get(url)
response.raise_for_status() # Check for any errors
localhost:8888/nbconvert/html/final_model.ipynb?download=false 3/77
6/25/23, 2:21 AM final_model
In [3]: df = pd.read_csv('marketing_campaign.csv')
df.head()
Out[3]: ID;Year_Birth;Education;Marital_Status;Income;Kidhome;Teenhome;Dt_Customer;Recency;MntWines;M
response = requests.get(url)
response.raise_for_status() # Check for any errors
# Decode the content using UTF-8 and replace semicolons with commas
content = response.content.decode("utf-8")
df.head()
In [53]: df.head()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 4/77
6/25/23, 2:21 AM final_model
5. Income: Customer's yearly household income, representing the total income of all
members in the household.
8. Dt_Customer: Date of customer's enrollment with the company, indicating when the
customer became a registered member.
9. Recency: Number of days since the customer's last purchase, providing a measure of
the customer's engagement and recent activity.
10. Complain: Indicates whether the customer has made a complaint in the last 2 years. (1 if
the customer has complained, 0 otherwise)
Products:
1. MntWines: Amount spent on wine in the last 2 years, reflecting the customer's
expenditure on wine products.
2. MntFruits: Amount spent on fruits in the last 2 years, representing the customer's
expenditure on fruit products.
3. MntMeatProducts: Amount spent on meat in the last 2 years, indicating the customer's
expenditure on meat products.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 5/77
6/25/23, 2:21 AM final_model
4. MntFishProducts: Amount spent on fish in the last 2 years, representing the customer's
expenditure on fish products.
6. MntGoldProds: Amount spent on gold products in the last 2 years, indicating the
customer's expenditure on gold items.
Promotion:
2. AcceptedCmp1: Indicates whether the customer accepted the offer in the 1st campaign.
(1 if the customer accepted, 0 otherwise)
3. AcceptedCmp2: Indicates whether the customer accepted the offer in the 2nd
campaign. (1 if the customer accepted, 0 otherwise)
4. AcceptedCmp3: Indicates whether the customer accepted the offer in the 3rd
campaign. (1 if the customer accepted, 0 otherwise)
5. AcceptedCmp4: Indicates whether the customer accepted the offer in the 4th
campaign. (1 if the customer accepted, 0 otherwise)
6. AcceptedCmp5: Indicates whether the customer accepted the offer in the 5th
campaign. (1 if the customer accepted, 0 otherwise)
7. Response: Indicates whether the customer accepted the offer in the last campaign. (1 if
the customer accepted, 0 otherwise)
Place:
Acknowledgement The dataset for this project is provided by Dr. Omar Romero-Hernandez.
Source : https://raw.githubusercontent.com/amankharwal/Website-
data/master/marketing_campaign.csv
localhost:8888/nbconvert/html/final_model.ipynb?download=false 6/77
6/25/23, 2:21 AM final_model
pd.set_option('display.max_columns', 50)
df.head()
3.Data Inspection
A pandas profiling analysis was conducted on the dataset using the ProfileReport
function from the pandas profiling library. The resulting profile report provided a
comprehensive summary of the data's structure, statistics, and distributions. The report
was saved as an HTML file for reference and further analysis. The profiling analysis
facilitated data understanding by revealing patterns, identifying missing values, and
offering insights for subsequent data preprocessing and analysis tasks.
data_profile = pp.ProfileReport(df)
data_profile.to_file('data_profile.html')
data_profile
localhost:8888/nbconvert/html/final_model.ipynb?download=false 7/77
6/25/23, 2:21 AM final_model
Overview
Dataset statistics
Number of variables 29
Missing cells 24
Duplicate rows 0
Variable types
Numeric 15
Categorical 13
DateTime 1
Alerts
Z_CostContact has constant value "" Constant
Out[9]:
4.Data Cleaning
1. Dropped duplicate rows from the DataFrame.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 8/77
6/25/23, 2:21 AM final_model
5. Identified and counted the number of outliers for each numeric column using the
interquartile range (IQR) method.
df.drop_duplicates(inplace=True)
In [11]: df.isnull().sum()
ID 0
Out[11]:
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Z_CostContact 0
Z_Revenue 0
Response 0
dtype: int64
To address missing values in the "income" attribute, it is decided to drop the null values to
avoid potential distortions in clustering analysis, as income is a crucial feature for grouping
the data accurately.
In [12]: df.dropna(inplace=True)
df.info()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 9/77
6/25/23, 2:21 AM final_model
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2216 entries, 0 to 2239
Data columns (total 29 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 2216 non-null int64
1 Year_Birth 2216 non-null int64
2 Education 2216 non-null object
3 Marital_Status 2216 non-null object
4 Income 2216 non-null float64
5 Kidhome 2216 non-null int64
6 Teenhome 2216 non-null int64
7 Dt_Customer 2216 non-null object
8 Recency 2216 non-null int64
9 MntWines 2216 non-null int64
10 MntFruits 2216 non-null int64
11 MntMeatProducts 2216 non-null int64
12 MntFishProducts 2216 non-null int64
13 MntSweetProducts 2216 non-null int64
14 MntGoldProds 2216 non-null int64
15 NumDealsPurchases 2216 non-null int64
16 NumWebPurchases 2216 non-null int64
17 NumCatalogPurchases 2216 non-null int64
18 NumStorePurchases 2216 non-null int64
19 NumWebVisitsMonth 2216 non-null int64
20 AcceptedCmp3 2216 non-null int64
21 AcceptedCmp4 2216 non-null int64
22 AcceptedCmp5 2216 non-null int64
23 AcceptedCmp1 2216 non-null int64
24 AcceptedCmp2 2216 non-null int64
25 Complain 2216 non-null int64
26 Z_CostContact 2216 non-null int64
27 Z_Revenue 2216 non-null int64
28 Response 2216 non-null int64
dtypes: float64(1), int64(25), object(3)
memory usage: 519.4+ KB
df['Dt_Customer'] = pd.to_datetime(df['Dt_Customer'])
localhost:8888/nbconvert/html/final_model.ipynb?download=false 10/77
6/25/23, 2:21 AM final_model
ID 2216
Out[15]:
Year_Birth 59
Education 5
Marital_Status 8
Income 1974
Kidhome 3
Teenhome 3
Dt_Customer 662
Recency 100
MntWines 776
MntFruits 158
MntMeatProducts 554
MntFishProducts 182
MntSweetProducts 176
MntGoldProds 212
NumDealsPurchases 15
NumWebPurchases 15
NumCatalogPurchases 14
NumStorePurchases 14
NumWebVisitsMonth 16
AcceptedCmp3 2
AcceptedCmp4 2
AcceptedCmp5 2
AcceptedCmp1 2
AcceptedCmp2 2
Complain 2
Z_CostContact 1
Z_Revenue 1
Response 2
dtype: int64
After examining the unique values in each feature of the DataFrame df using the
df.nunique() function, it was observed that the columns "Z_CostContact" and "Z_Revenue"
contain only a single unique value. As these columns do not contribute to the analysis and
model development, it is decided to drop them from the dataset.
Outliers
Outliers can have a significant impact on clustering models. They can distort the overall
distribution and lead to biased cluster assignments. Outliers tend to pull cluster centroids
towards them, resulting in less accurate cluster representation and potentially affecting the
cluster boundaries. It is important to handle outliers carefully by either removing them or
using robust clustering algorithms that are less sensitive to outliers to ensure more accurate
and reliable clustering results.
outlier_count = {}
for col in numeric_cols:
q1 = df[col].quantile(0.25)
q3 = df[col].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
localhost:8888/nbconvert/html/final_model.ipynb?download=false 11/77
6/25/23, 2:21 AM final_model
print(outlier_count)
The identification of outliers in the dataset prompts further analysis to understand their
context and relationship to the data. These outliers may provide insights into unique
patterns, market segments, or exceptional customer behavior. It is crucial to investigate their
validity and consider their influence carefully during modeling by using robust clustering
algorithms or outlier-resistant techniques.
# total purchase
df['Total_purchase'] = df['NumDealsPurchases'] + df['NumCatalogPurchases'] + df['N
In [20]: df['Marital_Status'].value_counts()
Married 857
Out[20]:
Together 573
Single 471
Divorced 232
Widow 76
Alone 3
Absurd 2
YOLO 2
Name: Marital_Status, dtype: int64
marital_status_mapping = {
localhost:8888/nbconvert/html/final_model.ipynb?download=false 12/77
6/25/23, 2:21 AM final_model
"Married": "Couple",
"Together": "Couple",
"Single": "Single",
"Divorced": "Single",
"Widow": "Single",
"Alone": "Single",
"Absurd": "Single",
"YOLO": "Single"
}
df['Marital_Status'] = df['Marital_Status'].map(marital_status_mapping)
df['Marital_Status'].value_counts()
Couple 1430
Out[21]:
Single 786
Name: Marital_Status, dtype: int64
total_adults_mapping = {
"Couple": 2,
"Single": 1
}
df['Total_adults'] = df['Marital_Status'].map(total_adults_mapping)
# Extract the year from 'Dt_Customer' and subtract from the current year
df['Customer_Since_Years'] = current_year - pd.to_datetime(df['Dt_Customer']).dt.ye
In [24]: df['Education'].value_counts()
Graduation 1116
Out[24]:
PhD 481
Master 365
2n Cycle 200
Basic 54
Name: Education, dtype: int64
Total Spend: A new feature 'Total_spend' was created by adding up the amount spent
on different categories of products. This gives a holistic view of the customer's
localhost:8888/nbconvert/html/final_model.ipynb?download=false 13/77
6/25/23, 2:21 AM final_model
spending habits.
Total Purchase: The total number of purchases made by each customer across different
platforms was calculated and stored in the 'Total_purchase' feature. This helps in
understanding the overall purchasing activity of the customer.
Family Size: The total number of adults and children in the family was calculated and
stored in 'Total_adults' and 'Total_children' respectively. A 'Family_size' feature was also
created that sums up these two, providing a complete picture of the family size.
Marital Status: The 'Marital_Status' feature was simplified by categorizing the customers
into 'Single' and 'Couple'. This simplification can make the data easier to analyze and
interpret.
Customer Since Years: The number of years a customer has been with the company was
calculated by subtracting the year of joining from the current year. This
'Customer_Since_Years' feature can provide insights into customer loyalty and retention.
Accepted Campaigns: A new feature 'accepted_camp' was created that sums up the
campaigns accepted by each customer. This can provide insights into the customer's
responsiveness to marketing campaigns.
Education Encoding: '2n Cycle' was replaced with 'Master' in the 'Education' feature for
clarity. The 'Education' feature was then encoded into numerical values, which can be
easier for machine learning algorithms to process.
EDA
In [26]: df.describe()
Upon reviewing the descriptive statistics of the dataset, it's observed that there are
significant variances in certain features, notably 'Income' and 'Age'.
The 'Age' feature has a maximum value of 130, which is unusually high and likely indicates
the presence of outliers. Similarly, the 'Income' feature shows a substantial difference
localhost:8888/nbconvert/html/final_model.ipynb?download=false 14/77
6/25/23, 2:21 AM final_model
between the mean and the maximum value, suggesting potential outliers in this feature as
well.
To ensure the accuracy of further analysis and modeling, it's crucial to address these outliers.
This can be done by either removing the outliers or replacing them with more
representative values, such as the median or mean of the feature.
The next step would be to conduct an outlier detection analysis to identify these extreme
values and decide on the most appropriate method to handle them. This will help to
improve the quality of the dataset and the reliability of subsequent insights derived from
the data.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 15/77
6/25/23, 2:21 AM final_model
Upon reviewing the box plots for 'Income' and 'Age', it's observed that there are only a few
outliers present in the data. Given their minimal presence, these outliers are unlikely to
provide meaningful insights and may potentially skew the results of the analysis. Therefore,
to maintain the integrity of the dataset and ensure more accurate results, the decision has
been made to remove these outliers from the dataset.
df = df[~((df[['Income', 'Age']] < (Q1 - 1.5 * IQR)) | (df[['Income', 'Age']] > (Q3
localhost:8888/nbconvert/html/final_model.ipynb?download=false 16/77
6/25/23, 2:21 AM final_model
df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df.to_csv('final_cleaned_data_csv', index_label=False)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 17/77
6/25/23, 2:21 AM final_model
Dimensionality Reduction
PCA
Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly correlated variables into a set of
values of linearly uncorrelated variables called principal components.
PCA identifies the axes in the feature space along which the data varies the most. It does
this by performing a covariance analysis between factors. In simple terms, it calculates an
'importance' score for each feature of your data, and then it orders these features by their
score, giving you the components in order of significance.
This can help to mitigate the curse of dimensionality, improve computational efficiency, and
make it easier to visualize the data. The transformed data (principal components) retain
most of the variance in the data with fewer dimensions, which can lead to more meaningful
and efficient clustering.
The filtered variables were selected based on expert advice from domain knowledge and
expertise.
In [40]: #Before performing PCA, it is important to standardize the data to bring the values
scaler = StandardScaler()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 18/77
6/25/23, 2:21 AM final_model
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.legend()
plt.show()
Based on the scree plot for df_scaled, a variance of 80% to 90% is achieved with 4 or 5
principal components. For df_filtered_scaled, a variance of 70% to 80% is achieved with 8
or 9 principal components. In PCA, it is desirable to retain a variance between 70% and
90% to capture a significant portion of the data's information while reducing
dimensionality.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 19/77
6/25/23, 2:21 AM final_model
df_scaled_pca = pca1.fit_transform(df_scaled)
Looking at the explained variance ratio (EVR) for the df_scaled dataset, the first component
explains 30.4% of the variance, while the remaining seven components account for 39.2%,
8.5%, 5.5%, 4.6%, 4.1%, 3.5%, and 3.5% respectively. This suggests that the first component
is contributing a substantial amount of information, but there is still a considerable amount
of information distributed across the remaining components.
On the other hand, in the df_filtered_scaled dataset, the first component explains a
significantly larger proportion of the variance, at 33.7%. The next four components account
for 16.9%, 11.4%, 10.1%, and 8.9% respectively. This represents a more substantial
proportion of the total variance explained by the first few components, indicating that this
dataset may have a lower effective dimensionality.
Based on the above analysis, it would seem that the df_filtered_scaled dataset may be a
better choice for further analysis or modeling. The reason for this is that a smaller number of
components explain a larger proportion of the variance, meaning that we retain more
information while reducing dimensionality. In other words, df_filtered_scaled is likely a more
efficiently compressed representation of our data.
In [44]: pca1_df.head()
In [45]: pca2_df.head()
1. df_scaled
2. df_filtered_scaled
3. pca1_df ( PCA of df_scaled data with n = 8)
4. pca2_df (PCA of df_filtered_scaled data with n = 5)
Stage1
Clustering
We now have four dataframes: two original dataframes (df_scaled and df_filtered_scaled)
and two dimensionality-reduced dataframes using PCA (pca1_df and pca2_df). Our next step
is to segment customers by implementing various clustering algorithms. Specifically, we will
utilize K-Means++, Hierarchical Clustering, and DBSCAN.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 21/77
6/25/23, 2:21 AM final_model
merges these atomic clusters into larger and larger clusters, until all objects are in a
single cluster or a termination condition is met.
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): This is a density-
based clustering algorithm, which groups together points in high-density areas and
identifies points in low-density areas as noise. It's different from K-Means and
Hierarchical Clustering as it doesn’t require the user to specify the number of clusters.
K-Means++
In the next step of our analysis, we will be using the Elbow Method and Within-Cluster-Sum-
of-Squares (WCSS) to determine the optimal number of clusters for the K-Means++
algorithm.
Elbow Method: This method involves running the K-Means algorithm multiple times
over a loop, with an increasing number of cluster choice and then plotting a clustering
score as a function of the number of clusters. The score could be within-cluster
variance, average silhouette, or any other internal clustering validation indices. The
optimal number of clusters is usually where the change in the clustering score begins to
diminish, often called the "elbow" in the plot.
By using the Elbow method with WCSS, we can determine an appropriate number of
clusters for our K-Means++ algorithm without excessive computation. The optimal number
of clusters is where we start to get diminishing returns in terms of reducing the WCSS, which
is visually represented as an 'elbow' in the plot.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 22/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 23/77
6/25/23, 2:21 AM final_model
dataframes = {
'df_scaled': df_scaled,
'df_filtered_scaled': df_filtered_scaled,
'pca1_df': pca1_df,
'pca2_df': pca2_df,
}
localhost:8888/nbconvert/html/final_model.ipynb?download=false 24/77
6/25/23, 2:21 AM final_model
plt.legend()
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 25/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 26/77
6/25/23, 2:21 AM final_model
The scatter plots of PCA1 and PCA2 show that the data points are well-distributed into
distinct groups, indicating that the PCA transformation effectively captured the
underlying structure of the data. In contrast, the scatter plots of the scaled DF and
DF_filtered reveal more noise and lack clear clustering patterns.
Moving forward, it is advisable to focus the analysis on PCA1 and PCA2 to gain further
insights and perform subsequent clustering or classification tasks. These transformed
components contain the most meaningful information from the original data while
reducing noise and dimensionality. This approach allows for a more concise and accurate
representation of the data, facilitating more effective analysis and decision-making.
pca_dfs = {
'pca1_df': pca1_df,
'pca2_df': pca2_df,
}
localhost:8888/nbconvert/html/final_model.ipynb?download=false 27/77
6/25/23, 2:21 AM final_model
fig.set_size_inches(18, 7)
y_lower = 10
for i in range(k):
ith_cluster_silhouette_values = \
sample_silhouette_values[labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / k)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
localhost:8888/nbconvert/html/final_model.ipynb?download=false 29/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 30/77
6/25/23, 2:21 AM final_model
The clustering analysis using pca1_df with 3 clusters achieved a moderately good silhouette
score of 0.315. This indicates a reasonable level of separation and coherence among the
identified clusters. Compared to the alternative pca2_df clustering solution, which obtained
a lower silhouette score of 0.213, pca1_df demonstrates a stronger clustering structure and
better differentiation between the groups. However, it's essential to consider the specific
context and domain knowledge when evaluating the quality of the clustering results. Further
validation and exploration of the clusters using additional techniques are recommended to
ensure the robustness and reliability of the chosen solution.
# Create a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Add legend
ax.legend()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 31/77
6/25/23, 2:21 AM final_model
# Create a 3D plot
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
# Add legend
ax.legend()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 32/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 33/77
6/25/23, 2:21 AM final_model
Hierarchical Clustering
Hierarchical clustering is a clustering algorithm that seeks to build a hierarchy of clusters. It
does not require a predefined number of clusters but instead forms clusters by recursively
merging or splitting them based on the similarity between data points.
In [55]: pca1_df.head()
Out[55]: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Cluster
In [56]: pca2_df.head()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 34/77
6/25/23, 2:21 AM final_model
pca1_df.drop(columns=['Cluster'], inplace=True)
pca2_df.drop(columns=['Cluster'], inplace=True)
In [59]: dataframes = {
'pca1_df': pca1_df,
'pca2_df': pca2_df,
}
plt.legend()
plt.show()
Based on the observation of the scatter plots, it appears that the clusters
created by hierarchical clustering showed significant overlap. This suggests
that the algorithm was not able to generate well-separated and distinct
clusters for this particular dataset. Consequently, the effectiveness of
hierarchical clustering in creating meaningful clusters for this dataset is
localhost:8888/nbconvert/html/final_model.ipynb?download=false 36/77
6/25/23, 2:21 AM final_model
# Display metrics/sample
n_clusters1_ = len(set(clusters1)) - (1 if -1 in clusters1 else 0)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 37/77
6/25/23, 2:21 AM final_model
n_noise1_ = list(clusters1).count(-1)
# Display metrics/sample
n_clusters2_ = len(set(clusters2)) - (1 if -1 in clusters2 else 0)
n_noise2_ = list(clusters2).count(-1)
Hyperparamater Tunning
Check Best eps and min_sample
localhost:8888/nbconvert/html/final_model.ipynb?download=false 38/77
6/25/23, 2:21 AM final_model
In [64]: # Create an instance of DBSCAN to create non-spherical clusters based on data densi
# For pca1_df
db1 = DBSCAN(eps=1, min_samples=20)
# Display metrics/sample
localhost:8888/nbconvert/html/final_model.ipynb?download=false 39/77
6/25/23, 2:21 AM final_model
In [61]: # Create an instance of DBSCAN to create non-spherical clusters based on data densi
# For pca1_df
db1 = DBSCAN(eps=1, min_samples=5)
# Display metrics/sample
n_clusters1_ = len(set(clusters1)) - (1 if -1 in clusters1 else 0)
n_noise1_ = list(clusters1).count(-1)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 40/77
6/25/23, 2:21 AM final_model
Based on these results, using DBSCAN with the current parameter settings is not
recommended for these datasets. The parameters, especially epsilon and the minimum
sample count, should be fine-tuned based on the characteristics of the data. If tuning the
parameters doesn't improve the results, it might be more appropriate to use a different
clustering algorithm that is more suited to the data's structure. Alternative algorithms to
consider could include K-means, Hierarchical Clustering, or Gaussian Mixture Models.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 41/77
6/25/23, 2:21 AM final_model
pca2_df. These dataframes represent two distinct sets of principal components derived from
our original dataset.
After an exhaustive and iterative process of model tuning and evaluation, we found that the
K-means++ algorithm performed the most effectively in revealing the underlying structure
of our data. The clusters formed by the K-means++ algorithm were more cohesive and
better separated than those generated by the Hierarchical Clustering and DBSCAN
algorithms.
To further explore and validate our clustering results from the K-means++ model, we have
conducted an in-depth exploratory data analysis (EDA) on the formed clusters for both
pca1_df and pca2_df. This process included assessing the distributions, central tendencies,
and dispersions of our data within and across the identified clusters.
We also employed visual analytics, creating an interactive dashboard using Power BI. This
step greatly facilitated the presentation of our clustering results, enabling us to interactively
explore each cluster and their respective properties. This dashboard offered a visually
intuitive understanding of the cluster formations and their characteristics, providing
significant insights into the structure of our data.
Based on the insights from our EDA and Power BI visualization, we were able to evaluate
and compare the clustering results between pca1_df and pca2_df. These evaluations will
allow us to ascertain which set of principal components, and consequently, which feature
space better represents the structure of our original data.
In conclusion, the application of K-means++ has yielded valuable insights into the hidden
structure of our datasets. This, combined with the subsequent EDA and Power BI dashboard
creation, has provided us with a comprehensive understanding of our data and will be
instrumental in guiding our future data-driven decision-making processes.
Feature Engineering
As a part of our feature engineering strategy, we have enriched the original datasets df1
and df2 by incorporating the cluster assignments from our K-means++ algorithm applied
to the respective PCA-transformed dataframes, pca1_df and pca2_df. This additional
'Cluster' feature serves as an indicator of the underlying group structure detected in our
high-dimensional datasets, thus providing a valuable source of information for further
exploratory data analysis.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 42/77
6/25/23, 2:21 AM final_model
df1.to_csv('group1_pca1.csv', index=False)
df2.to_csv('group2_pca2.csv', index=False)
Evaluation (EDA)
Sum of accepted campaigns by groups
plt.figure(figsize=(6, 4))
sns.barplot(x="Cluster", y="accepted_camp", data=df1, estimator=sum, ci=None)
plt.title('Sum of Accepted Campaigns for Each Cluster df1(pca1)')
plt.xlabel('Cluster')
plt.ylabel('Sum of Accepted Campaigns')
plt.show()
#df2(pca2 group)
plt.figure(figsize=(6, 4))
sns.barplot(x="Cluster", y="accepted_camp", data=df2, estimator=sum, ci=None)
plt.title('Sum of Accepted Campaigns for Each Cluster df2 (pca2)')
plt.xlabel('Cluster')
plt.ylabel('Sum of Accepted Campaigns')
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 43/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 45/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 46/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 47/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 48/77
6/25/23, 2:21 AM final_model
# Pairplot
sns.pairplot(df_subset, hue='Cluster')
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 49/77
6/25/23, 2:21 AM final_model
# Pairplot
sns.pairplot(df_subset, hue='Cluster')
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 50/77
6/25/23, 2:21 AM final_model
In [72]: # Define the bin edges and labels for age bins
age_bins = [20, 40, 60, 80, 100]
age_labels = ['20-40', '40-60', '60-80', '80-100']
# Define the bin edges and labels for age bins df2
age_bins = [20, 40, 60, 80, 100]
age_labels = ['20-40', '40-60', '60-80', '80-100']
localhost:8888/nbconvert/html/final_model.ipynb?download=false 51/77
6/25/23, 2:21 AM final_model
This group consists of lower income individuals, typically earning between 0 to 50K.
Their spending capacity is comparably lower, ranging between 10 to 250.
The group is largely made up of couples, most of whom have at least 1 or 2 children.
A high proportion of individuals in this group are educated.
The family size is typically larger, ranging from 3 to 5 with no singles in this group.
The age range is quite mature, from 40 to 60 years old.
Discounts, family-centric offerings, and educational resources might be well-received by
this group.
This cluster comprises a middle-income group, with earnings between 50K and 80K.
They spend an average amount on purchases, with the range falling between 1000 to
1600.
Most are single and have only one child, indicating a small family size not exceeding 3
members.
Given the balance between income and spending, this group may respond well to
value-for-money offerings.
Targeted marketing strategies may include single parent-focused campaigns or
promotions for moderate to high-priced items.
This group consists of lower income individuals, typically earning between 0 to 50K.
Their spending capacity is comparably lower, ranging between 10 to 250.
The group is largely made up of couples, most of whom have at least 1 or 2 children.
A high proportion of individuals in this group are educated.
The family size is typically larger, ranging from 3 to 5 with no singles in this group.
The age range is quite mature, from 40 to 60 years old.
Discounts, family-centric offerings, and educational resources might be well-received by
this group.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 53/77
6/25/23, 2:21 AM final_model
This cluster comprises a middle-income group, with earnings between 50K and 80K.
They spend an average amount on purchases, with the range falling between 1000 to
1600.
Most are single and have only one child, indicating a small family size not exceeding 3
members.
Given the balance between income and spending, this group may respond well to
value-for-money offerings.
Targeted marketing strategies may include single parent-focused campaigns or
promotions for moderate to high-priced items.
3. Value Deals for Cluster 0: Offer value deals, discounts, or bundle deals to cater to the
lower income and spending power of this group.
5. Balanced Approach for Cluster 1: Offer a balanced approach with value for money
products and emphasize quality and durability.
7. Single Parent Campaigns for Cluster 1: Tailor campaigns to cater to the unique
challenges and needs of single-parent households in Cluster 1.
8. Lifecycle Marketing: Track customers through different life stages and adjust
marketing strategies accordingly.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 54/77
6/25/23, 2:21 AM final_model
9. Customer Feedback: Seek regular feedback from each cluster group to understand
their evolving needs and make necessary adjustments.
10. Predictive Analytics: Utilize predictive analytics to anticipate future buying behavior
and tailor marketing efforts proactively.
STAGE 2
model_df2 = model_df.copy()
model_df3 = model_df.copy()
model_df4 = model_df.copy()
Experiment 1 (SMOTE)
In [74]: # drop unimportant features
X = model_df.drop(columns=['Cluster'], axis=1)
y = model_df['Cluster']
localhost:8888/nbconvert/html/final_model.ipynb?download=false 55/77
6/25/23, 2:21 AM final_model
In [76]: y_train.value_counts()
1 968
Out[76]:
2 655
0 141
Name: Cluster, dtype: int64
To address the class imbalance in the target variable, consider using the
SMOTE (Synthetic Minority Over-sampling Technique) algorithm.
SMOTE (Synthetic Minority Over-sampling Technique) is a technique used to address class
imbalance problems in machine learning. It creates synthetic samples of the minority
class to balance out the number of samples between classes, thereby helping to improve
model performance and robustness. SMOTE works by considering the k-nearest
neighbors of a data point in the feature space and generating new instances along the
lines joining the neighbors. While it's a powerful technique to counter class imbalance, it
should be applied thoughtfully as it can sometimes introduce noise by generating
synthetic instances without considering the instances from other classes.
In [78]: y_train.value_counts()
2 968
Out[78]:
1 968
0 968
Name: Cluster, dtype: int64
# Calculate metrics
training_accuracy = accuracy_score(y_train, y_train_pred)
testing_accuracy = accuracy_score(y_test, y_test_pred)
cv_score = cross_val_score(classifier, X, y, cv=5).mean()
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')
# Display results
experiment_1.head(10)
Support Vector
4 0.748967 0.845805 0.832653 0.858930 0.845805 0.850161
Machines (SVM)
K-Nearest Neighbors
5 0.907025 0.841270 0.874830 0.868029 0.841270 0.849118
(KNN)
Neural Networks
7 0.619490 0.587302 0.716100 0.585578 0.587302 0.558211
(MLP)
Evaluation
Among these models, Naive Bayes, Decision Tree, Random Forest, and Gradient Boosting
performed well in terms of accuracy, precision, recall, and F1 score. Further evaluation was
conducted using visualizations such as confusion matrix, learning curve, and class
prediction error.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 57/77
6/25/23, 2:21 AM final_model
distribution of true positives, true negatives, false positives, and false negatives.
Learning curve: A plot that depicts the performance of a model as training data size
increases. It helps in assessing model performance and identifying issues such as
overfitting or underfitting.
Class prediction error: A plot that shows the difference between the true class
distribution and the predicted class distribution. It helps in identifying which classes are
misclassified more frequently.
These visualizations were used to gain further insights into the models' performance and
identify areas for improvement.
plt.tight_layout()
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 58/77
6/25/23, 2:21 AM final_model
axes.set_title(title)
axes.set_xlabel("Training examples")
axes.set_ylabel("Score")
axes.grid()
axes.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
axes.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
axes.legend(loc="best")
# Make sure your 'selected_classifiers' is a list or ordered dict, not a normal dic
# Normal dict doesn't preserve order before Python 3.7
selected_classifiers = [
("Naive Bayes", GaussianNB()),
("Decision Tree", DecisionTreeClassifier()),
localhost:8888/nbconvert/html/final_model.ipynb?download=false 59/77
6/25/23, 2:21 AM final_model
plt.tight_layout()
plt.show()
X = model_df.drop(columns=['Cluster'], axis=1)
y = model_df['Cluster']
In [84]: y_train.value_counts()
1 968
Out[84]:
2 655
0 141
Name: Cluster, dtype: int64
# Calculate metrics
training_accuracy = accuracy_score(y_train, y_train_pred)
testing_accuracy = accuracy_score(y_test, y_test_pred)
cv_score = cross_val_score(classifier, X, y, cv=5).mean()
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')
# Display results
experiment_2.head(10)
Support Vector
4 0.824830 0.836735 0.832653 0.774417 0.836735 0.801250
Machines (SVM)
K-Nearest Neighbors
5 0.906463 0.895692 0.874830 0.897592 0.895692 0.893322
(KNN)
Neural Networks
7 0.759070 0.739229 0.697506 0.761691 0.739229 0.715073
(MLP)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 62/77
6/25/23, 2:21 AM final_model
In [86]: selected_classifiers = {
"Naive Bayes": GaussianNB(),
"Decision Tree": DecisionTreeClassifier(),
"Random Forest": RandomForestClassifier(),
"Gradient Boosting": XGBClassifier(eval_metric='mlogloss'),
}
plt.tight_layout()
plt.show()
axes.set_title(title)
axes.set_xlabel("Training examples")
axes.set_ylabel("Score")
localhost:8888/nbconvert/html/final_model.ipynb?download=false 63/77
6/25/23, 2:21 AM final_model
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
axes.grid()
axes.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
axes.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
axes.legend(loc="best")
selected_classifiers = [
("Naive Bayes", GaussianNB()),
("Decision Tree", DecisionTreeClassifier()),
("Random Forest", RandomForestClassifier()),
("Gradient Boosting", XGBClassifier(eval_metric='mlogloss')),
]
plt.tight_layout()
plt.show()
In [95]: model_df2.head()
X = model_df2.drop(columns=['Cluster'], axis=1)
y = model_df2['Cluster']
1 968
Out[97]:
2 655
0 141
Name: Cluster, dtype: int64
2 968
Out[98]:
1 968
0 968
Name: Cluster, dtype: int64
classifier.fit(X_train, y_train)
# Calculate metrics
training_accuracy = accuracy_score(y_train, y_train_pred)
testing_accuracy = accuracy_score(y_test, y_test_pred)
cv_score = cross_val_score(classifier, X, y, cv=10).mean()
precision = precision_score(y_test, y_test_pred, average='weighted')
recall = recall_score(y_test, y_test_pred, average='weighted')
f1 = f1_score(y_test, y_test_pred, average='weighted')
# Display results
experiment_3.head(10)
Support Vector
4 0.756887 0.845805 0.832631 0.858930 0.845805 0.850161
Machines (SVM)
K-Nearest Neighbors
5 0.926997 0.841270 0.876639 0.866046 0.841270 0.849378
(KNN)
Neural Networks
7 0.642906 0.632653 0.694126 0.735080 0.632653 0.601716
(MLP)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 67/77
6/25/23, 2:21 AM final_model
plt.tight_layout()
plt.show()
axes.set_title(title)
axes.set_xlabel("Training examples")
axes.set_ylabel("Score")
axes.grid()
axes.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
axes.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
axes.legend(loc="best")
selected_classifiers = [
("Naive Bayes", GaussianNB()),
("Decision Tree", DecisionTreeClassifier()),
("Random Forest", RandomForestClassifier()),
("Gradient Boosting", XGBClassifier(eval_metric='mlogloss')),
]
plt.tight_layout()
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 69/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 70/77
6/25/23, 2:21 AM final_model
SMOTE's Impact: Experiment 3, which made use of SMOTE to balance the classes, was
chosen over Experiment 2 which did not use SMOTE. This decision underscores the
importance of handling class imbalance in the dataset. While the impact of SMOTE was not
drastically apparent in the difference between the performances of the models, its utilization
is critical in datasets with imbalanced classes to ensure that the minority class is not ignored.
Model Performance: Among all the classification algorithms tested, the Gradient Boosting
method, specifically the XGBoost Classifier, showed the best performance with fewer errors
in the Class Prediction Error plot. It accurately identified the main target customers (class 1),
which was the objective of this exercise.
Given these findings, the XGBoost Classifier was chosen for further hyperparameter
tuning. This process of optimization helps enhance the performance of the model by
adjusting the model parameters to their ideal values. Once this is achieved, the model will
be saved and deployed. The deployment of the model allows it to be used in practical
applications, providing predictions on new, unseen data.
Hyperparameter Tunning
GradianteBoost (XGB Classifier)
To ensure an accurate evaluation, the dataset model_df2 will be split into three subsets:
training, validation, and testing. The training data will be used for model training, the
validation data for hyperparameter optimization, and the testing data for final evaluation.
This approach ensures a robust assessment of the model's performance.
Train-Test-Valid
X = model_df2.drop('Cluster', axis=1)
y = model_df2['Cluster']
# Split data into training, validation, and test sets (70% - 15% - 15%)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.15, random_stat
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.18, r
# Train the model using the best parameters on the training data
best_classifier = XGBClassifier(**best_parameters, eval_metric='mlogloss')
best_classifier.fit(X_train_smote, y_train_smote)
localhost:8888/nbconvert/html/final_model.ipynb?download=false 72/77
6/25/23, 2:21 AM final_model
ax.grid()
ax.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score
ax.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation
ax.set_title(title)
ax.set_xlabel("Training examples")
ax.set_ylabel("Score")
ax.legend(loc="best")
# Plotting
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 21))
plt.tight_layout()
plt.show()
localhost:8888/nbconvert/html/final_model.ipynb?download=false 73/77
6/25/23, 2:21 AM final_model
localhost:8888/nbconvert/html/final_model.ipynb?download=false 74/77
6/25/23, 2:21 AM final_model
Conclusion
The purpose of the presented analysis was to develop an effective predictive model that can
identify the target customers. Three separate experiments were conducted with varying
features and data manipulation techniques, which included SMOTE for handling class
imbalance.
The experiments compared a set of classification models including logistic regression, Naive
Bayes, decision tree, random forest, SVM, K-Nearest Neighbors (KNN), gradient boosting
(XGB Classifier), neural networks (MLP), and AdaBoost. Among these, the gradient boosting
model, specifically XGBoost, consistently performed the best across all experiments.
localhost:8888/nbconvert/html/final_model.ipynb?download=false 75/77
6/25/23, 2:21 AM final_model
Experiment 3, which incorporated nearly all features and applied SMOTE for oversampling,
was the most successful in terms of performance metrics, implying that both a more
comprehensive feature set and balanced data contribute positively to the model's
performance.
Moreover, XGBoost's performance was verified by splitting the data into training, validation,
and test sets. Hyperparameter optimization was also performed to ensure that the best
parameters were selected for the final model. This was initially done with GridSearchCV, but
due to computational constraints, RandomizedSearchCV was utilized as a more efficient
alternative.
The accuracy scores from both validation and test data demonstrate that the XGBoost
model generalized well, reducing the likelihood of overfitting. Confusion matrix, learning
curve, and class prediction error plots further confirmed the model's good performance.
To conclude, the XGBoost model developed in this analysis has shown to be a strong
predictor for the target customers. It has a good balance between bias and variance, which
makes it a reliable tool for new, unseen data. Therefore, this model was chosen as the final
model, and will be saved and deployed for further use.
filename = 'final_model.sav'
pickle.dump(best_classifier, open(filename, 'wb'))
# New data
data = {
'Income': [24882.0, 22979.0, 27071.0, 36957.0, 70044.0],
'Kidhome': [1, 1, 1, 1, 0],
'Teenhome': [0, 0, 0, 1, 1],
'Recency': [52, 29, 90, 43, 46],
'MntWines': [1, 16, 8, 100, 1073],
'MntFruits': [4, 17, 3, 2, 0],
'MntMeatProducts': [10, 19, 19, 16, 250],
'MntFishProducts': [29, 20, 0, 2, 153],
'MntSweetProducts': [0, 21, 2, 1, 14],
'MntGoldProds': [36, 22, 3, 31, 14],
'NumDealsPurchases': [1, 3, 2, 4, 4],
'NumWebPurchases': [1, 3, 2, 3, 7],
'NumCatalogPurchases': [1, 2, 0, 2, 10],
'NumStorePurchases': [2, 2, 3, 2, 5],
'NumWebVisitsMonth': [6, 8, 6, 9, 5],
'AcceptedCmp3': [1, 0, 0, 0, 0],
localhost:8888/nbconvert/html/final_model.ipynb?download=false 76/77
6/25/23, 2:21 AM final_model
# Convert to DataFrame
new_data = pd.DataFrame(data)
[1 1 1 1 2]
In [106… X_train.head(5)
178 24882.0 1 0 52 1 4 10
1976 22979.0 1 0 29 16 17 19
1816 27071.0 1 0 90 8 3 19
In [107… y_train.head(5)
178 1
Out[107]:
1976 1
1816 1
1384 1
1709 2
Name: Cluster, dtype: int64
In [ ]:
localhost:8888/nbconvert/html/final_model.ipynb?download=false 77/77
6/25/23, 10:15 AM Streamlit
Navigation
Go to
Home
Insights & Cluster Analysis
Predict
Power BI Dashboard
Model Development
Model Flow Chart
About Me
This portfolio project will focus on the second(Customer Segmentation) and third stages(Predic
developing predictive models for future data segmentation.
1. Marketing and Strategy Teams: For marketing professionals, understanding customer behav
repeat purchases. By segmenting customers based on their purchasing behaviors and other cha
potentially leading to increased customer conversion rates and business growth.
2. Product Development Teams: In the realm of product development, customer insights drive
preferences of different customer segments, the product team can focus on features that resona
Manage app
demands, and consequently, boost customer satisfaction and brand reputation.
https://customer-personality-analysis-clustering-navee2357.streamlit.app 1/1
6/25/23, 10:16 AM Streamlit
Navigation
Go to
Home
Insights & Cluster Analysis
Predict
Power BI Dashboard
Model Development
Model Flow Chart
About Me
Overview
Post an extensive exploratory data analysis and insightful visualizations via Power BI, the stakeh
by the group's superior capability to underline underlying patterns and segregate the data into m
This process revealed several interconnected relationships between features that offer valuable
cluster group analysis, will substantially shape the teams' future course of action and assist in d
Business Recommendations
Given their high income and spending, marketing premium and exclusive products/services cou
high engagement with past campaigns.
Navigation
Go to
Home
Insights & Cluster Analysis
Predict
Power BI Dashboard
Model Development
Model Flow Chart
About Me
0.00
Kidhome
0.00
Teenhome
0.00
Recency
0.00
MntWines
0.00
MntFruits
0.00
MntMeatProducts
0.00
Manage app
MntFishProducts
https://customer-personality-analysis-clustering-navee2357.streamlit.app 1/1