The document describes using k-means cluster analysis on customer usage data from a telecommunications provider to segment their customer base. Initially, a 3-cluster solution was obtained but did not capture all important groups. A 4-cluster solution identified a potentially profitable "Internet" customer cluster missed previously. Examining the final cluster centers and distances between clusters provided insight into the natural groupings of customers and how they compare.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
336 views10 pages
SPSS Annotated Output K Means Cluster Anal
The document describes using k-means cluster analysis on customer usage data from a telecommunications provider to segment their customer base. Initially, a 3-cluster solution was obtained but did not capture all important groups. A 4-cluster solution identified a potentially profitable "Internet" customer cluster missed previously. Examining the final cluster centers and distances between clusters provided insight into the natural groupings of customers and how they compare.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10
SPSS ANNOTATED OUTPUT K-MEANS CLUSTER ANALYSIS|
K-means cluster analysis is a tool designed to assign cases to
a fixed number of groups (clusters) whose characteristics are not yet known but are based on a set of specified variables. It is most useful when you want to classify a large number (thousands) of cases. A good cluster analysis is: Efficient. Uses as few clusters as possible. Effective. Captures all statistically and commercially important clusters. For example, a cluster with five customers may be statistically different but not very profitable.
The K-Means Cluster Analysis procedure begins with the
construction of initial cluster centers. You can assign these yourself or have the procedure select k well-spaced observations for the cluster centers. After obtaining initial cluster centers, the procedure: Assigns cases to clusters based on distance from the cluster centers. Updates the locations of cluster centers based on the mean values of cases in each cluster. These steps are repeated until any reassignment of cases would make the clusters more internally variable or externally similar.
A telecommunications provider wants to segment its customer base
by service usage patterns. If customers can be classified by usage, the company can offer more attractive packages to its customers.
1. To run the cluster analysis, from the menus choose:
Analyze > Classify > K-Means Cluster... Figure 1. K-Means Cluster Analysis dialog box 2. If the variable list does not display variable labels in file order, right-click anywhere in the variable list and from the context menu choose Display Variable Labels and Sort by File Order. 3. Select Standardized log-long distance through Standardized log-wireless and Standardized multiple lines through Standardized electronic billing as analysis variables. 4. Type 3 as the number of clusters. 5. Click Iterate.Figure 2. Iterate dialog box
6. Type 20 as the maximum iterations.
7. Click Continue. 8. Click Options in the K-Means Cluster Analysis dialog box.Figure 3. Options dialog box 9. Select ANOVA table and Cluster information for each group in the Statistics group. 10. Select Exclude cases pairwise in the Missing Values group. There are many missing values due to the fact that most customers do not subscribe to all services, so excluding cases pairwise maximizes the information you can obtain from the data... at the cost of possibly biasing the results. 11. Click Continue, then click OK in the K-Means Cluster Analysis dialog box.
Figure 1. Initial cluster centers for three-cluster solution
The initial cluster centers are the variable values of
the k well-spaced observations.
Figure 1. Iteration history for three-cluster solution
The iteration history shows the progress of the clustering process at each step. In early iterations, the cluster centers shift quite a lot. By the 14th iteration, they have settled down to the general area of their final location, and the last four iterations are minor adjustments. If the algorithm stops because the maximum number of iterations is reached, you may want to increase the maximum because the solution may otherwise be unstable. For example, if you had left the maximum number of iterations at 10, the reported solution would still be in a state of flux.
Figure 1. ANOVA table for three-cluster solution
The ANOVA table indicates which variables contribute the most
to your cluster solution. Variables with large F values provide the greatest separation between clusters.
Figure 1. Final cluster centers for three-cluster solution
The final cluster centers are computed as the mean for each variable within each final cluster. The final cluster centers reflect the characteristics of the typical case for each cluster. Customers in cluster 1 tend to be big spenders who purchase a lot of services. Customers in cluster 2 tend to be moderate spenders who purchase the "calling" services. Customers in cluster 3 tend to spend very little and do not purchase many services.
Figure 1. Distances between final cluster centers for three-
cluster solution
This table shows the Euclidean distances between the final
cluster centers. Greater distances between clusters correspond to greater dissimilarities. Clusters 1 and 3 are most different. Cluster 2 is approximately equally similar to clusters 1 and 3. These relationships between the clusters can also be intuited from the final cluster centers, but this becomes more difficult as the number of clusters and variables increases.
Figure 1. Number of cases in each cluster for three-cluster
solution A large number of cases were assigned to the third cluster, which unfortunately is the least profitable group. Perhaps a fourth, more profitable, cluster could be extracted from this "basic service" group.
Figure 1. K-Means Cluster Analysis dialog box
1. To run a cluster analysis with four clusters, reopen the
K-Means Cluster Analysis dialog box. 2. Type 4 as the number of clusters. 3. Click Save.Figure 2. Save dialog box
4. Select Cluster membership and Distance from cluster
center. 5. Click Continue. 6. Click OK in the K-Means Cluster Analysis dialog box. 7. The saved variables can be used to create a useful boxplot. From the menus, choose: Graphs > Chart Builder... 8. Click the Gallery tab, select Boxplot from the list of chart types, and drag and drop the Simple Boxplot icon onto the canvas. 9. Drag and drop Distance of Case from its Classification Cluster Center onto the y axis. 10. Drag and drop Cluster Number of Case onto the x axis. 11. Click OK to create the boxplot. Figure 3. Chart Builder
Figure 1. Plot of distances from cluster center by cluster
membership for four-cluster solution This is a diagnostic plot that helps you to find outliers within clusters. There is a lot of variability in cluster 2, but all the distances are within reason.
Figure 1. Final cluster centers for four-cluster solution
This table shows that an important grouping is missed in the
three-cluster solution. Members of clusters 1 and 2 are largely drawn from cluster 3 in the three-cluster solution, and they are unlikely to be big spenders. However, members of cluster 1 are highly likely to purchase Internet-related services, which establishes them as a distinct and possibly profitable group. Clusters 3 and 4 seem to correspond to clusters 1 and 2 from the three-cluster solution. Figure 1. Distances between final cluster centers for four- cluster solution
The distances between the clusters have not changed greatly.
Clusters 1 and 2 are the most similar, which makes sense because they were combined into one cluster in the three- cluster solution. Clusters 2 and 3 are the most dissimilar, since they represent opposite spending behaviors. Cluster 4 is still equally similar to the other clusters.
Figure 1. Number of cases in each cluster for four-cluster
solution
Nearly 25% of cases belong to the newly created group of "E-
service" customers, which is very significant to your profits.
Using k-means cluster analysis, you initially grouped the
customers into three clusters. However, this solution was not very satisfactory, so you reran the analysis with four clusters. These results were better, and from the final cluster centers, you saw that a potentially profitable "Internet" grouping was missed in the three-cluster solution. This example underscores the exploratory nature of cluster analysis, since it is impossible to determine the "best" number of clusters until you have run the analyses and examined the solutions. The next step for the company is to try to construct a model that classifies the customers according to their demographic information. With such a model, the company can customize offers for individual prospective customers. For information on how the company builds such a model, see Using Discriminant Analysis to Classify Telecommunications Customers.
The K-Means Cluster Analysis procedure is a tool for finding
natural groupings of cases, given their values on a set of variables. It is most useful when you want to classify a large number (thousands) of cases. The TwoStep Cluster Analysis procedure allows you to use both categorical and continuous variables, and can automatically select the "best" number of clusters. If you want to cluster variables instead of cases, or have a small number of cases, try the Hierarchical Cluster Analysis procedure. If your k-means analysis is part of a segmentation solution, these newly created clusters can be analyzed in the Discriminant Analysis procedure.
See the following texts for more information on k-means
cluster analysis: Aldenderfer, M. S., and R. K. Blashfield. 1984. Cluster Analysis. Newbury Park: Sage Publications.