0% found this document useful (0 votes)
56 views

K Means Handout

This document discusses k-means clustering, an unsupervised machine learning technique that groups unlabeled data points into a specified number of clusters based on their similarity. It provides an example of using k-means clustering in SPSS to analyze customer data from a telecommunications provider, grouping customers into 3 clusters based on their usage patterns. The results are then interpreted, identifying characteristics of customers in each cluster and which variables best discriminate between the clusters.

Uploaded by

Ankit Seth
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

K Means Handout

This document discusses k-means clustering, an unsupervised machine learning technique that groups unlabeled data points into a specified number of clusters based on their similarity. It provides an example of using k-means clustering in SPSS to analyze customer data from a telecommunications provider, grouping customers into 3 clusters based on their usage patterns. The results are then interpreted, identifying characteristics of customers in each cluster and which variables best discriminate between the clusters.

Uploaded by

Ankit Seth
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

ClusterAnalysis

WhatisClusterAnalysis? Clusteranalysisisastatisticaltechniqueusedtogroupcases(individualsorobjects)intohomogeneous subgroupsbasedonresponsestovariables.UsingPASW(SPSS)17.0toconductaclusteranalysis,there arethreeclusteringprocedures:twostep,kmeans,andhierarchical. Kmeansclusteringallowsyoutoselectthenumberofclustersandtheprocedurecanbeusedwith moderatetolargedatasets.Thekmeansclusteringalgorithmassignscasestoclustersbasedonthe smallestamountofdistancebetweentheclustermeanandcase.Thisisaniterativeprocessthatstops oncetheclustermeansdonotchangemuchinsuccessivesteps.

KMeansClustering
Asanexampleofkmeansclustering,asamplePASW17.0datasetwasused;telco_extra.sav, telecommunicationsproviderdatathathas14continuousvariables.Thecontinuousvariableshave alreadybeenstandardized,withameanof0andstandarddeviationof1,toallowfordifferentunitsin whichvariablesweremeasured.Thisanalysiswillclustercustomersbytheirserviceusagepatterns. InPASW17.0,gotoAnalyze>Classify>KMeansCluster

Next,theKMeansClusterAnalysismenuappears.SelectStandardizedloglongdistancethrough StandardizedlogwirelessandStandardizedmultiplelinesthroughStandardizedelectronicbilling variablesandplaceintheVariablesbox. LabelCasesby.Optional;placevariableheretolabelcases NumberofClusters.Youhavetospecifythenumberofclustersyouwant.Forthisexample, type3inthebox. Method.Thedefault"Iterateandclassify,"whichisaniterativeprocessisusedtocomputethe clustermeanseachtimeacaseisaddedordeletedfromthecluster.Clustersarethenclassified Page1of7

basedonceclustercentershavebeenupdated.The"Classifyonly"methodareclassifiedbased ontheinitialclustercenters,whicharenotiterativelycomputed.Forthisexample,Iterateand classifyischosen. ClusterCenters.Youcandrawinitialclustercentersfromafile(Readinitial)oryoucansave thefinalclustercenters(Writefinal).Forthisexample,wearenotusingeitheroption.

ClicktheIteratebutton;theKMeansClusterAnalysis:Iterateboxappears.ChangeMaximum Iterationsto20.ClickContinue. MaximumIterations.Setsthemaximumnumberofiterations. ConvergenceCriterion:Thedefaultterminatesoncethelargestchangeinmeansofanycluster islessthan2%oftheminimumdistancebetweeninitialclustercenters. Userunningmeans.Ifthisboxischecked,clustercenterswillbeupdatedaftereachcaseis classified,insteadofafterallofthecasesareclassified.

Page2of7

ClickOptionsintheKMeansClusterAnalysisdialogbox.CheckInitialclustercenters,ANOVAtable, Clusterinformationforeachcase,andExcludecasespairwise.ClickContinue.ClickOk. Initialclustercenters.Printstheinitialvariablemeansforeachclusterintheoutput. ANOVAtable.ANOVAFtestsareconductedforeachvariabletoindicatehowwellthevariable discriminatesbetweenclusters. Clusterinformationforeachcase.Printseachcase'sfinalclusterassignmentandtheEuclidean distancebetweenthecaseandtheclustercenterintheouput. MissingValues.Thedefaultislistwisedeletion.Forthisexample,therearemanymissingvalues becausemostcustomersdidnotsubscribetoallservices,soexcludingcasespairwisemaximizes theinformationyoucanobtainfromthedata.

Page3of7

KMeansClusteringInterpretation
TheInitialClusterCenterstableshowsthefirststepinthekmeansclusteringinfindingthekcenters.

TheIterationHistorytableshowsthenumberofiterationsthatwereenoughuntilclustercentersdid notchangesubstantially.

Page4of7

TheClusterMembershiptablegivesyouthecaseclustereachcasebelongstoandtheEuclidean distanceofeachcasetotheclustercenter.Belowisaprintoutofthefirstandlast10cases.Visual inspectionofdistancesisnecessarytocheckforoutliersthatmaynotadequatelyreflectthepopulation.

TheFinalClusterCenterstablebelowallowsyoutodescribetheclustersbythevariables.Forexample, customersinCluster1tendtopurchasealotofservices,asevidencedbyvaluesabovethemeanforall variables.CustomersinCluster2tendtopurchasethe"calling"services,shownbypositivevaluesfor thefourcallingservices(callerID,callwaiting,callforwarding,and3waycalling).Customersin Cluster3tendtospendverylittleanddonotpurchasemanyservices;theyhavenegativevalueson mostofthevariables.

Page5of7

TheDifferencesbetweenFinalClusterCenterstableshowstheEuclideandistancesbetweenthefinal clustercenters.Greaterdistancesbetweenclustersmeantherearegreaterdissimilarities.

Clusters1and3havethegreatestdissimilarities.

Cluster2isequallysimilartoClusters1and3.

TheANOVAtableindicateswhichvariablescontributethemosttoyourclustersolution.Variableswith largemeansquareerrorsprovidetheleasthelpindifferentiatingbetweenclusters.Forexample,long distanceandcallingcardhadthetwohighestmeansquareerrors(andlowestFstatistics);therefore,the twovariableswerenotashelpfulastheothervariablesinforminganddifferentiatingclusters.

Page6of7

TheNumberofCasesineachClustertableillustratesthesplitofcasesintoclusters.Alargenumberof caseswereassignedtothethirdcluster,whichistheleastprofitablegroup.

Page7of7

You might also like