SlideShare a Scribd company logo
Data Analysis Course
Cluster Analysis
Venkat Reddy
Contents
• What is the need of Segmentation
• Introduction to Segmentation & Cluster analysis
• Applications of Cluster Analysis
• Types of Clusters
• K-Means clustering
DataAnalysisCourse
VenkatReddy
2
What is the need of segmentation?
Problem:
• 10,000 Customers - we know their age, city name, income,
employment status, designation
• You have to sell 100 Blackberry phones(each costs $1000) to
the people in this group. You have maximum of 7 days
• If you start giving demos to each individual, 10,000 demos will
take more than one year. How will you sell maximum number
of phones by giving minimum number of demos?
DataAnalysisCourse
VenkatReddy
3
What is the need of segmentation?
Solution
• Divide the whole population into two groups employed / unemployed
• Further divide the employed population into two groups high/low salary
• Further divide that group into high /low designation
DataAnalysisCourse
VenkatReddy
4
10000
customers
Unemployed
3000
Employed
7000
Low salary
5000
High Salary
2000
Low
Designation
1800
High
Designation
200
Segmentation and Cluster Analysis
• Cluster is a group of similar objects (cases, points, observations,
examples, members, customers, patients, locations, etc)
• Finding the groups of cases/observations/ objects in the
population such that the objects are
• Homogeneous within the group (high intra-class similarity)
• Heterogeneous between the groups(low inter-class similarity )
DataAnalysisCourse
VenkatReddy
5
Inter-cluster
distances are
maximized
Intra-cluster distances are
minimized
DataAnalysisCourse
VenkatReddy
Applications of Cluster Analysis
• Market Segmentation: Grouping people (with the willingness,
purchasing power, and the authority to buy) according to their
similarity in several dimensions related to a product under
consideration.
• Sales Segmentation: Clustering can tell you what types of customers
buy what products
• Credit Risk: Segmentation of customers based on their credit history
• Operations: High performer segmentation & promotions based on
person’s performance
• Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost.
• City-planning: Identifying groups of houses according to their house
type, value, and geographical location
• Geographical: Identification of areas of similar land use in an earth
observation database.
DataAnalysisCourse
VenkatReddy
6
Types of Clusters
DataAnalysisCourse
VenkatReddy
7
• Partitional clustering or non-hierarchical : A division
of objects into non-overlapping subsets (clusters) such
that each object is in exactly one cluster
• The non-hierarchical methods divide a dataset of N
objects into M clusters.
• K-means clustering, a non-hierarchical technique, is
the most commonly used one in business analytics
• Hierarchical clustering: A set of nested clusters
organized as a hierarchical tree
• The hierarchical methods produce a set of nested
clusters in which each pair of objects or clusters is
progressively nested in a larger cluster until only one
cluster remains
• CHAID tree is most widely used in business analytics
Cluster Analysis -Example
DataAnalysisCourse
VenkatReddy
8
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-1 94 82 87 89
Student-2 46 67 33 72
Student-3 98 97 93 100
Student-4 14 5 7 24
Student-5 86 97 95 95
Student-6 34 32 75 66
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-9 24 26 15 22
Maths Science Gk Apt
Student-4 14 5 7 24
Student-9 24 26 15 22
Student-6 34 32 75 66
Student-2 46 67 33 72
Student-7 69 44 59 55
Student-8 85 90 96 89
Student-5 86 97 95 95
Student-1 94 82 87 89
Student-3 98 97 93 100
4,9,6
2,7
8,5,1,3
Building Clusters
1. Select a distance measure
2. Select a clustering algorithm
3. Define the distance between two clusters
4. Determine the number of clusters
5. Validate the analysis
DataAnalysisCourse
VenkatReddy
9
• The aim is to build clusters i.e divide the whole population into group of similar
objects
• What is similarity/dis-similarity?
• How do you define distance between two clusters
Dissimilarity & Similarity
DataAnalysisCourse
VenkatReddy
10
Weight
Cust1 68
Cust2 72
Cust3 100
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
Weight Age Income
Cust1 68 25 60,000
Cust2 72 70 9,000
Cust3 100 28 62,000
Which two customers are similar?
Which two customers are similar now?
Which two customers are similar in
this case?
Quantify dissimilarity-Distancemeasures
• To measure similarity between two observations a
distance measure is needed. With a single variable,
similarity is straightforward
• Example: income – two individuals are similar if their income
level is similar and the level of dissimilarity increases as the
income gap increases
• Multiple variables require an aggregate distance
measure
• Many characteristics (e.g. income, age, consumption habits,
family composition, owning a car, education level, job…), it
becomes more difficult to define similarity with a single value
• The most known measure of distance is the Euclidean
distance, which is the concept we use in everyday life for
spatial coordinates.
DataAnalysisCourse
VenkatReddy
11
Examples of distances
DataAnalysisCourse
VenkatReddy
12
 
2
1
n
ij ki kj
k
D x x

 
1
n
ij ki kj
k
D x x

 
Euclidean distance
City-block (Manhattan) distance
A
B
A
B
Dij distance between cases i and j xkj - value of variable xk for case j
Other distance measures: Chebychev, Minkowski, Mahalanobis,
maximum distance, cosine similarity, simple correlation between
observations etc.,


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
Data matrix Dissimilarity matrix
Calculating the distance
DataAnalysisCourse
VenkatReddy
13
Weight
Cust1 68
Cust2 72
Cust3 100
• Cust1 vs Cust2 :- (68-72)= 4
• Cust2 vs Cust3 :- (72-100) = 28
• Cust3 vs Cust1 :- (100-68) =32
Weight Age
Cust1 68 25
Cust2 72 70
Cust3 100 28
• Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9
• Cust2 vs Cust3 :- 50.54
• Cust3 vs Cust1 :- 32.14
Demo: Calculation of distance
proc distance data=cust_data out=Dist method=Euclid nostd;
var interval(Credit_score Expenses);
run;
proc print data=Dist;
run;
DataAnalysisCourse
VenkatReddy
14
Lab: Distance Calculation
proc distance data=cust_data out=Count_Dist method=Euclid
nostd;
var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate);
run;
proc print data=Count_Dist;
run;
DataAnalysisCourse
VenkatReddy
15
Clustering algorithms
• k-means clustering algorithm
• Fuzzy c-means clustering algorithm
• Hierarchical clustering algorithm
• Gaussian(EM) clustering algorithm
• Quality Threshold (QT) clustering algorithm
• MST based clustering algorithm
• Density based clustering algorithm
• kernel k-means clustering algorithm
DataAnalysisCourse
VenkatReddy
16
K -Means Clustering – Algorithm
1. The number k of clusters is fixed
2. An initial set of k “seeds” (aggregation centres) is provided
1. First k elements
2. Other seeds (randomly selected or explicitly defined)
3. Given a certain fixed threshold, all units are assigned to the
nearest cluster seed
4. New seeds are computed
5. Go back to step 3 until no reclassification is necessary
Or simply
Initialize k cluster centers
Do
Assignment step: Assign each data point to its closest cluster center
Re-estimation step: Re-compute cluster centers
While (there are still changes in the cluster centers)
DataAnalysisCourse
VenkatReddy
17
K-Means clustering
DataAnalysisCourse
VenkatReddy
18
Overall population
K-Means clustering
DataAnalysisCourse
VenkatReddy
19
Fix the number of clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
20
Calculate the distance of
each case from all clusters
K-Means clustering
DataAnalysisCourse
VenkatReddy
21
Assign each case to nearest
cluster
K-Means clustering
DataAnalysisCourse
VenkatReddy
22
Re calculate the cluster
centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
23
K-Means clustering
DataAnalysisCourse
VenkatReddy
24
K-Means clustering
DataAnalysisCourse
VenkatReddy
25
K-Means clustering
DataAnalysisCourse
VenkatReddy
26
K-Means clustering
DataAnalysisCourse
VenkatReddy
27
K-Means clustering
DataAnalysisCourse
VenkatReddy
28
K-Means clustering
DataAnalysisCourse
VenkatReddy
29
Reassign after changing the
cluster centers
K-Means clustering
DataAnalysisCourse
VenkatReddy
30
K-Means clustering
DataAnalysisCourse
VenkatReddy
31
Continue till there is no
significant change between
two iterations
K Means clustering in action
DataAnalysisCourse
VenkatReddy
32
• Dividing the data into 10 clusters using K-Means
Distance metric will
decide cluster for
these points
K-Means Clustering SAS Demo
proc fastclus data= sup_market radius=0 replace=full
maxclusters =5 maxiter =20 distance out=clustr_out;
id cust_id;
Var age family_size income spend visit_Other_shops;
run;
DataAnalysisCourse
VenkatReddy
33
• A Supermarket wanted to send some promotional coupons to 100
families
• The idea is to identify 100 customers with medium income and low
recent spends
Lab: K- Means Clustering
• Download contact center agents data
• The performance data contains
• Average handling time
• Average number of calls
• CSAT
• Resolution score
• Identify top 10 agents for promotion based on below criteria
• High C_SAT
• High Resolution
• Low Average handling time
• High number of calls
DataAnalysisCourse
VenkatReddy
34
SAS Code Options
• The RADIUS= option establishes the minimum distance criterion for
selecting new seeds. No observation is considered as a new seed unless its
minimum distance to previous seeds exceeds the value given by the
RADIUS= option. The default value is 0.
• The MAXCLUSTERS= option specifies the maximum number of clusters
allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed.
• The REPLACE= option specifies how seed replacement is performed.
• FULL :requests default seed replacement.
• PART :requests seed replacement only when the distance between the
observation and the closest seed is greater than the minimum distance between
seeds.
• NONE : suppresses seed replacement.
• RANDOM :Selects a simple pseudo-random sample of complete observations as
initial cluster seeds.
DataAnalysisCourse
VenkatReddy
35
SAS Code & Options
• The MAXITER= option specifies the maximum number of iterations for re
computing cluster seeds. When the value of the MAXITER= option is greater
than 0, each observation is assigned to the nearest seed, and the seeds are
recomputed as the means of the clusters.
• The LIST option lists all observations, giving the value of the ID variable (if
any), the number of the cluster to which the observation is assigned, and
the distance between the observation and the final cluster seed.
• The DISTANCE option computes distances between the cluster means.
• The ID variable, which can be character or numeric, identifies observations
on the output when you specify the LIST option.
• The VAR statement lists the numeric variables to be used in the cluster
analysis. If you omit the VAR statement, all numeric variables not listed in
other statements are used.
DataAnalysisCourse
VenkatReddy
36
Distance between Clusters
• Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
• Complete link: largest distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
• Average: avg distance between an element in one cluster and an element in
the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
• Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) =
dist(Ci, Cj)
• Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster
DataAnalysisCourse
VenkatReddy
37
X X
SAS output interpretation
• RMSSTD - Pooled standard deviation of all the variables forming the
cluster.(Variance within a cluster) Since the objective of cluster analysis is to
form homogeneous groups, the
• RMSSTD of a cluster should be as small as possible
• SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged
clusters, so SPRSQ is the loss of homogeneity due to combining two groups
or clusters to form a new group or cluster. (error incurred by combining two
groups)
• Thus, the SPRSQ value should be small to imply that we are merging two
homogeneous groups
DataAnalysisCourse
VenkatReddy
38
SAS output interpretation
• RSQ (R-squared) measures the extent to which groups or clusters
are different from each other. (Variance between the clusters)
• So, when you have just one cluster RSQ value is, intuitively, zero).
Thus, the RSQ value should be high.
• Centroid Distance is simply the Euclidian distance between the
centroid of the two clusters that are to be joined or merged.
• So, Centroid Distance is a measure of the homogeneity of merged
clusters and the value should be small.
DataAnalysisCourse
VenkatReddy
39
Distance Calculation on
standardized data
DataAnalysisCourse
VenkatReddy
40
Weight Income
Cust1 68 60,000
Cust2 72 9,000
Cust3 100 62,000
Average 80 43667
Stdev 14 24527
Weight Income
Cust1 -0.84 0.67
Cust2 -0.56 -1.41
Cust3 1.40 0.75

More Related Content

What's hot (20)

PPTX
Decision Tree Learning
Md. Ariful Hoque
 
PPTX
Cluster analysis
Jewel Refran
 
PPTX
Knn 160904075605-converted
rameswara reddy venkat
 
PPTX
Fuzzy Clustering(C-means, K-means)
UMBC
 
PPT
Decision tree
Ami_Surati
 
PPTX
Statistics for data science
zekeLabs Technologies
 
PDF
Decision tree
R A Akerkar
 
PPTX
K-means clustering algorithm
Vinit Dantkale
 
PDF
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
PDF
K-Means Algorithm
Carlos Castillo (ChaTo)
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PPTX
Machine Learning with R
Barbara Fusinska
 
PPTX
CART – Classification & Regression Trees
Hemant Chetwani
 
PDF
Cluster analysis
Venkata Reddy Konasani
 
PPTX
K-Means Clustering Algorithm.pptx
JebaRaj26
 
PDF
Principal Component Analysis and Clustering
Usha Vijay
 
PPTX
Feature selection concepts and methods
Reza Ramezani
 
PPTX
Belief Networks & Bayesian Classification
Adnan Masood
 
PPT
Decision tree
Soujanya V
 
Decision Tree Learning
Md. Ariful Hoque
 
Cluster analysis
Jewel Refran
 
Knn 160904075605-converted
rameswara reddy venkat
 
Fuzzy Clustering(C-means, K-means)
UMBC
 
Decision tree
Ami_Surati
 
Statistics for data science
zekeLabs Technologies
 
Decision tree
R A Akerkar
 
K-means clustering algorithm
Vinit Dantkale
 
K - Nearest neighbor ( KNN )
Mohammad Junaid Khan
 
K-Means Algorithm
Carlos Castillo (ChaTo)
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
Machine Learning with R
Barbara Fusinska
 
CART – Classification & Regression Trees
Hemant Chetwani
 
Cluster analysis
Venkata Reddy Konasani
 
K-Means Clustering Algorithm.pptx
JebaRaj26
 
Principal Component Analysis and Clustering
Usha Vijay
 
Feature selection concepts and methods
Reza Ramezani
 
Belief Networks & Bayesian Classification
Adnan Masood
 
Decision tree
Soujanya V
 

Viewers also liked (8)

PDF
Model selection and cross validation techniques
Venkata Reddy Konasani
 
PPT
Individual movements and geographical data mining. Clustering algorithms for ...
Beniamino Murgante
 
PDF
Homotopic Frechet Distance Between Curves
shripadthite
 
PDF
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Centre of Geographic Sciences (COGS)
 
PDF
Trajectory clustering - Traclus Algorithm
Iván Sanchez Vera
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PPT
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Venkata Reddy Konasani
 
PDF
GBM theory code and parameters
Venkata Reddy Konasani
 
Model selection and cross validation techniques
Venkata Reddy Konasani
 
Individual movements and geographical data mining. Clustering algorithms for ...
Beniamino Murgante
 
Homotopic Frechet Distance Between Curves
shripadthite
 
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Centre of Geographic Sciences (COGS)
 
Trajectory clustering - Traclus Algorithm
Iván Sanchez Vera
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Learning Tableau - Data, Graphs, Filters, Dashboards and Advanced features
Venkata Reddy Konasani
 
GBM theory code and parameters
Venkata Reddy Konasani
 
Ad

Similar to Cluster Analysis for Dummies (20)

PPT
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
PPT
26-Clustering MTech-2017.ppt
vikassingh569137
 
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
PDF
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
PPTX
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
PPT
Vi sem
Lavesh Kaushik
 
PPTX
machine learning - Clustering in R
Sudhakar Chavan
 
PPTX
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
kalyanasundaram68
 
PPTX
Clustering.pptx
19526YuvaKumarIrigi
 
PPT
Clustering (from Google)
Sri Prasanna
 
PPT
CS8091_BDA_Unit_II_Clustering
Palani Kumar
 
PPTX
Instance Based Learning in machine learning
tanishqgujari
 
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
ShwetapadmaBabu1
 
PDF
CSA 3702 machine learning module 3
Nandhini S
 
DOCX
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
PDF
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
PPTX
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
PPTX
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
2002_Spring_CS525_Lggggggfdtfffdfgecture_2.ppt
fetnbadani
 
26-Clustering MTech-2017.ppt
vikassingh569137
 
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
1052LaxmanrajS
 
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
1052LaxmanrajS
 
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
big data analytics unit 2 notes for study
DIVYADHARSHINISDIVYA
 
machine learning - Clustering in R
Sudhakar Chavan
 
Ensemble_instance_unsupersied_learning 01_02_2024.pptx
vigneshmatta2004
 
unitvclusteranalysis-221214135407-1956d6ef.pptx
kalyanasundaram68
 
Clustering.pptx
19526YuvaKumarIrigi
 
Clustering (from Google)
Sri Prasanna
 
CS8091_BDA_Unit_II_Clustering
Palani Kumar
 
Instance Based Learning in machine learning
tanishqgujari
 
CLUSTER ANALYSIS ALGORITHMS.pptx
ShwetapadmaBabu1
 
CSA 3702 machine learning module 3
Nandhini S
 
Neural nw k means
Eng. Dr. Dennis N. Mwighusa
 
Introduction to data mining and machine learning
Tilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL
 
Advanced database and data mining & clustering concepts
NithyananthSengottai
 
MODULE 4_ CLUSTERING.pptx
nikshaikh786
 
Ad

More from Venkata Reddy Konasani (20)

PDF
Transformers 101
Venkata Reddy Konasani
 
PDF
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
PDF
Neural Network Part-2
Venkata Reddy Konasani
 
PDF
Neural Networks made easy
Venkata Reddy Konasani
 
PPTX
Decision tree
Venkata Reddy Konasani
 
PPTX
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
PPTX
Credit Risk Model Building Steps
Venkata Reddy Konasani
 
PDF
Table of Contents - Practical Business Analytics using SAS
Venkata Reddy Konasani
 
PPTX
SAS basics Step by step learning
Venkata Reddy Konasani
 
PPTX
Testing of hypothesis case study
Venkata Reddy Konasani
 
DOCX
L101 predictive modeling case_study
Venkata Reddy Konasani
 
PDF
Machine Learning for Dummies
Venkata Reddy Konasani
 
PDF
Online data sources for analaysis
Venkata Reddy Konasani
 
PDF
A data analyst view of Bigdata
Venkata Reddy Konasani
 
PPTX
R- Introduction
Venkata Reddy Konasani
 
PDF
Data exploration validation and sanitization
Venkata Reddy Konasani
 
PPTX
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
PDF
Big data Introduction by Mohan
Venkata Reddy Konasani
 
PDF
Data Analyst - Interview Guide
Venkata Reddy Konasani
 
Transformers 101
Venkata Reddy Konasani
 
Machine Learning Deep Learning AI and Data Science
Venkata Reddy Konasani
 
Neural Network Part-2
Venkata Reddy Konasani
 
Neural Networks made easy
Venkata Reddy Konasani
 
Decision tree
Venkata Reddy Konasani
 
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
Credit Risk Model Building Steps
Venkata Reddy Konasani
 
Table of Contents - Practical Business Analytics using SAS
Venkata Reddy Konasani
 
SAS basics Step by step learning
Venkata Reddy Konasani
 
Testing of hypothesis case study
Venkata Reddy Konasani
 
L101 predictive modeling case_study
Venkata Reddy Konasani
 
Machine Learning for Dummies
Venkata Reddy Konasani
 
Online data sources for analaysis
Venkata Reddy Konasani
 
A data analyst view of Bigdata
Venkata Reddy Konasani
 
R- Introduction
Venkata Reddy Konasani
 
Data exploration validation and sanitization
Venkata Reddy Konasani
 
Introduction to predictive modeling v1
Venkata Reddy Konasani
 
Big data Introduction by Mohan
Venkata Reddy Konasani
 
Data Analyst - Interview Guide
Venkata Reddy Konasani
 

Recently uploaded (20)

PPTX
Mrs Mhondiwa Introduction to Algebra class
sabinaschimanga
 
PDF
1, 2, 3… E MAIS UM CICLO CHEGA AO FIM!.pdf
Colégio Santa Teresinha
 
PDF
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
PDF
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
PDF
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
PPTX
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
PPTX
How to Define Translation to Custom Module And Add a new language in Odoo 18
Celine George
 
PPTX
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
PPTX
How to Configure Prepayments in Odoo 18 Sales
Celine George
 
PPTX
SCHOOL-BASED SEXUAL HARASSMENT PREVENTION AND RESPONSE WORKSHOP
komlalokoe
 
PPTX
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
PPTX
Presentation: Climate Citizenship Digital Education
Karl Donert
 
PPT
digestive system for Pharm d I year HAP
rekhapositivity
 
PPTX
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
PDF
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
PPTX
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
PPTX
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
PPTX
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
PPTX
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
PPTX
The Human Eye and The Colourful World Class 10 NCERT Science.pptx
renutripathibharat
 
Mrs Mhondiwa Introduction to Algebra class
sabinaschimanga
 
1, 2, 3… E MAIS UM CICLO CHEGA AO FIM!.pdf
Colégio Santa Teresinha
 
IMP NAAC-Reforms-Stakeholder-Consultation-Presentation-on-Draft-Metrics-Unive...
BHARTIWADEKAR
 
07.15.2025 - Managing Your Members Using a Membership Portal.pdf
TechSoup
 
Comprehensive Guide to Writing Effective Literature Reviews for Academic Publ...
AJAYI SAMUEL
 
Capitol Doctoral Presentation -July 2025.pptx
CapitolTechU
 
How to Define Translation to Custom Module And Add a new language in Odoo 18
Celine George
 
Modern analytical techniques used to characterize organic compounds. Birbhum ...
AyanHossain
 
How to Configure Prepayments in Odoo 18 Sales
Celine George
 
SCHOOL-BASED SEXUAL HARASSMENT PREVENTION AND RESPONSE WORKSHOP
komlalokoe
 
Pyhton with Mysql to perform CRUD operations.pptx
Ramakrishna Reddy Bijjam
 
Presentation: Climate Citizenship Digital Education
Karl Donert
 
digestive system for Pharm d I year HAP
rekhapositivity
 
Views on Education of Indian Thinkers J.Krishnamurthy..pptx
ShrutiMahanta1
 
CONCURSO DE POESIA “POETUFAS – PASSOS SUAVES PELO VERSO.pdf
Colégio Santa Teresinha
 
ASRB NET 2023 PREVIOUS YEAR QUESTION PAPER GENETICS AND PLANT BREEDING BY SAT...
Krashi Coaching
 
How to Manage Access Rights & User Types in Odoo 18
Celine George
 
2025 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 
Optimizing Cancer Screening With MCED Technologies: From Science to Practical...
i3 Health
 
The Human Eye and The Colourful World Class 10 NCERT Science.pptx
renutripathibharat
 

Cluster Analysis for Dummies

  • 1. Data Analysis Course Cluster Analysis Venkat Reddy
  • 2. Contents • What is the need of Segmentation • Introduction to Segmentation & Cluster analysis • Applications of Cluster Analysis • Types of Clusters • K-Means clustering DataAnalysisCourse VenkatReddy 2
  • 3. What is the need of segmentation? Problem: • 10,000 Customers - we know their age, city name, income, employment status, designation • You have to sell 100 Blackberry phones(each costs $1000) to the people in this group. You have maximum of 7 days • If you start giving demos to each individual, 10,000 demos will take more than one year. How will you sell maximum number of phones by giving minimum number of demos? DataAnalysisCourse VenkatReddy 3
  • 4. What is the need of segmentation? Solution • Divide the whole population into two groups employed / unemployed • Further divide the employed population into two groups high/low salary • Further divide that group into high /low designation DataAnalysisCourse VenkatReddy 4 10000 customers Unemployed 3000 Employed 7000 Low salary 5000 High Salary 2000 Low Designation 1800 High Designation 200
  • 5. Segmentation and Cluster Analysis • Cluster is a group of similar objects (cases, points, observations, examples, members, customers, patients, locations, etc) • Finding the groups of cases/observations/ objects in the population such that the objects are • Homogeneous within the group (high intra-class similarity) • Heterogeneous between the groups(low inter-class similarity ) DataAnalysisCourse VenkatReddy 5 Inter-cluster distances are maximized Intra-cluster distances are minimized DataAnalysisCourse VenkatReddy
  • 6. Applications of Cluster Analysis • Market Segmentation: Grouping people (with the willingness, purchasing power, and the authority to buy) according to their similarity in several dimensions related to a product under consideration. • Sales Segmentation: Clustering can tell you what types of customers buy what products • Credit Risk: Segmentation of customers based on their credit history • Operations: High performer segmentation & promotions based on person’s performance • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost. • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Geographical: Identification of areas of similar land use in an earth observation database. DataAnalysisCourse VenkatReddy 6
  • 7. Types of Clusters DataAnalysisCourse VenkatReddy 7 • Partitional clustering or non-hierarchical : A division of objects into non-overlapping subsets (clusters) such that each object is in exactly one cluster • The non-hierarchical methods divide a dataset of N objects into M clusters. • K-means clustering, a non-hierarchical technique, is the most commonly used one in business analytics • Hierarchical clustering: A set of nested clusters organized as a hierarchical tree • The hierarchical methods produce a set of nested clusters in which each pair of objects or clusters is progressively nested in a larger cluster until only one cluster remains • CHAID tree is most widely used in business analytics
  • 8. Cluster Analysis -Example DataAnalysisCourse VenkatReddy 8 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-1 94 82 87 89 Student-2 46 67 33 72 Student-3 98 97 93 100 Student-4 14 5 7 24 Student-5 86 97 95 95 Student-6 34 32 75 66 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-9 24 26 15 22 Maths Science Gk Apt Student-4 14 5 7 24 Student-9 24 26 15 22 Student-6 34 32 75 66 Student-2 46 67 33 72 Student-7 69 44 59 55 Student-8 85 90 96 89 Student-5 86 97 95 95 Student-1 94 82 87 89 Student-3 98 97 93 100 4,9,6 2,7 8,5,1,3
  • 9. Building Clusters 1. Select a distance measure 2. Select a clustering algorithm 3. Define the distance between two clusters 4. Determine the number of clusters 5. Validate the analysis DataAnalysisCourse VenkatReddy 9 • The aim is to build clusters i.e divide the whole population into group of similar objects • What is similarity/dis-similarity? • How do you define distance between two clusters
  • 10. Dissimilarity & Similarity DataAnalysisCourse VenkatReddy 10 Weight Cust1 68 Cust2 72 Cust3 100 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 Weight Age Income Cust1 68 25 60,000 Cust2 72 70 9,000 Cust3 100 28 62,000 Which two customers are similar? Which two customers are similar now? Which two customers are similar in this case?
  • 11. Quantify dissimilarity-Distancemeasures • To measure similarity between two observations a distance measure is needed. With a single variable, similarity is straightforward • Example: income – two individuals are similar if their income level is similar and the level of dissimilarity increases as the income gap increases • Multiple variables require an aggregate distance measure • Many characteristics (e.g. income, age, consumption habits, family composition, owning a car, education level, job…), it becomes more difficult to define similarity with a single value • The most known measure of distance is the Euclidean distance, which is the concept we use in everyday life for spatial coordinates. DataAnalysisCourse VenkatReddy 11
  • 12. Examples of distances DataAnalysisCourse VenkatReddy 12   2 1 n ij ki kj k D x x    1 n ij ki kj k D x x    Euclidean distance City-block (Manhattan) distance A B A B Dij distance between cases i and j xkj - value of variable xk for case j Other distance measures: Chebychev, Minkowski, Mahalanobis, maximum distance, cosine similarity, simple correlation between observations etc.,                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0 Data matrix Dissimilarity matrix
  • 13. Calculating the distance DataAnalysisCourse VenkatReddy 13 Weight Cust1 68 Cust2 72 Cust3 100 • Cust1 vs Cust2 :- (68-72)= 4 • Cust2 vs Cust3 :- (72-100) = 28 • Cust3 vs Cust1 :- (100-68) =32 Weight Age Cust1 68 25 Cust2 72 70 Cust3 100 28 • Cust1 vs Cust2 :- sqrt((68-72)^2 + (25-70)^2)=44.9 • Cust2 vs Cust3 :- 50.54 • Cust3 vs Cust1 :- 32.14
  • 14. Demo: Calculation of distance proc distance data=cust_data out=Dist method=Euclid nostd; var interval(Credit_score Expenses); run; proc print data=Dist; run; DataAnalysisCourse VenkatReddy 14
  • 15. Lab: Distance Calculation proc distance data=cust_data out=Count_Dist method=Euclid nostd; var interval(Area_Sq_Miles_ GDP_MM_ Unemp_rate); run; proc print data=Count_Dist; run; DataAnalysisCourse VenkatReddy 15
  • 16. Clustering algorithms • k-means clustering algorithm • Fuzzy c-means clustering algorithm • Hierarchical clustering algorithm • Gaussian(EM) clustering algorithm • Quality Threshold (QT) clustering algorithm • MST based clustering algorithm • Density based clustering algorithm • kernel k-means clustering algorithm DataAnalysisCourse VenkatReddy 16
  • 17. K -Means Clustering – Algorithm 1. The number k of clusters is fixed 2. An initial set of k “seeds” (aggregation centres) is provided 1. First k elements 2. Other seeds (randomly selected or explicitly defined) 3. Given a certain fixed threshold, all units are assigned to the nearest cluster seed 4. New seeds are computed 5. Go back to step 3 until no reclassification is necessary Or simply Initialize k cluster centers Do Assignment step: Assign each data point to its closest cluster center Re-estimation step: Re-compute cluster centers While (there are still changes in the cluster centers) DataAnalysisCourse VenkatReddy 17
  • 31. K-Means clustering DataAnalysisCourse VenkatReddy 31 Continue till there is no significant change between two iterations
  • 32. K Means clustering in action DataAnalysisCourse VenkatReddy 32 • Dividing the data into 10 clusters using K-Means Distance metric will decide cluster for these points
  • 33. K-Means Clustering SAS Demo proc fastclus data= sup_market radius=0 replace=full maxclusters =5 maxiter =20 distance out=clustr_out; id cust_id; Var age family_size income spend visit_Other_shops; run; DataAnalysisCourse VenkatReddy 33 • A Supermarket wanted to send some promotional coupons to 100 families • The idea is to identify 100 customers with medium income and low recent spends
  • 34. Lab: K- Means Clustering • Download contact center agents data • The performance data contains • Average handling time • Average number of calls • CSAT • Resolution score • Identify top 10 agents for promotion based on below criteria • High C_SAT • High Resolution • Low Average handling time • High number of calls DataAnalysisCourse VenkatReddy 34
  • 35. SAS Code Options • The RADIUS= option establishes the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0. • The MAXCLUSTERS= option specifies the maximum number of clusters allowed. If you omit the MAXCLUSTERS= option, a value of 100 is assumed. • The REPLACE= option specifies how seed replacement is performed. • FULL :requests default seed replacement. • PART :requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. • NONE : suppresses seed replacement. • RANDOM :Selects a simple pseudo-random sample of complete observations as initial cluster seeds. DataAnalysisCourse VenkatReddy 35
  • 36. SAS Code & Options • The MAXITER= option specifies the maximum number of iterations for re computing cluster seeds. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters. • The LIST option lists all observations, giving the value of the ID variable (if any), the number of the cluster to which the observation is assigned, and the distance between the observation and the final cluster seed. • The DISTANCE option computes distances between the cluster means. • The ID variable, which can be character or numeric, identifies observations on the output when you specify the LIST option. • The VAR statement lists the numeric variables to be used in the cluster analysis. If you omit the VAR statement, all numeric variables not listed in other statements are used. DataAnalysisCourse VenkatReddy 36
  • 37. Distance between Clusters • Single link: smallest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq) • Complete link: largest distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq) • Average: avg distance between an element in one cluster and an element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq) • Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj) = dist(Ci, Cj) • Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) = dist(Mi, Mj) Medoid: a chosen, centrally located object in the cluster DataAnalysisCourse VenkatReddy 37 X X
  • 38. SAS output interpretation • RMSSTD - Pooled standard deviation of all the variables forming the cluster.(Variance within a cluster) Since the objective of cluster analysis is to form homogeneous groups, the • RMSSTD of a cluster should be as small as possible • SPRSQ -Semipartial R-squared is a measure of the homogeneity of merged clusters, so SPRSQ is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. (error incurred by combining two groups) • Thus, the SPRSQ value should be small to imply that we are merging two homogeneous groups DataAnalysisCourse VenkatReddy 38
  • 39. SAS output interpretation • RSQ (R-squared) measures the extent to which groups or clusters are different from each other. (Variance between the clusters) • So, when you have just one cluster RSQ value is, intuitively, zero). Thus, the RSQ value should be high. • Centroid Distance is simply the Euclidian distance between the centroid of the two clusters that are to be joined or merged. • So, Centroid Distance is a measure of the homogeneity of merged clusters and the value should be small. DataAnalysisCourse VenkatReddy 39
  • 40. Distance Calculation on standardized data DataAnalysisCourse VenkatReddy 40 Weight Income Cust1 68 60,000 Cust2 72 9,000 Cust3 100 62,000 Average 80 43667 Stdev 14 24527 Weight Income Cust1 -0.84 0.67 Cust2 -0.56 -1.41 Cust3 1.40 0.75