SlideShare a Scribd company logo
6
Most read
14
Most read
16
Most read
Clustering Algorithms: An Introduction
Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified)  instances
Clustering Method of unsupervised   learning Finds “natural” grouping of instances given un-labeled data
Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
Clusters:  exclusive vs. overlapping a k j i h g f e d c b
Example of Outlier x  x x  x  x  x x  x x  x  x  x  x x  x x xx  x x  x  x  x  x  x x x  x x x  x x  x  x  x x  x  x x  x x Outlier
Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters  Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start:  tree consists of empty root node Then:  add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on  category utility
And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1:  clustroid   = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
“ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
k  – Means Algorithm(s) Assumes Euclidean space. Start by picking  k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then  k  -1 other points, each as far away as possible from the previous points.
Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the  k   clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
K-means variations K-medoids  – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is  Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is  Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
Examples of Clustering Applications Marketing:  discover customer groups and use them for targeted marketing and re-organization Astronomy:  find groups of similar stars and galaxies Earth-quake studies:  Observed earth quake epicenters should be clustered along continent faults Genomics:  finding groups of gene with similar expression And many more.
Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot (20)

PPT
lecture12-clustering.ppt
ImXaib
 
PPT
Clustering
M Rizwan Aqeel
 
PPTX
Data Mining: clustering and analysis
DataminingTools Inc
 
PPT
Cluster analysis
Kamalakshi Deshmukh-Samag
 
PPTX
Unsupervised learning (clustering)
Pravinkumar Landge
 
PPTX
Ensemble learning
Haris Jamil
 
PDF
The fundamentals of Machine Learning
Hichem Felouat
 
PPTX
Cluster Analysis Introduction
PrasiddhaSarma
 
PPTX
Machine learning ppt.
ASHOK KUMAR
 
PPTX
Decision tree presentation
Vijay Yadav
 
PPTX
Cloud Computing- components, working, pros and cons
Amritpal Singh Bedi
 
PPTX
OLAP operations
kunj desai
 
PPTX
Clustering in Data Mining
Archana Swaminathan
 
PPTX
k medoid clustering.pptx
Roshan86572
 
PPTX
Introduction to Clustering algorithm
hadifar
 
PPTX
Classification techniques in data mining
Kamal Acharya
 
PDF
Cluster analysis
Venkata Reddy Konasani
 
PPTX
End-to-End Machine Learning Project
Eng Teong Cheah
 
PPTX
Hierarchical clustering machine learning by arpit_sharma
Er. Arpit Sharma
 
PPT
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
lecture12-clustering.ppt
ImXaib
 
Clustering
M Rizwan Aqeel
 
Data Mining: clustering and analysis
DataminingTools Inc
 
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Unsupervised learning (clustering)
Pravinkumar Landge
 
Ensemble learning
Haris Jamil
 
The fundamentals of Machine Learning
Hichem Felouat
 
Cluster Analysis Introduction
PrasiddhaSarma
 
Machine learning ppt.
ASHOK KUMAR
 
Decision tree presentation
Vijay Yadav
 
Cloud Computing- components, working, pros and cons
Amritpal Singh Bedi
 
OLAP operations
kunj desai
 
Clustering in Data Mining
Archana Swaminathan
 
k medoid clustering.pptx
Roshan86572
 
Introduction to Clustering algorithm
hadifar
 
Classification techniques in data mining
Kamal Acharya
 
Cluster analysis
Venkata Reddy Konasani
 
End-to-End Machine Learning Project
Eng Teong Cheah
 
Hierarchical clustering machine learning by arpit_sharma
Er. Arpit Sharma
 
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 

Viewers also liked (20)

PPS
Introduction to Apache Hive
Tapan Avasthi
 
PPT
System Init
cntlinux
 
PPT
Excel Datamining Addin Intermediate
DataminingTools Inc
 
PPT
Powerpoint paragraaf 5.3/5.4
guestaa9e6a
 
PPTX
Introduction to Data-Applied
DataminingTools Inc
 
ODP
Miedo Jajjjajajja
Yarex Mussa Gonzalez
 
PPTX
Matlab Text Files
DataminingTools Inc
 
PPTX
LISP: Scope and extent in lisp
DataminingTools Inc
 
PPTX
RapidMiner: Advanced Processes And Operators
DataminingTools Inc
 
ODP
Oratoria E RetóRica Latinas
lara
 
PPTX
LISP: Errors In Lisp
DataminingTools Inc
 
PPTX
MED dra Coding -MSSO
drabhishekpitti
 
PPTX
RapidMiner: Setting Up A Process
DataminingTools Inc
 
PPT
Webmining Overview
DataminingTools Inc
 
PPTX
C,C++ In Matlab
DataminingTools Inc
 
PPTX
LISP: Declarations In Lisp
DataminingTools Inc
 
XLSX
Test
spencer shanks
 
PPTX
LISP: Type specifiers in lisp
DataminingTools Inc
 
PPTX
LISP:Object System Lisp
DataminingTools Inc
 
Introduction to Apache Hive
Tapan Avasthi
 
System Init
cntlinux
 
Excel Datamining Addin Intermediate
DataminingTools Inc
 
Powerpoint paragraaf 5.3/5.4
guestaa9e6a
 
Introduction to Data-Applied
DataminingTools Inc
 
Miedo Jajjjajajja
Yarex Mussa Gonzalez
 
Matlab Text Files
DataminingTools Inc
 
LISP: Scope and extent in lisp
DataminingTools Inc
 
RapidMiner: Advanced Processes And Operators
DataminingTools Inc
 
Oratoria E RetóRica Latinas
lara
 
LISP: Errors In Lisp
DataminingTools Inc
 
MED dra Coding -MSSO
drabhishekpitti
 
RapidMiner: Setting Up A Process
DataminingTools Inc
 
Webmining Overview
DataminingTools Inc
 
C,C++ In Matlab
DataminingTools Inc
 
LISP: Declarations In Lisp
DataminingTools Inc
 
LISP: Type specifiers in lisp
DataminingTools Inc
 
LISP:Object System Lisp
DataminingTools Inc
 
Ad

Similar to Clustering (20)

PPT
26-Clustering MTech-2017.ppt
vikassingh569137
 
PDF
Clustering.pdf
saman Iftikhar
 
PDF
Chapter#04[Part#01]K-Means Clusterig.pdf
MaheenVohra
 
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PalanivelG6
 
PPTX
machine learning - Clustering in R
Sudhakar Chavan
 
PDF
Clustering.pdf
nadimhossain24
 
PPTX
Unsupervised learning Algorithms and Assumptions
refedey275
 
PDF
clustering-151017180103-lva1-app6892 (1).pdf
prasad761467
 
PPTX
Unsupervised%20Learninffffg (2).pptx. application
ShabirAhmad625218
 
PDF
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
PPTX
Unsupervised Learning.pptx
GandhiMathy6
 
PPTX
Cluster Analysis.pptx
Rvishnupriya2
 
PDF
Clustering[306] [Read-Only].pdf
igeabroad
 
PPT
Chap8 basic cluster_analysis
guru_prasadg
 
PDF
Unsupervised learning and clustering.pdf
officialnovice7
 
PPTX
Unsupervised learning Modi.pptx
ssusere1fd42
 
PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
4NM20IS025BHUSHANNAY
 
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
PDF
Clustering techniques data mining book ....
ShaimaaMohamedGalal
 
26-Clustering MTech-2017.ppt
vikassingh569137
 
Clustering.pdf
saman Iftikhar
 
Chapter#04[Part#01]K-Means Clusterig.pdf
MaheenVohra
 
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PalanivelG6
 
machine learning - Clustering in R
Sudhakar Chavan
 
Clustering.pdf
nadimhossain24
 
Unsupervised learning Algorithms and Assumptions
refedey275
 
clustering-151017180103-lva1-app6892 (1).pdf
prasad761467
 
Unsupervised%20Learninffffg (2).pptx. application
ShabirAhmad625218
 
ch_5_dm clustering in data mining.......
PriyankaPatil919748
 
Unsupervised Learning.pptx
GandhiMathy6
 
Cluster Analysis.pptx
Rvishnupriya2
 
Clustering[306] [Read-Only].pdf
igeabroad
 
Chap8 basic cluster_analysis
guru_prasadg
 
Unsupervised learning and clustering.pdf
officialnovice7
 
Unsupervised learning Modi.pptx
ssusere1fd42
 
[ML]-Unsupervised-learning_Unit2.ppt.pdf
4NM20IS025BHUSHANNAY
 
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
Salah Amean
 
Clustering techniques data mining book ....
ShaimaaMohamedGalal
 
Ad

More from DataminingTools Inc (20)

PPTX
Terminology Machine Learning
DataminingTools Inc
 
PPTX
Techniques Machine Learning
DataminingTools Inc
 
PPTX
Machine learning Introduction
DataminingTools Inc
 
PPTX
Areas of machine leanring
DataminingTools Inc
 
PPTX
AI: Planning and AI
DataminingTools Inc
 
PPTX
AI: Logic in AI 2
DataminingTools Inc
 
PPTX
AI: Logic in AI
DataminingTools Inc
 
PPTX
AI: Learning in AI 2
DataminingTools Inc
 
PPTX
AI: Learning in AI
DataminingTools Inc
 
PPTX
AI: Introduction to artificial intelligence
DataminingTools Inc
 
PPTX
AI: Belief Networks
DataminingTools Inc
 
PPTX
AI: AI & Searching
DataminingTools Inc
 
PPTX
AI: AI & Problem Solving
DataminingTools Inc
 
PPTX
Data Mining: Text and web mining
DataminingTools Inc
 
PPTX
Data Mining: Outlier analysis
DataminingTools Inc
 
PPTX
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
PPTX
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 
PPTX
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
PPTX
Data warehouse and olap technology
DataminingTools Inc
 
PPTX
Data Mining: Data processing
DataminingTools Inc
 
Terminology Machine Learning
DataminingTools Inc
 
Techniques Machine Learning
DataminingTools Inc
 
Machine learning Introduction
DataminingTools Inc
 
Areas of machine leanring
DataminingTools Inc
 
AI: Planning and AI
DataminingTools Inc
 
AI: Logic in AI 2
DataminingTools Inc
 
AI: Logic in AI
DataminingTools Inc
 
AI: Learning in AI 2
DataminingTools Inc
 
AI: Learning in AI
DataminingTools Inc
 
AI: Introduction to artificial intelligence
DataminingTools Inc
 
AI: Belief Networks
DataminingTools Inc
 
AI: AI & Searching
DataminingTools Inc
 
AI: AI & Problem Solving
DataminingTools Inc
 
Data Mining: Text and web mining
DataminingTools Inc
 
Data Mining: Outlier analysis
DataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
DataminingTools Inc
 
Data Mining: Mining ,associations, and correlations
DataminingTools Inc
 
Data Mining: Graph mining and social network analysis
DataminingTools Inc
 
Data warehouse and olap technology
DataminingTools Inc
 
Data Mining: Data processing
DataminingTools Inc
 

Recently uploaded (20)

PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PDF
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
PDF
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PPTX
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PPTX
PCU Keynote at IEEE World Congress on Services 250710.pptx
Ramesh Jain
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
RAT Builders - How to Catch Them All [DeepSec 2024]
malmoeb
 
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
The Yotta x CloudStack Advantage: Scalable, India-First Cloud
ShapeBlue
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PCU Keynote at IEEE World Congress on Services 250710.pptx
Ramesh Jain
 

Clustering

  • 2. Classification Method of Supervised learning Learns a method for predicting the instance class from pre-labeled (classified) instances
  • 3. Clustering Method of unsupervised learning Finds “natural” grouping of instances given un-labeled data
  • 4. Clustering Methods Many different method and algorithms: For numeric and/or symbolic data Deterministic vs. probabilistic Exclusive vs. overlapping Hierarchical vs. flat Top-down vs. bottom-up
  • 5. Clusters: exclusive vs. overlapping a k j i h g f e d c b
  • 6. Example of Outlier x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Outlier
  • 7. Methods of Clustering Hierarchical (Agglomerative): Initially, each point in cluster by itself. Repeatedly combine the two “nearest” clusters into one. Point Assignment: Maintain a set of clusters. Place points into their “nearest” cluster.
  • 8. Hierarchical clustering Bottom up Start with single-instance clusters At each step, join the two closest clusters Design decision: distance between clusters E.g. two closest instances in clusters vs. distance between means Top down Start with one universal cluster Find two clusters Proceed recursively on each subset Can be very fast Both methods produce a dendrogram
  • 9. Incremental clustering Heuristic approach (COBWEB/CLASSIT) Form a hierarchy of clusters incrementally Start: tree consists of empty root node Then: add instances one by one update tree appropriately at each stage to update, find the right leaf for an instance May involve restructuring the tree Base update decisions on category utility
  • 10. And in the Non-Euclidean Case? The only “locations” we can talk about are the points themselves. I.e., there is no “average” of two points. Approach 1: clustroid = point “closest” to other points. Treat clustroid as if it were centroid, when computing intercluster distances.
  • 11. “ Closest” Point? Possible meanings: Smallest maximum distance to the other points. Smallest average distance to other points. Smallest sum of squares of distances to other points. Etc., etc.
  • 12. k – Means Algorithm(s) Assumes Euclidean space. Start by picking k , the number of clusters. Initialize clusters by picking one point per cluster. Example: pick one point at random, then k -1 other points, each as far away as possible from the previous points.
  • 13. Populating Clusters For each point, place it in the cluster whose current centroid it is nearest. After all points are assigned, fix the centroids of the k clusters. Optional : reassign all points to their closest centroid. Sometimes moves points between clusters.
  • 14. Simple Clustering: K-means Works with numeric data only Pick a number (K) of cluster centers (at random) Assign every item to its nearest cluster center (e.g. using Euclidean distance) Move each cluster center to the mean of its assigned items Repeat steps 2,3 until convergence (change in cluster assignments less than a threshold)
  • 15. K-means clustering summary Advantages Simple, understandable items automatically assigned to clusters Disadvantages Must pick number of clusters before hand All items forced into a cluster Too sensitive to outliers
  • 16. K-means variations K-medoids – instead of mean, use medians of each cluster Mean of 1, 3, 5, 7, 9 is Mean of 1, 3, 5, 7, 1009 is Median of 1, 3, 5, 7, 1009 is Median advantage: not affected by extreme values For large databases, use sampling 5 205 5
  • 17. Examples of Clustering Applications Marketing: discover customer groups and use them for targeted marketing and re-organization Astronomy: find groups of similar stars and galaxies Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Genomics: finding groups of gene with similar expression And many more.
  • 18. Clustering Summary unsupervised many approaches K-means – simple, sometimes useful K-medoids is less sensitive to outliers Hierarchical clustering – works for symbolic attributes
  • 19. References This PPT is complied from: Data Mining: Concepts and Techniques, 2nd ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor, Morgan Kaufmann Publishers, March 2006. ISBN 1-55860-901-6
  • 20. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net