SlideShare a Scribd company logo
Chapter 12 Clustering: Large Databases Written by Farial Shahnaz  Presented by Zhao Xinyou Data Mining Technology
Contents Introduction Idea for there major approaches for scalable clustering  {Divide-and-Conquer, Incremental, Parallel} There approaches for scalable clustering { BIRCH, DSBCAN, CURE} Application
Introduction –Common method Common method for clustering: visit all data from database and analyze the data, just like: Time : Computational Complexities: O(n*n). Memory : Need to load all data to main memory PP133    huge, huge number    millions Time/ Memory Data
Motivation—Clustering for large database f(x):  O(n*n). f(x):  O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134
Requirement—Clustering for large database No more (preferably less) than one scan of the database. Process each [record] only once With limited memory f(x):  O(n*n). f(x):  O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134 Can suspend, stop, and resume Can update the results when new data inserted or removed Can perform different technology to scan the database During execution, method should provide status and ‘best’ answer.
Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach  PP135
Divide-and Conquer approach Definition.   Divide-and-conquer is a problem-solving approach in which we:  divide the problem into sub-problems,  recursively  conquer  or solve each sub-problem, and then  combine the sub-problem solutions to obtain a solution to the original problem.  PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions.  2.Subproblem solutions are independent of one another.  9*9  数独
Parallel clustering approach Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer) PP136-137
Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms
Application Sorting: quick-sort and merge sort Fast Fourier transforms  Tower of Hanoi puzzle matrix multiplication ….. PP135
CURE- Divide-and-Conquer 1.Get the size n of set D and partition D into p group (contain n/p elements) 2.To each group pi, clustered into k groups by using Heap and k-d tree 3.delete some no relationship node in Heap and k-d tree 4. Cluster the partial clusters and get the final cluster PP140-141
Heap  PP140-141
k-D Tree Technically, the letter  k  refers to the number of dimensions  PP140-141 3-dimensional  k d-tree
K-D Tree PP140-141
CURE- Divide-and-Conquer PP140-141 Nearest Merge Nearest Merge
Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters,  if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data Steps: 1. S={};//set cluster = NULL 2. do{ 3.  read one record d; 4.  r = find_simiarity_cluster(d, S); 5.  if (r exists)  6.  assign d to the cluster r 6.  else  7.  Add_cluster(d, S); 8.  } untill (no record in database); PP135-136
Application--Incremental clustering approach BIRCH   Balanced Iterative Reducing and Clustering using Hierarchies  DBSCAN Density-Based Spatial Clustering of Application with Noise
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute the similarity between record and cluster and give the clusters. Inner Cluster Among Cluster PP137-138
BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Inner Cluster  Among Cluster PP137-138
Related Definiation Cluster: {x i }, where i = 1, 2, …, N CF(Clustering Feature) : is a triple, (N,LS,SS) , N : number of data ; LS : linear sum of N data  ;  SS : Square sum
Related Definiation CF tree = (B,T), B = (CF i , child i ), if is internal node in a cluster B = (CF i , prev, next) if is external or leaf node in a cluster. T: threshold for all leaf node, which should satisfy mean distance  D < T
Algorithm for BIRCH
DBSCAN DBSCAN: Density-Based Spatial Clustering of Application with Noise Ex1: We want to class house along with river from one spatial photo Ex2:
Definition for DBSCAN Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈D |dist(p,q)  ≤ Eps} Minimum Number (MinPts) The MinPts is the minimum number of data points in any cluster.
Definition for DBSCAN Directly density-reachable A point p is directly density-reachable from a point q. Eps and MinPts if  1): p  ∈  N Eps (q); 2): |N Eps (q)| ≥MinPts ;
Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if  there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n  such as p i +1 is directly desity-reachable from p i ;
Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if  there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n  such as p i +1 is directly desity-reachable from p i ;
Algorithm of DBSCAN Input  D={t 1 ,t 2 ,…,t n } MinPts Eps Output K=K 1 ,K 2 ,…K k k = 0; for i =1 to n do  if t i  is not in a cluster then X={t i | t j  is density-reachable from t i } end if if  X is a valid cluster then k=  k+1; K k  = X; end if end for
Ad

More Related Content

What's hot (20)

Shortest path algorithm
Shortest  path algorithmShortest  path algorithm
Shortest path algorithm
Subrata Kumer Paul
 
Congestion Control
Congestion ControlCongestion Control
Congestion Control
VaishnaviVaishnavi17
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Haris Jamil
 
Coda file system
Coda file systemCoda file system
Coda file system
Sneh Pahilwani
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
SPADE -
SPADE - SPADE -
SPADE -
Monica Dagadita
 
Congestion control
Congestion controlCongestion control
Congestion control
Aman Jaiswal
 
Congestion on computer network
Congestion on computer networkCongestion on computer network
Congestion on computer network
Disi Dc
 
AODV (adhoc ondemand distance vector routing)
AODV (adhoc ondemand distance vector routing) AODV (adhoc ondemand distance vector routing)
AODV (adhoc ondemand distance vector routing)
ArunChokkalingam
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
Sunita Sahu
 
Machine learning clustering
Machine learning clusteringMachine learning clustering
Machine learning clustering
CosmoAIMS Bassett
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternatives
Pooja Dixit
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 
Web Security
Web SecurityWeb Security
Web Security
Dr.Florence Dayana
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
Neel Patel
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
3. mining frequent patterns
3. mining frequent patterns3. mining frequent patterns
3. mining frequent patterns
Azad public school
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
Haris Jamil
 
3.2 partitioning methods
3.2 partitioning methods3.2 partitioning methods
3.2 partitioning methods
Krish_ver2
 
Congestion control
Congestion controlCongestion control
Congestion control
Aman Jaiswal
 
Congestion on computer network
Congestion on computer networkCongestion on computer network
Congestion on computer network
Disi Dc
 
AODV (adhoc ondemand distance vector routing)
AODV (adhoc ondemand distance vector routing) AODV (adhoc ondemand distance vector routing)
AODV (adhoc ondemand distance vector routing)
ArunChokkalingam
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
Sunita Sahu
 
Distributed design alternatives
Distributed design alternativesDistributed design alternatives
Distributed design alternatives
Pooja Dixit
 
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUESDISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
DISTRIBUTED DATABASE WITH RECOVERY TECHNIQUES
AAKANKSHA JAIN
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
Neel Patel
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
Krish_ver2
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
Krish_ver2
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 

Viewers also liked (20)

Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
Ashwin Shenoy M
 
Birch
BirchBirch
Birch
ngocdiem87
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Jewel Refran
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
thilagasna
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for Dummies
Mark Broadbent
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Pier Luca Lanzi
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien
 
DBSCAN
DBSCANDBSCAN
DBSCAN
Éverton M. Gava
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
Mahbubur Rahman Shimul
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
 
Db Scan
Db ScanDb Scan
Db Scan
International Islamic University
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Clustering
ClusteringClustering
Clustering
Meme Hei
 
Clustering
ClusteringClustering
Clustering
NLPseminar
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
Yan Xu
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Global Business Solutions SME
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
Farzad Nozarian
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
DataminingTools Inc
 
Current clustering techniques
Current clustering techniquesCurrent clustering techniques
Current clustering techniques
Poonam Kshirsagar
 
DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook
 
DATA MINING:Clustering Types
DATA MINING:Clustering TypesDATA MINING:Clustering Types
DATA MINING:Clustering Types
Ashwin Shenoy M
 
Similarity distance measures
Similarity  distance measuresSimilarity  distance measures
Similarity distance measures
thilagasna
 
SQL Server Clustering for Dummies
SQL Server Clustering for DummiesSQL Server Clustering for Dummies
SQL Server Clustering for Dummies
Mark Broadbent
 
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical Machine Learning and Data Mining: 08 Clustering: Hierarchical
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Pier Luca Lanzi
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien
 
K-means and Hierarchical Clustering
K-means and Hierarchical ClusteringK-means and Hierarchical Clustering
K-means and Hierarchical Clustering
guestfee8698
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
SSA KPI
 
Clustering
ClusteringClustering
Clustering
Meme Hei
 
K means and dbscan
K means and dbscanK means and dbscan
K means and dbscan
Yan Xu
 
Ad

Similar to Clustering: Large Databases in data mining (20)

DBSCAN
DBSCANDBSCAN
DBSCAN
ssuseraef7e0
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
tthonet
 
Clique and sting
Clique and stingClique and sting
Clique and sting
Subramanyam Natarajan
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
Mostafa G. M. Mostafa
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
Zihui Li
 
Project PPT
Project PPTProject PPT
Project PPT
Dhaarna Singh
 
Parallel kmeans clustering in Erlang
Parallel kmeans clustering in ErlangParallel kmeans clustering in Erlang
Parallel kmeans clustering in Erlang
Chinmay Patel
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Salah Amean
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Lect4
Lect4Lect4
Lect4
sumit621
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Scientific Review SR
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Scientific Review
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Chapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & KamberChapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & Kamber
Houw Liong The
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
Frank Nielsen
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
Christian Robert
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
Houw Liong The
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
tthonet
 
Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)Neural Networks: Principal Component Analysis (PCA)
Neural Networks: Principal Component Analysis (PCA)
Mostafa G. M. Mostafa
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
Krish_ver2
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
Zihui Li
 
Parallel kmeans clustering in Erlang
Parallel kmeans clustering in ErlangParallel kmeans clustering in Erlang
Parallel kmeans clustering in Erlang
Chinmay Patel
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Data Mining: Concepts and techniques: Chapter 11,Review: Basic Cluster Analys...
Salah Amean
 
Parallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel applicationParallel Computing 2007: Bring your own parallel application
Parallel Computing 2007: Bring your own parallel application
Geoffrey Fox
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...Classification of Iris Data using Kernel Radial Basis Probabilistic  Neural N...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural N...
Scientific Review SR
 
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Classification of Iris Data using Kernel Radial Basis Probabilistic Neural Ne...
Scientific Review
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
Albert Bifet
 
Chapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & KamberChapter 11 cluster advanced, Han & Kamber
Chapter 11 cluster advanced, Han & Kamber
Houw Liong The
 
Multi-Layer Perceptrons
Multi-Layer PerceptronsMulti-Layer Perceptrons
Multi-Layer Perceptrons
ESCOM
 
Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...Optimal interval clustering: Application to Bregman clustering and statistica...
Optimal interval clustering: Application to Bregman clustering and statistica...
Frank Nielsen
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
Christian Robert
 
Chapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text miningChapter 11 cluster advanced : web and text mining
Chapter 11 cluster advanced : web and text mining
Houw Liong The
 
Ad

More from ZHAO Sam (8)

Solr installation
Solr installationSolr installation
Solr installation
ZHAO Sam
 
Special issue on Technology Enhanced Learning
Special issue on Technology Enhanced LearningSpecial issue on Technology Enhanced Learning
Special issue on Technology Enhanced Learning
ZHAO Sam
 
国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System
ZHAO Sam
 
祝大家新年快樂
祝大家新年快樂祝大家新年快樂
祝大家新年快樂
ZHAO Sam
 
Ubiquitous
UbiquitousUbiquitous
Ubiquitous
ZHAO Sam
 
similarity measure
similarity measure similarity measure
similarity measure
ZHAO Sam
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
A Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance LearningA Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance Learning
ZHAO Sam
 
Solr installation
Solr installationSolr installation
Solr installation
ZHAO Sam
 
Special issue on Technology Enhanced Learning
Special issue on Technology Enhanced LearningSpecial issue on Technology Enhanced Learning
Special issue on Technology Enhanced Learning
ZHAO Sam
 
国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System国際会議推薦システムAcademic Conference Publishing System
国際会議推薦システムAcademic Conference Publishing System
ZHAO Sam
 
祝大家新年快樂
祝大家新年快樂祝大家新年快樂
祝大家新年快樂
ZHAO Sam
 
Ubiquitous
UbiquitousUbiquitous
Ubiquitous
ZHAO Sam
 
similarity measure
similarity measure similarity measure
similarity measure
ZHAO Sam
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
A Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance LearningA Real-Time Interactive Shared System for Distance Learning
A Real-Time Interactive Shared System for Distance Learning
ZHAO Sam
 

Recently uploaded (20)

Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Social Media App Development Company-EmizenTech
Social Media App Development Company-EmizenTechSocial Media App Development Company-EmizenTech
Social Media App Development Company-EmizenTech
Steve Jonas
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Vaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without HallucinationsVaibhav Gupta BAML: AI work flows without Hallucinations
Vaibhav Gupta BAML: AI work flows without Hallucinations
john409870
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Social Media App Development Company-EmizenTech
Social Media App Development Company-EmizenTechSocial Media App Development Company-EmizenTech
Social Media App Development Company-EmizenTech
Steve Jonas
 
Cyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of securityCyber Awareness overview for 2025 month of security
Cyber Awareness overview for 2025 month of security
riccardosl1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 

Clustering: Large Databases in data mining

  • 1. Chapter 12 Clustering: Large Databases Written by Farial Shahnaz Presented by Zhao Xinyou Data Mining Technology
  • 2. Contents Introduction Idea for there major approaches for scalable clustering {Divide-and-Conquer, Incremental, Parallel} There approaches for scalable clustering { BIRCH, DSBCAN, CURE} Application
  • 3. Introduction –Common method Common method for clustering: visit all data from database and analyze the data, just like: Time : Computational Complexities: O(n*n). Memory : Need to load all data to main memory PP133  huge, huge number  millions Time/ Memory Data
  • 4. Motivation—Clustering for large database f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134
  • 5. Requirement—Clustering for large database No more (preferably less) than one scan of the database. Process each [record] only once With limited memory f(x): O(n*n). f(x): O(n). Time/ Memory Data Time/ Memory Data Method ??? PP134 Can suspend, stop, and resume Can update the results when new data inserted or removed Can perform different technology to scan the database During execution, method should provide status and ‘best’ answer.
  • 6. Major approach for scalable clustering Divide-and-Conquer approach Parallel clustering approach Incremental clustering approach PP135
  • 7. Divide-and Conquer approach Definition. Divide-and-conquer is a problem-solving approach in which we: divide the problem into sub-problems, recursively conquer or solve each sub-problem, and then combine the sub-problem solutions to obtain a solution to the original problem. PP135 Key Assumptions 1.Problem solutions can be constructed using subproblem solutions. 2.Subproblem solutions are independent of one another. 9*9 数独
  • 8. Parallel clustering approach Idea: Divide data into small set and then run small set on different machine (Come from Divide-and-Conquer) PP136-137
  • 9. Explanation about Divide-and-Conquer Divide is some algorithms Conquer is some algorithms
  • 10. Application Sorting: quick-sort and merge sort Fast Fourier transforms Tower of Hanoi puzzle matrix multiplication ….. PP135
  • 11. CURE- Divide-and-Conquer 1.Get the size n of set D and partition D into p group (contain n/p elements) 2.To each group pi, clustered into k groups by using Heap and k-d tree 3.delete some no relationship node in Heap and k-d tree 4. Cluster the partial clusters and get the final cluster PP140-141
  • 13. k-D Tree Technically, the letter k refers to the number of dimensions PP140-141 3-dimensional k d-tree
  • 15. CURE- Divide-and-Conquer PP140-141 Nearest Merge Nearest Merge
  • 16. Incremental clustering approach Idea: scan all data in database, Compare with the existing clusters, if find similar cluster, assign it to with cluster, or else, create a new cluster. Go on till no data Steps: 1. S={};//set cluster = NULL 2. do{ 3. read one record d; 4. r = find_simiarity_cluster(d, S); 5. if (r exists) 6. assign d to the cluster r 6. else 7. Add_cluster(d, S); 8. } untill (no record in database); PP135-136
  • 17. Application--Incremental clustering approach BIRCH Balanced Iterative Reducing and Clustering using Hierarchies DBSCAN Density-Based Spatial Clustering of Application with Noise
  • 18. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Based on distance measurement, compute the similarity between record and cluster and give the clusters. Inner Cluster Among Cluster PP137-138
  • 19. BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies ) Inner Cluster Among Cluster PP137-138
  • 20. Related Definiation Cluster: {x i }, where i = 1, 2, …, N CF(Clustering Feature) : is a triple, (N,LS,SS) , N : number of data ; LS : linear sum of N data ; SS : Square sum
  • 21. Related Definiation CF tree = (B,T), B = (CF i , child i ), if is internal node in a cluster B = (CF i , prev, next) if is external or leaf node in a cluster. T: threshold for all leaf node, which should satisfy mean distance D < T
  • 23. DBSCAN DBSCAN: Density-Based Spatial Clustering of Application with Noise Ex1: We want to class house along with river from one spatial photo Ex2:
  • 24. Definition for DBSCAN Eps-neighborhood of a point The Eps-neighborhood of a point p, denoted by N Eps (p), is defined by N Eps (p)={q ∈D |dist(p,q) ≤ Eps} Minimum Number (MinPts) The MinPts is the minimum number of data points in any cluster.
  • 25. Definition for DBSCAN Directly density-reachable A point p is directly density-reachable from a point q. Eps and MinPts if 1): p ∈ N Eps (q); 2): |N Eps (q)| ≥MinPts ;
  • 26. Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ;
  • 27. Definition for DBSCAN Density-reachable A point p is density-reachable from a point q. Eps and MinPts if there is a chain of points p 1 ,p 2 ,…,p n ,p=p 1 ,q=p n such as p i +1 is directly desity-reachable from p i ;
  • 28. Algorithm of DBSCAN Input D={t 1 ,t 2 ,…,t n } MinPts Eps Output K=K 1 ,K 2 ,…K k k = 0; for i =1 to n do if t i is not in a cluster then X={t i | t j is density-reachable from t i } end if if X is a valid cluster then k= k+1; K k = X; end if end for