0% found this document useful (0 votes)

16 views

Unit 5 DM

Uploaded by

apkbala107

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Unit 5 DM

Uploaded by

apkbala107

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 47

Unit 5

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods

1
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

2
Clustering for Data Understanding
and Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market resarch
3
Clustering as a Preprocessing Tool
(Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

4
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

5
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
6
Considerations for Cluster
Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
7
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes

 Numerical, binary, categorical, ordinal, linked, and mixture of

these
 Constraint-based clustering
 User may give inputs on constraints

 Use domain knowledge to determine input parameters

 Interpretability and usability

 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

8
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

9
Major Clustering Approaches
(II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other

 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank, LinkClus

10
Unit 5

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods

11
Partitioning Algorithms: Basic
Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
12
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four

steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

13
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update
the
point) for each partition cluster
 Assign each object to the centroids
cluster of its nearest centroid
 Until no change
14
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is

# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes 15
Variations of the K-Means Method

 Most of the variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

16
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

17
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5 each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

1 1
initial to
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping 7 total cost 7

Until no O and 6
of 6

Oramdom
change
5 5

4
swapping 4

If quality is 3

2
3

improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

18
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling
19
Unit 5

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods

20
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
21
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

22
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each
connected component forms a cluster

23
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

24
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster and
an element in the other, i.e., dist(K i, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(K i, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(K i, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
25
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster iN1(t )
Cm  N ip

 Radius: square root of average distance from any point

of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N  1)

26
Extensions to Hierarchical Clustering
 Major weakness of agglomerative clustering methods

Can never undo what was done previously

Do not scale well: time complexity of at least O(n2), where
n is the number of total objects
 Integration of hierarchical & distance-based clustering

BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters

CHAMELEON (1999): hierarchical clustering using
dynamic modeling
27
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
 Zhang, Ramakrishnan & Livny, SIGMOD’96
 Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
 Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
 Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
 Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
 Weakness: handles only numeric data, and sensitive to the order of the
data record

28
Clustering Feature Vector in
BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points:  X i
i 1

CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
 Xi
9

(2,6)
8

i 1
7

(4,5)
5

1
(4,7)
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

29
CF-Tree in BIRCH
 Clustering feature:
 Summary of the statistics for a given subcluster: the 0-th, 1st,

and 2nd moments of the subcluster from the statistical point

of view
 Registers crucial measurements for computing cluster and

utilizes storage efficiently

A CF tree is a height-balanced tree that stores the clustering
features for a hierarchical clustering
 A nonleaf node in a tree has descendants or “children”

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters

 Branching factor: max # of children

 Threshold: max diameter of sub-clusters stored at the leaf

nodes 30
The CF Tree Structure
Root

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

31
The Birch Algorithm
 Cluster Diameter 1 2
 (x  x )
n( n  1) i j

 For each point in the input

 Find closest leaf entry

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

parents
 Algorithm is O(n)
 Concerns
 Sensitive to insertion order of data points

 Since we fix the size of leaf nodes, so clusters may not be so

natural
 Clusters tend to be spherical given the radius and diameter

measures
32
CHAMELEON: Hierarchical Clustering
Using Dynamic Modeling (1999)
 CHAMELEON: G. Karypis, E. H. Han, and V. Kumar, 1999
 Measures the similarity based on a dynamic model
 Two clusters are merged only if the interconnectivity
and closeness (proximity) between two clusters are
high relative to the internal interconnectivity of the
clusters and closeness of items within the clusters
 Graph-based, and a two-phase algorithm
1. Use a graph-partitioning algorithm: cluster objects into
a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm:
find the genuine clusters by repeatedly combining
these sub-clusters
33
Overall Framework of CHAMELEON

Construct (K-NN)
Sparse Graph Partition the Graph

Data Set

K-NN Graph
P and q are connected if Merge Partition
q is among the top k
closest neighbors of p
Relative interconnectivity:
connectivity of c1 and c2
over internal connectivity
Final Clusters
Relative closeness:
closeness of c1 and c2 over
internal closeness 34
CHAMELEON (Clustering Complex
Objects)

35
Probabilistic Hierarchical Clustering
 Algorithmic hierarchical clustering
 Nontrivial to choose a good distance measure
 Hard to handle missing attribute values
 Optimization goal not clear: heuristic, local search
 Probabilistic hierarchical clustering
 Use probabilistic models to measure distances between clusters
 Generative model: Regard the set of data objects to be clustered
as a sample of the underlying data generation mechanism to be
analyzed
 Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
 In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed
by parameters
36
Generative Model
 Given a set of 1-D points X = {x1, …, xn} for clustering
analysis & assuming they are generated by a
Gaussian distribution:

 The probability that a point xi ∈ X is generated by the

model

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the

the maximum
parameters μ and σ2 such that likelihood

37
A Probabilistic Hierarchical Clustering
Algorithm

 For a set of objects partitioned into m clusters C1, . . . ,Cm, the quality can
be measured by,

where P() is the maximum likelihood

 Distance between clusters C1 and C2:
 Algorithm: Progressively merge points and clusters
Input: D = {o1, ..., on}: a data set containing n objects
Output: A hierarchy of clusters
Method
Create a cluster for each object Ci = {oi}, 1 ≤ i ≤ n;
For i = 1 to n {
Find pair of clusters Ci and Cj such that
Ci,Cj = argmaxi ≠ j {log (P(Ci∪Cj )/(P(Ci)P(Cj ))};
If log (P(Ci∪Cj )/(P(Ci)P(Cj )) > 0 then merge Ci and Cj }
38
Data Mining Applications
 Data mining: A young discipline with broad and diverse
applications
 There still exists a nontrivial gap between generic data

mining methods and effective and scalable data mining

tools for domain-specific applications
 Some application domains (briefly discussed here)
 Data Mining for Financial data analysis

 Data Mining for Retail and Telecommunication

Industries
 Data Mining in Science and Engineering

 Data Mining for Intrusion Detection and Prevention

 Data Mining and Recommender Systems

39
Data Mining for Financial Data Analysis (I)
 Financial data collected in banks and financial institutions
are often relatively complete, reliable, and of high quality
 Design and construction of data warehouses for
multidimensional data analysis and data mining
 View the debt and revenue changes by month, by
region, by sector, and by other factors
 Access statistical information such as max, min, total,
average, trend, etc.
 Loan payment prediction/consumer credit policy analysis
 feature selection and attribute relevance ranking

 Loan payment performance

 Consumer credit rating

40
Data Mining for Financial Data
Analysis (II)
 Classification and clustering of customers for targeted
marketing
 multidimensional segmentation by nearest-neighbor,

classification, decision trees, etc. to identify customer

groups or associate a new customer to an appropriate
customer group
 Detection of money laundering and other financial crimes
 integration of from multiple DBs (e.g., bank

transactions, federal/state crime history DBs)

 Tools: data visualization, linkage analysis,

classification, clustering tools, outlier analysis, and

sequential pattern analysis tools (find unusual access
sequences)
41
Data Mining for Retail & Telcomm. Industries (I)

 Retail industry: huge amounts of data on sales, customer

shopping history, e-commerce, etc.
 Applications of retail data mining
 Identify customer buying behaviors
 Discover customer shopping patterns and trends
 Improve the quality of customer service
 Achieve better customer retention and satisfaction
 Enhance goods consumption ratios
 Design more effective goods transportation and
distribution policies
 Telcomm. and many other industries: Share many similar
goals and expectations of retail data mining
42
Data Mining Practice for Retail
Industry
 Design and construction of data warehouses
 Multidimensional analysis of sales, customers, products, time, and
region
 Analysis of the effectiveness of sales campaigns
 Customer retention: Analysis of customer loyalty
 Use customer loyalty card information to register sequences of
purchases of particular customers
 Use sequential pattern mining to investigate changes in customer
consumption or loyalty
 Suggest adjustments on the pricing and variety of goods
 Product recommendation and cross-reference of items
 Fraudulent analysis and the identification of usual patterns
 Use of visualization tools in data analysis
43
Data Mining in Science and Engineering
 Data warehouses and data preprocessing

Resolving inconsistencies or incompatible data collected in
diverse environments and different periods (e.g. eco-system
studies)
 Mining complex data types

Spatiotemporal, biological, diverse semantics and relationships
 Graph-based and network-based mining

Links, relationships, data flow, etc.
 Visualization tools and domain-specific knowledge
 Other issues

Data mining in social sciences and social studies: text and social
media

Data mining in computer science: monitoring systems, software
bugs, network intrusion
44
Data Mining for Intrusion Detection and
Prevention
 Majority of intrusion detection and prevention systems use
 Signature-based detection: use signatures, attack patterns that are
preconfigured and predetermined by domain experts
 Anomaly-based detection: build profiles (models of normal
behavior) and detect those that are substantially deviate from the
profiles
 What data mining can help
 New data mining algorithms for intrusion detection
 Association, correlation, and discriminative pattern analysis help
select and build discriminative classifiers
 Analysis of stream data: outlier detection, clustering, model shifting
 Distributed data mining
 Visualization and querying tools

45
Data Mining and Recommender Systems
 Recommender systems: Personalization, making product
recommendations that are likely to be of interest to a user
 Approaches: Content-based, collaborative, or their hybrid
 Content-based: Recommends items that are similar to items the
user preferred or queried in the past
 Collaborative filtering: Consider a user's social environment,
opinions of other customers who have similar tastes or preferences
 Data mining and recommender systems
 Users C × items S: extract from known to unknown ratings to
predict user-item combinations
 Memory-based method often uses k-nearest neighbor approach
 Model-based method uses a collection of ratings to learn a model
(e.g., probabilistic models, clustering, Bayesian networks, etc.)
 Hybrid approaches integrate both to improve performance (e.g.,
using ensemble)
46
Trends of Data Mining
 Application exploration: Dealing with application-specific problems
 Scalable and interactive data mining methods
 Integration of data mining with Web search engines, database
systems, data warehouse systems and cloud computing systems
 Mining social and information networks
 Mining spatiotemporal, moving objects and cyber-physical systems
 Mining multimedia, text and web data
 Mining biological and biomedical data
 Data mining with software engineering and system engineering
 Visual and audio data mining
 Distributed data mining and real-time data stream mining
 Privacy protection and information security in data mining

Class 1: Statistics in Business Decisions Introduction To Probability Conditional Probability Independence
No ratings yet
Class 1: Statistics in Business Decisions Introduction To Probability Conditional Probability Independence
15 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Clustering
No ratings yet
Clustering
104 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
UNIT-6
No ratings yet
UNIT-6
102 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
Clustering
No ratings yet
Clustering
80 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Partitioning Methods
100% (1)
Partitioning Methods
3 pages
3
No ratings yet
3
23 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unit-5 DM
No ratings yet
Unit-5 DM
11 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Clustering
No ratings yet
Clustering
7 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Clustering
No ratings yet
Clustering
26 pages
Clustering
No ratings yet
Clustering
32 pages
کتاب چهارم بارگزاری شده
No ratings yet
کتاب چهارم بارگزاری شده
63 pages
Data Mining-Model Based Clustering
No ratings yet
Data Mining-Model Based Clustering
8 pages
DMDW Qa-5
No ratings yet
DMDW Qa-5
7 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
Grouping
No ratings yet
Grouping
98 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
DMW UNIT 5
No ratings yet
DMW UNIT 5
10 pages
L 12 Flat Cluster
No ratings yet
L 12 Flat Cluster
26 pages
ML Lecture06 Unsupervised Learning
No ratings yet
ML Lecture06 Unsupervised Learning
87 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
Pam Clustering Technique
No ratings yet
Pam Clustering Technique
10 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Unit 4 Introduction to Algorithm
No ratings yet
Unit 4 Introduction to Algorithm
10 pages
Chapter 7
100% (1)
Chapter 7
31 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
DWDM Unit Vi
No ratings yet
DWDM Unit Vi
23 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Assignment No 5 K-Means Clustering
No ratings yet
Assignment No 5 K-Means Clustering
2 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
83 pages
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
No ratings yet
Cluster Analysis: Dr. Bernard Chen Ph.D. Assistant Professor
43 pages
UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples
No ratings yet
UCS 401 Unit-lll Lect 13 Distance Based Models Neighbours and Examples
20 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
SPK Clustering
No ratings yet
SPK Clustering
35 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Design and Implementation of Adaptive Filtering Algorithm For Noise Cancellation of Speech Signal On Fpga For Hearing Aid
No ratings yet
Design and Implementation of Adaptive Filtering Algorithm For Noise Cancellation of Speech Signal On Fpga For Hearing Aid
27 pages
Capm From Ccapm
No ratings yet
Capm From Ccapm
3 pages
Class Test 6 Questions With Answers
No ratings yet
Class Test 6 Questions With Answers
3 pages
1breohbaq 289967
No ratings yet
1breohbaq 289967
94 pages
Lecture 1
No ratings yet
Lecture 1
114 pages
Combinatorial Optimization - Chekuri (2022)
No ratings yet
Combinatorial Optimization - Chekuri (2022)
255 pages
Data Analytics: Relation Analysis
No ratings yet
Data Analytics: Relation Analysis
88 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
2A5 Linear Programming - Simplex Method - Maximization Case 3RD FILE Try This SC
No ratings yet
2A5 Linear Programming - Simplex Method - Maximization Case 3RD FILE Try This SC
21 pages
Artificial Intelligence MCQS
No ratings yet
Artificial Intelligence MCQS
22 pages
Edge Drawing A Combined Real-Time Edge and Segment Detector
No ratings yet
Edge Drawing A Combined Real-Time Edge and Segment Detector
11 pages
QM_Tutorial_Sheet#3
No ratings yet
QM_Tutorial_Sheet#3
2 pages
21MAT756
No ratings yet
21MAT756
3 pages
IPM Template V5
No ratings yet
IPM Template V5
26 pages
Mid - Sem - 2019 - Linear Control System
No ratings yet
Mid - Sem - 2019 - Linear Control System
1 page
Time Series Forcasting
No ratings yet
Time Series Forcasting
19 pages
Operations Research
100% (2)
Operations Research
293 pages
Balaji 1
No ratings yet
Balaji 1
30 pages
Artificial Intelligence - AL3391 2021 Regulation
No ratings yet
Artificial Intelligence - AL3391 2021 Regulation
11 pages
K024 K006 DWM ResearchPaper
No ratings yet
K024 K006 DWM ResearchPaper
16 pages
VTU DSP Lab Manual 5th Sem E C Matlab Programs and CCS Studio Programs
No ratings yet
VTU DSP Lab Manual 5th Sem E C Matlab Programs and CCS Studio Programs
35 pages
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
No ratings yet
A Mean-VaR Based Deep Reinforcement Learning Framework For Practical Algorithmic Trading
14 pages
Admission Requirements: SAIT Logo
No ratings yet
Admission Requirements: SAIT Logo
27 pages
3 The Quadratic Family and Bifurcations
No ratings yet
3 The Quadratic Family and Bifurcations
2 pages
Machine Learning Deep Learning Natural Language Processing (NLP) Algorithms Machine Learning Models
No ratings yet
Machine Learning Deep Learning Natural Language Processing (NLP) Algorithms Machine Learning Models
2 pages
Signal Theory and Application
No ratings yet
Signal Theory and Application
16 pages
Data Warehouse
No ratings yet
Data Warehouse
5 pages
PGP Aiml2024
No ratings yet
PGP Aiml2024
22 pages
1 1optimization
No ratings yet
1 1optimization
25 pages

Unit 5 DM

Uploaded by

Unit 5 DM

Uploaded by

Unit 5

 Cluster Analysis: Basic Concepts

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

characteristics found in the data and grouping similar

 As a preprocessing step for other algorithms

 Ability to deal with different types of attributes

 Use domain knowledge to determine input parameters

 Interpretability and usability

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

criterion, e.g., minimizing the sum of square errors

using some criterion

 Typical methods: DBSACN, OPTICS, DenClue

 Typical methods: STING, WaveCluster, CLIQUE

the best fit of that model to each other

 Typical methods: p-Cluster

 Massive links can be used to cluster objects: SimRank, LinkClus

 Cluster Analysis: Basic Concepts

 Given k, the k-means algorithm is implemented in four

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is

 Most of the variants of the k-means which differ in

 The k-means algorithm is sensitive to outliers !

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 Cluster Analysis: Basic Concepts

A clustering of the data objects is obtained by cutting

 Introduced in Kaufmann and Rousseeuw (1990)

 Centroid: distance between the centroids of two clusters, i.e., dist(K i,

 Medoid: distance between the medoids of two clusters, i.e., dist(K i,

 Radius: square root of average distance from any point

and 2nd moments of the subcluster from the statistical point

utilizes storage efficiently

 The nonleaf nodes store sums of the CFs of their children

 A CF tree has two parameters

 Threshold: max diameter of sub-clusters stored at the leaf

B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Leaf node Leaf node

 For each point in the input

 Add point to leaf entry and update CF

 If entry diameter > max_diameter, then split leaf, and possibly

 Since we fix the size of leaf nodes, so clusters may not be so

 The probability that a point xi ∈ X is generated by the

 The likelihood that X is generated by the model:

 The task of learning the generative model: find the

where P() is the maximum likelihood

mining methods and effective and scalable data mining

 Data Mining for Retail and Telecommunication

 Data Mining for Intrusion Detection and Prevention

 Data Mining and Recommender Systems

 Loan payment performance

 Consumer credit rating

classification, decision trees, etc. to identify customer

transactions, federal/state crime history DBs)

classification, clustering tools, outlier analysis, and

 Retail industry: huge amounts of data on sales, customer

You might also like