0% found this document useful (0 votes)
3 views30 pages

Birch

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is a hierarchical clustering method that utilizes Clustering Features (CF) and a Clustering Feature Tree (CFT) to efficiently manage large datasets. It addresses the limitations of traditional agglomerative clustering methods by incrementally constructing a CF tree and applying multi-phase clustering for improved quality. The algorithm is sensitive to input order and can suffer from order dependence, but it demonstrates effective clustering performance compared to other methods like CLARANS.

Uploaded by

Bhavna Buddhadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views30 pages

Birch

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is a hierarchical clustering method that utilizes Clustering Features (CF) and a Clustering Feature Tree (CFT) to efficiently manage large datasets. It addresses the limitations of traditional agglomerative clustering methods by incrementally constructing a CF tree and applying multi-phase clustering for improved quality. The algorithm is sensitive to input order and can suffer from order dependence, but it demonstrates effective clustering performance compared to other methods like CLARANS.

Uploaded by

Bhavna Buddhadev
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

BIRCH: Balanced Iterative

Reducing and Clustering Using


Hierarchies

A hierarchical clustering method.


It introduces two concepts :
Clustering feature
Clustering feature tree (CF tree)

These structures help the clustering method


achieve good speed and scalability in large
databases.
Motivation
 Major weakness of agglomerative
clustering methods

Do not scale well; time complexity of at least
O(n2), where n is total number of objects

Can never undo what was done previously
 Birch: Balanced Iterative Reducing and
Clustering using Hierarchies

Incrementally construct a CF (Clustering
Feature) tree, a hierarchical data structure
summarizing cluster info; finds a good
clustering with a single scan

Apply multi-phase clustering; improves
quality with a few additional scans
2
Summarized Info for Single cluster

Given a cluster with N objects

N
 Centroid
X
i 1 i
X0 
N
 Radius

N 2
( X i  X 0 ) 1/ 2
R ( i 1
)
N
 Diameter

 
N N 2
i 1 j 1
(Xi  X j ) 1/ 2
D ( )
N ( N  1)

3
Summarized Info for Two Clusters
 Given two clusters with N1 and N2
objects, respectively
Centroid Euclidean distance
  2 1/ 2

D0 (( X 0  Y0 ) )

Centroid Manhattan distance
 
D1  X 0  Y0

Average inter-cluster distance
  2
i 1  j 1 ( Xi  Yj )
N1 N2

D2 ( )1/ 2
N1N 2
4
Clustering Feature (CF)
 CF = (N, LS, SS)

N =|C|
Number of data points


 LS = N (Vi1, …Vid) =
Oi i 1
Xi
N N N N

LS = ∑ Oi = (∑Vi1, ∑ Vi2,…
∑Vid)

i=1 i=1 i=1 i =1

Linear sum of N data points


2


N


SS =
X ii 1

square sum of N data points


5
Example of Clustering Feature Vector

CF  (N , LS,SS)
 Clustering Feature:

 N: Number of data points 2 N N 


SS:  Xi LS :  Xi
i 1 i 1

CF = (5, (16, 30), (54,


10
190))
9

7 (3,4)
6

5 (2,6)
(4,5)
4

1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)

6

N 2
(Xi  X0) 1/ 2
R ( i 1
)
N
2 2

N
( X i  2 X 0 X i  X 0 )
( i 1
)1/ 2
N
2 2
 X i  2i 1 X i X 0  i 1 X 0 )
N N N

( i 1
)1/ 2
N
2 2
 X i  2 X 0 i 1 X i  N X 0
N N
1/ 2
( i 1
)
N
LS LS 2
SS  2 LS  N ( )
( N N )1/ 2
N 7
CF Additive Theorem
 Suppose cluster C1 has CF1=(N1, LS1 ,SS1),
cluster C2 has CF2 =(N2,LS2,SS2)
 If we merge C1 with C2, the CF for the
merged cluster C is
CF CF1  CF2
( N1  N 2 , LS1  LS 2 , SS1  SS 2 )
 Why CF?
 Summarized info for single cluster
 Summarized info for two clusters
 Additive theorem

8
Example of Clustering Feature Vector

CF  (N , LS,SS)
 Clustering Feature:
 N 2
N

LS :  Xi
N: Number of data points SS:  Xi
i 1 i 1

CF1 = (5, (16, 30), (54,


10 190))
9 CF2 = (5, (36, 17),
(262,
(3,4) 61))
8

7 (6,2)
6

5 (2,6) (7,2)
(4,5) (7,4)
4

1
(4,7) (8,4)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8) (8,5)

CF = (10, (52, 47),


9
(316, 251))
Clustering Feature Tree (CFT)
 Clustering feature tree (CFT) is an
alternative representation of data set:

Each non-leaf node is a cluster comprising
sub-clusters corresponding to entries (at
most B) in non-leaf node

Each leaf node is a cluster comprising sub-
clusters corresponding to entries (at most
L) in leaf node and each sub-cluster’s
diameter is at most T; when T is larger,
CFT is smaller

Each node must fit a memory page

10
Example of CF Tree
Root
B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf
CF9 CF1 CFnode
1 CF1
0 1 3
child1 child2 child3 child5

Leaf Leaf
pre node node nex
CF9 CF9 CF9 nex pre CF9 CF9 CF9
v t v t
0 1 4 5 6 8

11
BIRCH Phase 1
 Phase 1 scans data points and build in-
memory CFT;
 Start from root, traverse down tree to
choose closest leaf node for d
 Search for closest entry Li in leaf node
 If d can be inserted in Li, then update CF
vector of Li
 Else if node has space to insert new entry,
insert; else split node
 Once inserted, update nodes along path to
the root; if there is splitting, need to insert
new entry in parent node (which may result
in further splitting)
12
Example of the BIRCH Algorithm

New subcluster
sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2 LN2 LN3

Root
LN1 LN2 LN3
LN1

sc8 sc5 sc6


sc1 sc2 sc3 sc4 sc7
13
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed 3
, then LN1 is split.
sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2

LN1’ LN2 LN3

Root
LN1”
LN1’ LN2 LN3
LN1”

sc8 sc5 sc6


sc1 sc2 sc3 sc4 sc7
14
Merge Operation in BIRCH
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2

LN1’ LN2 LN3

LN1” Root

NLN1 NLN2

LN1’
LN1” LN2 LN3

15
sc8 sc2 sc3 sc5 sc6 sc7
Merge Operation in BIRCH
Assume that the subclusters are numbered according to
the order of formation.

sc5 sc2
sc6

sc3 sc4
sc1

root LN1
LN2

LN2
LN1

sc6
sc5
sc1 sc2 sc3 sc4 16
If the branching factor of a leaf node can not exceed 3,
Then LN2 is split.

sc5 sc2
sc6

sc3 sc4
sc1

LN3’
LN2” root
LN2’

LN2’
LN2”
LN1

sc6
sc1 sc2 sc5 sc4 sc3 17
LN2’ and LN1 will be merged, and the newly formed
Node will be split immediately.

sc5 sc2
sc6

sc3 sc4 sc1

LN3’
LN2” root
LN3”

LN3’ LN3” LN2”

sc6
sc2 sc1 sc5 sc4 sc3 18
Cases that Troubles BIRCH

 The objects are numbered by the incoming


order and assume that the distance between
objects 1 and 2 exceeds the diameter
threshold.

6 15
5 7 8
1 11
3 4 2 14
9 10 12
Subcluster 1 13 Subcluster 2

19
Order Dependence
 Incremental clustering algorithm such as
BIRCH suffers order dependence.
 As the previous example demonstrates,
the split and merge operations can
alleviate order dependence to certain
extent.
 In the example that demonstrates the
merge operation, the split and merge
operations together improve clustering
quality.
 However, order dependence can not be
completely avoided. If no new objects
were added to form subcluster 6, then
the clustering quality is not satisfactory.
20
Several Issues with the CF Tree
 No of entries in CFT node limited by page
size; thus, node may not correspond to
natural cluster

Two sub-clusters that should be in one
cluster are splitted across nodes

Two sub-clusters that should not be in one
cluster are kept in same node (dependent on
input order and skewness)
 Sensitive to skewed input order

Data point may end in leaf node where it
should not have been

If data point is inserted twice at different
times, may end up in two copies at two
distinct leaf nodes
21
BIRCH Phases X
 Phase: Global clustering

Use existing clustering (e.g., AGNES)
algorithm on sub-clusters at leaf nodes

May treat each sub-cluster as a point (its
centroid) and perform clustering on these
points

Clusters produced closer to data
distribution pattern
 Phase: Cluster refinement

Redistribute (re-label) all data points w.r.t.
clusters produced by global clustering

This phase may be repeated to improve
cluster quality
22
Data set
10
column
s
cluster
s

10
rows

23
BIRCH and CLARANS comparison

24
BIRCH and CLARANS comparison

25
BIRCH and CLARANS comparison

26
BIRCH and CLARANS comparison
 Result visualization:

Cluster represented as a circle

Circle center is centroid; circle radius is
cluster radius

Number of points in cluster labeled in circle
 BIRCH results:

BIRCH clusters are similar to actual clusters

Maximal and average difference between
centroids of actual and corresponding BIRCH
cluster are 0.17 and 0.07 respectively

No of points in BIRCH cluster is no more than
4% different from that of corresponding actual
cluster

Radii of BIRCH clusters are close to those of
actual clusters
27
 CLARANS results:

Pattern of location of cluster centers
distorted

No of data points in a cluster as much as
57% different from that of actual cluster

Radii of cluster vary from 1.15 to 1.94
w.r.t. avg of 1.44

28
BIRCH performance on Base Workload
w.r.t. Time, diameter
data set and
andinput
inputorder
order

CLARANS performance on Base Workload


w.r.t.
w.r.t. Time,
Time, data set and
diameter andinput
inputorder
order

29
BIRCH and CLARANS comparison
 Parameters:

D: average diameter; smaller means better
cluster quality

Time: time to cluster datasets (in seconds)

Order (‘o’): points in the same cluster are
placed together in the input data
 Results:

BIRCH took less than 50 secs to cluster
100,000 data points of each dataset (on an
HP 9000 workstation with 80K memory)

Ordering of data points also have no impact

CLARANS is at least 15 times slower than
BIRCH; when data points are ordered,
CLARANS performance degrades

30

You might also like