0% found this document useful (0 votes)

3 views30 pages

Birch

BIRCH (Balanced Iterative Reducing and Clustering Using Hierarchies) is a hierarchical clustering method that utilizes Clustering Features (CF) and a Clustering Feature Tree (CFT) to efficiently manage large datasets. It addresses the limitations of traditional agglomerative clustering methods by incrementally constructing a CF tree and applying multi-phase clustering for improved quality. The algorithm is sensitive to input order and can suffer from order dependence, but it demonstrates effective clustering performance compared to other methods like CLARANS.

Uploaded by

Bhavna Buddhadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views30 pages

Birch

Uploaded by

Bhavna Buddhadev

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 30

BIRCH: Balanced Iterative

Reducing and Clustering Using

Hierarchies

A hierarchical clustering method.

It introduces two concepts :
Clustering feature
Clustering feature tree (CF tree)

These structures help the clustering method

achieve good speed and scalability in large
databases.
Motivation
 Major weakness of agglomerative
clustering methods

Do not scale well; time complexity of at least
O(n2), where n is total number of objects

Can never undo what was done previously
 Birch: Balanced Iterative Reducing and
Clustering using Hierarchies

Incrementally construct a CF (Clustering
Feature) tree, a hierarchical data structure
summarizing cluster info; finds a good
clustering with a single scan

Apply multi-phase clustering; improves
quality with a few additional scans
2
Summarized Info for Single cluster

Given a cluster with N objects

N
 Centroid
X
i 1 i
X0 
N
 Radius

N 2
( X i  X 0 ) 1/ 2
R ( i 1
)
N
 Diameter

 
N N 2
i 1 j 1
(Xi  X j ) 1/ 2
D ( )
N ( N  1)

3
Summarized Info for Two Clusters
 Given two clusters with N1 and N2
objects, respectively
Centroid Euclidean distance
  2 1/ 2


D0 (( X 0  Y0 ) )

Centroid Manhattan distance
 
D1  X 0  Y0

Average inter-cluster distance
  2
i 1  j 1 ( Xi  Yj )
N1 N2

D2 ( )1/ 2
N1N 2
4
Clustering Feature (CF)
 CF = (N, LS, SS)

N =|C|
Number of data points


 LS = N (Vi1, …Vid) =
Oi i 1
Xi
N N N N

LS = ∑ Oi = (∑Vi1, ∑ Vi2,…
∑Vid)

i=1 i=1 i=1 i =1

Linear sum of N data points


N


SS =
X ii 1

square sum of N data points

5
Example of Clustering Feature Vector

CF  (N , LS,SS)
 Clustering Feature:

 N: Number of data points 2 N N 

SS:  Xi LS :  Xi
i 1 i 1

CF = (5, (16, 30), (54,

10
190))
9

7 (3,4)
6

5 (2,6)
(4,5)
4

1
(4,7)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8)

6

N 2
(Xi  X0) 1/ 2
R ( i 1
)
N
2 2

N
( X i  2 X 0 X i  X 0 )
( i 1
)1/ 2
N
2 2
 X i  2i 1 X i X 0  i 1 X 0 )
N N N

( i 1
)1/ 2
N
2 2
 X i  2 X 0 i 1 X i  N X 0
N N
1/ 2
( i 1
)
N
LS LS 2
SS  2 LS  N ( )
( N N )1/ 2
N 7
CF Additive Theorem
 Suppose cluster C1 has CF1=(N1, LS1 ,SS1),
cluster C2 has CF2 =(N2,LS2,SS2)
 If we merge C1 with C2, the CF for the
merged cluster C is
CF CF1  CF2
( N1  N 2 , LS1  LS 2 , SS1  SS 2 )
 Why CF?
 Summarized info for single cluster
 Summarized info for two clusters
 Additive theorem

8
Example of Clustering Feature Vector

CF  (N , LS,SS)
 Clustering Feature:
 N 2
N

LS :  Xi
N: Number of data points SS:  Xi
i 1 i 1

CF1 = (5, (16, 30), (54,

10 190))
9 CF2 = (5, (36, 17),
(262,
(3,4) 61))
8

7 (6,2)
6

5 (2,6) (7,2)
(4,5) (7,4)
4

1
(4,7) (8,4)
0
0 1 2 3 4 5 6 7 8 9 10
(3,8) (8,5)

CF = (10, (52, 47),

9
(316, 251))
Clustering Feature Tree (CFT)
 Clustering feature tree (CFT) is an
alternative representation of data set:

Each non-leaf node is a cluster comprising
sub-clusters corresponding to entries (at
most B) in non-leaf node

Each leaf node is a cluster comprising sub-
clusters corresponding to entries (at most
L) in leaf node and each sub-cluster’s
diameter is at most T; when T is larger,
CFT is smaller

Each node must fit a memory page

10
Example of CF Tree
Root
B=7 CF1 CF2 CF3 CF6

L=6 child1 child2 child3 child6

Non-leaf
CF9 CF1 CFnode
1 CF1
0 1 3
child1 child2 child3 child5

Leaf Leaf
pre node node nex
CF9 CF9 CF9 nex pre CF9 CF9 CF9
v t v t
0 1 4 5 6 8

11
BIRCH Phase 1
 Phase 1 scans data points and build in-
memory CFT;
 Start from root, traverse down tree to
choose closest leaf node for d
 Search for closest entry Li in leaf node
 If d can be inserted in Li, then update CF
vector of Li
 Else if node has space to insert new entry,
insert; else split node
 Once inserted, update nodes along path to
the root; if there is splitting, need to insert
new entry in parent node (which may result
in further splitting)
12
Example of the BIRCH Algorithm

New subcluster
sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2 LN2 LN3

Root
LN1 LN2 LN3
LN1

sc8 sc5 sc6

sc1 sc2 sc3 sc4 sc7
13
Merge Operation in BIRCH
If the branching factor of a leaf node can not exceed 3
, then LN1 is split.
sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2

LN1’ LN2 LN3

Root
LN1”
LN1’ LN2 LN3
LN1”

sc8 sc5 sc6

sc1 sc2 sc3 sc4 sc7
14
Merge Operation in BIRCH
If the branching factor of a non-leaf node can not
exceed 3, then the root is split and the height of
the CF Tree increases by one.

sc8
sc3
sc1 sc4 sc7
sc5 sc6

sc2

LN1’ LN2 LN3

LN1” Root

NLN1 NLN2

LN1’
LN1” LN2 LN3

15
sc8 sc2 sc3 sc5 sc6 sc7
Merge Operation in BIRCH
Assume that the subclusters are numbered according to
the order of formation.

sc5 sc2
sc6

sc3 sc4
sc1

root LN1
LN2

LN2
LN1

sc6
sc5
sc1 sc2 sc3 sc4 16
If the branching factor of a leaf node can not exceed 3,
Then LN2 is split.

sc5 sc2
sc6

sc3 sc4
sc1

LN3’
LN2” root
LN2’

LN2’
LN2”
LN1

sc6
sc1 sc2 sc5 sc4 sc3 17
LN2’ and LN1 will be merged, and the newly formed
Node will be split immediately.

sc5 sc2
sc6

sc3 sc4 sc1

LN3’
LN2” root
LN3”

LN3’ LN3” LN2”

sc6
sc2 sc1 sc5 sc4 sc3 18
Cases that Troubles BIRCH

 The objects are numbered by the incoming

order and assume that the distance between
objects 1 and 2 exceeds the diameter
threshold.

6 15
5 7 8
1 11
3 4 2 14
9 10 12
Subcluster 1 13 Subcluster 2

19
Order Dependence
 Incremental clustering algorithm such as
BIRCH suffers order dependence.
 As the previous example demonstrates,
the split and merge operations can
alleviate order dependence to certain
extent.
 In the example that demonstrates the
merge operation, the split and merge
operations together improve clustering
quality.
 However, order dependence can not be
completely avoided. If no new objects
were added to form subcluster 6, then
the clustering quality is not satisfactory.
20
Several Issues with the CF Tree
 No of entries in CFT node limited by page
size; thus, node may not correspond to
natural cluster

Two sub-clusters that should be in one
cluster are splitted across nodes

Two sub-clusters that should not be in one
cluster are kept in same node (dependent on
input order and skewness)
 Sensitive to skewed input order

Data point may end in leaf node where it
should not have been

If data point is inserted twice at different
times, may end up in two copies at two
distinct leaf nodes
21
BIRCH Phases X
 Phase: Global clustering

Use existing clustering (e.g., AGNES)
algorithm on sub-clusters at leaf nodes

May treat each sub-cluster as a point (its
centroid) and perform clustering on these
points

Clusters produced closer to data
distribution pattern
 Phase: Cluster refinement

Redistribute (re-label) all data points w.r.t.
clusters produced by global clustering

This phase may be repeated to improve
cluster quality
22
Data set
10
column
s
cluster
s

10
rows

23
BIRCH and CLARANS comparison

24
BIRCH and CLARANS comparison

25
BIRCH and CLARANS comparison

26
BIRCH and CLARANS comparison
 Result visualization:

Cluster represented as a circle

Circle center is centroid; circle radius is
cluster radius

Number of points in cluster labeled in circle
 BIRCH results:

BIRCH clusters are similar to actual clusters

Maximal and average difference between
centroids of actual and corresponding BIRCH
cluster are 0.17 and 0.07 respectively

No of points in BIRCH cluster is no more than
4% different from that of corresponding actual
cluster

Radii of BIRCH clusters are close to those of
actual clusters
27
 CLARANS results:

Pattern of location of cluster centers
distorted

No of data points in a cluster as much as
57% different from that of actual cluster

Radii of cluster vary from 1.15 to 1.94
w.r.t. avg of 1.44

28
BIRCH performance on Base Workload
w.r.t. Time, diameter
data set and
andinput
inputorder
order

CLARANS performance on Base Workload

w.r.t.
w.r.t. Time,
Time, data set and
diameter andinput
inputorder
order

29
BIRCH and CLARANS comparison
 Parameters:

D: average diameter; smaller means better
cluster quality

Time: time to cluster datasets (in seconds)

Order (‘o’): points in the same cluster are
placed together in the input data
 Results:

BIRCH took less than 50 secs to cluster
100,000 data points of each dataset (on an
HP 9000 workstation with 80K memory)

Ordering of data points also have no impact

CLARANS is at least 15 times slower than
BIRCH; when data points are ordered,
CLARANS performance degrades

4.6 Birch
No ratings yet
4.6 Birch
21 pages
Hierarchical ClusteringAlgorithm
No ratings yet
Hierarchical ClusteringAlgorithm
32 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
33 pages
Presentation On Clustering Algorithms
No ratings yet
Presentation On Clustering Algorithms
43 pages
Balanced Iterative Reducing and Clustering Using Hierarchies
No ratings yet
Balanced Iterative Reducing and Clustering Using Hierarchies
28 pages
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
No ratings yet
ML_U2_BIRCH_61845fd2-aa4b-4335-afa5-37d9f3b4d63a (1)
20 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
30 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
Birch Clustering
No ratings yet
Birch Clustering
11 pages
Lesson 3.6 - Supervised Learning Neural Networks
No ratings yet
Lesson 3.6 - Supervised Learning Neural Networks
35 pages
Lecture 18
No ratings yet
Lecture 18
27 pages
Clustering Part 2
No ratings yet
Clustering Part 2
28 pages
Data Mining
No ratings yet
Data Mining
4 pages
Clustering-Part2
No ratings yet
Clustering-Part2
40 pages
Week-10
No ratings yet
Week-10
84 pages
Unit 4 - Data Warehousing and Mining
No ratings yet
Unit 4 - Data Warehousing and Mining
51 pages
Birch
No ratings yet
Birch
17 pages
BIRCH: A New Data Clustering Algorithm and Its Applications
No ratings yet
BIRCH: A New Data Clustering Algorithm and Its Applications
42 pages
Heirarchical clustering
No ratings yet
Heirarchical clustering
22 pages
ML Module Iv
No ratings yet
ML Module Iv
27 pages
DWDM Unit 4
No ratings yet
DWDM Unit 4
24 pages
4.4 Hierarchical Clustering Methods
No ratings yet
4.4 Hierarchical Clustering Methods
39 pages
Clustering
No ratings yet
Clustering
45 pages
DM Clustering UNIT4
No ratings yet
DM Clustering UNIT4
36 pages
DMBI IAT-2 IMP QUES SOLN
No ratings yet
DMBI IAT-2 IMP QUES SOLN
43 pages
Birch alg
No ratings yet
Birch alg
23 pages
List of Figures Chapter 1: State of The Art
No ratings yet
List of Figures Chapter 1: State of The Art
25 pages
DOC-20231118-WA0008new Unit 5
No ratings yet
DOC-20231118-WA0008new Unit 5
15 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
Clustering
No ratings yet
Clustering
28 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Birch
No ratings yet
Birch
6 pages
Unit-4 Notes (1)
No ratings yet
Unit-4 Notes (1)
16 pages
Hierarchical Document Clustering Using Frequent Itemsets: Benjamin Fung
No ratings yet
Hierarchical Document Clustering Using Frequent Itemsets: Benjamin Fung
37 pages
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
No ratings yet
Clustering Data Streams: Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague
19 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Agglomerative Mean-Shift Clustering
No ratings yet
Agglomerative Mean-Shift Clustering
7 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
7 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Unit 4-1
No ratings yet
Unit 4-1
25 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
AI20- Hierarchical-clustering
No ratings yet
AI20- Hierarchical-clustering
31 pages
Text Clustering and Validation For Web Search Results
No ratings yet
Text Clustering and Validation For Web Search Results
7 pages
University of Calgary Team Reference Document: March 15, 2017
No ratings yet
University of Calgary Team Reference Document: March 15, 2017
23 pages
University of Calgary Team Reference Document: March 15, 2017
No ratings yet
University of Calgary Team Reference Document: March 15, 2017
23 pages
By Lior Rokach and Oded Maimon: Clustering Methods
No ratings yet
By Lior Rokach and Oded Maimon: Clustering Methods
5 pages
Fuzzy Classification Part I
No ratings yet
Fuzzy Classification Part I
47 pages
MODULE_5
No ratings yet
MODULE_5
43 pages
13_BIRCH
No ratings yet
13_BIRCH
8 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
No ratings yet
WINSEM2020-21 CSE4020 ETH VL2020210504996 Reference Material I 26-Apr-2021 Clustering
43 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
enli
No ratings yet
enli
19 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Topics on Tournaments in Graph Theory
From Everand
Topics on Tournaments in Graph Theory
John W. Moon
No ratings yet
Personal Relationship LP
100% (1)
Personal Relationship LP
2 pages
A Numerical Study On Flow Through The Spiral Casing of A Hydraulic Turbine
No ratings yet
A Numerical Study On Flow Through The Spiral Casing of A Hydraulic Turbine
14 pages
Science Flight 1
No ratings yet
Science Flight 1
4 pages
Switching Networks: Single Stage Network
No ratings yet
Switching Networks: Single Stage Network
14 pages
Creative Nonfiction SLHT 3
No ratings yet
Creative Nonfiction SLHT 3
4 pages
EN-cybrid Sealant TDS 1
No ratings yet
EN-cybrid Sealant TDS 1
2 pages
Three-Dimensional Cephalometry: Richard L. Jacobson
No ratings yet
Three-Dimensional Cephalometry: Richard L. Jacobson
6 pages
Teste de Nivelamento de Ingls - Lages 193635
No ratings yet
Teste de Nivelamento de Ingls - Lages 193635
11 pages
0620_w24_ms_62
No ratings yet
0620_w24_ms_62
8 pages
PR2 Q1 Mod5-7
No ratings yet
PR2 Q1 Mod5-7
51 pages
Morphological Colab
No ratings yet
Morphological Colab
2 pages
Formazione Pyr N-Sost
No ratings yet
Formazione Pyr N-Sost
172 pages
Rhetorical Visual Analysis
No ratings yet
Rhetorical Visual Analysis
6 pages
Medical & Diagnostic Bacteriology: A Guide in Resources Constrained Laboratories
No ratings yet
Medical & Diagnostic Bacteriology: A Guide in Resources Constrained Laboratories
229 pages
Ce368-M7 - Doubly Reinforced Beam
No ratings yet
Ce368-M7 - Doubly Reinforced Beam
1 page
Course-Outline EDUC 202 - FACILITATING LEARNER CENTERED TEACHING
No ratings yet
Course-Outline EDUC 202 - FACILITATING LEARNER CENTERED TEACHING
4 pages
Detailed Lesson Plan
100% (1)
Detailed Lesson Plan
17 pages
4th Quarter SCIENCE10 - Asynchronous Seatwork #5 (Combined Gas Law)
No ratings yet
4th Quarter SCIENCE10 - Asynchronous Seatwork #5 (Combined Gas Law)
1 page
OP200 OC200 User Manual and Technical Specifications r1
No ratings yet
OP200 OC200 User Manual and Technical Specifications r1
138 pages
Pranali Khanvilkar Sun Kutir, Plot No.17 104, Sector 28, Nerul, Navi Mumbai - 400706
No ratings yet
Pranali Khanvilkar Sun Kutir, Plot No.17 104, Sector 28, Nerul, Navi Mumbai - 400706
6 pages
The Emperor's New Mathematics
No ratings yet
The Emperor's New Mathematics
453 pages
Delic Dalija
No ratings yet
Delic Dalija
105 pages
7 Self Awareness
No ratings yet
7 Self Awareness
27 pages
Phenomenology of Language Sensitivity in The Classroom
No ratings yet
Phenomenology of Language Sensitivity in The Classroom
10 pages
Energies: Analysis of Cooling Effectiveness and Temperature Uniformity in A Battery Pack For Cylindrical Batteries
No ratings yet
Energies: Analysis of Cooling Effectiveness and Temperature Uniformity in A Battery Pack For Cylindrical Batteries
17 pages
ESDU 73031 Convective Heat Transfer During Crossflow of Fluids Over Plain Tube
No ratings yet
ESDU 73031 Convective Heat Transfer During Crossflow of Fluids Over Plain Tube
31 pages
Test 0153
No ratings yet
Test 0153
2 pages
Biology of Grasses
100% (2)
Biology of Grasses
288 pages
Sound Power Determination-1
100% (1)
Sound Power Determination-1
4 pages
Discharge Measurement
No ratings yet
Discharge Measurement
13 pages

Birch

Uploaded by

Birch

Uploaded by

BIRCH: Balanced Iterative

Reducing and Clustering Using

A hierarchical clustering method.

These structures help the clustering method

i=1 i=1 i=1 i =1

Linear sum of N data points

square sum of N data points

 N: Number of data points 2 N N 

CF = (5, (16, 30), (54,

CF1 = (5, (16, 30), (54,

CF = (10, (52, 47),

L=6 child1 child2 child3 child6

sc2 LN2 LN3

sc8 sc5 sc6

LN1’ LN2 LN3

sc8 sc5 sc6

LN1’ LN2 LN3

sc3 sc4 sc1

LN3’ LN3” LN2”

 The objects are numbered by the incoming

CLARANS performance on Base Workload

You might also like