En Tanagra Clustering Tree
En Tanagra Clustering Tree
Subject
We show how to induce clustering trees with TANAGRA.
The aim of clustering is to build groups of individuals so that, the examples in the same
group are similar, the examples in different groups are dissimilar.
Top down induction of clustering trees adapts the supervised decision/regression trees
framework towards clustering. The groups are built by recursive partitioning of the dataset,
the internal nodes of the tree are classically split with input attributes. The obtained model,
the clustering tree, describes the groups; the learning algorithm selects automatically the
relevant attributes.
The clustering trees approach is not very known; we show in this tutorial the interesting
properties of this method. Our main references are the papers of Chavent 1 (1998) and
Blockeel2 (1998).
Dataset
We use the ZOO dataset (UCI). We want to group animals using their characteristics such as
number of legs, producing milk, …
The expert domain proposes 7 clusters. We want to know (1) if our algorithm can find these
clusters; (2) if we find the same clusters as the well-known K-MEANS algorithm.
Clustering trees
1 M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
2 H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering rees », ICML, 55—63.
Because we have discrete attributes, we use multiple correspondence analysis (MCA). This
data transformation cumulates several advantages: we can use now classical Euclidian
distance, more especially as the factorial axes (the latent variables) are independent; by
selecting only the first 10 axes, we recover "useful" information and leave side "disturbed"
information specific to the file (the artifact information in the dataset).
We add a MCA component in the diagram, we set 10 the number of produced axis
(approximately the half of the total number of axis).
Note: In the case of continuous attributes, we follow the same principle and use instead a principal
component analysis (PCA). We observe the same advantages.
We click on the VIEW contextual menu. The 10 axis summaries 90% of available information,
that is fully suitable.
Note: In this tutorial, we use the same attributes for the homogeneity computation and the
construction of the tree. But, in fact, we can use two separate sets of attributes. We obtain a
generalization of decision/regression trees; some authors call this approach “multi-objective
regression/decision trees” or “predictive clustering trees”.
Clustering trees
We add the clustering tree component in the diagram (CTP -- CLUSTERING TREE WITH
PRUNING).
Roughly speaking, it is a generalization of CART algorithm (Breiman et al, 1984) with two
specificities:
1. We compute inertia instead of variance to evaluate homogeneity of groups.
2. Our goal is not to produce an accurate prediction but find “natural” groups. So, we try to
detect the “angle” of the within-inertia computed on the pruning set. At the present time,
we use a regression on successive 3 points. We select the cut point that corresponds to a
slope of the lines near to zero.
In this tutorial, we use 20% of the dataset as pruning set; 80% of examples are used for the
growing phase. We obtain the following clustering tree (VIEW menu).
We obtain 4 groups (the leaves of the tree), each cluster corresponds to the following rule:
We can see also the decrease of the within-class inertia according to the number of the leaves
(groups), on the growing and the pruning set.
The 14 groups clustering minimizes the within inertia on the pruning set (green mark). But
we see an “angle” when we have 4 groups (red mark). The following chart shows the
variation of the within inertia.
0.4
0.2
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
We add a DEFINE STATUS component in the diagram. We set TYPE as TARGET and our
clustering suggestion (CLUSTER_CTP_1) as INPUT. Then we add a CONTINGENCY CHI-
SQUARE (NON PARAMETRIC STATISTICS tab) in order to compare the groups.
Dataset (zoo.xls)
Define status 1
M ultiple Correspondance Analysis 1
Define status 2
CTP 1
Define status 3
Contingency Chi-Square 1
Each expert group is set in one cluster. And each cluster is a pure group (Cluster 1 and
Cluster 4) or a mix of similar species (Cluster 2 and Cluster 3)3.
We insert again a DEFINE STATUS component under the CTP (Clustering Tree) component.
We set as INPUT the factorial axis. We add the K-MEANS component that is configured so
that the results of the two approaches (tree and k-means) are comparable: we want 4 groups;
we must not normalize the factorial axis in the inertia computation.
3 I am not an expert !
We want to compare these groups with the groups obtained with CTP.
We insert another DEFINE STATUS in the diagram; we set as TARGET the clusters of the
tree (CLUSTER_CTP_1), as INPUT the clusters of the K-MEANS (CLUSTER_KMEANS_1).
The two methods are equivalent; the profit of interpretability of the trees is not
counterbalanced by a degradation of the precision of calculations. The other advantage of the
tree in this case is that it selected the relevant variables automatically.
Visualization of groups
Factorial analysis allows us to visualize the dataset in a reduced dimension space. We want
to see if we can perceive the expert groups in the first two “latent” variables.
-1
This result is edifying. The groups proposed by experts are really distinct even on the first
two axes which summaries only about 50% of the available information (see MCA result,
44.89%).
If this example shows well that the visual tools are often very powerful; the main difficulty is
to be able to be came back thereafter to the initial space of description and obtain an
interpretable results in relation to these descriptors. The reading of the results of the MCA
remains obscure for the people who are not accustomed.
The clustering trees approach is a simple method to build automatically clusters and obtain
interpretable results.