0% found this document useful (0 votes)
201 views

En Tanagra Clustering Tree

This document describes using clustering trees to induce groupings of animals (the zoo dataset) based on their characteristics. It shows how to: 1. Import the zoo dataset and select relevant attributes for clustering. 2. Perform multiple correspondence analysis to transform the data and build a new representation space. 3. Use clustering trees to recursively partition the data into groups defined by rules based on the attributes. 4. Compare the clustering tree groups to the expert classification and to k-means clustering, finding them to produce equivalent groupings in this case.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
201 views

En Tanagra Clustering Tree

This document describes using clustering trees to induce groupings of animals (the zoo dataset) based on their characteristics. It shows how to: 1. Import the zoo dataset and select relevant attributes for clustering. 2. Perform multiple correspondence analysis to transform the data and build a new representation space. 3. Use clustering trees to recursively partition the data into groups defined by rules based on the attributes. 4. Compare the clustering tree groups to the expert classification and to k-means clustering, finding them to produce equivalent groupings in this case.
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Didacticiel - Etudes de cas R.R.

Subject
We show how to induce clustering trees with TANAGRA.

The aim of clustering is to build groups of individuals so that, the examples in the same
group are similar, the examples in different groups are dissimilar.

Top down induction of clustering trees adapts the supervised decision/regression trees
framework towards clustering. The groups are built by recursive partitioning of the dataset,
the internal nodes of the tree are classically split with input attributes. The obtained model,
the clustering tree, describes the groups; the learning algorithm selects automatically the
relevant attributes.

The clustering trees approach is not very known; we show in this tutorial the interesting
properties of this method. Our main references are the papers of Chavent 1 (1998) and
Blockeel2 (1998).

Dataset
We use the ZOO dataset (UCI). We want to group animals using their characteristics such as
number of legs, producing milk, …

The expert domain proposes 7 clusters. We want to know (1) if our algorithm can find these
clusters; (2) if we find the same clusters as the well-known K-MEANS algorithm.

Clustering trees

Downloading the dataset


In the first time, we must create a diagram and import ZOO.XLS. We click on the FILE/NEW
menu.

1 M. Chavent (1998), « A monothetic clustering method », Pattern Recognition Letters, 19, 989—996.
2 H. Blockeel, L. De Raedt, J. Ramon (1998), « Top-Down Induction of Clustering rees », ICML, 55—63.

06/05/2006 Page 1 sur 12


Didacticiel - Etudes de cas R.R.

Selecting the attributes


In the next step, we select the attributes that we use in order to characterize the homogeneity
of groups. We choice all the measured attributes; we do not use the TYPE attribute, which is
provided by experts. We use the DEFINE STATUS component.

06/05/2006 Page 2 sur 12


Didacticiel - Etudes de cas R.R.
Feature construction
Computing a distance on discrete attributes is possible but not easy. Moreover, some
attributes may be redundant. We use factorial analysis in order to build a new representation
space where we respect, as much as possible, the proximity between the individuals.

Because we have discrete attributes, we use multiple correspondence analysis (MCA). This
data transformation cumulates several advantages: we can use now classical Euclidian
distance, more especially as the factorial axes (the latent variables) are independent; by
selecting only the first 10 axes, we recover "useful" information and leave side "disturbed"
information specific to the file (the artifact information in the dataset).

We add a MCA component in the diagram, we set 10 the number of produced axis
(approximately the half of the total number of axis).

Note: In the case of continuous attributes, we follow the same principle and use instead a principal
component analysis (PCA). We observe the same advantages.

We click on the VIEW contextual menu. The 10 axis summaries 90% of available information,
that is fully suitable.

06/05/2006 Page 3 sur 12


Didacticiel - Etudes de cas R.R.

Target and input attributes for clustering tree


In order to build groups, we want split the dataset using original attributes (INPUT); the
homogeneity of groups is computed on factorial axis (TARGET). We add a DEFINE STATUS
component in the diagram and set these parameters.

We obtain the following results (VIEW menu).

06/05/2006 Page 4 sur 12


Didacticiel - Etudes de cas R.R.

Note: In this tutorial, we use the same attributes for the homogeneity computation and the
construction of the tree. But, in fact, we can use two separate sets of attributes. We obtain a
generalization of decision/regression trees; some authors call this approach “multi-objective
regression/decision trees” or “predictive clustering trees”.

Clustering trees
We add the clustering tree component in the diagram (CTP -- CLUSTERING TREE WITH
PRUNING).

06/05/2006 Page 5 sur 12


Didacticiel - Etudes de cas R.R.

Roughly speaking, it is a generalization of CART algorithm (Breiman et al, 1984) with two
specificities:
1. We compute inertia instead of variance to evaluate homogeneity of groups.
2. Our goal is not to produce an accurate prediction but find “natural” groups. So, we try to
detect the “angle” of the within-inertia computed on the pruning set. At the present time,
we use a regression on successive 3 points. We select the cut point that corresponds to a
slope of the lines near to zero.

In this tutorial, we use 20% of the dataset as pruning set; 80% of examples are used for the
growing phase. We obtain the following clustering tree (VIEW menu).

06/05/2006 Page 6 sur 12


Didacticiel - Etudes de cas R.R.

We obtain 4 groups (the leaves of the tree), each cluster corresponds to the following rule:

If milk = true Then Cluster 1


If milk = false And feathers = false And backbone = true Then Cluster 2
If milk = false And feathers = false And backbone = false Then Cluster 3
If milk = false And feathers = true Then Cluster 4

It is very easy to assign a group to a new example with theses rules.

We can see also the decrease of the within-class inertia according to the number of the leaves
(groups), on the growing and the pruning set.

The 14 groups clustering minimizes the within inertia on the pruning set (green mark). But
we see an “angle” when we have 4 groups (red mark). The following chart shows the
variation of the within inertia.

06/05/2006 Page 7 sur 12


Didacticiel - Etudes de cas R.R.

Within-groups inertia according to


the number of clusters
1.2

0.8 Growing set


Pruning set
0.6

0.4

0.2

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Comparison with the classification of the domain expert


The experts suggest 7 groups. Our aim is to compare our 4 groups clustering with this
classification. It is a good indicator of the relevance of our results.

We add a DEFINE STATUS component in the diagram. We set TYPE as TARGET and our
clustering suggestion (CLUSTER_CTP_1) as INPUT. Then we add a CONTINGENCY CHI-
SQUARE (NON PARAMETRIC STATISTICS tab) in order to compare the groups.

Dataset (zoo.xls)
Define status 1
M ultiple Correspondance Analysis 1
Define status 2
CTP 1
Define status 3
Contingency Chi-Square 1

We note that we have very similar groups.

06/05/2006 Page 8 sur 12


Didacticiel - Etudes de cas R.R.

Each expert group is set in one cluster. And each cluster is a pure group (Cluster 1 and
Cluster 4) or a mix of similar species (Cluster 2 and Cluster 3)3.

Comparison with K-MEANS clustering algorithm


The learning and representation bias of the clustering trees can lead to not very effective
solutions compared to well-known methods such as K-MEANS. In this next step, we
compare the groups of CTP with the groups produced by K-MEANS.

We insert again a DEFINE STATUS component under the CTP (Clustering Tree) component.
We set as INPUT the factorial axis. We add the K-MEANS component that is configured so
that the results of the two approaches (tree and k-means) are comparable: we want 4 groups;
we must not normalize the factorial axis in the inertia computation.

3 I am not an expert !

06/05/2006 Page 9 sur 12


Didacticiel - Etudes de cas R.R.

We obtain the following results.

We want to compare these groups with the groups obtained with CTP.

We insert another DEFINE STATUS in the diagram; we set as TARGET the clusters of the
tree (CLUSTER_CTP_1), as INPUT the clusters of the K-MEANS (CLUSTER_KMEANS_1).

06/05/2006 Page 10 sur 12


Didacticiel - Etudes de cas R.R.
So we add again the contingency table component in order to comparing the two
approaches.

The two methods are equivalent; the profit of interpretability of the trees is not
counterbalanced by a degradation of the precision of calculations. The other advantage of the
tree in this case is that it selected the relevant variables automatically.

Visualization of groups
Factorial analysis allows us to visualize the dataset in a reduced dimension space. We want
to see if we can perceive the expert groups in the first two “latent” variables.

(X1) MCA_1_Axis_1 vs. (X2) MCA_1_Axis_2 by (Y) type

-1

mammal fish bird invertebrate insect amphibian reptile

06/05/2006 Page 11 sur 12


Didacticiel - Etudes de cas R.R.

This result is edifying. The groups proposed by experts are really distinct even on the first
two axes which summaries only about 50% of the available information (see MCA result,
44.89%).

If this example shows well that the visual tools are often very powerful; the main difficulty is
to be able to be came back thereafter to the initial space of description and obtain an
interpretable results in relation to these descriptors. The reading of the results of the MCA
remains obscure for the people who are not accustomed.

The clustering trees approach is a simple method to build automatically clusters and obtain
interpretable results.

06/05/2006 Page 12 sur 12

You might also like