Clustering For Clasification
Clustering For Clasification
Reuben Evans
Contents
1 Introduction 3
2 Description 4
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 The Clusterers . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 First K . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Simple K Means . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Farthest First . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.4 Bisecting K Means . . . . . . . . . . . . . . . . . . . . 6
2.3.5 Expectation Maximization (EM) . . . . . . . . . . . . 6
3 Experiments 7
3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Algorithms for Nominal Datasets . . . . . . . . . . . . 7
3.1.2 Algorithms for Numeric Datasets . . . . . . . . . . . . 7
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Experiment One . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Nominal Datasets . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Numeric Datasets . . . . . . . . . . . . . . . . . . . . . 8
3.4 Experiment Two . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.1 Nominal Datasets . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Numeric Datasets . . . . . . . . . . . . . . . . . . . . . 10
3.5 Experiment Three . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Experiment Four . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Experiment Five . . . . . . . . . . . . . . . . . . . . . . . . . 10
1
4 Related Work 11
4.1 Data Squishing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Instance Selecction . . . . . . . . . . . . . . . . . . . . . . . . 11
5 Conclusions 12
5.0.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 12
2
Chapter 1
Introduction
3
Chapter 2
Description
This section details the problem and an algorithm that addresses this problem
2.1 Motivation
This section describles the reasons behind this research. What makes this
algorithm of interest
Dataset too large, cannot fit in memory Classifier to comples speed up
by reducing the size of input
2.2 Idea
The classifier handles nominal values using the nominal to binary filter to
convert them into a number of binary attributes. Only numeric and binary
attributes are used in the filtering process. All attributes are normalized,
to prevent different attributes having more weight than others in distance
calculations.
For a numeric class the data is clustered directly producing exactly as
many clusters as specified by the user in the C parameter to the classifier. For
all other types of classes the data is separated sets with all the same class and
then the clustering process is use on all sets resulting in C clusters for each
posable class value. The clusters are then built, when building the clusters
if there are less Instances than the number of classes then the instances
are returned as the clusters. Otherwise the instances are randomized and a
4
number of instances equal to the desired number o clusters are taken from
the start of the dataset as cluster centers.
Each of the remaining instances is then taken in turn and merged with the
closest cluster centre. The closeness is determined by measuring the relative
squared euclidean distance between the instance and each cluster centre. The
cluster centre is then updated so that each of it’s attribute values is the sum
of the weight adjusted attribute values of the cluster centre and the instance,
so if the centre had a weight of three and the instance had a weight of one
then the resulting attribute value would be three quarters the value of the
centre plus one quarter the value of the instance. The weight of the instance
is added to the weight of the centre to create the new centre weight.
When all instances have been merged into the clusters the set of clusters
is passed to the classifier specified by the user to build the model.
ClustersForClasses Separate Instances into separate collections so each
collection has only one class value Use ClustersForData on each collection
set to create the clusters Merge resulting cluster sets from ClustersForData
into one large set of clusters Return clusters
2.3.1 First K
First K is a very naive clusterer focused on speed. This clusterer declares
the first K instances encountered to be the cluster centers, each subsequent
instance in the dataset is then merged with the closest cluster centre by the
Euclidian distance. This results in clusters that are certainly not tthe best
5
clusters thatt could have been obtained from the data however it allows the
creation of clusters in a single pass through that data which results in a
clusterer that is linear in the number of clusters and instances
ClustersForData IF number of instances is less than or equal to num-
ber of classes return instances as clusters randomize instances Set first C
instances as cluster centers For Each Instance above C MergeWithClosest
Return clusters
MergeWithClosest Set minD = relativeSquaredEuclideanDistance from
instance to first cluster for each cluster IF relativeSquaredEuclideanDis-
tance from instance to cluster ¡ minD minD = relativeSquaredEuclidean-
Distance from instance to cluster Set Total Weight = Weight(minD) +
Weight(instance) w0 = Weight(mnD)/total w1 = Weight(instance)/total
for each attribute in cluster with minD attributeValue = w0* attribute-
Value + w1 * instanceAttributeValue Weight(minD) = Weight(minD) +
Weight(instance)
6
Chapter 3
Experiments
3.1 Algorithms
This section describes the four different algorithms we used for testing and
why each of them was chosen. The Experiments use two algorrthims for each
Class type a simple one and a more complex one. Two algorthins were chosen
because earlier testing had shhown that some clusterers worked better with
more complex clusterers, while others were better forr the simple clusterers.
7
Table 3.1: Algorithms Used in testing
Nominal Numeric
Simple Naive Bayes Linear Regression
Complex Logistic Regression M5 Model Trees
3.2 Datasets
The three experiments use twenty different datasets. Ten Nominal Datasets
and Ten numeric ones.
8
Table 3.2: Datasets Used in these experiments
9
3.4.1 Nominal Datasets
All Nominal datasets were used.
Arrg here be results!! - eventually
10
Chapter 4
Related Work
This section describes how the Clustered Meta Instance Classifier fits in with
other work.
11
Chapter 5
Conclusions
12