0% found this document useful (0 votes)

29 views

Clustering For Clasification

The document discusses using clustering algorithms to reduce the size of datasets for classification. It describes several clustering algorithms that could be used, including first K, K-means, farthest first, bisecting K-means, and expectation maximization. It outlines a series of experiments to test the performance of these clustering algorithms on nominal and numeric datasets of varying sizes when used as a preprocessing step for classification algorithms like naive Bayes, logistic regression, linear regression, and M5 model trees.

Uploaded by

Jayan Madusanka

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Clustering For Clasification

Uploaded by

Jayan Madusanka

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Clustering for Clasification

Reuben Evans
Contents

1 Introduction 3

2 Description 4
2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 The Clusterers . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.1 First K . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3.2 Simple K Means . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Farthest First . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.4 Bisecting K Means . . . . . . . . . . . . . . . . . . . . 6
2.3.5 Expectation Maximization (EM) . . . . . . . . . . . . 6

3 Experiments 7
3.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 Algorithms for Nominal Datasets . . . . . . . . . . . . 7
3.1.2 Algorithms for Numeric Datasets . . . . . . . . . . . . 7
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Experiment One . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3.1 Nominal Datasets . . . . . . . . . . . . . . . . . . . . . 8
3.3.2 Numeric Datasets . . . . . . . . . . . . . . . . . . . . . 8
3.4 Experiment Two . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4.1 Nominal Datasets . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Numeric Datasets . . . . . . . . . . . . . . . . . . . . . 10
3.5 Experiment Three . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.6 Experiment Four . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.7 Experiment Five . . . . . . . . . . . . . . . . . . . . . . . . . 10

1
4 Related Work 11
4.1 Data Squishing . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Instance Selecction . . . . . . . . . . . . . . . . . . . . . . . . 11

5 Conclusions 12
5.0.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 12

2
Chapter 1

Introduction

3
Chapter 2

Description

This section details the problem and an algorithm that addresses this problem

2.1 Motivation
This section describles the reasons behind this research. What makes this
algorithm of interest
Dataset too large, cannot fit in memory Classifier to comples speed up
by reducing the size of input

2.2 Idea
The classifier handles nominal values using the nominal to binary filter to
convert them into a number of binary attributes. Only numeric and binary
attributes are used in the filtering process. All attributes are normalized,
to prevent different attributes having more weight than others in distance
calculations.
For a numeric class the data is clustered directly producing exactly as
many clusters as specified by the user in the C parameter to the classifier. For
all other types of classes the data is separated sets with all the same class and
then the clustering process is use on all sets resulting in C clusters for each
posable class value. The clusters are then built, when building the clusters
if there are less Instances than the number of classes then the instances
are returned as the clusters. Otherwise the instances are randomized and a

4
number of instances equal to the desired number o clusters are taken from
the start of the dataset as cluster centers.
Each of the remaining instances is then taken in turn and merged with the
closest cluster centre. The closeness is determined by measuring the relative
squared euclidean distance between the instance and each cluster centre. The
cluster centre is then updated so that each of it’s attribute values is the sum
of the weight adjusted attribute values of the cluster centre and the instance,
so if the centre had a weight of three and the instance had a weight of one
then the resulting attribute value would be three quarters the value of the
centre plus one quarter the value of the instance. The weight of the instance
is added to the weight of the centre to create the new centre weight.
When all instances have been merged into the clusters the set of clusters
is passed to the classifier specified by the user to build the model.
ClustersForClasses Separate Instances into separate collections so each
collection has only one class value Use ClustersForData on each collection
set to create the clusters Merge resulting cluster sets from ClustersForData
into one large set of clusters Return clusters

2.3 The Clusterers

This section describes the properties required in a clusterer for it to be used
with the algorthim described by the previous section. The key property
required in a clusterer is that it can be asked to make a set number of
clusters. While it doesn’t have to be exact the clusterer must be able to get
close to the requested number. It must also be possible to obtain a meta
instance for each cluster. Some clusterers provide a cluster centriod that can
be used as the meta instance to reprresent that cluster others mush hhave
one generated. The way the Clustered Meta Instance Algorthin generates
these is by taking the average values of each attribute for alll the memmbers
of the cluster.

2.3.1 First K
First K is a very naive clusterer focused on speed. This clusterer declares
the first K instances encountered to be the cluster centers, each subsequent
instance in the dataset is then merged with the closest cluster centre by the
Euclidian distance. This results in clusters that are certainly not tthe best

5
clusters thatt could have been obtained from the data however it allows the
creation of clusters in a single pass through that data which results in a
clusterer that is linear in the number of clusters and instances
ClustersForData IF number of instances is less than or equal to num-
ber of classes return instances as clusters randomize instances Set first C
instances as cluster centers For Each Instance above C MergeWithClosest
Return clusters
MergeWithClosest Set minD = relativeSquaredEuclideanDistance from
instance to first cluster for each cluster IF relativeSquaredEuclideanDis-
tance from instance to cluster ¡ minD minD = relativeSquaredEuclidean-
Distance from instance to cluster Set Total Weight = Weight(minD) +
Weight(instance) w0 = Weight(mnD)/total w1 = Weight(instance)/total
for each attribute in cluster with minD attributeValue = w0* attribute-
Value + w1 * instanceAttributeValue Weight(minD) = Weight(minD) +
Weight(instance)

2.3.2 Simple K Means

It’s just simple K means

2.3.3 Farthest First

Fatherest firsat is a Varient off K means that placces each cluster centre in
turn at the point furrthest frrom the existing cluster centres. This point
must lie within the data area. This greatly spped up the clustering in mostt
cases since less reassignment and adjustment is needed

2.3.4 Bisecting K Means

Creates Bisecting Regions applies Simple K means algorthim with two clus-
ters recursively.

2.3.5 Expectation Maximization (EM)

Model based clusterer

6
Chapter 3

Experiments

This section discusses the three experiments I have conducted in order to

test my system

3.1 Algorithms
This section describes the four different algorithms we used for testing and
why each of them was chosen. The Experiments use two algorrthims for each
Class type a simple one and a more complex one. Two algorthins were chosen
because earlier testing had shhown that some clusterers worked better with
more complex clusterers, while others were better forr the simple clusterers.

3.1.1 Algorithms for Nominal Datasets

The experiments using nominal datasets are conducted with two different
algorithms: Naive Bayes and Logistic Regression

3.1.2 Algorithms for Numeric Datasets

The experiments using numeric datasets are conducted with two different
algorithms: Linear Regression and M5 Model Trees

7
Table 3.1: Algorithms Used in testing

Nominal Numeric
Simple Naive Bayes Linear Regression
Complex Logistic Regression M5 Model Trees

3.2 Datasets
The three experiments use twenty different datasets. Ten Nominal Datasets
and Ten numeric ones.

3.3 Experiment One

This experiment tests all the clusterers against several of the smaller datasets.
This experiment is the only one with all the clusterers since several of them
are too expencive and are not practical on the larger datasets.

3.3.1 Nominal Datasets

The Following Four Nominal datasets were used by this experiment: mush-
room, opdigits, hypothroid, kr-vs-kp

3.3.2 Numeric Datasets

Only Kin8nm and ailerons are small enough to be used for this experiment.
These two datasets are dissimilar enough to provide a reasonable check of
the clusterers performance on numeric data.

3.4 Experiment Two

This experiment takes two of the clusterers First K and Bisecting K Means
and compares them to random selection. This experiment uses all twenty
datasets described above. These cclusterers were shown in earlier testing to
be considerably faster than the others since they do much less work in their
efforts to cluster the data.

8
Table 3.2: Datasets Used in these experiments

Datset Name Size Attributes Classes

pendigits 10992 17 10
letter 20000 17 26
mushroom 8124 23 2
opdigits 5620 65 10
hypothroid 3772 30 4
kr-vs-kp 3196 37 2
Waveform21 40000 21 3
Waveform40 40000 40 3
Agrawal 40000 10 2
2Dplanes 40768 11 Numeric
ailerons 13750 41 Numeric
kin8nm 8192 9 Numeric
house-8L 22784 9 Numeric
mv 40768 11 Numeric
fried 40768 11 Numeric
ColorHistogram 68040 33 Numeric
LayoutHistogram 66615 33 Numeric
CoocTexture 68040 17 Numeric
ColorMoments 68040 10 Numeric

9
3.4.1 Nominal Datasets
All Nominal datasets were used.
Arrg here be results!! - eventually

3.4.2 Numeric Datasets

All Numeric datasets were used..

3.5 Experiment Three

This experiment looks at the noise reduction effects of a small amount of
summarization over using the full dataset. This experiment used the first
K clusterer. Small amounts of summarization were compared to the base
classifier performance.

3.6 Experiment Four

Even weightts

3.7 Experiment Five

Dropping small cllusters

10
Chapter 4

Related Work

This section describes how the Clustered Meta Instance Classifier fits in with
other work.

4.1 Data Squishing

Data squishing techniques are the most similar to the Clustered Meta In-
stance Classifier.

4.2 Instance Selecction

Instance Selection uses real instances rather than meta instances

11
Chapter 5

Conclusions

5.0.1 Future Work

Streaming stuff

CS102_Lab14_Spring2025
No ratings yet
CS102_Lab14_Spring2025
11 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
34 pages
Fuzzy Clustering Toolbox
No ratings yet
Fuzzy Clustering Toolbox
77 pages
phd_thesis_final_statistics tests
No ratings yet
phd_thesis_final_statistics tests
154 pages
Intro to Programming Week 6
No ratings yet
Intro to Programming Week 6
9 pages
Entropy Based Subspace Clustering
No ratings yet
Entropy Based Subspace Clustering
95 pages
Course Project Report: Indian Institute of Technology, Kanpur
No ratings yet
Course Project Report: Indian Institute of Technology, Kanpur
15 pages
Rigid Body Sound Synthesis: Cornell University
No ratings yet
Rigid Body Sound Synthesis: Cornell University
19 pages
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
No ratings yet
Cluster 3.0 Manual: Michael Eisen Updated by Michiel de Hoon
32 pages
AlgDs1LectureNotes-2025-02-16
No ratings yet
AlgDs1LectureNotes-2025-02-16
89 pages
Alg Ds 1 Lecture Notes
No ratings yet
Alg Ds 1 Lecture Notes
86 pages
Data Mining Notes
100% (1)
Data Mining Notes
178 pages
Algorithmics PDF
No ratings yet
Algorithmics PDF
119 pages
Data Analysis With SAS
100% (1)
Data Analysis With SAS
353 pages
Prediction of Financial Markets Using Deep Learning: Masaryk University
No ratings yet
Prediction of Financial Markets Using Deep Learning: Masaryk University
66 pages
MSC Proj
No ratings yet
MSC Proj
102 pages
Oishee Dey NNFL Project Report
No ratings yet
Oishee Dey NNFL Project Report
20 pages
Metamath: A Computer Language For Mathematical Proofs
No ratings yet
Metamath: A Computer Language For Mathematical Proofs
247 pages
Book-A Exible Regression Approach Using GAMLSS in R.
No ratings yet
Book-A Exible Regression Approach Using GAMLSS in R.
355 pages
Open Data Structures: An Introduction
From Everand
Open Data Structures: An Introduction
Pat Morin
4/5 (4)
Mirt
No ratings yet
Mirt
103 pages
Orca Manual 3 0 3
No ratings yet
Orca Manual 3 0 3
595 pages
Zhang Bowen
No ratings yet
Zhang Bowen
38 pages
Mohammadi Arezou 200904 PHD
No ratings yet
Mohammadi Arezou 200904 PHD
143 pages
Introduction To RPART
No ratings yet
Introduction To RPART
67 pages
ComparativeAnalysisOfClusteringMethods_dissertacaofinal
No ratings yet
ComparativeAnalysisOfClusteringMethods_dissertacaofinal
117 pages
Metodos Clasificacion
No ratings yet
Metodos Clasificacion
203 pages
(Week 4) - Balance DataSet
No ratings yet
(Week 4) - Balance DataSet
5 pages
Essential Data Structures for C++ Programmers
No ratings yet
Essential Data Structures for C++ Programmers
105 pages
Subspace System Identification Theory and Applications: Lecture Notes
No ratings yet
Subspace System Identification Theory and Applications: Lecture Notes
282 pages
Thesis Frank Wagenmans 3870154
No ratings yet
Thesis Frank Wagenmans 3870154
52 pages
(Applied Optimization 44) Evangelos Triantaphyllou (Auth.) - Multi-Criteria Decision Making Methods - A Comparative Study-Springer US (2000)
No ratings yet
(Applied Optimization 44) Evangelos Triantaphyllou (Auth.) - Multi-Criteria Decision Making Methods - A Comparative Study-Springer US (2000)
306 pages
MCDM Method - Comparative Study
No ratings yet
MCDM Method - Comparative Study
26 pages
Two-Dimensional Pattern Matching: Technische Universiteit Eindhoven Department of Mathematics and Computer Science
No ratings yet
Two-Dimensional Pattern Matching: Technische Universiteit Eindhoven Department of Mathematics and Computer Science
100 pages
Quick Introduction Into SAT/SMT Solvers and Symbolic Execution
No ratings yet
Quick Introduction Into SAT/SMT Solvers and Symbolic Execution
85 pages
Disc Math
No ratings yet
Disc Math
153 pages
Accounting for Model Error رسالة دكتوراه
No ratings yet
Accounting for Model Error رسالة دكتوراه
278 pages
Neural Network Package For Octave Developers Guide
No ratings yet
Neural Network Package For Octave Developers Guide
31 pages
Introduction To Numpy Exercise
No ratings yet
Introduction To Numpy Exercise
24 pages
Matrixde
No ratings yet
Matrixde
182 pages
Amelia PDF
No ratings yet
Amelia PDF
57 pages
Just For Fun
No ratings yet
Just For Fun
24 pages
Metamath: A Computer Language For Pure Mathematics
No ratings yet
Metamath: A Computer Language For Pure Mathematics
211 pages
Level Set Trees For Applied Statistics
100% (1)
Level Set Trees For Applied Statistics
124 pages
FULLTEXT01
No ratings yet
FULLTEXT01
84 pages
Master Wilson
No ratings yet
Master Wilson
66 pages
Learning Multiple Layers of Features from Tiny Images. Alex Krizhevsky
No ratings yet
Learning Multiple Layers of Features from Tiny Images. Alex Krizhevsky
60 pages
DESeq 2
No ratings yet
DESeq 2
48 pages
Matematicka Statistika I Statisticko Modelovanje
No ratings yet
Matematicka Statistika I Statisticko Modelovanje
238 pages
Prime Numbers Gaurish
No ratings yet
Prime Numbers Gaurish
57 pages
Limma: Linear Models For Microarray Data User's Guide
No ratings yet
Limma: Linear Models For Microarray Data User's Guide
102 pages
Transitive Closure, Inbreeding Coefficients, Pedigrees, On Line Collection and On Line Presentation
No ratings yet
Transitive Closure, Inbreeding Coefficients, Pedigrees, On Line Collection and On Line Presentation
110 pages
Programming in Scilab
No ratings yet
Programming in Scilab
155 pages
Visco Very
No ratings yet
Visco Very
37 pages
Some Applications of Number Theory To RS
No ratings yet
Some Applications of Number Theory To RS
22 pages
A Time Series Forecasting Approach For Queue Wait-Time Prediction
No ratings yet
A Time Series Forecasting Approach For Queue Wait-Time Prediction
76 pages
Big Data and The Web
No ratings yet
Big Data and The Web
170 pages
The Satisfiability Problem: Algorithms and Analyses
From Everand
The Satisfiability Problem: Algorithms and Analyses
Uwe Schöning
No ratings yet
Error-controlled Adaptive Finite Elements in Solid Mechanics
From Everand
Error-controlled Adaptive Finite Elements in Solid Mechanics
Erwin Stein
No ratings yet
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
From Everand
Basic Research and Technologies for Two-Stage-to-Orbit Vehicles: Final Report of the Collaborative Research Centres 253, 255 and 259
Dieter Jacob
No ratings yet
Architectural Thesis Checklist
100% (1)
Architectural Thesis Checklist
2 pages
Corporate Restructuring
No ratings yet
Corporate Restructuring
3 pages
How I Won The Game of Money
100% (1)
How I Won The Game of Money
7 pages
Ministry Assignment
100% (1)
Ministry Assignment
9 pages
Math Pyq
No ratings yet
Math Pyq
4 pages
Evaluation of Pile Capacity Components o PDF
No ratings yet
Evaluation of Pile Capacity Components o PDF
18 pages
II Sem Telecommunications Equipment Used in Front Office 2 PDF
No ratings yet
II Sem Telecommunications Equipment Used in Front Office 2 PDF
4 pages
CORE EARTH SCIENCE11 Q2 Mod10 W3 Rocks Behavior Under Stress
No ratings yet
CORE EARTH SCIENCE11 Q2 Mod10 W3 Rocks Behavior Under Stress
20 pages
Hydro Electric Power Plants of India
No ratings yet
Hydro Electric Power Plants of India
12 pages
SIMIS PC & SICAS ECC4 - PPT
100% (1)
SIMIS PC & SICAS ECC4 - PPT
48 pages
Research Question Exp 1 ME LAB 2 Other Answer
No ratings yet
Research Question Exp 1 ME LAB 2 Other Answer
21 pages
A. Introduction "Always Listening, Always Understanding"
No ratings yet
A. Introduction "Always Listening, Always Understanding"
5 pages
Carding Methods Netflix AMAZON WESTERN
No ratings yet
Carding Methods Netflix AMAZON WESTERN
11 pages
CSsect 1
No ratings yet
CSsect 1
22 pages
Hero BACK SCHOOL-CHECKLIST PDF
No ratings yet
Hero BACK SCHOOL-CHECKLIST PDF
1 page
Brigada Action Plan - Edited
No ratings yet
Brigada Action Plan - Edited
7 pages
Project Report .Edited
No ratings yet
Project Report .Edited
25 pages
Microsoft Exchange Server 2013 Monitoring 15.0.669.22 Full Report
No ratings yet
Microsoft Exchange Server 2013 Monitoring 15.0.669.22 Full Report
276 pages
Health Philippines: Deliverables, 2016 and Beyond
No ratings yet
Health Philippines: Deliverables, 2016 and Beyond
51 pages
Biography PDF
No ratings yet
Biography PDF
11 pages
Assignment (Organisational Behaviour)
94% (18)
Assignment (Organisational Behaviour)
24 pages
Activity Proposal. School Encampment (Autorecovered)
No ratings yet
Activity Proposal. School Encampment (Autorecovered)
5 pages
Urban Transportation Planning
No ratings yet
Urban Transportation Planning
10 pages
Associate New
No ratings yet
Associate New
1 page
Joy Final Final Na 2
No ratings yet
Joy Final Final Na 2
12 pages
Intruder Alert System
No ratings yet
Intruder Alert System
16 pages
Dso138mini How To Send Waveform To PC
No ratings yet
Dso138mini How To Send Waveform To PC
8 pages
C03M16-18 v1.6
No ratings yet
C03M16-18 v1.6
8 pages
Options Advantages Disadvantages: Generate Funds Internally From Current Business Operations
No ratings yet
Options Advantages Disadvantages: Generate Funds Internally From Current Business Operations
2 pages
The Lorax
No ratings yet
The Lorax
3 pages

Clustering For Clasification

Uploaded by

Clustering For Clasification

Uploaded by

Clustering for Clasification

2.3 The Clusterers

2.3.2 Simple K Means

2.3.3 Farthest First

2.3.4 Bisecting K Means

2.3.5 Expectation Maximization (EM)

This section discusses the three experiments I have conducted in order to

3.1.1 Algorithms for Nominal Datasets

3.1.2 Algorithms for Numeric Datasets

3.3 Experiment One

3.3.1 Nominal Datasets

3.3.2 Numeric Datasets

3.4 Experiment Two

Datset Name Size Attributes Classes

3.4.2 Numeric Datasets

3.5 Experiment Three

3.6 Experiment Four

3.7 Experiment Five

4.1 Data Squishing

4.2 Instance Selecction

5.0.1 Future Work

You might also like