Genetic Algorithm Applied in Clustering Datasets
Genetic Algorithm Applied in Clustering Datasets
net/publication/228895611
CITATIONS READS
3 153
1 author:
SEE PROFILE
All content following this page was uploaded by James Cunha Werner on 26 July 2015.
SCISM
South Bank University
103 Borough Road
London SE1 0AA
Abstract. This paper compares the clustering technique k-means and two
different approaches of Genetic Algorithms to a sample dataset, and in
EachMovie dataset. The comparison between both techniques shows a
better result of GA.
Introduction.
Unsupervised learning is a technique of knowledge extraction from datasets where there is not a correct
solution to compare with the estimated output of the model. The best output or model groups the
elements with some attribute, building a cluster with minimum total distance with a metric definition.
The final result contains a set of elements that are a base for all elements, with minimise the adopted
metrics. All events of one cluster have some attribute in common: kind of movie or book, taste, smell,
diagnostic, etc. This information would help to recommend similar thinks with same characteristics to a
customer, bringing up new business opportunities.
Preparing data for this kind of problem require some non-objective data acquisition. For example,
movie classification would differ for each people between different classes like animation and family,
or sometimes horror and comedy, due different points of view, social class and personality. It’s
difficult to establish a classification for the users too.
To solve this, a large amount of users would be required to establish an average classification for each
product, with a mark of like/dislike or a classification. The next step consist in obtain the sets with
group all events with a minimum distance, the objective of cluster technique.
This paper describes the techniques k-means and genetic algorithm, and its application in one example
for comparative study and the application of genetic algorithms to EachMovie dataset for a proper
performance study.
K-means algorithm.
This algorithm is based in an adaptive approach where a random set of cluster base is select from the
original dataset, and each element update the nearest element of the base with the average of its
attributes (see [1] for a detailed description). To study this algorithm we used an available in Internet
software with an example dataset (see [2]).
Genetic algorithm.
Genetic algorithms (GA) mimics the evolution and improvement of life through reproduction, when
each individual contributes with its own genetic information building a new one with fitness to the
environment and more surviving chances. These are the bases of genetic algorithms and programming
(Holland [3], Goldberg [4] and Koza [5]). Specialized Markov Chains underline the theoretical bases of
this algorithm change of states and searching procedures.
Each ‘individual’ of the generation represents a feasible solution to the problem, coding distinct
algorithms/parameters that should be evaluated by a fitness function.
GA operators are mutation (the change of a random position of the chromosome) and crossover (the
change of slices of chromosome between parents).
The best individuals are continuously being selected, and crossover and mutation take place. Following
few generations, the population converges to the solution that better attend the performance function.
Genetic algorithm concepts and implementation are structured in the following diagram.
Population initialization
Step 2: Crossover
11010 11111001100 ------- \ 11010 10100110110
10000 10100110110 ------- / 10000 11111001100
Step 3: Mutation
1101010100110110 1000011111001100
1111010100100110 1000011111001100
z = 8.044 z = 6.092
Step 4: Reinsertion
second population Objective value z=f(x,y)
==> 1111010100100110 8.044
==> 1000011111001100 6.092
1000010100110110 6.261380
1101011111001100 12.864222
Comparative example.
The data available with the k-means software consist of the name of the element, and 10 attributes. The
distance is the Euclidian distance between normalized elements.
GA should apply two different approaches:
1. Coding each component of the cluster in the chromosome: in this case, GA must find each
component for each cluster base, resulting a chromosome size with is the number of attributes
times the number of cluster in the base.
2. The chromosome contains for each element the cluster it pertains, and the centre is the average
of all members. The chromosome size is the number of elements.
The fitness function in both cases is:
1.0
Fitness =
∑ min ∑ ( patterni − cluster j )2
cluster j attributei
mark
The same fitness function could be evaluated for all approaches, and the final value for k-means were
0.138. GA using the number of the cluster for each element (option 2) gave a final fitness of 0.180 and
the case of coding each element (option 1) of the base gave 0.189.
Both cases the GA result show be more accurate then k-means approach.
To apply GA in a clustering problem, we used the EachMovie dataset available in Internet [6], with
contains the 1.732.627 votes of 47.636 people for 1.628 movies. The data were filtered remaining the
corrects and completes one, from where we normalise and selected the following attributes:
• Person (person.txt) provides optional, unaudited demographic data supplied by each person:
ID: Number -- primary key
Age: Number
Gender: Text -- one of "M", "F"
• Movie (movie.txt) provides descriptive information about each movie:
ID: Number -- primary key
Name: Text
Type: Action, Animation, Art_Foreign, Classic, Comedy, Drama, Family, Horror,
Romance, Thriller: Yes/No
• Vote (vote.txt) is the actual rating data:
Person_ID: Number
Movie_ID: Number
Mark or Score: Number -- 0 <= Score <= 1
We adopted the approach of determine the cluster base for 40 different groups and 3 attributes: sex,
age, and type. Type is obtained by the association of 10 bits sequence respectively to the movie type,
for example, action is 1 base 2 and 1 base 10, Art_Foreign is 100 base 2 or 4 base 10, etc. Table I
shows the more important groups elements.
Male Female
Cluster: 16 Cluster: 22
Age: 17 Age: 11
Animation, classic, comedy, drama. Animation, Art_Foreign, Classic, Comedy
# Patterns in Cluster: 118190 # Patterns in Cluster: 22619
The least initial distance of the first generation was 62,321, and the final result after converge was
7,098. The result shows that if I (man with 43 years old) am in Blockbuster lending “Dances with
wolves”, should be invited to lend “The fugitive” and “True lies”, which in my opinion is satisfactory.
Conclusion
A class of problems without a training dataset, where the relation between excitation and answer is
formally defined, can be addressed by clustering algorithms.
Working with clustering means define a metric distance and the features for what each event will be
grouped to obtain the least distance for all set. In EachMovie application, the definition of type of
movie as 10 bits adjacent is a feature definition that would affect the solution because, for example,
there is a distance between drama (32 base 10 or 100000 base 2) and drama and action (33 base 10 or
100001 base 2) that is not the same (or comparable) as animation (2 base 10) and action (1base 10).
The definition of distance as Euclidian affect this position as well.
The application of genetic algorithms to clustering problems shows better results than k-means
approach, under a controlled example, and the result of the clustering for different movies are
reasonable for the author enjoyment.
However, the classification, the data acquisition and the clusters are completely subjective: there is not
a correct answer. There is not a function able to classify the same products to any individual, but it
would recommend other alternatives and new business opportunities would be created.
References.