Evaluation of Clustering Methods For Mining Duplicate Image Groups
Evaluation of Clustering Methods For Mining Duplicate Image Groups
Keywords: Clustering, Duplicate image groups, Hierarchical clustering method, K-means clustering
method, Distribution Based Method, DBSCAN.
1. Introduction
Clustering technique is most fundamental tasks in Data Mining. Clustering is used for
unsupervised learning and the goal of clustering method is descriptive and new groups are
interest in themselves. Groups of data are clustered into subsets in a manner that similar
images are grouped together, while different images belong to different groups. In clustering
method, social networks are used for clustering the image group’s [1] .Social networks are
very well-liked for young people to gain information, especially with the development of
Smartphone and internet. People tend to forward, share and follow what they are interested
in. Favorite images sharing websites as flickr. The whole images are more than 6 billion.
Face book has gathered about one billion users, and uploaded about 0.25 billion images per
day. The major problem is to manage the big image data is very challenging for effective
indexing and retrieval. Most of the user share large amount of images as their places of
interest. For the moment, some of the images are modified, forwarded and copied by other
users before being shared in social communities.
Phil bin model images as nodes of the graph and image-to-image similarity as edge between
the corresponding nodes. Clustering based approach is adopted to divide the group into
smaller groups containing near duplicate image groups. In graph based NDIG detection, the
weights of edges are merely measuring by the concurrence of visual words which neglects
the context information between images. Moreover, the image-to-image similarity computing
is too time-consuming for a large scale dataset.
Hu et al. proposed a coherent phrase model for near-duplicate image retrieval. Different from
the standard BoW, their model represents every local region using multiple descriptors and
enforces the coherency across multiple descriptors [2].Spatial coherent phrase and Feature
coherent phrase are designed to represent spatial and feature coherency. They mentioned that
near duplicate images retrieval approach was hard to achieve the task of near duplicate image
group’s detection.
Gao et.al simultaneously utilizes both textual and visual information to estimate images
relevance, which is determined with a hyper graph learning approach [4]. They propose an
Correspondence:
interactive 3-D object retrieval scheme, additionally. Wang et.al obtained relevant images by
Ramya S
CSE, Sri Krishna College Of exploring the image content and the associated tags .A Greedy ordering algorithm which
Technology, Coimbatore, optimizes average diverse precision as the ranking method.
Tamil Nadu, India In the present paper a study of various clustering techniques have been made. Section 2 deals
~ 801 ~
International Journal of Multidisciplinary Research and Development
with a study on various clustering method, section 2.1 deals number of clusters K first. Then assumed the centroid for these
with the study on K-means clustering algorithm, section 2.2 clusters and also assume random objects as the initial centroids
deals with study on Centroid based clustering, section 2.3 or the first K objects in sequence could also serve as the initial
deals with the study on hierarchical clustering method, section centroids. The K means algorithm in a logical representation:
2.4 deals with the study on Centroid based clustering, section Execute the below steps until convergence. Do the following
2.5 deals with the study on Distribution based clustering steps while no object move group [3].
methods. Section 2.6 deals with study on DBSCAN, Section 3 a) First, centroid coordinate is determined (Random
deals with results and discussion. Finally the last section 4 assignment).
concludes the paper. b) Calculate the distance of each Object pixel to the Centroids.
c) Based on minimum distance we group the objects with the
2. Various Clustering Methods Centroid
2.1 K-Means Clustering Method
K-means is an algorithm to classify or group objects based on Steps:
features/attributes into K number of group. Here,K is positive 1. If the number of data < number of cluster then assign each
integer number. Minimizing the sum of squares of distances data as centroid
between cluster centroid and the corresponding cluster data. 2. Each centroid have a cluster number
The main purpose of K-mean clustering is to classify the data. 3. If the number of data> number of cluster then for each
Image Classification has a major role in the field of mining data we have to calculate minimum distance measure of
analysis. Cluster analysis is a useful for image classification all centroid for each cluster.
and object diagnosis. In image processing, various clustering 4. Location of centroid is not determined correctly, we need
algorithm are used for image classification. K-means to adjust the centroid location based on current updated
algorithm split the given image into different clusters of pixels data.
in the feature space, each of them defined by its center. In the 5. Then assign all data to new centroid. Repeat this steps
image each pixels is allocated to the nearest cluster. Then the until no data is moving to another cluster.
new centers are compute with the original clusters. Repeat the
steps until convergence. Mostly we have to to find out the
2.2 Centroid Based Clustering be NP-hard, and thus the common approach is to search only
Clusters are represented by a central vector in centroid based for approximate solutions. A particularly well known
clustering which may not essentially be a member of the data approximate method referred to as "k-means algorithm". Its
set. When the number of clusters is fixed to k. For find a local optimum, and is commonly run multiple times
optimization problem k-means clustering gives formal with different random initializations. Some variation of k-
definition: find the cluster centers and allocate the objects to means such that optimizations as choosing the best of multiple
the nearest cluster center, such that the squared distances from runs, but also restricting the centroids to members of the data
the cluster are minimized. set k-medoids, choosing medians that is k-medians clustering,
E= ∑ ∑ ∈ dist (p,ci)2 choosing the initial centers less randomly K-means++ or
Where E is the sum of the squared error for all objects in the allowing a fuzzy cluster assignment is referred as Fuzzy c-
data sets, p is the point in space representing of an object, Ci is means[4-5].
the centroid of cluster ci. The optimization problem is known to One of the biggest drawbacks of these k-means algorithms is
require the number of clusters Specified in advance.
~ 802 ~
International Journal of Multidisciplinary Research and Development
Additionally, the k-means algorithms prefer clusters of from one cluster to another cluster. The similarity between
approximately similar size, and always assign an object to the a pair of clusters is considered to be equal to the greatest
nearest centroid. This algorithm optimized cluster centers, not similarity from one cluster to another cluster
cluster borders. Complete-link clustering is also called furthest neighbor
method or maximum - methods that consider the distance
2.3 Hierarchical Clustering Method between two clusters to be equal to the longest distance
Hierarchical clustering is a method of cluster analysis which is from one cluster to other cluster.
to build a hierarchy of clusters. There are two types of Average-link clustering is also called minimum variance
hierarchical method such as, method. Consider the distance between two clusters to be
Agglomerative: This method is a "bottom up" approach: equal to the average distance from one cluster to other
each and every observation starts in its own cluster, and cluster.
pairs of clusters are merged as one moves up the Hierarchical methods are characterized with the following
hierarchy. Advantages:
Divisive: This method is a "top down" approach: in one
cluster all observation are started, and splits are performed Versatility - The single-link methods, for example, maintain
recursively as one moves down the hierarchy. good performance on data sets containing separated, chain-like
Given a set of N objects to be clustered, and N*N distance and concentric clusters.
matrix, the basic process of hierarchical clustering is this:
1. Start by assigning each object to a cluster, so that if it Multiple partitions - hierarchical methods produce multiple
contain N objects, now you have N clusters, each nested partitions not only one partitions, which allow different
containing just one objects. Let the distances between the partitions can be chosen by different users, according to the
clusters the same as the distances between the objects they preferred similarity level.The main disadvantages of the
contain. hierarchical methods are:
2. Find the most similar pair of clusters and merge them into
a single cluster, so that you have one cluster less.
Inability to scale well - The time complexity is at least O
3. Compute similarities between the new cluster and each of
(m2), where m is the total number of instances by using a
the old clusters.
hierarchical algorithm is to clustering a large number of
4. Repeat steps 2 and 3 until all items are clustered into a
objects is characterize by huge Input/output costs.There is no
single cluster of size N. (*)
back-tracking capability in hierarchical Method.
2.4 Agglomerative Hierarchical Clustering:
Grouping the data one by one on the basis of the nearest 2.5 Distribution-Based Clustering
distance measure. All the pair wise distance between the data In distribution models, the clustering model is closely related
point are grouped based on the nearest distance measure. to statistics. Clusters can easily be defined as objects most
Distance between the data point is recalculated but don’t know likely to the same distribution. A suitable property of this
which distance has to consider when the groups has been approach is that this closely resembles the way artificial data
formed? Several methods are available for this. Some of them sets are generated by sampling random objects from a
are: distribution. Although the theoretical foundation of these
1. Single linkage or single-nearest distance distribution methods is excellent, they go through from one
2. Complete linkage or complete farthest distance key problem known as over fitting, if not constraint are put on
3. Average linkage or average-average distance the model complexity. A complex model will easily can able
4. Centroid distance. to explain the data better, it makes choosing the suitable model
5. Sum of squared Euclidean distance is minimized- ward’s complexity logically difficult [7].
Method. Gaussian mixture models are an prominent method by using
In this way we grouping the data until one cluster is formed. the expectation-maximization algorithm. To avoid over fitting
we can calculate how many numbers of clusters should be problem, number of Gaussian distributions are initialized
actually present on the basis of dendogram graph [6]. randomly and whose parameters are iteratively optimized to fit
better to the data set. This will join to a local optimum, a
outcome multiple runs may produce different results. For hard
clustering, objects are assigned to the Gaussian distribution.
For soft clustering is not necessary. Distribution-based
clustering produces complex models for clusters that can
capture dependence and correlation between attributes.
Spatial Database Systems (SDBS) [9] are used for the
management of spatial data such as points and polygons. The
mission of spatial database in clustering, the main problem is
to detecting the clusters of points which are distributed as
Poisson point. This type of distribution is referred as random
distribution or uniform distribution.
The function to large spatial databases raises the following
requirements for clustering algorithms:
1. In many applications, approximate values are not know so
Fig 2: Agglomerative clustering on data objects minimal number of input parameters are taken.
Single-link clustering is also called nearest neighbor 2. We usually do not know the density, shape and number of
method or minimum method. Consider the distance cluster in the application of detecting minefield.
between two clusters to be equal to the shortest distance
~ 803 ~
International Journal of Multidisciplinary Research and Development
3. Shape of clusters in spatial databases may be spherical, In the algorithm, Density Based Spatial Clustering of
elongated, drawn-out etc.discover of cluster based on Applications [DBSCAN] is designed to discover the spatial
arbitrary shape. data clusters with noise.
4. Provides a good efficiency on large databases. The steps involved in this algorithm are as follows,
The problem of detecting surface-laid minefields on the basis 1. Choose an arbitrary point p
of an image from a investigation aircraft. After processing, an 2. Eps and Minpts are retrieving all the points density
image is reduced to a set of points, some of images may be reachable from p.
mines, and some of images may be noise, such as rocks or 3. A cluster is formed, when p is a core point
other metal objects. The main aim of the analysis is to find out 4. DBSCAN visits the next point of the database when p is a
whether minefields are present or not. border points and there is no points are density reachable
from p
5. Keep on the process until all the points have been
processed.
~ 804 ~
International Journal of Multidisciplinary Research and Development
0.0009
0.0008
0.0007
0.0006
Performance Index
0.0005
0.0004
precision
0.0003
0.0002 Recall
0.0001 F‐measure
0
Hieararchical Distribution K‐means DBSCAN
Based based
Clustering clustering
~ 805 ~