0% found this document useful (0 votes)
2 views

2e9c7b5ca372fc0e

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2e9c7b5ca372fc0e

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Al-Nahrain Journal of Science Vol.21 (4), December, 2018, pp.

76-82 Science

Image Classification Using Bag of Visual Words (BoVW)


Abdul Amir Abdullah Karim1 and Rafal Ali Sameer 2
1
Department of Computers, University of Technology, Baghdad-Iraq.
2
Department of Computers, Collage of Science, University of Baghdad, Baghdad-Iraq.
Corresponding author: [email protected]
Corresponding author: [email protected]
Abstract
In this paper two main stages for image classification has been presented. Training stage
consists of collecting images of interest, and apply BOVW on these images (features extraction and
description using SIFT, and vocabulary generation), while testing stage classifies a new unlabeled
image using nearest neighbor classification method for features descriptor. Supervised bag of visual
words gives good result that are present clearly in the experimental part where unlabeled images are
classified although small number of images are used in the training process.
[DOI: 10.22401/ANJS.21.4.11]

Keywords: SIFT, Euclidean distance, classification, k-nearest neighbor, Bag of Visual Words.
1. Introduction the idea of Bag of Words (BoW) in the text
Recognition is the main problem of document, therefore this techniques for text
learning visual categories and classify new classification easily viable to the problem of
instances to those categories. Vision task image classification [4].
almost relies on the capability to identify The remaining of this paper includes the
objects, scenes, and categories. Visual following: section 2 presents a review about
recognition has different applications that the existing works on image classification and
contact with many areas of artificial Bag of Word. Section 3 presents the concept
intelligence and information retrieval e.g. of Scale Invariant Feature Transform method.
content based image, data mining, or object Section 4 presents a general concept of
identification for mobile robots [1]. clustering and k-means clustering algorithm.
Content based image retrieval (CBIR) Section 5 presents Euclidean distance metric
make it possible to search for and classify for corresponding features comparison
images. Images can be analyzed based on their process. Section 6 presents K-Nearest
features (such as color, textures, shape or Neighbor classification algorithm. Section 7
edge). Keypoints are salient image patches that presents Bag of Visual Words approach.
contain rich local information of an image, and Section 8 presents image classification based
they can be automatically detected using on bag of visual words algorithm. Section 9
various detectors [2]. presents images of interest and the
Local features have been widely used, the experimental results when running the
most well-known local features detection and algorithm on unlabeled images. Section 10
description approaches are Speed Up Robust presents conclusion of this work.
Feature (SURF) and Scale Invariant Feature
Transform (SIFT). To find images similar to a
query image, all images feature descriptors 2. Related Work
must be compared using some distance Various surveys for image classification
measures. Bag of Words (BoW) method has using BoVW can be found, in literatures below
gained popularity. In BoW method, get some of those which are most related to this
clustered vectors of image features and create work:
histograms (number of features occurrence) 1. In 2007, Jun Yang, Yu-Gang Jiang,
based on features descriptor. All the obtained Alexander Hauptmann, and Chong-Wah
descriptors in the histogram must be compared
[3]. Ngo used text categorization steps to create
Bag of Visual Words (BoVW) model in different representations of visual word
computer vision represents image as visual and studied their impact to classification
words. The concept of BoVW is extract from performance on the TRECVID and

76
Abdul Amir Abdullah Karim

PASCAL collections. The Empirical study image rotation, image scaling, and image
gives basis for representing visual word lightening variation. SIFT is patent algorithm
that is likely to create high classification and take dense processing cost that make it too
performance [2]. slow [6].
2. In 2012, Mingyuan Jiu, Christian Wolf, SIFT composed of four main stages: (a)
Christophe Garcia, and Atilla Baskurt detect scale space, (b) localize keypoints, (c)
offers a novel method for learning assign orientation, and (d) describe keypoints.
supervised codebook and optimization bag The first step is define a location and a scales
of words approach. The proposed approach of the interest points using the extrema of scale
allows to evolve or keep the distinctive space in the DoG (Difference of Gaussian)
power of an unsupervised learning functions with various values of σ. Different
codebook while reducing of the learned scale of images created by using different
codebook size. The codebook learning and value of σ in Gaussian function (σ in every
recognition process are integrated to scale separated by k that is constant value),
update the cluster centers through the back then Subtract consecutive images to create
propagated errors: one is based on classical DoG pyramid. DOG was used instead of
error backpropagation. The drawback of a Gaussian to increase the processing speed.
gradient descent algorithm applied to a After that the Gaussian image down sampled
nonlinear system is difficult to learn a set by 2 and create DoG to down sampled image.
of optimal parameters, the algorithms Gaussian function shown in equation (1) and
mostly converge to local minima and DoG shown in equation(2) [7][8].
sometimes even diverge. The other is
based on cluster reassignments algorithm G (x, y, σ) = exp [ - ] ............... (1)
which adjusts the cluster centers indirectly
by rearranging the cluster labels for all the Where
feature vectors. It needs more iterations to G (x,y,σ) represents a changing scale
converge to a better solution [5]. Gaussian,
3. In 2015, Marcin Korytkowski, Rafał σ represents the scale variable of the
Scherer, Paweł Staszewski, and Piotr consecutive scale space,
Woldan presents method to classify and x represents horizontal coordinates in
retrieve visual words using a novel Gaussian window,
relational database architecture. This work y represents vertical coordinates in Gaussian
created a special database indexing window,
algorithm, which will significantly speed π = 3.14
up answering to visual query-by-example
SQL queries in relational databases. The D(x, y, σ) = (G(x, y, k σ) - G(x, y, σ))*I(x, y)
proposed method tested on three classes of ................................. (2)
visual objects and divided them into Where
learning and testing examples. The testing * represents the convolution operation,
set consists of 15% images from the whole k represents scaling factor,
dataset. Local keypoint generated before G(x,y,σ) represents a changing scale Gaussian
the learning procedure for all images using function,
the SIFT algorithm. All simulations were I(x, y) represents an input image,
performed on a hyper virtual machine [3]. D(x,y,σ) represents Difference of Gaussians
have k times scale,
3. Scale Invariant Feature Transform x represents horizontal coordinate in image
(SIFT) (I(x,y)) with corresponding horizontal
SIFT is a local features detection and coordinate in Gaussian window (G(x,y,σ)),
description algorithm, it is able to provide y represents vertical coordinate in image
steady point for matching image. SIFT is (I(x,y)) with corresponding vertical coordinate
popular algorithm for detecting important in Gaussian window (G(x,y,σ)).
points which are invariant to image translation,

77
Al-Nahrain Journal of Science Vol.21 (4), December, 2018, pp.76-82 Science

Local extrema obtained by comparing 1) Select integer number k at random that


every pixel after DoG with 26 other pixels represent the number of centroid.
(eight neighbor pixels at the current pixel’s 2) Compute the average distance between
level and nine pixels in the upper level and every data point and centroids using
nine pixels in the lower level. When the Euclidean distance equation (4).
compared pixel is extrema (minimum than all 3) The data point assigned to the cluster when
26 pixels or maximum than all 26 pixels), distance between data point and cluster
pixel position and scale are saved. In the center is minimum than the distances with
keypoint localization step, low contrast points other centroids.
and points at edge are eliminated. Intervention 4) Repeat calculation of the new centroid
point is also eliminated by using (2×2) Hessian using:
matrix [7][9].
The descriptors build by calculating the Vi = ( )∑ .................................. (3)
gradient strength and orientation strength for
each neighbor of a keypoint. The 5) Repeat calculation of the distance between
neighborhood of every keypoint are every data point and the new obtained
characterized by creating 8 bins gradient and centroids.
orientation histogram for 16×16 region of 6) If the data point was not reassigned then
neighbors around keypoint. The region is split break, otherwise repeat from step 3.
up into 4×4 sub regions and each sub region
have 8 directions this will produce 4×4×8= Note that:
128 dimensional vector to give description for K represents positive integer number,
every keypoint [9][10]. x represents a group of data points {x1, x2, x3,
The existence of large number of features … , xm},
will produce irrelevant or redundant features ci represents number of data points in i-th
that increasing the processing time and can cluster,
also affect the accuracy. The aim of feature V represents set of centers {v1, v2, ... , vc}.
selection is to reduce feature space
dimensionality and to keep the distinctive 5. Euclidean distance
features [11]. Euclidean distance is considered as the
standard metric for geometrical problems. It is
4. K-means clustering Algorithm simply the ordinary distance between two
Clustering is unsupervised iterative method points. Euclidean distance is extensively used
that classify group of points into clusters based in clustering and classification problems. It is
on similarity. The similarity measure the default distance measure used with the K-
frequently depending on distance methods e.g. means and k-nearest neighbor algorithms. The
Euclidean distance to classify points in groups Euclidean distance determines the root of
(or clusters) [12]. square differences between the coordinates of
K-means clustering algorithm is an a pair of objects as shown in equation (4) [12]:
unsupervised classification procedure which
classifies or groups the objects automatically Dist (P1, P2) = √∑ ) ) )
into K number of group where each group .............................. …(4)
contain points that have minimum distance Where
between them. K-means is also called C- P1 (x1, y1) is the First point,
means or ISODATA clustering method. K- P2 (x2, y2) is the second point.
means algorithm initialize clusters center (or
centroids) by selecting samples at random 6. K-Nearest Neighbor
from training vectors. K-means is repeated The K Nearest Neighbor classifier (or
method which used to collect data into groups called instance based classifier) is a traditional
and these groups change every iteration [13]. nonparametric classification algorithm that
K-means algorithm described as follows [12]: gives good performance for best value of k.
The KNN classifier performs classification of

78
Abdul Amir Abdullah Karim

unlabeled image by relating the unlabeled vocabulary that can vary from hundreds to
image’s features to the labeled features more than tens of thousands. By mapping the
depending on distance function (equation (4)) keypoints to visual words each image can be
or similarity measure. In the k nearest represented as a “bag of visual words” [2].
neighbor a test sample allocates to the class BoVW for new unlabeled images
that frequently describe among the k nearest calculated in a similar way: local features
training samples. If two or more such classes extraction from image and features
exist, then the test sample is assigned the class description, projection of these descriptors on
with minimum average distance to it [14]. the dictionary calculated previously by the
training set, and histogram calculation of each
7. Bag of Visual words (BoVW) visual word appearance of the dictionary [5].
The image has keypoints or local features
identified as prominent image regions that 8. Proposed Algorithm
have rich local information (such as color or The proposed Algorithm of image
texture) and these features can be detected classification using bag of visual words can be
using different detection and description described by two main algorithms training
method. Detected features are then split to a algorithm and testing algorithm as follows:
number of clusters using the K-means
clustering algorithm where each cluster will Training Algorithm
have features with similar descriptors and Input (collection of image)
encodes each keypoint by the index of the Output (k - clusters, k - visual word)
cluster to which it belongs this is called vector Step 1: Collect set of images for each class
quantization (VQ) technique [2]. of interest (in this paper the experimental
The VQ can be considered as a class of interest are Car, Motor, and Ship).
generalization of scalar quantization to the Step 2: Apply BoVW on collected images.
quantization of a vector. The VQ encoder BoVW consists of three main steps:
encodes a given set of k-dimensional data 1. Extract keypoints from images using
vectors with a much smaller subset. The subset SIFT feature detection and description
C is called a codebook and its elements Ci are algorithm.
called codewords, codevectors, reproducing 2. Create descriptor for each extracted
vectors, prototypes or design samples. The keypoints.
commonly used vector quantizers are based on 3. Clustering features using k-means
nearest neighbor called Voronoi or nearest clustering algorithm (Create visual
neighbor vector quantizer [13]. vocabulary using vector quantization of
Each cluster represented by a visual word descriptor space) and save the resulting
that represents the specific local pattern “visual words”.
participated by the keypoints in that cluster, so
a visual word vocabulary identifies all types of Testing Algorithm
local patterns of image. The image can be Input (k - visual word)
identifies as a bag of visual words, or in other Output (labeled image)
words, as a visual-word vector containing the Step 1: Open unlabeled new image.
number (weight) of each visual word in image Step 2: Extract and describe features of
(i.e., the number of keypoints in the unlabeled image using SIFT.
corresponding cluster), which in classification Step 3: Extract visual word (centroid) for
task can be used as a feature vector [2]. testing image.
BoVW approach in general creates Step 4: Calculate the nearest neighbor
supervised classifiers depend on visual words using Euclidean distance between visual
taken from labeled images for label prediction word of tested image and visual words of
of a new image. Therefore the clustering training images.
method creates a visual words vocabulary to Step 5: Take the decision: compare
describe different local patterns in images. The extracted features of unlabeled image with
clusters number defines the size of the visual words extracted in training stage.

79
Al-Nahrain Journal of Science Vol.21 (4), December, 2018, pp.76-82 Science

The new image classified by finding the Sensitivity = ................................... (5)


smallest distance between the new image’s
feature vector and the feature vectors of
Specificity = ................................... (6)
clusters obtained in the learning phase
(unlabeled image belong to the cluster that
have smallest distance with it based on their Accuracy = ........................ …(7)
feature vectors or the differencing between
histograms of unlabeled image and training Note that,
cluster are the smallest histogram TP (true positive): represents the number
differencing which indicate that the two of images correctly labeled with corresponding
clusters are Converged). class by the algorithm, FP (false positive):
represents the number of images not exist in
9. Experimental Results training set but labeled as one of the clusters
Quantitative performance of algorithms is (unexpected result), FN (false negative):
reported in terms of sensitivity, specificity. represents the number of Missing images, TN
Sensitivity is the fraction of positive class (true negative): represents the number of
sample correctly classified (the ability of the images not exist in training set and not labeled
classifier to find all the positive samples). correctly [15].
Specificity is the fraction of negative class The program written using VisualBasic.net
sample correctly identified. Accuracy is the programming language, data set for three
proportion of true results, either true positive different classes was used for training process
or true negative, in a population. It measures (car, ship, and motor) every data set have 16
the degree of veracity of a diagnostic test on a images.
condition [15].
First data set (car class):

Second data set (ship class):

Third data set (Motorbike class):

Table (1) classification method based on bag of visual


Classification performance. words.
Nearest neighbor based on Euclidean
Class Sensitivity Specificity Accuracy
distance for (16) tested image will present in
Car 0.875 0.875 0.9 Table (2), where the distance calculated using
Ship 0.83 0.7 0.8 x,y position of tested image centroid and x,y
Motorbike 0.7 0.83 0.8 position of training images centroid in each
The experimental results of (16) unlabeled cluster. At the right side of Table (2) the
images (tested images) present in Table (1) original image type while at the left side the
shows the performance of nearest neighbor

80
Abdul Amir Abdullah Karim

distance with each cluster the minimum [2] Jun Y., Yu-Gang J., Alexander H., Chong-
distance describe the class of tested image. Wah N., “Evaluating Bag-of-Visual-Words
Representations in Scene Classification”,
Table (2) Proceedings of the international Workshop
K-Nearest Neighbor. on Workshop on Multimedia information
Original Car Ship Motorbike
Retrieval, vol. 2, pp. 197-206, 2007.
image class class class [3] Marcin K., Rafał S., Paweł S., Piotr W.,
“Bag-of-Features Image Indexing and
Car 14 154 154
Classification in Microsoft SQL Server
Car 30 154 17 Relational Database”, IEEE, 46, 746-751,
Car 18 30 154 2015.
Car 10 16 19 [4] Pornntiwa P., Emmanuel O., Olarik S.,
Car 154 574 518 Lambert S., Marco W., “Comparing Local
Ship 21 15 24 Descriptors and Bags of Visual Words to
Ship 24 11 51 Deep Convolutional Neural Networks for
Ship 17 9 18 Plant Recognition”, 6th International
Ship 18 30 37 Conference on Pattern Recognition
Applications and Methods, 1, 886-893,
Ship 15 18 23
2017.
Motorbike 8 44 7 [5] Mingyuan J., Christian W., Christophe G.,
Motorbike 23 154 17 Atilla B., “Supervised Learning and
Motorbike 14 16 6 Codebook Optimization for Bag-of-Words
Motorbike 9 16 9 Models”, Springer Science Business
Motorbike 16 21 17 Media, 4, 409-419, 2012.
Motorbike 14 17 11 [6] Yi H., Guohua D., Yuanyuan W., Ling W.,
Jinsheng Y., Xiqi L., Yudong Z.,
The ratio of sensitivity and specificity is “Optimization of SIFT algorithm for fast-
limited between 0 and 1; that is the ratio of image feature extraction in line-scanning
true classification, high value of sensitivity ophthalmoscope”, Optik journal, 152, 21-
and specificity give impression of good 28, 2017.
method performance. [7] El-gayar M., Soliman H., Meky N., “A
comparative study of image low level
10. Conclusion feature extraction algorithms”, Egyptian
Bag of visual word (BoVW) technique is Informatics Journal, 14, 175-181, 2013.
an efficient image representation in the [8] Panchal P., Panchal S., Shah S., “A
classification task. In this paper there are two Comparison of SIFT and SURF”,
main stage, the first is the training stage and International Journal of Innovative
the second is the testing stage, every stage Research in Computer and Communication
have number of steps. In general the first stage Engineering, 1, 323-327, 2013.
create visual vocabulary from training images. [9] Jian W., Zhiming C., Victor S., Pengpeng
The information that extracted in the first stage Z., Dongliang S., Shengrong G., “A
used to classify new unlabeled image based on Comparative Study of SIFT and its
bag of features created using supervised Variants”, Measurement Science Review,
BoVW approach on set of training images. 13, 122-131, 2013.
This approach gives very good results [10] Pedro J., “Contribution to the
although small number of images using in completeness and complementarity of
training process. Local Image Features”, thesis, 2013.
[11] Soumyadeep G., Tejas I., Rohit K., Richa
References S., Mayank V., “Feature and Keypoint
[1] Kristen G., “Visual Object Recognition”, Selection for Visible to Near-infrared Face
thesis, 2010. Matching”, International Conference on

81
Al-Nahrain Journal of Science Vol.21 (4), December, 2018, pp.76-82 Science

Biometrics: Theory, Applications &


Systems, 978, 1109-1119, 2015.
[12] Jasmine I., Nitin P., Madhura P.,
“Clustering Techniques and the Similarity
Measures used in Clustering: A Survey”,
International Journal of Computer
Applications, 134, 93-103, 2016.
[13] Balwant A., Doye D., “Speech
Recognition Using Vector Quantization
through Modified K-means LBG
Algorithm”, Computer Engineering and
Intelligent Systems, 3, 137-144, 2012.
[14] Aman K., Singh M., “A Review of Data
Classification Using K-Nearest Neighbour
Algorithm”, International Journal of
Emerging Technology and Advanced
Engineering, 3, 354-360, 2013.
[15] Chamaa C. , Mukhopadhyaya S., Biswasa
P., Dharaa A., Madaiahb M., Khandelwalb
N., “Automated Lung Field Segmentation
in CT images using Mean Shift Clustering
and Geometrical Features”, Medical
Imaging 2013: Computer-Aided Diagnosis,
8670, 867032-867042, 2013.

82

You might also like