0% found this document useful (0 votes)

33 views

Jimaging 05 00033

Scalable Database Indexing and Fast Image Retrieval Based on Deep Learning and Hierarchically Nested Structure Applied to Remote Sensing and Plant Biology

Uploaded by

abarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views

Jimaging 05 00033

Scalable Database Indexing and Fast Image Retrieval Based on Deep Learning and Hierarchically Nested Structure Applied to Remote Sensing and Plant Biology

Uploaded by

abarni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Journal of

Imaging
Article
Scalable Database Indexing and Fast Image Retrieval
Based on Deep Learning and Hierarchically Nested
Structure Applied to Remote Sensing and
Plant Biology
Pouria Sadeghi-Tehran 1, * , Plamen Angelov 2 , Nicolas Virlet 1 and Malcolm J. Hawkesford 1
1 Department of Plant Sciences, Rothamsted Research, Harpenden AL5 2JQ, UK;
[email protected] (N.V.); [email protected] (M.J.H.)
2 School of Computing and Communications, InfoLab21, Lancaster University, Lancaster LA1 4WA, UK;
[email protected]
* Correspondence: [email protected]

Received: 6 November 2018; Accepted: 18 February 2019; Published: 1 March 2019

Abstract: Digitalisation has opened a wealth of new data opportunities by revolutionizing how
images are captured. Although the cost of data generation is no longer a major concern, the data
management and processing have become a bottleneck. Any successful visual trait system requires
automated data structuring and a data retrieval model to manage, search, and retrieve unstructured
and complex image data. This paper investigates a highly scalable and computationally efficient
image retrieval system for real-time content-based searching through large-scale image repositories
in the domain of remote sensing and plant biology. Images are processed independently without
considering any relevant context between sub-sets of images. We utilize a deep Convolutional
Neural Network (CNN) model as a feature extractor to derive deep feature representations from
the imaging data. In addition, we propose an effective scheme to optimize data structure that can
facilitate faster querying at search time based on the hierarchically nested structure and recursive
similarity measurements. A thorough series of tests were carried out for plant identification and
high-resolution remote sensing data to evaluate the accuracy and the computational efficiency of the
proposed approach against other content-based image retrieval (CBIR) techniques, such as the bag
of visual words (BOVW) and multiple feature fusion techniques. The results demonstrate that the
proposed scheme is effective and considerably faster than conventional indexing structures.

Keywords: content-based image retrieval; deep convolutional neural networks; information retrieval;
data indexing; recursive similarity measurement; deep learning; bag of visual words; remote sensing

1. Introduction
Today, digital images and videos are ubiquitous in every domain. The advancement in
multi-media technologies has led to the generation of an enormous number of images and videos.
The size of image repositories has increased rapidly in many domains, such as biology, remote sensing,
medical, military, and web-searching. The use of automated data acquisition systems, such as modern
phenotyping platforms [1–3] has revolutionized the way the data is collected and analyzed. The plant
science community is seeking novel solutions to fully exploit all the potential offered by such new
platforms equipped with high-resolution remote sensing sensors. Any large-scale dataset in modern
biological sciences first and foremost requires reliable data infrastructure and an efficient information
retrieval system. For image repositories of large scale, manual tagging is infeasible and is prone
to errors, due to users’ subjective opinions. Thus, to utilize such unstructured and complex image

J. Imaging 2019, 5, 33; doi:10.3390/jimaging5030033 www.mdpi.com/journal/jimaging

J. Imaging 2019, 5, 33 2 of 21

collections, there is a substantial need for content-based image retrieval (CBIR) systems for browsing
through images at a large scale and to classify, structure, and retrieve relevant information requested
by the users.
Information retrieval (IR) refers to finding material (image repositories or documents) of an
unstructured nature (image or text) that satisfies an information need from within large collections [4].
There is a fundamental difference between CBIR and search by text and metadata. Searching methods
based on metadata rarely examine the content of an image itself but rather rely on manual annotations
and tagging. In these systems, words are stored as ASCII character strings to describe image content.
However, the high complexity of images cannot be described easily by keywords; thus, retrieval
systems which are based solely on manual annotation often lead to unsatisfactory outcomes. In contrast,
CBIR does not require keywords (manual annotation) and desired images are retrieved automatically
based on their similarity to the query representation [5–7].
Although CBIR techniques are beginning to find a foothold in many applications, such as biology,
remote sensing, satellite imaging, etc., the technology still suffers from lack of maturity due to
a significant gap towards semantic-aware retrieval from visual content. A major challenge associated
with CBIR systems is to extract information from an image which is unique and representative,
to overcome the issue of so called semantic-gap. The semantic-gap refers to low-level features of images
such as colors and texture, but those features might not be able to extract a higher level of understanding
of the image perceived by humans [8]. Due to the absence of solid evidence on the effectiveness of
CBIR techniques for high-throughput datasets with varied collections of images, opinion is still sharply
divided regarding the reliability and performance of such systems in real-time. It is essential to
standardize CBIR for easy access to data and speed up the retrieval process.
In this paper, a new concept of CBIR is employed to exploit the opportunities presented by large
image-based repositories, particularly in remote sensing and plant biology. The proposed approach,
which relies solely on the contents of the images, will pave the way for a computationally efficient and
real-time image querying through an unstructured image database. An end-to-end CBIR framework is
conducted without supervision. First, we utilize a deep CNN model as a feature extractor to obtain the
feature representations from the activations of the convolutional layers. In the next step, a hierarchically
nested database indexing structure and local recursive density estimation are developed to facilitate
an efficient and fast retrieval process. Finally, the key elements of CBIR, accuracy and computational
efficiency, are evaluated and compared with the state-of-the-art CBIR techniques.

2. Related Works
The core modules of any CBIR systems include image representation, database indexing,
and image scoring, described as detailed below:

2.1. Image Representation

At the core, visual features affect every aspect of computer vision applications, including CBIR.
The success of any CBIR system crucially depends on the feature representation of the images extracted
by applying an image descriptor. Although over the past decades a variety of feature extraction
techniques have been developed to find semantically richer image representations, it still remains one
of the key challenges in CBIR applications.

2.1.1. Hand-crafted Feature Extraction Techniques

Handcrafted features are used excessively in conventional CBIR applications to quantify the
contents of images. Earlier applications mainly focused on primitive features (global features) which
describe an image as a whole to generalise the entire image as a single vector, such as contour
representations, texture, or shape features.
J. Imaging 2019, 5, 33 3 of 21

• Color properties are extracted directly from the pixel densities over the whole image, segmented
regions/bins, or sub-image. Image descriptors that characterize the color properties of an image
seek to model the distribution of the pixel intensities in each channel of the image. These methods
include color statistics, such as deviation, mean, and skewness, along with color histograms.
Since color features are robust to background complications and are invariant to the size or
orientation of an image, the color based methods have become one of the most common techniques
in CBIR [9–11].
• Texture properties measure visual patterns in images that contain important information about
the structural arrangement of surface i.e., fabric, bricks, etc. Texture descriptors seek to model the
feel, appearance, and overall tactile quality of an object in an image and are defined as a structure
of surfaces formed by repeating a particular element or several elements in different relative
spatial distribution and synthetic structure. In general, the repetition involves local variations of
scale, orientation, or other geometric and optical features of the elements [12,13].
• Shape properties can also be considered as one of the fundamental perceptual characteristics.
Shape properties take on many non-geometric and geometric forms, such as moment invariants,
aspect ratio, circularity, and boundary segments. There are difficulties associated with shape
representation and descriptors techniques due to noise, occlusion, and arbitrary distortion, which
often causes inaccuracies in extracting shape features. Nonetheless, the method has shown
promising results to describe the image content [14,15].
Whilst the above techniques focus on primitive features, more recent techniques have been aimed
to find semantically richer image representations by extracting a collection of local invariant features.
The main advantage of semantic features is locality, which means that the extracted features are local
and robust to clutter and occlusion. Also, individual features can be matched to a large database of
objects and have close to real-time performance.
One of the most effective techniques is the bag of visual words technique [16,17]. The main reasons
that BOVW has gained popularity in classification and retrieval applications are the use of powerful
local descriptors, such as Scale Invariant Feature Transform (SIFT) [18], Speeded Up Robust Features
(SURF) [19], and Binary Robust Invariant Scalable Keypoints (BRISK) [20]. In addition, the vector
representations can be compared with standard distances, and subsequently be used for effective
CBIR. However, the main drawback of BOVW is the high dimensional vector representing an image.
Although a high-dimensional vector usually provides better exhaustive search results compared to a
low-dimensional one, it is more difficult to index efficiently. Aggregated vectors, such as Fisher Vector
(FV) [21] and Vector of Locally Aggregated Descriptors (VLAD) [22] aim to address this problem by
encoding an image into a single vector, reducing the dimensionality without noticeably impacting the
accuracy [16,17].
Nevertheless, despite the robustness of local descriptors techniques, global features are still
desirable in a variety of computer vision applications. Ultimately, having an intimate knowledge of
the dataset contents will provide a better perspective for which feature extraction techniques might be
appropriate. For example, datasets that have relatively different color distributions, color descriptors
will be more effective. Nonetheless, the effectiveness of hand-crafted feature representation in CBIR
is inherently limited, as these approaches mainly operate at the primitive level. As presented in the
following section, higher accuracy will be achieved by extracting semantic features from images based
on learning-based features using deep networks.

2.1.2. Learning-based Features Using Deep Convolutional Neural Network

Recent years have witnessed the success of learning based features using Deep Neural Networks
(DNNs) [23–25]. Unlike conventional global and local feature extraction methods, which often use
shallow architecture and solely rely on human-crafted features, deep Conventional Neural Networks
are considered the most well-known architecture for visual analysis [26]. CNN models attempt to
model high-level abstractions in images by employing deep architectures composed of multiple
J. Imaging 2019, 5, 33 4 of 21

non-linear transformations [27]. In CNNs, features are extracted at multiple levels of abstracts and
allow the system to learn complex functions that directly map raw sensory input data to the output,
without relying on hand-engineered features using domain knowledge.
CNN has achieved state-of-the-art performance in a variety of applications, including natural
language processing [28,29], speech recognition [30], and object recognition [31]. Inspired by the
success of CNN in many computer vision applications, it has started to gain a foothold in the
research area of CBIR. Subsequently, CNN models have been proposed to improve the image
retrieval workflow [16,32,33]. For instance, in Sun et al. [34], features derived from local image
regions identified with a general object detector and an adapted CNN model have been evaluated
on two public large-scale image datasets. Lai et. al. [35] proposed simultaneous feature learning
using deep neural networks and hash coding. The short binary codes resulted from hash coding
achieved efficient retrieval and a considerable saving in memory usage. In other techniques, CNN
descriptors are combined with conventional descriptors such as the VLAD representation [36,37].
Finally, in Mohedano et al. [38] authors proposed a method based on encoding the convolutional
features of CNN and the BOVW aggregation scheme. The approach outperformed the state-of-the-art
tested on landmark datasets.

2.2. Feature Indexing and Image Scoring

Another line of research in CBIR focuses on feature indexing and structuring the data vectors
extracted from images. Feature indexing refers to structuring a database to facilitate search speed.
Since one of the key features in CBIR systems is response time, the importance of feature indexing
becomes more vivid, especially in a large-scale image database. An efficient database indexing
can significantly accelerate the retrieval process and reduces memory usage substantially [39].
Conventional methods use a similarity metric to compare the feature vector of the query image
to each and every single feature vector in the database. However, whilst comparing the query feature
vector to the entire image dataset might be feasible for small datasets, this is still an O(N) linear
operation; thus, for large-scale datasets of billions of feature vectors, this is not computationally
efficient. In [39,40], a hierarchical structure is formed based on low-level feature extraction techniques
such as color, texture, and local mean clustering technique. The model is problem-specific and
threshold-dependent. The main drawbacks are that the developed primitive features are not effective
enough to represent images; in addition, the wrong choice of cluster radius may have a negative
impact on the retrieval performance.
Two widely used indexing techniques in CBIR are inverted file index and hashing based indexing.
An inverted index (also called “inverted file”) is the central component of many search systems [41–45]
as it facilitates faster and more scalable querying. Inspired by the field of information retrieval (i.e., text
search engine), the inverted index stores mapping of unique word IDs to the document IDs in which
the words occur [45]. It is easy to conceptualize an inverted index as a dictionary data structure with
the word ID as the key and the value as a list of document IDs that contain the word.
The hashing based index projects images into a common Hamming space, while similar data
will be mapped into similar binary codes [46–49]. The main concern of the existing hashing scheme
such as locality sensitive hashing (LSH) [50] is an expensive memory cost. The reason is that these
methods require to store the raw dataset representation vectors in memory, which is not scalable for a
large-scale image database. Inspired by the success of deep networks, deep hashing methods have been
proposed for image retrieval systems to take advantage of the deep network’s image representation
power [35,46,47,51].

3. Methodology
In this paper, we focus on three key challenges of any content-based image retrieval: image
representation, database indexing, and image similarity measurement. Figure 1 illustrates an overall
view of the proposed framework. The first step in the prescriptive analytics process is to transform the
Journal of Imaging 2019, 5, x FOR PEER REVIEW 5 of 21

a fixed feature extractor without the last fully connected layer. The trained model provides access to
J. Imaging 2019, 5, 33 5 of 21
the visual descriptors previously learnt by the CNN after processing millions of images in the
ImageNet dataset without requiring a computational expensive training phase.
initialAlthough the deep
unstructured learning model
and structured is effective
data sources intoin extracting prepared
analytically discriminative visual
data. To features
achieve from
a balance
images (Section
between 4.2), and
complexity it would compute
efficiency, multi-dimensional
a pre-trained feature
CNN is used vectors
to utilize the(2048-D
ability in our model
of the case) forto
every image
produce which
better imageincreases the computational
representations complexity
for the retrieval task. Wefor featurean
leverage indexing
existing and
model querying.
trained Toon
address
the the multi-dimensional
ImageNet dataset [52], known complexity
as residualcaused by the
network CNN model,
(ResNet) a novel
[53]. The modelnested
is usedhierarchical
as a fixed
database
feature indexing
extractor is proposed
without the lastto facilitate
fully fast querying.
connected In addition,
layer. The trained modela provides
recursiveaccess
calculation based
to the visual
on local density
descriptors estimation
previously learntisbyused to measure
the CNN the similarity
after processing between
millions the given
of images in the query
ImageNetand dataset
all the
images from
without a given
requiring image cluster. expensive training phase.
a computational

Figure 1.
Figure Schematic representation
1. Schematic representation of
of the
the retrieval
retrieval model.
model.

Although the deep

3.1. Representation learning
Learning Usingmodel is effective
Residual in extracting
Learning Model discriminative visual features from
images (Section 4.2), it would compute multi-dimensional feature vectors (2048-D in our case) for every
image CNNs
which process images
increases through severalcomplexity
the computational layers, mainly in two parts
for feature of (a)
indexing thequerying.
and convolutional layers
To address
and max pooling layers and (b) the fully connected layers which are typically a linear
the multi-dimensional complexity caused by the CNN model, a novel nested hierarchical database classifier, such
as softmax
indexing regressiontoclassifier
is proposed facilitate(Figure 2). The In
fast querying. convolutional layers are
addition, a recursive used to detect
calculation based on features
local
whereas normalization and pooling layers control overfitting and reduce the number of
density estimation is used to measure the similarity between the given query and all the images fromweights. Thea
last fully-connected
given image cluster. layers are used for classification. Recent studies [23,54] indicate that it is feasible
to adapt CNN models to extract semantic aware features by the activation of different layers in the
3.1. Representation
networks Learning
[52]. Such genericUsing Residual
descriptors Learning
derived Model
from CNN are effective and powerful.
As mentioned, Neural Networks have the ability to learn and discover a good combination of
CNNs process images through several layers, mainly in two parts of (a) the convolutional layers
features, even for complex tasks which would otherwise require a lot of human effort to be manually
and max pooling layers and (b) the fully connected layers which are typically a linear classifier, such as
hand-crafted. In practice, it is common to pre-train a CNN on a very large dataset such as ImageNet
softmax regression classifier (Figure 2). The convolutional layers are used to detect features whereas
dataset with 1.2 million images and 1000 categories, and then use the model either as an initialization
normalization and pooling layers control overfitting and reduce the number of weights. The last
for fine-tuning the CNN or use it as a fixed feature extractor, which is also known as Representation
fully-connected layers are used for classification. Recent studies [23,54] indicate that it is feasible
Learning (RL). The main reason is that it is relatively rare to have a dataset big enough to train an
to adapt CNN models to extract semantic aware features by the activation of different layers in the
entire CNN from scratch; additionally, training a CNN model from scratch will take considerable
networks [52]. Such generic descriptors derived from CNN are effective and powerful.
time to train across multiple GPUs on a large-scale dataset such as ImageNet.
As mentioned, Neural Networks have the ability to learn and discover a good combination of
Representation learning is the improvement of learning in a new task through the transfer of
features, even for complex tasks which would otherwise require a lot of human effort to be manually
knowledge from a related task that has already been learned [55]. In such a model, an existing pre-
hand-crafted. In practice, it is common to pre-train a CNN on a very large dataset such as ImageNet
trained model is used as a starting point for a new task, such as classification. The conventional CNNs
dataset with 1.2 million images and 1000 categories, and then use the model either as an initialization
are treated as end-to-end image classifiers where an image forward propagates through the network
for fine-tuning the CNN or use it as a fixed feature extractor, which is also known as Representation
and the final probabilities are obtained from the end of the network. However, in the representation
Learning (RL). The main reason is that it is relatively rare to have a dataset big enough to train an entire
learning, instead of allowing the image to forward propagate through the entire network, we can
CNN from scratch; additionally, training a CNN model from scratch will take considerable time to
stop the propagation at an arbitrary layer, such as the last fully connected layer, and extract the values
train across multiple GPUs on a large-scale dataset such as ImageNet.
from the network at this time, and then use them as feature vectors.
Representation learning is the improvement of learning in a new task through the transfer of
knowledge from a related task that has already been learned [55]. In such a model, an existing
pre-trained model is used as a starting point for a new task, such as classification. The conventional
CNNs are treated as end-to-end image classifiers where an image forward propagates through the
J. Imaging 2019, 5, 33 6 of 21

network and the final probabilities are obtained from the end of the network. However, in the
representation learning, instead of allowing the image to forward propagate through the entire network,
we can stop the propagation at an arbitrary layer, such as the last fully connected layer, and extract the
valuesJournal
from the network
of Imaging 2019, 5, x at
FORthis
PEERtime, and then use them as feature vectors.
REVIEW 6 of 21

Figure 2. Representation
Figure learning
2. Representation scheme.
learning scheme.Deep
Deep feature extractionfrom
feature extraction from
thethe pretrained
pretrained Convolutional
Convolutional
Neural Network
Neural (CNN)
Network model.
(CNN) model.

In this In study,
this study, we we utilize
utilize thethe convolutional layers
convolutional layers merely
merelyasasaafeature
feature extractor.
extractor.TheThe aim aim
is to is to
generalize a trained CNN in learning discriminative feature representations for the imagesour
generalize a trained CNN in learning discriminative feature representations for the images in in our
dataset.dataset.
The Thetrainedtrained model
model is is usedtotoderive
used derive feature
feature vectors,
vectors,moremorepowerful
powerful than hand-designed
than hand-designed
algorithms such as SIFT, GIST, HOG, etc. We exploit the ability of a well-known deep convolutional
algorithms such as SIFT, GIST, HOG, etc. We exploit the ability of a well-known deep convolutional
neural network framework known as residual learning (ResNet) [53,56]. Residual learning
neural network framework known as residual learning (ResNet) [53,56]. Residual learning frameworks
frameworks ease the training of deeper networks and are a great candidate to capture the
ease the training of properties
discriminative deeper networksof imagesandasare a great
a fixed candidate
feature to capture
extractor model.the discriminative
Network depth is properties
a key
of images
element as in
a neural
fixed feature
networkextractor model.
architecture; however,Network
deeper depth
networks is are
a key
more element
difficultintoneural
train, asnetwork
the
architecture;
accuracy gets saturated and then degrades rapidly. When deeper networks start converging, a and
however, deeper networks are more difficult to train, as the accuracy gets saturated
then degrades
degradation rapidly.
problemWhen deeper
is exposed networks
which start by
is not caused converging,
overfitting,awhile
degradation
adding more problem
layers is exposed
causes
whicheven is nothigher
causedtraining
byerror. In residual
overfitting, learning
while models,
adding more instead
layers of learning
causes even a direct mapping
higher of 𝑥 →error.
training
𝑦 with learning
In residual 𝐻(𝑥), theinstead
a functionmodels, residualoffunction
learning is defined
a directusing
mapping𝐻(𝑥) = of𝐹(𝑥)
x →+y𝑥; with
wherea 𝐹(𝑥) and xH ( x ),
function
represents residual mapping function and the identity function, respectively.
the residual function is defined using H ( x ) = F ( x ) + x; where F ( x ) and x represents residual mapping The author's hypothesis
is that it is easier to optimize 𝐹(𝑥) than to optimise the original mapping function, 𝐻(𝑥). We refer
function and the identity function, respectively. The author’s hypothesis is that it is easier to optimize
readers to [53,56] for more details.
F ( x ) than to optimise the original mapping function, H ( x ). We refer readers to [53,56] for more details.
The employed ResNet model has been pre-trained on the ImageNet Large Scale Visual
The employed
Recognition ResNet (ILSVRC)
Challenge model has2012,beentopre-trained
classify 1.3 on the ImageNet
million images to Large Scale Visual
1000 ImageNet Recognition
classes [52].
Challenge (ILSVRC) 2012, to classify 1.3 million images to 1000 ImageNet
The ResNet consists of convolutional layers, pooling layers, and fully connected layers. The network classes [52]. The ResNet
consiststakesof images
convolutional
of size 224layers,
× 224 pooling layers,
pixels as input andpasses
then fully through
connected the layers.
networkThe in a network
forward pass takes images
after
of size 224 × 224
applying filterspixels
to theasinput
input then When
image. passestreating
through the network
networks in afeature
as a fixed forward pass after
extractor, we cut applying
off
filtersthe network
to the inputatimage.
an arbitrary
Whenpoint (normally
treating networkspriorastoathe lastfeature
fixed fully-connected
extractor, layers);
we cutthus, all images
off the network at
will be extracted from the activations of convolutional feature maps directly.
an arbitrary point (normally prior to the last fully-connected layers); thus, all images will be extracted This would compute a
2048-D feature vector for every image that contains the hidden layer
from the activations of convolutional feature maps directly. This would compute a 2048-D feature immediately before the classifier.
The 2048-D feature vectors will be directly used for computing the similarity between images. The
vector for every image that contains the hidden layer immediately before the classifier. The 2048-D
computational complexity and retrieval process may become cumbersome as the dimensionality
feature vectors will be directly used for computing the similarity between images. The computational
grows. This requires us to optimize the retrieval process by proposing a hierarchically nested
complexity
indexing and retrieval
structure andprocess
recursive may becomemeasurements
similarity cumbersome to as facilitate
the dimensionality
faster accessgrows. This requires
and comparison
us to ofoptimize the retrieval
multi-dimensional featureprocess
vectorsby
as proposing
described in athe hierarchically
following sections.nested indexing structure and
recursive similarity measurements to facilitate faster access and comparison of multi-dimensional
feature 3.2.vectors
Featureas Indexing
describedBasedin ontheHierarchical Nested Data Clusters
following sections.
The success of a CBIR not only depends on image delineation, but feature indexing and
3.2. Feature Indexing Based on Hierarchical Nested Data Clusters
similarity measurement matrix also play vital roles to facilitate the execution of queries. In general,
feature
The indexing
success of refers
a CBIRto anot
database
only organizing
depends structure
on image to assist fast retrieval
delineation, but process.
feature Whilst it is and
indexing
feasible to retrieve information from datasets which are small in size by measuring the similarity
similarity measurement matrix also play vital roles to facilitate the execution of queries. In general,
between a query and every image in the dataset, the computational complexity will soon increase
feature indexing refers to a database organizing structure to assist fast retrieval process. Whilst it
significantly on a larger scale image database.
J. Imaging 2019, 5, 33 7 of 21

is feasible to retrieve information from datasets which are small in size by measuring the similarity
between a query and every image in the dataset, the computational complexity will soon increase
Journal of Imaging 2019, 5, x FOR PEER REVIEW 7 of 21
significantly on a larger scale image database.
In an attempt to address the challenges faced by retrieval information on a large-scale dataset,
In an attempt to address the challenges faced by retrieval information on a large-scale dataset,
we present a hierarchically nested structure. The introduced database indexing aims at arranging
we present a hierarchically nested structure. The introduced database indexing aims at arranging and
and structuring the image database into a simple yet effective form of data clusters and hierarchies.
structuring the image database into a simple yet effective form of data clusters and hierarchies.
Although forming a hierarchical structure for retrieval optimization has been explored before [57–60],
Although forming a hierarchical structure for retrieval optimization has been explored before [57–
the method presented in this study is quite different. Hierarchically nested data clusters are structured
60], the method presented in this study is quite different. Hierarchically nested data clusters are
in which data clusters at higher layers represent one or multiple clusters at a lower layer based on
structured in which data clusters at higher layers represent one or multiple clusters at a lower layer
mean values of the cluster centers (Figure 3). The first layer clusters are generated based on feature
based on mean values of the cluster centers (Figure 3). The first layer clusters are generated based on
representations derived from the CNN model. Data clusters are formed by grouping the relevant
feature representations derived from the CNN model. Data clusters are formed by grouping the
data points using a partition-based clustering approach known as K-means clustering [61]. Figure 3
relevant data points using a partition-based clustering approach known as K-means clustering [61].
illustrates how the hierarchical structure of clusters is formed. µ and X are abstract values and denote
Figure 3 illustrates how the hierarchical structure of clusters is formed. µ and X are abstract values
mean values and scalar products explained in Section 3.3.
and denote mean values and scalar products explained in Section 3.3.

Figure 3.
Figure Schematic representation
3. Schematic representation of
of the
the hierarchical
hierarchical nested
nested indexing
indexing structure.
structure.

3.3. Fast Searching and Similarity Measure Based on Recursive Data Density Estimation
3.3. Fast Searching and Similarity Measure Based on Recursive Data Density Estimation
The final step after forming the hierarchically nested data clusters is to find the cluster which
The final step after forming the hierarchically nested data clusters is to find the cluster which
contains the most similar images to a query image. We applied recursive density estimation [62,63] to
contains the most similar images to a query image. We applied recursive density estimation [62,63]
measure a similarity between the query image and all images inside each cluster recursively. The main
to measure a similarity between the query image and all images inside each cluster recursively. The
idea of the recursive density function is to estimate the probability density function by a Cauchy type
main idea of the recursive density function is to estimate the probability density function by a Cauchy
kernel and to recursively calculate it. The method is also applied for novelty detection in real-time
type kernel and to recursively calculate it. The method is also applied for novelty detection in real-
data streams and video analytics [64]. The recursive calculation allows us to discard each data once
time data streams and video analytics [64]. The recursive calculation allows us to discard each data
it has been processed and only store the accumulated information in memory concerning the local
once it has been processed and only store the accumulated information in memory concerning the
mean (per cluster), µ and scalar product X. In order to speed up the retrieval process by an order of
local mean (per cluster), µ and scalar product X. In order to speed up the retrieval process by an order
magnitude, the searching process is performed from the top of the pyramid in an ordered hierarchy
of magnitude, the searching process is performed from the top of the pyramid in an ordered hierarchy
based on “winner takes all” principle with maximum local recursive density estimation at each level
based on “winner takes all” principle with maximum local recursive density estimation at each level
(Figure 4).
(Figure 4).
J. Imaging 2019, 5, 33 8 of 21
Journal of Imaging 2019, 5, x FOR PEER REVIEW 8 of 21

Figure 4.
Figure Schematic representation
4. Schematic representation of
of searching
searching through
through hierarchical
hierarchical nested
nested structure
structure and
and retrieve
retrieve the
the
most similar
most similar images
images (winner
(winner cluster)
cluster) to
to the
the query.
query.

The degree of similarity between the query image to images inside each cluster is measured by
The degree of similarity between the query image to images inside each cluster is measured by
the relative local density with regards to the query image, which is defined by a suitable kernel over
the relative local density with regards to the query image, which is defined by a suitable kernel over
the distance between the current image sample and all the other images inside the cluster:
the distance between the current image sample and all the other images inside the cluster:
c
!
M
D c
𝐷i = K
=𝐾 ∑ d𝑑ijc c𝑐 = 1, C
= [[1, 𝐶]] (1)
(1)
j =1

c with ccthth cluster;

cluster; 𝑑dijc denotes
where M 𝑀 isis the
the number
number ofof images associated with denotes the the distance
distance between the
th
query image and any any other
other image
image of
ofthe cluster;i 𝑖==1,1,2,
theccthcluster; 2, .…
. . ,,𝑁
Nc;;N
N isis the
the number
number ofof images within
cth cluster.
th cluster.

Different types of distance measures can be used, such as Euclidean or Cosine distance. We We used
a Cauchy type of kernel to define the local density D 𝐷ic . It can be proven that Cauchy type kernel
asymptotically tends to Gaussian, but can be calculated recursivelyrecursively [63]:[63]:

11
c =
D𝐷 = (2)
(2)
i
11
++k F𝐹i −−µ𝜇ic k2 + 𝑋ic −
+X − k 𝜇µic k2
𝐹
F== {𝑓{ f,1⋯ } is the
, · ,·𝑓· , f 2048 feature
} is the feature vector.𝑖 =i 1,2,
vector. = 1,…2,, 𝑁
. . .; , N
Ncc ;isNcthe number
is the numberof images within
of images cth
within
cluster.
cth cluster.
Both
Both the
the mean,
mean, µ µi and
and the
the scalar
scalar product,
product, XXi are
are updated
updated recursively
recursively as as follows
follows [63]:
[63]:
i i
𝑖−1 1
𝜇 =i − 1 𝜇 + 1𝐹 ;𝜇 = 𝐹 (3)
µi = 𝑖 µi−1 + 𝑖 Fi ; µ1 = F1 (3)
i i
𝑖−1 1
𝑋 = 𝑋 + ‖𝐹 ‖ ; 𝑋 = ‖𝐹 ‖ (4)
i − 1𝑖 1𝑖 2 2
Xi = Xi−1 + k Fi k ; X1 =k F1 k (4)
i
Finally, the cluster with the maximum local idensity Dc, with respect to the query image, is most
likely to contain similar images:
𝐶 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 {𝐷 } (5)
J. Imaging 2019, 5, 33 9 of 21

Finally, the cluster with the maximum local density Dc , with respect to the query image, is most
likely to contain similar images:
Ci∗ = argmaxcC=1 { Dic } (5)

The final step is the similarity measurement between the query image and all the images inside
the winning cluster at the lowest layer. The relevance score is defined by distance-based scoring using
City Block distance. Images are then ranked accordingly to their obtained scores. A smaller value
of City Block distance implies that the corresponding image is more similar to the query image and
vice versa. The City Block distance between the query image and images inside the winner cluster is
calculated as follows:
K
d I j , Q = ∑ Qk − Ik ; j = 1, . . . , Nc
j
(6)

k =1

where Nc is the number of images of winning cloud; K is the number of extracted features (K = 2048);
Q denotes the query image; and I is the image in the winning cluster.

4. Experiments and Results

In this section, we present the experiments conducted to evaluate the key elements of CBIR:
accuracy and computational efficiency. The deep learning framework and feature indexing were
developed in Python using Keras API and MATLAB, respectively. The experiment was carried out on
a desktop PC with Intel Core i7, processing power with 3.4 GHz CPU, 24 GB RAM, and GeForce GT
640 GPU running Ubuntu 16.04. Furthermore, the accuracy of the proposed approach is compared
with two hand-crafted feature-based methods, known as BOVW and multiple fused global features
(MFF). In addition, the computational efficiency and retrieval execution timing are evaluated against
inverted file indexing and non-hierarchical searching.
Integrating multiple features: As mentioned, correct selection and utilizing appropriate features to
represent an image are key elements for having a more accurate retrieval system [65,66]. The common
approach is to combine color and texture properties to generate a robust feature representation [66,67].
In this study, two feature extractor techniques based on color and texture properties known and color
correlogram and GIST are integrated. GIST descriptor [68] is widely used in scene classification to
represent an image by a vector of spectral values which is based on spatial envelope properties, such as
ruggedness, expansion, naturalness, and roughness. Color auto-correlogram, on the other hand,
is used to preserve the spatial information of colors in an image. It describes the global distribution of
local spatial correlations between identical colors [69].
Bag of visual words: In the BOVW method, the SIFT algorithm [18] is applied as a feature
descriptor in addition to Local Linear Constraint (LLC) [70] to project the descriptors into the
visual vocabulary and to reduce the computational complexity. In addition, to preserve the spatial
relationships of the code vector, Spatial Pyramid Matching (SPM) [71] was developed where the entire
image was divided into levels. Each image is divided into spatial sub-regions and histograms of
features are computed from each sub-region. Each level divides the image into 2l × 2l −1 ; where l is
level. The features are computed locally for each grid and the spatial information is incorporated
into histograms. A three-level SPM is comprised of a single histogram in level 0, 4 histograms
in level 1, and 16 histograms in level 2. In the end, the histograms from all the sub-regions are
concatenated together to generate the final representation of the image. The result is a feature vector of
(1 + 4 + 16) × K; where K is the number of codebooks (K = 2000).
J. Imaging 2019, 5, 33 10 of 21

4.1. Datasets
MalayaKew (MK) Leaf-Dataset: This dataset [72] consists of a collection of leaves from 44 species
class, with 52 images in each class. The data is in the form of digital images, size 256 × 256 pixels,
collected at the Royal Botanic Garden, Kew, England. The dataset has been used solely for supervised
image classification, since the dataset is extremely challenging as some of the classes have very
similar appearances (Figure 5) making it extremely difficult to distinguish differences between classes
with a fully unsupervised model, as was presented in this study. Although the MK dataset is
Journal of Imaging 2019, 5, x FOR PEER REVIEW 10 of 21
not considered a big dataset, we believe the similarity between classes can be a good example to
demonstrate how discriminative the features are between the convolutional neural networks and the
demonstrate how discriminative the features are between the convolutional neural networks and the
hand-crafted methods.
hand-crafted methods.

Figure 5.
Figure 5. Sample
Sample images
images of
of MalayaKew
MalayaKew 44
44 leaf
leaf collection.
collection.

The
TheUniversity
UniversityofofCalifornia
CaliforniaMerced
Merced(UCM)
(UCM) Dataset:
Dataset:UCM
UCM dataset
dataset[73]
[73]consists
consistsofof21
21 land
land cover,
large-scale
large-scale aerial images
images from
from the
theUSGS
USGSnational
nationalmap
mapurban
urbanarea
area imagery.
imagery. EachEach class
class contains
contains 100
images with 256 × 256 pixels; the spatial resolution of each pixel is 30 cm measured in the RGB spectral
100 images with 256 × 256 pixels; the spatial resolution of each pixel is 30 cm measured in the RGB
spectral space.
space. The The has
dataset dataset
beenhas been utilized
widely widely utilized for evaluating
for evaluating the performance
the performance of high-resolution
of high-resolution remote
remote sensing image scene classification [74–76]. The UCM dataset shows very small
sensing image scene classification [74–76]. The UCM dataset shows very small inter-class diversity inter-class
diversity among
among some some categories
categories that sharethat share
a few a fewtexture
similar similar patterns
texture patterns or objects,
or objects, which this
which makes makes this
dataset
dataset very challenging.
very challenging. Some sample
Some sample image from
image scenes scenes from
the UCM thedataset
UCM dataset
are shown are shown
in Figurein Figure
6. 6.
J. Imaging 2019, 5, 33 11 of 21
Journal of Imaging 2019, 5, x FOR PEER REVIEW 11 of 21

Figure 6.
Figure Sample images
6. Sample images of
of the
the University
University of
of California
California Merced
Merced (UCM)
(UCM) dataset.
dataset.
4.2. Performance and Accuracy
4.2. Performance and Accuracy
Throughout this work, we use two evaluation metrics widely used to assess CBIR performance,
known Throughout this work,
as mean Average we use(mAP)
Precision two evaluation metricsatwidely
and the precision rank Nused to assess
(P@N). AverageCBIR performance,
Precision (AP) is
known as mean Average Precision (mAP) and the precision at rank N (P@N). Average
one of the most frequent methods used to evaluate the retrieval quality of a single query’s retrieval Precision (AP)
is one of the most frequent methods used to evaluate the retrieval quality of a single
results. AP takes consideration of both Precision (Pr) and Recall (Re). Precision is the fraction of query’s retrieval
results.
retrievedAP takesthat
images consideration
are relevant,ofwhereas
both Precision
Recall is(Pr)
the and Recall
fraction (Re). Precision
of relevant is the
images that arefraction of
retrieved.
retrieved
AP images
averages that are relevant,
the precision values fromwhereas Recall
the rank is thewhere
positions fraction of relevant
relevant imagesimages that are retrieved.
are retrieved. The mean
AP averages the precision values from the rank positions where relevant images
average precision (mAP) is widely used to summaries the retrieval quality, which averages are retrieved. The
the AP over
mean average precision (mAP) is widely used to summaries
all queries. The definition of the above metrics follows below [4]: the retrieval quality, which averages the
AP over all queries. The definition of the above metrics follows below [4]:
∑nk=1𝑃(𝑘)
∑ P(k) ××𝑟𝑒𝑙(𝑘)
rel (k )
AP
𝐴𝑃 == (7)
(7)
R
𝑅
where P(k)
where P(k)denotes
denotes thethe
precision
precision of ktop
of top retrieval rel(k) isrel(k)
results;results;
k retrieval a binary is indicator
a binary function
indicatorequaling
function1
if the kth retrieved
equaling 1 if the kthresults are relevant
retrieved to the
results are current
relevant toquery image query
the current and 0 image
otherwise;
and and R and n denote
0 otherwise; and R
the number of relevant results for the current query image and the total
and n denote the number of relevant results for the current query image and the total number of number of retrieved results,
respectively.
retrieved Also,respectively.
results, the precisionAlso, at particular rank-N
the precision accuracy isrank-N
at particular anotheraccuracy
evaluation metric to
is another evaluate
evaluation
CBIR performance.
metric to evaluate CBIR P@Nperformance.
score refers toP@N the average
score refersnumberto theof average
same retrieved
numberimages,
of samewithin the
retrieved
top-N ranked images. It should be noted that although mAP and P@N
images, within the top-N ranked images. It should be noted that although mAP and P@N are widely are widely used as evaluation
metrics
used in CBIR, defining
as evaluation metrics ainsuitable metric to
CBIR, defining measuremetric
a suitable the quality of results
to measure for an of
the quality arbitrary
results query
for an
image is not a trivial process. In CBIR, it is hard to define the ground-truth
arbitrary query image is not a trivial process. In CBIR, it is hard to define the ground-truth since since different users might
have a different
different measure
users might haveofa similarity. If the degree
different measure of similarity
of similarity. If the of someofofsimilarity
degree the images is very
of some oflow,
the
ignoringisorvery
images not low,
displaying
ignoringthose
or images is not critical
not displaying thoseand doesisnot
images notimpact
criticaltheandoverall
does performance
not impact the of
the system. Labelling images as non-relevant is not always satisfactory to
overall performance of the system. Labelling images as non-relevant is not always satisfactory to thethe users. Any CBIR should
have a Any
users. certain
CBIRtolerance
shouldforhavefalseapositives, which often
certain tolerance for provides useful which
false positives, information.
often provides useful
In this study, to form a hierarchically nested pyramid, at the lower layer, images were grouped
information.
into aInfixed number
this study, to of clusters,
form while at the
a hierarchically second
nested layer, the
pyramid, means
at the lowerof the clusters
layer, images atwere
the first layer
grouped
were
into afurther groupedofinto
fixed number smaller
clusters, numbers
while at theofsecond
clusters. Since
layer, the
the number
means of images
of the clustersinatboth
the datasets is
first layer
were further grouped into smaller numbers of clusters. Since the number of images in both datasets
query image and all the clusters at the top layer and selecting the winning cluster with maximum
local Recursive Density Estimation (RDE). The search continues at the lower layers, but only with the
clusters which associated to the winning cluster at the top layer. Finally, images in the winning cluster
at the lowest stage are ranked based on calculating the eigenvector distance to the query image.
J. Imaging 2019, 5, 33 12 of 21
4.2.1. Retrieval Performance on MalayaKew Leaf-Dataset
The results of the convolutional neural network as a feature extractor (RL-CNN) are shown in
in the region of few thousands, two-layer hierarchies are enough to achieve real-time image querying.
Figure 7 and Table 1. The precision accuracy at rank-20 is compared in Figure 7 based on 20 queries.
In MK and UCM datasets, based on our experience, the number of clusters at the first layer was set to
The queries were selected to tackle every range of visual appearances with a unique shape, such as
44 and 21 clusters (number of categories) and 10 and 4 clusters at the top layer, respectively.
qoxyodon, or similar appearances, like q-aff-cerris and qlaurifolia.
The retrieval process begins by calculating the local recursive density estimation between the
Several observations can be achieved from the precision results. The RL-CNN method
query image and all the clusters at the top layer and selecting the winning cluster with maximum
outperformed the two state-of-the-art techniques by a large margin. The proposed method not only
local Recursive Density Estimation (RDE). The search continues at the lower layers, but only with the
performed well on classes with unique visual appearances, such as qlobata or qpetraea, but it also
clusters which associated to the winning cluster at the top layer. Finally, images in the winning cluster
distinguished categories with similar appearances, such as quercus and q-x-kewensis. In RL-CNN
at the lowest stage are ranked based on calculating the eigenvector distance to the query image.
method, q-x-mannifera, qboissieri, qellipsoidalis, qmacransmera, and qpetraea obtained maximum
accuracy
4.2.1. with over
Retrieval 90%, whereas
Performance qlaurifolia
on MalayaKew and q-aff-cerris had the lowest value of 55% and 45%,
Leaf-Dataset
respectively. The qlaurifolia class achieved 55% accuracy, whereas 9 out of 20 images belong to
The results
qcanariensis, of the convolutional
qrhysophylla, neuralcategories
and qtrotana network (Figure
as a feature
8A). extractor (RL-CNN)
The accuracy droppedaretoshown in
35% and
Figure
30% in7BOVW
and Table
and1.MFF,
The accordingly.
precision accuracy at rank-20class
The q-aff-cerris is compared
obtained in
theFigure
lowest7 accuracy
based on in
20RL-CNN
queries.
The
with 45% accuracy rate, whereas 11 out of 20 images belong to qrobur category, which is such
queries were selected to tackle every range of visual appearances with a unique shape, as
visually
qoxyodon, or similar appearances, like
almost identical to the query image. q-aff-cerris and qlaurifolia.

MK-DATASET
100

70
RANK-20(%)

0
1 4 6 8 10 11 13 16 17 18 19 20 21 25 27 28 29 30 34 43

CLASS NUMBER

RL-CNN BOW MFF

Figure7.7. The
Figure The retrieval
retrieval Rank-20
Rank-20 accuracy
accuracy between
between the
theConvolutional
Convolutional Neural
Neural Network
Network (CNN)
(CNN) as
as aa
featureextractor,
feature extractor,bag
bagofofvisual
visualwords,
words,and
andmultiple
multiplefeature
featurefusion
fusion(color
(colorand
andtexture).
texture).

Table 1. The retrieval accuracy mAP of convolutional neural network as a feature extractor (RL-CNN),
On the other hand, the BOVW and MFF performed poorly in identifying small differences
bag of visual words (BOVW), and multiple fused global features (MFF) on Malaya–Kew (MK) and
between leaf varieties in MK dataset. Both methods retrieved images with the visual similarity to
University of California Merced (UCM) datasets.
queries; however, they failed to distinguish small visual differences among classes. As illustrated in
Figure 7, BOVW performed
Dataset better than MFF Method
in most cases, except classes
mAPq_rubur_f_purpubascens,
(%)
qagriefolia, qagrifolia, and qpetraea. (The results for
FE-CNN each class are presented
88.1% in the Supplementary
Materials). MalayaKew BOVW 66.2%
MFF 52.6%
FE-CNN 90.5%
UCM BOVW 86.2%
MFF 69.8%

Several observations can be achieved from the precision results. The RL-CNN method
outperformed the two state-of-the-art techniques by a large margin. The proposed method not
only performed well on classes with unique visual appearances, such as qlobata or qpetraea, but it
also distinguished categories with similar appearances, such as quercus and q-x-kewensis. In RL-CNN
J. Imaging 2019, 5, 33 13 of 21

method, q-x-mannifera, qboissieri, qellipsoidalis, qmacransmera, and qpetraea obtained maximum accuracy
with over 90%, whereas qlaurifolia and q-aff-cerris had the lowest value of 55% and 45%, respectively.
The qlaurifolia class achieved 55% accuracy, whereas 9 out of 20 images belong to qcanariensis,
qrhysophylla, and qtrotana categories (Figure 8A). The accuracy dropped to 35% and 30% in BOVW and
MFF, accordingly. The q-aff-cerris class obtained the lowest accuracy in RL-CNN with 45% accuracy
rate, whereas 11 out of 20 images belong to qrobur category, which is visually almost identical to the
Journal of Imaging 2019, 5, x FOR PEER REVIEW 13 of 21
query image.

Figure 8. Qualitative evaluation of the proposed image retrieval on the two lowest performance of
Figure 8. Qualitative evaluation of the proposed image retrieval on the two lowest performance of
classes in Malaya–Kew Leaf-Dataset (A) Retrieval result from qlaurifolia class (B) retrieval result from
classes in Malaya–Kew Leaf-Dataset (A) Retrieval result from qlaurifolia class (B) retrieval result from
q-aff-cerris class. The first image is the query and the following images are the images most similar to
q-aff-cerris class. The first image is the query and the following images are the images most similar to
the query image. The retrieved images wrongly categorized are highlighted in red.
the query image. The retrieved images wrongly categorized are highlighted in red.
On the other hand, the BOVW and MFF performed poorly in identifying small differences between
Table 1 summarizes the mAP evaluation of the Malaya–Kew leaf dataset. The results are
leaf varieties in MK dataset. Both methods retrieved images with the visual similarity to queries;
obtained from 20 queries in which the retrieval system can be tested and evaluated. The best accuracy
however, they failed to distinguish small visual differences among classes. As illustrated in Figure 7,
score is 88.1%, achieved by RL-CNN, followed by BOVW and MFF with 66.2% and 52.6%,
BOVW performed better than MFF in most cases, except classes q_rubur_f_purpubascens, qagriefolia,
respectively.
qagrifolia, and qpetraea. (The results for each class are presented in the Supplementary Materials).
Table 1 summarizes the mAP evaluation of the Malaya–Kew leaf dataset. The results are obtained
Table 1. The retrieval accuracy mAP of convolutional neural network as a feature extractor (RL-
from 20 queries in which the retrieval system can be tested and evaluated. The best accuracy score is
CNN), bag of visual words (BOVW), and multiple fused global features (MFF) on Malaya–Kew (MK)
88.1%, achieved by RL-CNN, followed by BOVW and MFF with 66.2% and 52.6%, respectively.
and University of California Merced (UCM) datasets.

4.2.2. Retrieval Performance on UCM Dataset Method mAP (%)

Dataset
The precision performances at P@40 for the FE-CNN 88.1% are shown in Figure 9. The results
UCM dataset
show that RL-CNN method outperformed MalayaKewboth BOVW
the BOW 66.2%
and MFF by achieving better accuracy
MFF 52.6%
in all categories except baseball diamond category. In RL-CNN, high accuracy results obtained in
FE-CNN
agricultural, beach, forest, harbor, chaparral, and airplane 90.5%
(Figure 10) categories. On the other hand, in the
UCM BOVW 86.2%
baseball-diamond, dense residential (Figure 11), and freeway (Figure 12), RL-CNN achieved the lowest
accuracy with 41%, 35%, and 50%, respectively MFF (Figure 9). 69.8%
Figure 11 shows the retrieval results of the dense building class on a randomly given query.
4.2.2. Retrieval
The class Performance
achieved on UCM
35% accuracy, Dataset
whereas 14 out of 40 images belong to the same class as the query
image. However, the rest of the images
The precision performances at P@40 for are stillthe
visually
UCM similar
dataset to
arethe queryinretrieved
shown Themedium
Figure 9.from results
residential
show thatand mobile method
RL-CNN home parkoutperformed
classes. The freeway
both the class
BOWwithand
50% accuracy
MFF has a similar
by achieving betterperformance,
accuracy in
whereas half ofexcept
all categories the retrieved
baseballimages
diamond to runwayInand
belongcategory. overpass high
RL-CNN, classes, which are
accuracy still visually
results obtainedvery
in
similar to the freeway class (Figure 12).
agricultural, beach, forest, harbor, chaparral, and airplane (Figure 10) categories. On the other hand,
in the baseball-diamond, dense residential (Figure 11), and freeway (Figure 12), RL-CNN achieved
the lowest accuracy with 41%, 35%, and 50%, respectively (Figure 9).
UCM-Dataset
100
90
80
J. Imaging 2019, 5, 3370 14 of 21

Rank-40
60
Journal of Imaging 2019, 5, x FOR PEER REVIEW 14 of 21
50
40
30
20 UCM-Dataset
10
100
0
90
80
70
Rank-40

60
50
40 RL-CNN BOW MFF
30
20
Figure 9. The 10 retrieval Rank-40 accuracy between feature extractor using convolutional neural
0
network, bag of visual words, and multiple feature fusion (color and texture).

Figure 11 shows the retrieval results of the dense building class on a randomly given query. The
class achieved 35% accuracy, whereas 14 out of 40 images belong to the same class as the query image.
However, the rest of the images are stillRL-CNN visuallyBOWsimilar
MFF to the query retrieved from medium
residential and mobile home park classes. The freeway class with 50% accuracy has a similar
performance, whereas
Figure 9.9.The half
Theretrieval
of theaccuracy
Rank-40
retrieval
retrieved
Rank-40
images
between
accuracy
belong
feature
between
to runway
extractor
feature using and
extractor
overpassneural
convolutional classes,
using convolutional
which are
network,
neural
still visually
bag
network, very
of visual similar
bagwords,
of visual to
and the freeway
multiple
words, class
andfeature (Figure
fusion
multiple 12).
(color
feature and(color
fusion texture).
and texture).

Retrieval results
Figure 10. Retrieval results of
of airplane
airplane category
category using
using convolutional
convolutional neural
neural network
network as
as a feature
Journalextractor
of Imaging(RL-CNN).
2019, 5, x FOR PEER
The REVIEW
methods obtained 100% retrieval accuracy. 15 of 21
extractor (RL-CNN). The methods obtained 100% retrieval accuracy.

Figure 10. Retrieval results of airplane category using convolutional neural network as a feature
extractor (RL-CNN). The methods obtained 100% retrieval accuracy.

11. Retrieval
Figure 11. Retrievalresults
results
of of dense-building
dense-building category
category usingusing convolutional
convolutional neuralneural network
network as a
as a feature
extractor (RL-CNN).
feature extractor The green
(RL-CNN). Therectangles indicateindicate
green rectangles correct retrieval results. results.
correct retrieval
Figure
J. Imaging 2019,11. Retrieval results of dense-building category using convolutional neural network as a15 of 21
5, 33
feature extractor (RL-CNN). The green rectangles indicate correct retrieval results.

Retrieval results
Figure 12. Retrieval
Figure results of
of freeway
freeway category
category using
using convolutional
convolutional neural
neural network
network as
as aa feature
feature
extractor (RL-CNN). The
extractor (RL-CNN). The redred rectangles indicate incorrect retrieval results.

The retrieval
The mAP of
retrieval mAP of different
different models
models on
on the
the UCM
UCM image
image dataset
dataset are
are listed
listed in
in Table
Table 1.
1. As
As shown
shown
in the
in thetable,
table,the
theRL-CNN
RL-CNNoutperformed
outperformed both
both thethe state-of-the-art
state-of-the-art techniques.
techniques. mAPmAP
TheThe measure
measure in
in RL-
RL-CNN is 90.1%, whereas the BOW and MFF achieved 86.2% and 69.8%,
CNN is 90.1%, whereas the BOW and MFF achieved 86.2% and 69.8%, respectively. respectively.

4.3. Retrieval Time Per Query

4.3. Retrieval Time Per Query
Commercial CBIR applications are often assessed for requirements in computational capacity and
Commercial CBIR applications are often assessed for requirements in computational capacity
memory efficiency. As mentioned earlier, the proposed hierarchically nested structure is beneficial
and memory efficiency. As mentioned earlier, the proposed hierarchically nested structure is
for the retrieval performance in terms of search time and memory size required to store the indexed
beneficial for the retrieval performance in terms of search time and memory size required to store the
images. In this section, we provide more details on the retrieval time per query between the proposed
indexed images. In this section, we provide more details on the retrieval time per query between the
method, the inverted index file method, and the non-hierarchical searching with a single layer.
proposed method, the inverted index file method, and the non-hierarchical searching with a single
The non-hierarchical searching technique processes each image by scanning all image patches and
layer. The non-hierarchical searching technique processes each image by scanning all image patches
computing similarity values for every individual image, unlike the nested hierarchical indexing
and computing similarity values for every individual image, unlike the nested hierarchical indexing
described in Section 3.2.
described in Section 3.2.
Figure 13 shows the execution time of CNN-hierarchically nested structure, CNN sequential
Figure 13 shows the execution time of CNN-hierarchically nested structure, CNN sequential
searching, BOVW-inverted indexing technique, and BOVW-non-hierarchical retrieval method.
searching, BOVW-inverted indexing technique, and BOVW-non-hierarchical retrieval method. The
The non-hierarchical method processes each image by scanning all images and computing similarity
non-hierarchical method processes each image by scanning all images and computing similarity
values between query images and all images. In contrast, the hierarchically nested indexing avoids
values between query images and all images. In contrast, the hierarchically nested indexing avoids
comparing the query image with every image in the dataset by grouping similar images together and
comparing the query image with every image in the dataset by grouping similar images together and
measures the similarity based
measures the similarity based on
on recursive
recursive density
density estimation. The inverted
estimation. The inverted index
index technique
technique also
also
avoids performing a linear search over all images in the dataset, helping us to speed up the querying
process. In the inverted index method, we query our inverted index to find images that contain the
same visual word as the query image. Then, we only compare images in the dataset that contain a
significant number of visual words as the query.
As illustrated in Figure 13, the proposed hierarchical indexing method achieved the fastest
retrieval performance. The average retrieval time of RL-CNN with hierarchical indexing scheme on
the MK and UCM image datasets are 0.039 and 0.025 seconds, respectively. RL-CNN with sequential
searching came second with 0.164 and 0.142 seconds, respectively. Nevertheless, this is an O(N) linear
operation; thus, the execution time increases considerably in the sequential searching if the number
of images increases to hundreds of thousands or millions. BOVW with inverted index and BOVW
without indexing had the slowest retrieval time, with 0.29 and 1.8 seconds and 0.66 and 3.9 seconds in
MK and UCM datasets, respectively. Moreover, the RL-CNN with sequential searching showed faster
performance compared to the BOW with a similar structure. This can be justified since the RL-CNN
computes a 2048-D feature vector, whereas BOW generates a 42,000-D feature representation (Section 4,
searching came second with 0.164 and 0.142 seconds, respectively. Nevertheless, this is an O(N) linear
operation; thus, the execution time increases considerably in the sequential searching if the number
of images increases to hundreds of thousands or millions. BOVW with inverted index and BOVW
without indexing had the slowest retrieval time, with 0.29 and 1.8 seconds and 0.66 and 3.9 seconds
in MK and UCM
J. Imaging 2019, 5, datasets,
33 respectively. Moreover, the RL-CNN with sequential searching showed 16 of 21
faster performance compared to the BOW with a similar structure. This can be justified since the RL-
CNN computes a 2048-D feature vector, whereas BOW generates a 42,000-D feature representation
the bag
(Section of visual
4, the bag of words). As a result,
visual words). As a itresult,
takesitmore
takestime to time
more compute similarity
to compute for every
similarity forindividual
every
image in the dataset.
individual image in the dataset.

RETRIEVAL TIME
RL-CNN Hierarchically Netsed RL-CNN Sequential Searching BOW Inverted Index BOW Non-Hierarchy

3.9
4

3.5

2.5
TIME (SEC)

1.8
2

1.5
0.66

1
0.29
0.164

0.142
0.039

0.025
0.5

0
MK UCM
DATASET

Figure 13. Execution

13. Execution
Figure times times in seconds
in seconds of the hierarchically
of the hierarchically nested indexing,
nested indexing, inverted inverted index,
index, and non- and
non-hierarchical
hierarchical content-based
content-based image retrieval
image retrieval (CBIR),(CBIR),
tested tested on Malaya–Kew
on Malaya–Kew and University
and The The University
of of
California
California Merced
Merced datasets.
datasets.

4.4. 4.4. Discussion,

Discussion, Challenges
Challenges andandFuture
FutureWorkWork
Although
Although the the ability
ability to retrieve
to retrieve digital
digital images
images withwith relatively
relatively highhigh accuracy
accuracy and and low low
computational
computational efficiency
efficiency waswas presented
presented in this
in this study,
study, challenges
challenges remain
remain in terms
in terms of optimizing
of optimizing the the
CNN model to derive better feature representations as well as developing
CNN model to derive better feature representations as well as developing a dynamic clustering a dynamic clustering
technique
technique to group
to group similar
similar imagesimages
andandform form a hierarchically
a hierarchically nested
nested pyramid.
pyramid.
In this study, we applied pre-existing network architecture
In this study, we applied pre-existing network architecture pre-trained on data pre-trained on data of some
of some “related”
“related”
domain and use it as feature extractor. However, if the testing dataset is not
domain and use it as feature extractor. However, if the testing dataset is not related to the training related to the training
dataset
dataset thatthat the pre-existing
the pre-existing network
network is trained
is trained on (for
on (for example,
example, hyperspectral
hyperspectral or medical
or medical imagery),
imagery),
the pre-trained model will most likely have difficulty deriving discriminative
the pre-trained model will most likely have difficulty deriving discriminative features from the features from the testing
testing dataset. There is a type of transfer learning (TL) called fine-tuning that exists to leveragedata.
dataset. There is a type of transfer learning (TL) called fine-tuning that exists to leverage unlabeled
Typically,
unlabeled these
data. techniques
Typically, attempt
these to pre-train
techniques the weights
attempt of thethe
to pre-train classification
weights ofnetwork, by iteratively
the classification
training each layer to reconstruct the images. A combination
network, by iteratively training each layer to reconstruct the images. A combination ofof these techniques and pre-trained
these
network is often used to improve convergence.
techniques and pre-trained network is often used to improve convergence.
In terms of database structure and querying, the proposed indexing technique is an approximation
based on visual similarity. Since the similarity measurement is based on data density estimation,
the nearest neighbor will be either in the winning cluster or in the edge/border of another cluster,
but in most cases in the same cluster, as the high accuracy achieved by the proposed methods
presented in Section 4.2 indicates. However, the assumption can be modified from “winner takes all”
to “few winners take all” to also include similar images fall into border clusters.
As shown, cluster/group feature vectors extracted from images based on their similarities will
reduce the computational complexity of CBIR; however, in feature vectors with high dimension,
data becomes very sparse and distance measures become increasingly meaningless, resulting in
low performance of CBIR. Moreover, in some applications where the number of image categories
is unknown, the difficulties become more vivid when the clustering method has a static nature and
pre-defined structure, such as K-means. Future work will tackle dynamic clustering methods without
J. Imaging 2019, 5, 33 17 of 21

the requirement of pre-defining the number of clusters in advance. The advantage of using such a
model is that if new images are added to the dataset, the clustering images and forming the hierarchical
structure will not be repeated from scratch.
Another improvement will be adding relevant feedback which enables users to have more interaction
with the system and provide feedback on the relevance of the retrieved images. The feedback can be used
for learning and improving the performance of the CBIR.

5. Conclusions
The research scope for this paper focused on highly scalable and memory efficient image retrieval
system. The aim was to overcome the limitations of conventional retrieval methods in the field of plant
biology and remote sensing to significantly boost the retrieval performance in terms of accuracy and
computational efficiency. The challenge was to preserve multi-dimensional and high discriminative
image representations derived by the CNN model and still maintain the computational efficiency of
the querying process. It is worth highlighting the following advantages of the proposed method:

• Fast Retrieval time: The proposed approach improves the retrieval process and is over 16 times
faster than the traditional brute-force sequential searching which is vital for large-scale databases.
• Scalability: The model is constructed in a hierarchical structure. The feature indexing in a
hierarchical form can handle a dynamic image database and can be easily integrated into the
server-client architecture.
• Unsupervised data mining: The proposed technique does not require any prior knowledge of
image repositories or any human intervention. However, in future work, human input/feedback
can potentially improve the performance.
• Recursive similarity measurement: The similarity measurements are done recursively,
which significantly reduces memory cost in high-scale multimedia CBIR systems.
• Discriminative power for quantifying images: Transfer learning is applied by utilizing a
pre-trained deep neural network model merely as a feature extractor. The results indicate that
the generic descriptors extracted from the CNNs are effective and powerful, and performed
consistently better than conventional content-based retrieval systems.

Furthermore, although the visual content was the main focus of this study, integrating keywords
and text to the CBIR pipeline can capture images’ semantic content and describe images which are
identical by linguistic clues.

Supplementary Materials: The following are available online at http://www.mdpi.com/2313-433X/5/3/33/s1,

Additional file 1: The retrieval results of the CNN, BOW, and MFF methods tested on MK dataset. Additional file
2: The retrieval results of the CNN, BOW, and MFF methods tested on MK dataset. The results are presented in
both HTML and PNG formats. It should be noted that the retrieval images in the HTML files will not be displayed
due the missing link with the local hard drive stored the images; however, the class labels of the retrieved data
can be validated in the HTML source code.
Author Contributions: Conceptualization, P.S.T. and P.A.; methodology, P.S.T. and P.A.; software, P.S.T.; validation,
P.S.T.; formal analysis, P.S.T.; investigation, P.S.T.; resources, P.S.T.; data curation, P.S.T.; writing—original draft
preparation, P.S.T.; writing—review and editing, P.S.T., N.V., M.J.H., and P.A.; visualization, P.S.T.; supervision,
M.J.H., and P.A.; project administration, M.J.H.; funding acquisition, M.J.H.
Funding: Rothamsted Research receives support from the Biotechnology and Biological Sciences Research Council
(BBSRC) of the UK as part of the Designing Future Wheat (BBS/E/C/000I0220) project.
Acknowledgments: Rothamsted Research receives support from the Biotechnology and Biological Sciences
Research Council (BBSRC) of the UK as part of the Designing Future Wheat (BBS/E/C/000I0220) project).
Conflicts of Interest: The authors declare no conflict of interest.
J. Imaging 2019, 5, 33 18 of 21

Abbreviations
The following abbreviations are used in this manuscript:

CBIR Content Based Image Retrieval

BOVW Bag of Visual Words
SIFT Scale Invariant Feature Transform
CNN Convolutional Neural Network
DNN Deep Neural Network
TL Transfer Learning
RDE Recursive Density Estimation
SPM Spatial Pyramid Matching
LLC Local Linear Constraint
FV Fisher Vector
VLAD Vector of Locally Aggregated Descriptors
MFF Multiple Feature Fusion
IR Information Retrieval
RL Representation Learning

References
1. Virlet, N.; Sabermanesh, K.; Sadeghi-Tehran, P.; Hawkesford, M.J. Field Scanalyzer: An automated robotic
field phenotyping platform for detailed crop monitoring. Funct. Plant Biol. 2017, 44, 143. [CrossRef]
2. Busemeyer, L.; Mentrup, D.; Möller, K.; Wunder, E.; Alheit, K.; Hahn, V.; Maurer, H.P.; Reif, J.C.; Würschum, T.;
Müller, J.; et al. BreedVision—A Multi-Sensor Platform for Non-Destructive Field-Based Phenotyping in
Plant Breeding. Sensors 2013, 13, 2830–2847. [CrossRef] [PubMed]
3. Kirchgessner, N.; Liebisch, F.; Yu, K.; Pfeifer, J.; Friedli, M.; Hund, A.; Walter, A. The ETH field phenotyping
platform FIP: A cable-suspended multi-sensor system. Funct. Plant Biol. 2017, 44, 154. [CrossRef]
4. Larson, R.R. Introduction to Information Retrieval. J. Am. Soc. Inf. Sci. 2010, 61, 852–853. [CrossRef]
5. Datta, R.; Joshi, D.; Li, J.; Wang, J.Z. Image retrieval: Ideas, influences, and trends of the new age.
ACM Comput. Surv. 2008, 40, 5–60. [CrossRef]
6. Lew, M.; Sebe, N.; Djeraba, C.; Jain, R. Content-based multimedia information retrieval: State of the art and
challenges. ACM Trans. Multimed. Comput. Commun. Appl. 2006, 2, 1–19. [CrossRef]
7. Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R. Content-based image retrieval at the end of
the early years. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1349–1380. [CrossRef]
8. Alzu’bi, A.; Amira, A.; Ramzan, N. Semantic content-based image retrieval: A comprehensive study. J. Vis.
Commun. Image Represent. 2015, 32, 20–54. [CrossRef]
9. Yu, H.; Li, M.; Zhang, H.-J.; Feng, J. Color texture moments for content-based image retrieval. In Proceedings
of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; pp. 929–932.
10. Lin, C.-H.; Chen, R.-T.; Chan, Y.-K. A smart content-based image retrieval system based on color and texture
feature. J. Image Vis. Comput. 2009, 27, 658–665. [CrossRef]
11. Singh, S.M.; Hemach, K.; Hemachandran, K. Content-Based Image Retrieval using Color Moment and Gabor
Texture Feature. IJCSI Int. J. Comput. Sci. 2012, 9, 299–309.
12. Guo, Y.; Zhao, G.; Pietikainen, M. Discriminative features for texture description. Pattern Recognit.
2012, 45, 3834–3843. [CrossRef]
13. Ahonen, T.; Matas, J.; He, C.; Pietikainen, M. Rotation invariant image description with local binary
pattern histogram fourier features. In Proceedings of the 16th Scandinavian Conference on Image Analysis
(SCIA 2009), Oslo, Norway, 15–18 June 2009; Springer: Berlin/Heidelberg, Germany, 2009.
14. Mezaris, V.; Kompatsiaris, I.; Strintzis, M.G. An ontology approach to object-based image retrieval.
In Proceedings of the 2003 International Conference on Image Processing (Cat. No.03CH37429), Barcelona,
Spain, 14–17 September 2003.
15. Nikkam, P.S.; Reddy, B.E. A Key Point Selection Shape Technique for Content based Image Retrieval System.
Int. J. Comput. Vis. Image Process. 2016, 6, 54–70. [CrossRef]
J. Imaging 2019, 5, 33 19 of 21

16. Zhou, W.; Li, H.; Tian, Q. Recent Advance in Content-based Image Retrieval: A Literature Survey. arXiv 2017,
arXiv:1706.06064.
17. Tsai, C.F. Bag-of-words representation in image annotation: A review. ISRN Artif. Intell. 2012, 2012.
[CrossRef]
18. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
[CrossRef]
19. Bay, H.; Tuytelaars, T.; Gool, L. Surf: Speeded Up Robust Features. In Proceedings of the 9th European
Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006;
pp. 404–417.
20. Leutenegger, S.; Chli, M.; Siegwart, R.Y. Brisk: Binary Robust Invariant Scalable Keypoints. In Proceedings
of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011;
pp. 2548–2555.
21. Perronnin, F.; Liu, Y.; Sánchez, J. Large-scale image retrieval with compressed fisher vectors. In Proceedings
of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco,
CA, USA, 13–18 June 2010.
22. Jegou, H.; Douze, M.; Schmid, C. Aggregating local descriptors into a compact image representation. In
Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
San Francisco, CA, USA, 13–18 June 2010.
23. Bengio, Y. Learning Deep Architectures for AI. Found. Trends®Mach. Learn. 2009, 2, 1–127. [CrossRef]
24. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition.
arXiv 2014, arXiv:1409.1556.
25. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
26. Tzelepi, M.; Tefas, A. Deep convolutional learning for Content Based Image Retrieval. Neurocomputing
2018, 275, 2467–2478. [CrossRef]
27. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
[PubMed]
28. Johnson, R.; Zhang, T. Semi-supervised Convolutional Neural Networks for Text Categorization via Region
Embedding. In Proceedings of the Twenty-Ninth Conference on Neural Information Processing Systems
(NIPS 2015), Montreal, QC, Canada, 7–12 December 2015.
29. Shen, Y.; He, X.; Gao, J.; Deng, L.; Mesnil, G. A Latent Semantic Model with Convolutional-Pooling Structure
for Information Retrieval. In Proceedings of the 23rd ACM International Conference on Information and
Knowledge Management, Shanghai, China, 3–7 November 2014.
30. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional Neural Networks for
Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [CrossRef]
31. Borji, A.; Cheng, M.-M.; Jiang, H.; Li, J. Salient Object—A Benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722.
[CrossRef] [PubMed]
32. Tzelepi, M.; Tefas, A. Deep convolutional image retrieval: A general framework. Signal Process. Image
Commun. 2018, 63, 30–43. [CrossRef]
33. Wan, J.; Wang, D.; Hoi, S.; Wu, P.; Zhu, J. Deep learning for content-based image retrieval: A comprehensive
study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA,
3–7 November 2014; pp. 157–166.
34. Sun, S.; Zhou, W.; Tian, Q.; Li, H. Scalable Object Retrieval with Compact Image Representation from Generic
Object Regions. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2016, 12, 29. [CrossRef]
35. Lai, H.; Pan, Y.; Liu, Y.; Yan, S. Simultaneous Feature Learning and Hash Coding with Deep Neural Networks.
In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston,
MA, USA, 7–12 June 2015; pp. 3270–3278.
36. Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. Multi-scale Orderless Pooling of Deep Convolutional Activation
Features. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014;
Springer: Cham, Switzerland, 2014; Volume 8695, pp. 392–407.
J. Imaging 2019, 5, 33 20 of 21

37. Ng, J.Y.-H.; Yang, F.; Davis, L.S. Exploiting local features from deep networks for image retrieval.
In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 53–61.
38. Mohedano, E.; McGuinness, K.; O’Connor, N.E.; Salvador, A.; Marques, F.; Giro-i-Nieto, X. Bags of Local
Convolutional Features for Scalable Instance Search. In Proceedings of the 2016 ACM on International
Conference on Multimedia Retrieval, New York, NY, USA, 6–9 June 2016; ACM Press: New York, NY, USA,
2016; pp. 327–331.
39. Angelov, P.; Sadeghi-Tehran, P. Look-a-Like: A Fast Content-Based Image Retrieval Approach Using a
Hierarchically Nested Dynamically Evolving Image Clouds and Recursive Local Data Density. Int. J. Intell.
Syst. 2016, 32, 82–103. [CrossRef]
40. Angelov, P.; Sadeghi-Tehran, P. A Nested Hierarchy of Dynamically Evolving Clouds for Big Data Structuring
and Searching. Procedia Comput. Sci. 2015, 53, 1–8. [CrossRef]
41. Cai, J.; Liu, Q.; Chen, F.; Joshi, D.; Tian, Q. Scalable Image Search with Multiple Index Tables. In Proceedings
of the International Conference on Multimedia Retrieval, Glasgow, UK, 1–4 April 2014; p. 407.
42. Nister, D.; Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22
June 2006; pp. 2161–2168.
43. Zhou, W.; Lu, Y.; Li, H.; Song, Y.; Tian, Q. Spatial coding for large scale partial-duplicate web image search.
In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010;
pp. 511–520.
44. Wu, Z.; Ke, Q.; Isard, M.; Sun, J. Bundling features for large scale partial-duplicate web image search. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,
20–25 June 2009; pp. 25–32.
45. Bartolini, I.; Patella, M. Windsurf: the best way to SURF. Multimed. Syst. 2018, 24, 459–476. [CrossRef]
46. Zhang, J.; Peng, Y.; Ye, Z. Deep Reinforcement Learning for Image Hashing. arXiv 2018, arXiv:1802.02904.
47. Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep Supervised Hashing for Fast Image Retrieval. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
27–30 June 2016; pp. 2064–2072.
48. Jiang, K.; Que, Q.; Kulis, B. Revisiting Kernelized Locality-Sensitive Hashing for Improved Large-Scale
Image Retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4933–4941.
49. Tang, J.; Li, Z.; Wang, M. Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans.
Image Process. 2015, 24, 2827–2840. [CrossRef] [PubMed]
50. Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable
distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, Brooklyn,
NY, USA, 8–11 June 2004; pp. 253–262.
51. Cao, Z.; Long, M.; Wang, J.; Yu, P.S. HashNet: Deep Learning to Hash by Continuation. In Proceedings
of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017;
pp. 5609–5618.
52. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV,
USA, 3–8 December 2012; pp. 1097–1105.
53. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the 14th
European Conference Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016;
Springer: Cham, Switzerland, 2016; Volume 9908, pp. 630–645.
54. Sharif Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN Features Off-the-Shelf: An Astounding
Baseline for Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, Columbus, OH, USA, 24–27 June 2014; pp. 806–813.
55. Olivas, E.S. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and
Techniques; IGI Global: Hershey, PA, USA, 2009.
56. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June
2016; pp. 770–778.
J. Imaging 2019, 5, 33 21 of 21

57. Yang, L.; Qi, X.; Xing, F.; Kurc, T.; Saltz, J.; Foran, D.J. Parallel content-based sub-image retrieval using
hierarchical searching. Bioinformatics 2013, 30, 996–1002. [CrossRef] [PubMed]
58. Distasi, R.; Vitulano, D.; Vitulano, S. A Hierarchical Representation for Content-based Image Retrieval. J. Vis.
Lang. Comput. 2000, 11, 369–382. [CrossRef]
59. Jiang, F.; Hu, H.M.; Zheng, J.; Li, B. A hierarchal BoW for image retrieval by enhancing feature salience.
Neurocomputing 2016, 175, 146–154. [CrossRef]
60. You, J.; Li, Q. On hierarchical content-based image retrieval by dynamic indexing and guided search.
In Proceedings of the 2009 8th IEEE International Conference on Cognitive Informatics (ICCI’09), Hong Kong,
China, 15–17 June 2009; pp. 188–195.
61. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [CrossRef]
62. Angelov, P. Anomalous System State Identification. U.S. Patent US9390265B2, 15 May 2012.
63. Angelov, P. Evolving Rule-Based Models: A Tool for Design of Flexible Adaptive Systems; Springer:
Berlin/Heidelberg, Germany, 2002.
64. Angelov, P.; Sadeghi-Tehran, P.; Ramezani, R. A Real-time Approach to Autonomous Novelty Detection and
Object Tracking in Video Stream. Int. J. Intell. Syst. 2011, 26, 189–205. [CrossRef]
65. Zhang, C.; Huang, L. Content-Based Image Retrieval Using Multiple Features. J. Comput. Inf. Technol.
2014, 22, 1–10. [CrossRef]
66. Wang, X.-Y.; Zhang, B.-B.; Yang, H.-Y. Content-based image retrieval by integrating color and texture features.
Multimed. Tools Appl. 2012, 68, 545–569. [CrossRef]
67. Yue, J.; Li, Z.; Liu, L.; Fu, Z. Content-based image retrieval using color and texture fused features.
Math. Comput. Model. Int. J. 2011, 54, 1121–1127. [CrossRef]
68. Oliva, A.; Torralba, A. Building the Gist of A Scene: The Role of Global Image Features in Recognition.
Prog. Brain Res. 2006, 155, 23–36. [PubMed]
69. Huang, J.; Kumar, S.R.; Mitra, M.; Zhu, W.-J.; Zabih, R. Image indexing using color correlograms.
In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
San Juan, Puerto Rico, USA, 17–19 June 1997; pp. 762–768.
70. Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; Gong, Y. Locality-Constrained Linear Coding For Image
Classification. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3360–3367.
71. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing
Natural Scene Categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 2169–2178.
72. Han, S.; Chee, L.; Chan, S.; Wilkin, P.; Remagnino, P. Deep-Plant: Plant Identification with Convolutional
Neural Networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP),
Quebec City, QC, Canada, 27–30 September 2015.
73. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings
of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose,
CA, USA, 2–5 November 2010; ACM: New York, NY, USA; pp. 270–279.
74. Yu, H.; Yang, W.; Xia, G.-S.; Liu, G. A Color-Texture-Structure Descriptor for High-Resolution Satellite Image
Classification. Remote Sens. 2016, 8, 259. [CrossRef]
75. Li, Y.; Tao, C.; Tan, Y. Unsupervised multilayer feature learning for satellite image scene classification.
IEEE Geosci. Remote Sens. Lett. 2016, 13, 157–161. [CrossRef]
76. Romero, A. Unsupervised deep feature extraction for remote sensing image classification. IEEE Trans. Geosci.
Remote Sens. 2016, 54, 1349–1362. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Vipin Tyagi (Auth.) - Content-Based Image Retrieval - Ideas, Influences, and Current Trends (2017, Springer Singapore)
No ratings yet
Vipin Tyagi (Auth.) - Content-Based Image Retrieval - Ideas, Influences, and Current Trends (2017, Springer Singapore)
399 pages
applsci-13-04581
No ratings yet
applsci-13-04581
17 pages
Image Retrival
No ratings yet
Image Retrival
7 pages
Content Based Remote-Sensing Image Retrieval With Bag of Visual Words Representation
No ratings yet
Content Based Remote-Sensing Image Retrieval With Bag of Visual Words Representation
6 pages
1 s2.0 S1047320320301115 Main
No ratings yet
1 s2.0 S1047320320301115 Main
10 pages
Content-Based Image Retrieval (CBIR) : A Review: Deepti Agrawal, Apurva Agarwal, and Dilip Kumar Sharma
No ratings yet
Content-Based Image Retrieval (CBIR) : A Review: Deepti Agrawal, Apurva Agarwal, and Dilip Kumar Sharma
14 pages
view Require
No ratings yet
view Require
20 pages
Comp Sci - A Content Based Image - Navneet Kaur1
No ratings yet
Comp Sci - A Content Based Image - Navneet Kaur1
4 pages
Categorical Image Retrieval Through Genetically Optimized Support Vector Machines (GOSVM) and Hybrid Texture Features
No ratings yet
Categorical Image Retrieval Through Genetically Optimized Support Vector Machines (GOSVM) and Hybrid Texture Features
17 pages
Content Based Image Retrieval Using Hybrid Features and Various Distance Metric 2018-56
No ratings yet
Content Based Image Retrieval Using Hybrid Features and Various Distance Metric 2018-56
15 pages
Content-Based Image Retrieval Based On Corel Dataset Using Deep Learning
No ratings yet
Content-Based Image Retrieval Based On Corel Dataset Using Deep Learning
10 pages
Cbir Thesis PDF
100% (3)
Cbir Thesis PDF
6 pages
2021 - State of The Art Content Based Image Retrieval Techniques Using Deep Learning A Survey
No ratings yet
2021 - State of The Art Content Based Image Retrieval Techniques Using Deep Learning A Survey
23 pages
Content Based Image Retrieval Using Color, Shape and Texture Extraction Techniques
No ratings yet
Content Based Image Retrieval Using Color, Shape and Texture Extraction Techniques
6 pages
B2020 Deep - Image - Retrieval - A - Survey
No ratings yet
B2020 Deep - Image - Retrieval - A - Survey
21 pages
Content Based Image Retrieval: Unlocking Visual Databases
From Everand
Content Based Image Retrieval: Unlocking Visual Databases
Fouad Sabry
No ratings yet
Content Based Image Retrieval
No ratings yet
Content Based Image Retrieval
4 pages
Content Based Image Retrieval Methods Using Self Supporting Retrieval Map Algorithm
No ratings yet
Content Based Image Retrieval Methods Using Self Supporting Retrieval Map Algorithm
7 pages
Local Tetra-Directional Pattern-A New Texture Descriptor For Content-Based Image Retrieval
No ratings yet
Local Tetra-Directional Pattern-A New Texture Descriptor For Content-Based Image Retrieval
16 pages
Ijcs 2016 0303012 PDF
No ratings yet
Ijcs 2016 0303012 PDF
5 pages
Applied Sciences: Image Retrieval Method Based On Image Feature Fusion and Discrete Cosine Transform
No ratings yet
Applied Sciences: Image Retrieval Method Based On Image Feature Fusion and Discrete Cosine Transform
28 pages
Content Based Image Retrieval System Using Sketches and Colored Images With Clustering
No ratings yet
Content Based Image Retrieval System Using Sketches and Colored Images With Clustering
4 pages
Thesis Content Based Image Retrieval
100% (2)
Thesis Content Based Image Retrieval
4 pages
Image Retrieval
No ratings yet
Image Retrieval
7 pages
Content Based Image Retrieval System Using Feature Classification With Modified KNN Algorithm
No ratings yet
Content Based Image Retrieval System Using Feature Classification With Modified KNN Algorithm
6 pages
Baby Manjusha - CBIR Report
No ratings yet
Baby Manjusha - CBIR Report
25 pages
Image Retrieval: Importance and Applications: João Augusto Da Silva Júnior Rodiney Elias Marçal Marcos Aurélio Batista
No ratings yet
Image Retrieval: Importance and Applications: João Augusto Da Silva Júnior Rodiney Elias Marçal Marcos Aurélio Batista
5 pages
Research Paper on Content Based Image Retrieval
100% (1)
Research Paper on Content Based Image Retrieval
5 pages
Intelligent Water Drop Algorithm Based Relevant Image Fetching Using Histogram and Annotation Features
No ratings yet
Intelligent Water Drop Algorithm Based Relevant Image Fetching Using Histogram and Annotation Features
7 pages
IJETR021117
No ratings yet
IJETR021117
5 pages
Accepted Manuscript: Information Fusion
No ratings yet
Accepted Manuscript: Information Fusion
34 pages
A Review On Feature Extraction Technique PDF
No ratings yet
A Review On Feature Extraction Technique PDF
7 pages
A Novel Approach For Content-Based Image Indexing and Retrieval System Using Global and Region Features
No ratings yet
A Novel Approach For Content-Based Image Indexing and Retrieval System Using Global and Region Features
12 pages
Collective Information Profiling of Companies Using CBIR
No ratings yet
Collective Information Profiling of Companies Using CBIR
3 pages
Content Based Image Retrieval Literature Review
100% (1)
Content Based Image Retrieval Literature Review
7 pages
Leaf Recognition and Disease Detection Using Content Based Image Retrieval
No ratings yet
Leaf Recognition and Disease Detection Using Content Based Image Retrieval
5 pages
A Review On Image Retrieval Techniques
No ratings yet
A Review On Image Retrieval Techniques
4 pages
Content Based Image Retrieval A Review of Recent Trends
No ratings yet
Content Based Image Retrieval A Review of Recent Trends
38 pages
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
No ratings yet
Efficient Content-Based Image Retrieval Using Integrated Dual Deep Convolutional Neural Network
8 pages
Thesis Report On Content Based Image Retrieval
100% (2)
Thesis Report On Content Based Image Retrieval
4 pages
A Literature Review of Image Retrieval Based On Semantic Concept
100% (1)
A Literature Review of Image Retrieval Based On Semantic Concept
7 pages
Image Retrieval: Current Techniques, Promising Directions, and Open Issues
No ratings yet
Image Retrieval: Current Techniques, Promising Directions, and Open Issues
41 pages
CBIR paper3
No ratings yet
CBIR paper3
6 pages
A User-Oriented Image Retrieval System Based On Interactive Genetic Algorithm
No ratings yet
A User-Oriented Image Retrieval System Based On Interactive Genetic Algorithm
8 pages
Budapest2020
No ratings yet
Budapest2020
22 pages
Graph Tree
No ratings yet
Graph Tree
20 pages
Vikram Narayan Research Engineer GREYC, University of Caen
No ratings yet
Vikram Narayan Research Engineer GREYC, University of Caen
19 pages
Image Retrieval Thesis
100% (3)
Image Retrieval Thesis
6 pages
Mobile Image Retrieval Using Integration of Geo-Sensing and Visual Descriptor
No ratings yet
Mobile Image Retrieval Using Integration of Geo-Sensing and Visual Descriptor
6 pages
VietNguyen_MasterThesis
No ratings yet
VietNguyen_MasterThesis
66 pages
Performance Evaluation of Ontology Andfuzzybase Cbir
No ratings yet
Performance Evaluation of Ontology Andfuzzybase Cbir
7 pages
A Novel Approach To Develop A New Hybrid Technique For Trademark Image Retrieval
No ratings yet
A Novel Approach To Develop A New Hybrid Technique For Trademark Image Retrieval
12 pages
Wcist 2021
No ratings yet
Wcist 2021
12 pages
A User-Oriented Image Retrieval System Based On Interactive Genetic Algorithm
No ratings yet
A User-Oriented Image Retrieval System Based On Interactive Genetic Algorithm
8 pages
An Efficient Perceptual of Content Based Image Retrieval System Using SVM and Evolutionary Algorithms
No ratings yet
An Efficient Perceptual of Content Based Image Retrieval System Using SVM and Evolutionary Algorithms
7 pages
Content Based Image Retrieval PHD Thesis
100% (3)
Content Based Image Retrieval PHD Thesis
8 pages
Content Based Image Retrieval System Based On Neural Networks
No ratings yet
Content Based Image Retrieval System Based On Neural Networks
5 pages
Multilabel Neighborhood Propagation For Region-Based Image Retrieval
No ratings yet
Multilabel Neighborhood Propagation For Region-Based Image Retrieval
13 pages
Image Trained: Features
No ratings yet
Image Trained: Features
7 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
BACnet Network Set Up and Troubleshooting
100% (1)
BACnet Network Set Up and Troubleshooting
80 pages
SNB - Install Insructions & License
No ratings yet
SNB - Install Insructions & License
1 page
Tutorial Simaris Design 24 0 en
No ratings yet
Tutorial Simaris Design 24 0 en
146 pages
Fireface Ufx e
No ratings yet
Fireface Ufx e
4 pages
Cost Recovery User Guide: Supplier Associated Warranty Reduction Program (SAWRP)
No ratings yet
Cost Recovery User Guide: Supplier Associated Warranty Reduction Program (SAWRP)
9 pages
Internshala Answers
No ratings yet
Internshala Answers
2 pages
Product: Old Company Name in Catalogs and Other Documents
No ratings yet
Product: Old Company Name in Catalogs and Other Documents
21 pages
Assignment Four Answers
No ratings yet
Assignment Four Answers
3 pages
Leksion 01 - Intro To Is
No ratings yet
Leksion 01 - Intro To Is
23 pages
MV-CS020-10GMGC GigE Area Scan Camera - Datasheet - 20221012
No ratings yet
MV-CS020-10GMGC GigE Area Scan Camera - Datasheet - 20221012
2 pages
Blast
No ratings yet
Blast
18 pages
TLE 102 Information Sheet No. 1.3
No ratings yet
TLE 102 Information Sheet No. 1.3
8 pages
Babe Im Gonna Leave You
No ratings yet
Babe Im Gonna Leave You
16 pages
Amazon - Testkings.aws Solution Architect Associate - Sample.question.2023 Jul 21.by - Derrick.349q.vce
No ratings yet
Amazon - Testkings.aws Solution Architect Associate - Sample.question.2023 Jul 21.by - Derrick.349q.vce
27 pages
Learning Swift Building Apps for OSX, iOS, and Beyond Jon Manning instant download
No ratings yet
Learning Swift Building Apps for OSX, iOS, and Beyond Jon Manning instant download
56 pages
Mesa (Programming Language)
No ratings yet
Mesa (Programming Language)
5 pages
Sipeed MaixGo Datasheet V1.1
No ratings yet
Sipeed MaixGo Datasheet V1.1
7 pages
Vocabulary Around Variables and Scope
No ratings yet
Vocabulary Around Variables and Scope
1 page
Linux Command Line Cheat Sheet Its FOSS
No ratings yet
Linux Command Line Cheat Sheet Its FOSS
3 pages
KUIS III Pemrograman Mobile
No ratings yet
KUIS III Pemrograman Mobile
7 pages
Blue Dragon (Video Game)
No ratings yet
Blue Dragon (Video Game)
11 pages
Starbucks Organizational Culture: Focus On Employees As The Source of Core Competency
No ratings yet
Starbucks Organizational Culture: Focus On Employees As The Source of Core Competency
36 pages
04 Function Location Create, Change, Display - IL01 IL02 IL03
No ratings yet
04 Function Location Create, Change, Display - IL01 IL02 IL03
33 pages
Student Management System Project Report 2
No ratings yet
Student Management System Project Report 2
47 pages
Treinamento PCS7 V9.0.0 - Apostila Inglês
No ratings yet
Treinamento PCS7 V9.0.0 - Apostila Inglês
674 pages
Advance Java: Question Bank
No ratings yet
Advance Java: Question Bank
23 pages
(Chapter 4 and 5) Self Exam
No ratings yet
(Chapter 4 and 5) Self Exam
2 pages
Hlookup and Vlookup Functions
No ratings yet
Hlookup and Vlookup Functions
4 pages
Merced Transcript of Hewlett Packard
No ratings yet
Merced Transcript of Hewlett Packard
2 pages
2406.13161v1
No ratings yet
2406.13161v1
25 pages

Jimaging 05 00033

Uploaded by

Jimaging 05 00033

Uploaded by

Journal of

J. Imaging 2019, 5, 33; doi:10.3390/jimaging5030033 www.mdpi.com/journal/jimaging

2.1. Image Representation

2.1.1. Hand-crafted Feature Extraction Techniques

2.1.2. Learning-based Features Using Deep Convolutional Neural Network

2.2. Feature Indexing and Image Scoring

Although the deep

c with ccthth cluster;

4. Experiments and Results

RL-CNN BOW MFF

4.2.2. Retrieval Performance on UCM Dataset Method mAP (%)

4.3. Retrieval Time Per Query

Figure 13. Execution

4.4. 4.4. Discussion,

Supplementary Materials: The following are available online at http://www.mdpi.com/2313-433X/5/3/33/s1,

CBIR Content Based Image Retrieval

You might also like