Jimaging 05 00033
Jimaging 05 00033
Imaging
Article
Scalable Database Indexing and Fast Image Retrieval
Based on Deep Learning and Hierarchically Nested
Structure Applied to Remote Sensing and
Plant Biology
Pouria Sadeghi-Tehran 1, * , Plamen Angelov 2 , Nicolas Virlet 1 and Malcolm J. Hawkesford 1
1 Department of Plant Sciences, Rothamsted Research, Harpenden AL5 2JQ, UK;
[email protected] (N.V.); [email protected] (M.J.H.)
2 School of Computing and Communications, InfoLab21, Lancaster University, Lancaster LA1 4WA, UK;
[email protected]
* Correspondence: [email protected]
Received: 6 November 2018; Accepted: 18 February 2019; Published: 1 March 2019
Abstract: Digitalisation has opened a wealth of new data opportunities by revolutionizing how
images are captured. Although the cost of data generation is no longer a major concern, the data
management and processing have become a bottleneck. Any successful visual trait system requires
automated data structuring and a data retrieval model to manage, search, and retrieve unstructured
and complex image data. This paper investigates a highly scalable and computationally efficient
image retrieval system for real-time content-based searching through large-scale image repositories
in the domain of remote sensing and plant biology. Images are processed independently without
considering any relevant context between sub-sets of images. We utilize a deep Convolutional
Neural Network (CNN) model as a feature extractor to derive deep feature representations from
the imaging data. In addition, we propose an effective scheme to optimize data structure that can
facilitate faster querying at search time based on the hierarchically nested structure and recursive
similarity measurements. A thorough series of tests were carried out for plant identification and
high-resolution remote sensing data to evaluate the accuracy and the computational efficiency of the
proposed approach against other content-based image retrieval (CBIR) techniques, such as the bag
of visual words (BOVW) and multiple feature fusion techniques. The results demonstrate that the
proposed scheme is effective and considerably faster than conventional indexing structures.
Keywords: content-based image retrieval; deep convolutional neural networks; information retrieval;
data indexing; recursive similarity measurement; deep learning; bag of visual words; remote sensing
1. Introduction
Today, digital images and videos are ubiquitous in every domain. The advancement in
multi-media technologies has led to the generation of an enormous number of images and videos.
The size of image repositories has increased rapidly in many domains, such as biology, remote sensing,
medical, military, and web-searching. The use of automated data acquisition systems, such as modern
phenotyping platforms [1–3] has revolutionized the way the data is collected and analyzed. The plant
science community is seeking novel solutions to fully exploit all the potential offered by such new
platforms equipped with high-resolution remote sensing sensors. Any large-scale dataset in modern
biological sciences first and foremost requires reliable data infrastructure and an efficient information
retrieval system. For image repositories of large scale, manual tagging is infeasible and is prone
to errors, due to users’ subjective opinions. Thus, to utilize such unstructured and complex image
collections, there is a substantial need for content-based image retrieval (CBIR) systems for browsing
through images at a large scale and to classify, structure, and retrieve relevant information requested
by the users.
Information retrieval (IR) refers to finding material (image repositories or documents) of an
unstructured nature (image or text) that satisfies an information need from within large collections [4].
There is a fundamental difference between CBIR and search by text and metadata. Searching methods
based on metadata rarely examine the content of an image itself but rather rely on manual annotations
and tagging. In these systems, words are stored as ASCII character strings to describe image content.
However, the high complexity of images cannot be described easily by keywords; thus, retrieval
systems which are based solely on manual annotation often lead to unsatisfactory outcomes. In contrast,
CBIR does not require keywords (manual annotation) and desired images are retrieved automatically
based on their similarity to the query representation [5–7].
Although CBIR techniques are beginning to find a foothold in many applications, such as biology,
remote sensing, satellite imaging, etc., the technology still suffers from lack of maturity due to
a significant gap towards semantic-aware retrieval from visual content. A major challenge associated
with CBIR systems is to extract information from an image which is unique and representative,
to overcome the issue of so called semantic-gap. The semantic-gap refers to low-level features of images
such as colors and texture, but those features might not be able to extract a higher level of understanding
of the image perceived by humans [8]. Due to the absence of solid evidence on the effectiveness of
CBIR techniques for high-throughput datasets with varied collections of images, opinion is still sharply
divided regarding the reliability and performance of such systems in real-time. It is essential to
standardize CBIR for easy access to data and speed up the retrieval process.
In this paper, a new concept of CBIR is employed to exploit the opportunities presented by large
image-based repositories, particularly in remote sensing and plant biology. The proposed approach,
which relies solely on the contents of the images, will pave the way for a computationally efficient and
real-time image querying through an unstructured image database. An end-to-end CBIR framework is
conducted without supervision. First, we utilize a deep CNN model as a feature extractor to obtain the
feature representations from the activations of the convolutional layers. In the next step, a hierarchically
nested database indexing structure and local recursive density estimation are developed to facilitate
an efficient and fast retrieval process. Finally, the key elements of CBIR, accuracy and computational
efficiency, are evaluated and compared with the state-of-the-art CBIR techniques.
2. Related Works
The core modules of any CBIR systems include image representation, database indexing,
and image scoring, described as detailed below:
• Color properties are extracted directly from the pixel densities over the whole image, segmented
regions/bins, or sub-image. Image descriptors that characterize the color properties of an image
seek to model the distribution of the pixel intensities in each channel of the image. These methods
include color statistics, such as deviation, mean, and skewness, along with color histograms.
Since color features are robust to background complications and are invariant to the size or
orientation of an image, the color based methods have become one of the most common techniques
in CBIR [9–11].
• Texture properties measure visual patterns in images that contain important information about
the structural arrangement of surface i.e., fabric, bricks, etc. Texture descriptors seek to model the
feel, appearance, and overall tactile quality of an object in an image and are defined as a structure
of surfaces formed by repeating a particular element or several elements in different relative
spatial distribution and synthetic structure. In general, the repetition involves local variations of
scale, orientation, or other geometric and optical features of the elements [12,13].
• Shape properties can also be considered as one of the fundamental perceptual characteristics.
Shape properties take on many non-geometric and geometric forms, such as moment invariants,
aspect ratio, circularity, and boundary segments. There are difficulties associated with shape
representation and descriptors techniques due to noise, occlusion, and arbitrary distortion, which
often causes inaccuracies in extracting shape features. Nonetheless, the method has shown
promising results to describe the image content [14,15].
Whilst the above techniques focus on primitive features, more recent techniques have been aimed
to find semantically richer image representations by extracting a collection of local invariant features.
The main advantage of semantic features is locality, which means that the extracted features are local
and robust to clutter and occlusion. Also, individual features can be matched to a large database of
objects and have close to real-time performance.
One of the most effective techniques is the bag of visual words technique [16,17]. The main reasons
that BOVW has gained popularity in classification and retrieval applications are the use of powerful
local descriptors, such as Scale Invariant Feature Transform (SIFT) [18], Speeded Up Robust Features
(SURF) [19], and Binary Robust Invariant Scalable Keypoints (BRISK) [20]. In addition, the vector
representations can be compared with standard distances, and subsequently be used for effective
CBIR. However, the main drawback of BOVW is the high dimensional vector representing an image.
Although a high-dimensional vector usually provides better exhaustive search results compared to a
low-dimensional one, it is more difficult to index efficiently. Aggregated vectors, such as Fisher Vector
(FV) [21] and Vector of Locally Aggregated Descriptors (VLAD) [22] aim to address this problem by
encoding an image into a single vector, reducing the dimensionality without noticeably impacting the
accuracy [16,17].
Nevertheless, despite the robustness of local descriptors techniques, global features are still
desirable in a variety of computer vision applications. Ultimately, having an intimate knowledge of
the dataset contents will provide a better perspective for which feature extraction techniques might be
appropriate. For example, datasets that have relatively different color distributions, color descriptors
will be more effective. Nonetheless, the effectiveness of hand-crafted feature representation in CBIR
is inherently limited, as these approaches mainly operate at the primitive level. As presented in the
following section, higher accuracy will be achieved by extracting semantic features from images based
on learning-based features using deep networks.
non-linear transformations [27]. In CNNs, features are extracted at multiple levels of abstracts and
allow the system to learn complex functions that directly map raw sensory input data to the output,
without relying on hand-engineered features using domain knowledge.
CNN has achieved state-of-the-art performance in a variety of applications, including natural
language processing [28,29], speech recognition [30], and object recognition [31]. Inspired by the
success of CNN in many computer vision applications, it has started to gain a foothold in the
research area of CBIR. Subsequently, CNN models have been proposed to improve the image
retrieval workflow [16,32,33]. For instance, in Sun et al. [34], features derived from local image
regions identified with a general object detector and an adapted CNN model have been evaluated
on two public large-scale image datasets. Lai et. al. [35] proposed simultaneous feature learning
using deep neural networks and hash coding. The short binary codes resulted from hash coding
achieved efficient retrieval and a considerable saving in memory usage. In other techniques, CNN
descriptors are combined with conventional descriptors such as the VLAD representation [36,37].
Finally, in Mohedano et al. [38] authors proposed a method based on encoding the convolutional
features of CNN and the BOVW aggregation scheme. The approach outperformed the state-of-the-art
tested on landmark datasets.
3. Methodology
In this paper, we focus on three key challenges of any content-based image retrieval: image
representation, database indexing, and image similarity measurement. Figure 1 illustrates an overall
view of the proposed framework. The first step in the prescriptive analytics process is to transform the
Journal of Imaging 2019, 5, x FOR PEER REVIEW 5 of 21
a fixed feature extractor without the last fully connected layer. The trained model provides access to
J. Imaging 2019, 5, 33 5 of 21
the visual descriptors previously learnt by the CNN after processing millions of images in the
ImageNet dataset without requiring a computational expensive training phase.
initialAlthough the deep
unstructured learning model
and structured is effective
data sources intoin extracting prepared
analytically discriminative visual
data. To features
achieve from
a balance
images (Section
between 4.2), and
complexity it would compute
efficiency, multi-dimensional
a pre-trained feature
CNN is used vectors
to utilize the(2048-D
ability in our model
of the case) forto
every image
produce which
better imageincreases the computational
representations complexity
for the retrieval task. Wefor featurean
leverage indexing
existing and
model querying.
trained Toon
address
the the multi-dimensional
ImageNet dataset [52], known complexity
as residualcaused by the
network CNN model,
(ResNet) a novel
[53]. The modelnested
is usedhierarchical
as a fixed
database
feature indexing
extractor is proposed
without the lastto facilitate
fully fast querying.
connected In addition,
layer. The trained modela provides
recursiveaccess
calculation based
to the visual
on local density
descriptors estimation
previously learntisbyused to measure
the CNN the similarity
after processing between
millions the given
of images in the query
ImageNetand dataset
all the
images from
without a given
requiring image cluster. expensive training phase.
a computational
Figure 1.
Figure Schematic representation
1. Schematic representation of
of the
the retrieval
retrieval model.
model.
network and the final probabilities are obtained from the end of the network. However, in the
representation learning, instead of allowing the image to forward propagate through the entire network,
we can stop the propagation at an arbitrary layer, such as the last fully connected layer, and extract the
valuesJournal
from the network
of Imaging 2019, 5, x at
FORthis
PEERtime, and then use them as feature vectors.
REVIEW 6 of 21
Figure 2. Representation
Figure learning
2. Representation scheme.
learning scheme.Deep
Deep feature extractionfrom
feature extraction from
thethe pretrained
pretrained Convolutional
Convolutional
Neural Network
Neural (CNN)
Network model.
(CNN) model.
In this In study,
this study, we we utilize
utilize thethe convolutional layers
convolutional layers merely
merelyasasaafeature
feature extractor.
extractor.TheThe aim aim
is to is to
generalize a trained CNN in learning discriminative feature representations for the imagesour
generalize a trained CNN in learning discriminative feature representations for the images in in our
dataset.dataset.
The Thetrainedtrained model
model is is usedtotoderive
used derive feature
feature vectors,
vectors,moremorepowerful
powerful than hand-designed
than hand-designed
algorithms such as SIFT, GIST, HOG, etc. We exploit the ability of a well-known deep convolutional
algorithms such as SIFT, GIST, HOG, etc. We exploit the ability of a well-known deep convolutional
neural network framework known as residual learning (ResNet) [53,56]. Residual learning
neural network framework known as residual learning (ResNet) [53,56]. Residual learning frameworks
frameworks ease the training of deeper networks and are a great candidate to capture the
ease the training of properties
discriminative deeper networksof imagesandasare a great
a fixed candidate
feature to capture
extractor model.the discriminative
Network depth is properties
a key
of images
element as in
a neural
fixed feature
networkextractor model.
architecture; however,Network
deeper depth
networks is are
a key
more element
difficultintoneural
train, asnetwork
the
architecture;
accuracy gets saturated and then degrades rapidly. When deeper networks start converging, a and
however, deeper networks are more difficult to train, as the accuracy gets saturated
then degrades
degradation rapidly.
problemWhen deeper
is exposed networks
which start by
is not caused converging,
overfitting,awhile
degradation
adding more problem
layers is exposed
causes
whicheven is nothigher
causedtraining
byerror. In residual
overfitting, learning
while models,
adding more instead
layers of learning
causes even a direct mapping
higher of 𝑥 →error.
training
𝑦 with learning
In residual 𝐻(𝑥), theinstead
a functionmodels, residualoffunction
learning is defined
a directusing
mapping𝐻(𝑥) = of𝐹(𝑥)
x →+y𝑥; with
wherea 𝐹(𝑥) and xH ( x ),
function
represents residual mapping function and the identity function, respectively.
the residual function is defined using H ( x ) = F ( x ) + x; where F ( x ) and x represents residual mapping The author's hypothesis
is that it is easier to optimize 𝐹(𝑥) than to optimise the original mapping function, 𝐻(𝑥). We refer
function and the identity function, respectively. The author’s hypothesis is that it is easier to optimize
readers to [53,56] for more details.
F ( x ) than to optimise the original mapping function, H ( x ). We refer readers to [53,56] for more details.
The employed ResNet model has been pre-trained on the ImageNet Large Scale Visual
The employed
Recognition ResNet (ILSVRC)
Challenge model has2012,beentopre-trained
classify 1.3 on the ImageNet
million images to Large Scale Visual
1000 ImageNet Recognition
classes [52].
Challenge (ILSVRC) 2012, to classify 1.3 million images to 1000 ImageNet
The ResNet consists of convolutional layers, pooling layers, and fully connected layers. The network classes [52]. The ResNet
consiststakesof images
convolutional
of size 224layers,
× 224 pooling layers,
pixels as input andpasses
then fully through
connected the layers.
networkThe in a network
forward pass takes images
after
of size 224 × 224
applying filterspixels
to theasinput
input then When
image. passestreating
through the network
networks in afeature
as a fixed forward pass after
extractor, we cut applying
off
filtersthe network
to the inputatimage.
an arbitrary
Whenpoint (normally
treating networkspriorastoathe lastfeature
fixed fully-connected
extractor, layers);
we cutthus, all images
off the network at
will be extracted from the activations of convolutional feature maps directly.
an arbitrary point (normally prior to the last fully-connected layers); thus, all images will be extracted This would compute a
2048-D feature vector for every image that contains the hidden layer
from the activations of convolutional feature maps directly. This would compute a 2048-D feature immediately before the classifier.
The 2048-D feature vectors will be directly used for computing the similarity between images. The
vector for every image that contains the hidden layer immediately before the classifier. The 2048-D
computational complexity and retrieval process may become cumbersome as the dimensionality
feature vectors will be directly used for computing the similarity between images. The computational
grows. This requires us to optimize the retrieval process by proposing a hierarchically nested
complexity
indexing and retrieval
structure andprocess
recursive may becomemeasurements
similarity cumbersome to as facilitate
the dimensionality
faster accessgrows. This requires
and comparison
us to ofoptimize the retrieval
multi-dimensional featureprocess
vectorsby
as proposing
described in athe hierarchically
following sections.nested indexing structure and
recursive similarity measurements to facilitate faster access and comparison of multi-dimensional
feature 3.2.vectors
Featureas Indexing
describedBasedin ontheHierarchical Nested Data Clusters
following sections.
The success of a CBIR not only depends on image delineation, but feature indexing and
3.2. Feature Indexing Based on Hierarchical Nested Data Clusters
similarity measurement matrix also play vital roles to facilitate the execution of queries. In general,
feature
The indexing
success of refers
a CBIRto anot
database
only organizing
depends structure
on image to assist fast retrieval
delineation, but process.
feature Whilst it is and
indexing
feasible to retrieve information from datasets which are small in size by measuring the similarity
similarity measurement matrix also play vital roles to facilitate the execution of queries. In general,
between a query and every image in the dataset, the computational complexity will soon increase
feature indexing refers to a database organizing structure to assist fast retrieval process. Whilst it
significantly on a larger scale image database.
J. Imaging 2019, 5, 33 7 of 21
is feasible to retrieve information from datasets which are small in size by measuring the similarity
between a query and every image in the dataset, the computational complexity will soon increase
Journal of Imaging 2019, 5, x FOR PEER REVIEW 7 of 21
significantly on a larger scale image database.
In an attempt to address the challenges faced by retrieval information on a large-scale dataset,
In an attempt to address the challenges faced by retrieval information on a large-scale dataset,
we present a hierarchically nested structure. The introduced database indexing aims at arranging
we present a hierarchically nested structure. The introduced database indexing aims at arranging and
and structuring the image database into a simple yet effective form of data clusters and hierarchies.
structuring the image database into a simple yet effective form of data clusters and hierarchies.
Although forming a hierarchical structure for retrieval optimization has been explored before [57–60],
Although forming a hierarchical structure for retrieval optimization has been explored before [57–
the method presented in this study is quite different. Hierarchically nested data clusters are structured
60], the method presented in this study is quite different. Hierarchically nested data clusters are
in which data clusters at higher layers represent one or multiple clusters at a lower layer based on
structured in which data clusters at higher layers represent one or multiple clusters at a lower layer
mean values of the cluster centers (Figure 3). The first layer clusters are generated based on feature
based on mean values of the cluster centers (Figure 3). The first layer clusters are generated based on
representations derived from the CNN model. Data clusters are formed by grouping the relevant
feature representations derived from the CNN model. Data clusters are formed by grouping the
data points using a partition-based clustering approach known as K-means clustering [61]. Figure 3
relevant data points using a partition-based clustering approach known as K-means clustering [61].
illustrates how the hierarchical structure of clusters is formed. µ and X are abstract values and denote
Figure 3 illustrates how the hierarchical structure of clusters is formed. µ and X are abstract values
mean values and scalar products explained in Section 3.3.
and denote mean values and scalar products explained in Section 3.3.
Figure 3.
Figure Schematic representation
3. Schematic representation of
of the
the hierarchical
hierarchical nested
nested indexing
indexing structure.
structure.
3.3. Fast Searching and Similarity Measure Based on Recursive Data Density Estimation
3.3. Fast Searching and Similarity Measure Based on Recursive Data Density Estimation
The final step after forming the hierarchically nested data clusters is to find the cluster which
The final step after forming the hierarchically nested data clusters is to find the cluster which
contains the most similar images to a query image. We applied recursive density estimation [62,63] to
contains the most similar images to a query image. We applied recursive density estimation [62,63]
measure a similarity between the query image and all images inside each cluster recursively. The main
to measure a similarity between the query image and all images inside each cluster recursively. The
idea of the recursive density function is to estimate the probability density function by a Cauchy type
main idea of the recursive density function is to estimate the probability density function by a Cauchy
kernel and to recursively calculate it. The method is also applied for novelty detection in real-time
type kernel and to recursively calculate it. The method is also applied for novelty detection in real-
data streams and video analytics [64]. The recursive calculation allows us to discard each data once
time data streams and video analytics [64]. The recursive calculation allows us to discard each data
it has been processed and only store the accumulated information in memory concerning the local
once it has been processed and only store the accumulated information in memory concerning the
mean (per cluster), µ and scalar product X. In order to speed up the retrieval process by an order of
local mean (per cluster), µ and scalar product X. In order to speed up the retrieval process by an order
magnitude, the searching process is performed from the top of the pyramid in an ordered hierarchy
of magnitude, the searching process is performed from the top of the pyramid in an ordered hierarchy
based on “winner takes all” principle with maximum local recursive density estimation at each level
based on “winner takes all” principle with maximum local recursive density estimation at each level
(Figure 4).
(Figure 4).
J. Imaging 2019, 5, 33 8 of 21
Journal of Imaging 2019, 5, x FOR PEER REVIEW 8 of 21
Figure 4.
Figure Schematic representation
4. Schematic representation of
of searching
searching through
through hierarchical
hierarchical nested
nested structure
structure and
and retrieve
retrieve the
the
most similar
most similar images
images (winner
(winner cluster)
cluster) to
to the
the query.
query.
The degree of similarity between the query image to images inside each cluster is measured by
The degree of similarity between the query image to images inside each cluster is measured by
the relative local density with regards to the query image, which is defined by a suitable kernel over
the relative local density with regards to the query image, which is defined by a suitable kernel over
the distance between the current image sample and all the other images inside the cluster:
the distance between the current image sample and all the other images inside the cluster:
c
!
M
D c
𝐷i = K
=𝐾 ∑ d𝑑ijc c𝑐 = 1, C
= [[1, 𝐶]] (1)
(1)
j =1
Different types of distance measures can be used, such as Euclidean or Cosine distance. We We used
a Cauchy type of kernel to define the local density D 𝐷ic . It can be proven that Cauchy type kernel
asymptotically tends to Gaussian, but can be calculated recursivelyrecursively [63]:[63]:
11
c =
D𝐷 = (2)
(2)
i
11
++k F𝐹i −−µ𝜇ic k2 + 𝑋ic −
+X − k 𝜇µic k2
𝐹
F== {𝑓{ f,1⋯ } is the
, · ,·𝑓· , f 2048 feature
} is the feature vector.𝑖 =i 1,2,
vector. = 1,…2,, 𝑁
. . .; , N
Ncc ;isNcthe number
is the numberof images within
of images cth
within
cluster.
cth cluster.
Both
Both the
the mean,
mean, µ µi and
and the
the scalar
scalar product,
product, XXi are
are updated
updated recursively
recursively as as follows
follows [63]:
[63]:
i i
𝑖−1 1
𝜇 =i − 1 𝜇 + 1𝐹 ;𝜇 = 𝐹 (3)
µi = 𝑖 µi−1 + 𝑖 Fi ; µ1 = F1 (3)
i i
𝑖−1 1
𝑋 = 𝑋 + ‖𝐹 ‖ ; 𝑋 = ‖𝐹 ‖ (4)
i − 1𝑖 1𝑖 2 2
Xi = Xi−1 + k Fi k ; X1 =k F1 k (4)
i
Finally, the cluster with the maximum local idensity Dc, with respect to the query image, is most
likely to contain similar images:
𝐶 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥 {𝐷 } (5)
J. Imaging 2019, 5, 33 9 of 21
Finally, the cluster with the maximum local density Dc , with respect to the query image, is most
likely to contain similar images:
Ci∗ = argmaxcC=1 { Dic } (5)
The final step is the similarity measurement between the query image and all the images inside
the winning cluster at the lowest layer. The relevance score is defined by distance-based scoring using
City Block distance. Images are then ranked accordingly to their obtained scores. A smaller value
of City Block distance implies that the corresponding image is more similar to the query image and
vice versa. The City Block distance between the query image and images inside the winner cluster is
calculated as follows:
K
d I j , Q = ∑ Qk − Ik ; j = 1, . . . , Nc
j
(6)
k =1
where Nc is the number of images of winning cloud; K is the number of extracted features (K = 2048);
Q denotes the query image; and I is the image in the winning cluster.
4.1. Datasets
MalayaKew (MK) Leaf-Dataset: This dataset [72] consists of a collection of leaves from 44 species
class, with 52 images in each class. The data is in the form of digital images, size 256 × 256 pixels,
collected at the Royal Botanic Garden, Kew, England. The dataset has been used solely for supervised
image classification, since the dataset is extremely challenging as some of the classes have very
similar appearances (Figure 5) making it extremely difficult to distinguish differences between classes
with a fully unsupervised model, as was presented in this study. Although the MK dataset is
Journal of Imaging 2019, 5, x FOR PEER REVIEW 10 of 21
not considered a big dataset, we believe the similarity between classes can be a good example to
demonstrate how discriminative the features are between the convolutional neural networks and the
demonstrate how discriminative the features are between the convolutional neural networks and the
hand-crafted methods.
hand-crafted methods.
Figure 5.
Figure 5. Sample
Sample images
images of
of MalayaKew
MalayaKew 44
44 leaf
leaf collection.
collection.
The
TheUniversity
UniversityofofCalifornia
CaliforniaMerced
Merced(UCM)
(UCM) Dataset:
Dataset:UCM
UCM dataset
dataset[73]
[73]consists
consistsofof21
21 land
land cover,
large-scale
large-scale aerial images
images from
from the
theUSGS
USGSnational
nationalmap
mapurban
urbanarea
area imagery.
imagery. EachEach class
class contains
contains 100
images with 256 × 256 pixels; the spatial resolution of each pixel is 30 cm measured in the RGB spectral
100 images with 256 × 256 pixels; the spatial resolution of each pixel is 30 cm measured in the RGB
spectral space.
space. The The has
dataset dataset
beenhas been utilized
widely widely utilized for evaluating
for evaluating the performance
the performance of high-resolution
of high-resolution remote
remote sensing image scene classification [74–76]. The UCM dataset shows very small
sensing image scene classification [74–76]. The UCM dataset shows very small inter-class diversity inter-class
diversity among
among some some categories
categories that sharethat share
a few a fewtexture
similar similar patterns
texture patterns or objects,
or objects, which this
which makes makes this
dataset
dataset very challenging.
very challenging. Some sample
Some sample image from
image scenes scenes from
the UCM thedataset
UCM dataset
are shown are shown
in Figurein Figure
6. 6.
J. Imaging 2019, 5, 33 11 of 21
Journal of Imaging 2019, 5, x FOR PEER REVIEW 11 of 21
Figure 6.
Figure Sample images
6. Sample images of
of the
the University
University of
of California
California Merced
Merced (UCM)
(UCM) dataset.
dataset.
4.2. Performance and Accuracy
4.2. Performance and Accuracy
Throughout this work, we use two evaluation metrics widely used to assess CBIR performance,
known Throughout this work,
as mean Average we use(mAP)
Precision two evaluation metricsatwidely
and the precision rank Nused to assess
(P@N). AverageCBIR performance,
Precision (AP) is
known as mean Average Precision (mAP) and the precision at rank N (P@N). Average
one of the most frequent methods used to evaluate the retrieval quality of a single query’s retrieval Precision (AP)
is one of the most frequent methods used to evaluate the retrieval quality of a single
results. AP takes consideration of both Precision (Pr) and Recall (Re). Precision is the fraction of query’s retrieval
results.
retrievedAP takesthat
images consideration
are relevant,ofwhereas
both Precision
Recall is(Pr)
the and Recall
fraction (Re). Precision
of relevant is the
images that arefraction of
retrieved.
retrieved
AP images
averages that are relevant,
the precision values fromwhereas Recall
the rank is thewhere
positions fraction of relevant
relevant imagesimages that are retrieved.
are retrieved. The mean
AP averages the precision values from the rank positions where relevant images
average precision (mAP) is widely used to summaries the retrieval quality, which averages are retrieved. The
the AP over
mean average precision (mAP) is widely used to summaries
all queries. The definition of the above metrics follows below [4]: the retrieval quality, which averages the
AP over all queries. The definition of the above metrics follows below [4]:
∑nk=1𝑃(𝑘)
∑ P(k) ××𝑟𝑒𝑙(𝑘)
rel (k )
AP
𝐴𝑃 == (7)
(7)
R
𝑅
where P(k)
where P(k)denotes
denotes thethe
precision
precision of ktop
of top retrieval rel(k) isrel(k)
results;results;
k retrieval a binary is indicator
a binary function
indicatorequaling
function1
if the kth retrieved
equaling 1 if the kthresults are relevant
retrieved to the
results are current
relevant toquery image query
the current and 0 image
otherwise;
and and R and n denote
0 otherwise; and R
the number of relevant results for the current query image and the total
and n denote the number of relevant results for the current query image and the total number of number of retrieved results,
respectively.
retrieved Also,respectively.
results, the precisionAlso, at particular rank-N
the precision accuracy isrank-N
at particular anotheraccuracy
evaluation metric to
is another evaluate
evaluation
CBIR performance.
metric to evaluate CBIR P@Nperformance.
score refers toP@N the average
score refersnumberto theof average
same retrieved
numberimages,
of samewithin the
retrieved
top-N ranked images. It should be noted that although mAP and P@N
images, within the top-N ranked images. It should be noted that although mAP and P@N are widely are widely used as evaluation
metrics
used in CBIR, defining
as evaluation metrics ainsuitable metric to
CBIR, defining measuremetric
a suitable the quality of results
to measure for an of
the quality arbitrary
results query
for an
image is not a trivial process. In CBIR, it is hard to define the ground-truth
arbitrary query image is not a trivial process. In CBIR, it is hard to define the ground-truth since since different users might
have a different
different measure
users might haveofa similarity. If the degree
different measure of similarity
of similarity. If the of someofofsimilarity
degree the images is very
of some oflow,
the
ignoringisorvery
images not low,
displaying
ignoringthose
or images is not critical
not displaying thoseand doesisnot
images notimpact
criticaltheandoverall
does performance
not impact the of
the system. Labelling images as non-relevant is not always satisfactory to
overall performance of the system. Labelling images as non-relevant is not always satisfactory to thethe users. Any CBIR should
have a Any
users. certain
CBIRtolerance
shouldforhavefalseapositives, which often
certain tolerance for provides useful which
false positives, information.
often provides useful
In this study, to form a hierarchically nested pyramid, at the lower layer, images were grouped
information.
into aInfixed number
this study, to of clusters,
form while at the
a hierarchically second
nested layer, the
pyramid, means
at the lowerof the clusters
layer, images atwere
the first layer
grouped
were
into afurther groupedofinto
fixed number smaller
clusters, numbers
while at theofsecond
clusters. Since
layer, the
the number
means of images
of the clustersinatboth
the datasets is
first layer
were further grouped into smaller numbers of clusters. Since the number of images in both datasets
query image and all the clusters at the top layer and selecting the winning cluster with maximum
local Recursive Density Estimation (RDE). The search continues at the lower layers, but only with the
clusters which associated to the winning cluster at the top layer. Finally, images in the winning cluster
at the lowest stage are ranked based on calculating the eigenvector distance to the query image.
J. Imaging 2019, 5, 33 12 of 21
4.2.1. Retrieval Performance on MalayaKew Leaf-Dataset
The results of the convolutional neural network as a feature extractor (RL-CNN) are shown in
in the region of few thousands, two-layer hierarchies are enough to achieve real-time image querying.
Figure 7 and Table 1. The precision accuracy at rank-20 is compared in Figure 7 based on 20 queries.
In MK and UCM datasets, based on our experience, the number of clusters at the first layer was set to
The queries were selected to tackle every range of visual appearances with a unique shape, such as
44 and 21 clusters (number of categories) and 10 and 4 clusters at the top layer, respectively.
qoxyodon, or similar appearances, like q-aff-cerris and qlaurifolia.
The retrieval process begins by calculating the local recursive density estimation between the
Several observations can be achieved from the precision results. The RL-CNN method
query image and all the clusters at the top layer and selecting the winning cluster with maximum
outperformed the two state-of-the-art techniques by a large margin. The proposed method not only
local Recursive Density Estimation (RDE). The search continues at the lower layers, but only with the
performed well on classes with unique visual appearances, such as qlobata or qpetraea, but it also
clusters which associated to the winning cluster at the top layer. Finally, images in the winning cluster
distinguished categories with similar appearances, such as quercus and q-x-kewensis. In RL-CNN
at the lowest stage are ranked based on calculating the eigenvector distance to the query image.
method, q-x-mannifera, qboissieri, qellipsoidalis, qmacransmera, and qpetraea obtained maximum
accuracy
4.2.1. with over
Retrieval 90%, whereas
Performance qlaurifolia
on MalayaKew and q-aff-cerris had the lowest value of 55% and 45%,
Leaf-Dataset
respectively. The qlaurifolia class achieved 55% accuracy, whereas 9 out of 20 images belong to
The results
qcanariensis, of the convolutional
qrhysophylla, neuralcategories
and qtrotana network (Figure
as a feature
8A). extractor (RL-CNN)
The accuracy droppedaretoshown in
35% and
Figure
30% in7BOVW
and Table
and1.MFF,
The accordingly.
precision accuracy at rank-20class
The q-aff-cerris is compared
obtained in
theFigure
lowest7 accuracy
based on in
20RL-CNN
queries.
The
with 45% accuracy rate, whereas 11 out of 20 images belong to qrobur category, which is such
queries were selected to tackle every range of visual appearances with a unique shape, as
visually
qoxyodon, or similar appearances, like
almost identical to the query image. q-aff-cerris and qlaurifolia.
MK-DATASET
100
90
80
70
RANK-20(%)
60
50
40
30
20
10
0
1 4 6 8 10 11 13 16 17 18 19 20 21 25 27 28 29 30 34 43
CLASS NUMBER
Figure7.7. The
Figure The retrieval
retrieval Rank-20
Rank-20 accuracy
accuracy between
between the
theConvolutional
Convolutional Neural
Neural Network
Network (CNN)
(CNN) as
as aa
featureextractor,
feature extractor,bag
bagofofvisual
visualwords,
words,and
andmultiple
multiplefeature
featurefusion
fusion(color
(colorand
andtexture).
texture).
Table 1. The retrieval accuracy mAP of convolutional neural network as a feature extractor (RL-CNN),
On the other hand, the BOVW and MFF performed poorly in identifying small differences
bag of visual words (BOVW), and multiple fused global features (MFF) on Malaya–Kew (MK) and
between leaf varieties in MK dataset. Both methods retrieved images with the visual similarity to
University of California Merced (UCM) datasets.
queries; however, they failed to distinguish small visual differences among classes. As illustrated in
Figure 7, BOVW performed
Dataset better than MFF Method
in most cases, except classes
mAPq_rubur_f_purpubascens,
(%)
qagriefolia, qagrifolia, and qpetraea. (The results for
FE-CNN each class are presented
88.1% in the Supplementary
Materials). MalayaKew BOVW 66.2%
MFF 52.6%
FE-CNN 90.5%
UCM BOVW 86.2%
MFF 69.8%
Several observations can be achieved from the precision results. The RL-CNN method
outperformed the two state-of-the-art techniques by a large margin. The proposed method not
only performed well on classes with unique visual appearances, such as qlobata or qpetraea, but it
also distinguished categories with similar appearances, such as quercus and q-x-kewensis. In RL-CNN
J. Imaging 2019, 5, 33 13 of 21
method, q-x-mannifera, qboissieri, qellipsoidalis, qmacransmera, and qpetraea obtained maximum accuracy
with over 90%, whereas qlaurifolia and q-aff-cerris had the lowest value of 55% and 45%, respectively.
The qlaurifolia class achieved 55% accuracy, whereas 9 out of 20 images belong to qcanariensis,
qrhysophylla, and qtrotana categories (Figure 8A). The accuracy dropped to 35% and 30% in BOVW and
MFF, accordingly. The q-aff-cerris class obtained the lowest accuracy in RL-CNN with 45% accuracy
rate, whereas 11 out of 20 images belong to qrobur category, which is visually almost identical to the
Journal of Imaging 2019, 5, x FOR PEER REVIEW 13 of 21
query image.
Figure 8. Qualitative evaluation of the proposed image retrieval on the two lowest performance of
Figure 8. Qualitative evaluation of the proposed image retrieval on the two lowest performance of
classes in Malaya–Kew Leaf-Dataset (A) Retrieval result from qlaurifolia class (B) retrieval result from
classes in Malaya–Kew Leaf-Dataset (A) Retrieval result from qlaurifolia class (B) retrieval result from
q-aff-cerris class. The first image is the query and the following images are the images most similar to
q-aff-cerris class. The first image is the query and the following images are the images most similar to
the query image. The retrieved images wrongly categorized are highlighted in red.
the query image. The retrieved images wrongly categorized are highlighted in red.
On the other hand, the BOVW and MFF performed poorly in identifying small differences between
Table 1 summarizes the mAP evaluation of the Malaya–Kew leaf dataset. The results are
leaf varieties in MK dataset. Both methods retrieved images with the visual similarity to queries;
obtained from 20 queries in which the retrieval system can be tested and evaluated. The best accuracy
however, they failed to distinguish small visual differences among classes. As illustrated in Figure 7,
score is 88.1%, achieved by RL-CNN, followed by BOVW and MFF with 66.2% and 52.6%,
BOVW performed better than MFF in most cases, except classes q_rubur_f_purpubascens, qagriefolia,
respectively.
qagrifolia, and qpetraea. (The results for each class are presented in the Supplementary Materials).
Table 1 summarizes the mAP evaluation of the Malaya–Kew leaf dataset. The results are obtained
Table 1. The retrieval accuracy mAP of convolutional neural network as a feature extractor (RL-
from 20 queries in which the retrieval system can be tested and evaluated. The best accuracy score is
CNN), bag of visual words (BOVW), and multiple fused global features (MFF) on Malaya–Kew (MK)
88.1%, achieved by RL-CNN, followed by BOVW and MFF with 66.2% and 52.6%, respectively.
and University of California Merced (UCM) datasets.
Rank-40
60
Journal of Imaging 2019, 5, x FOR PEER REVIEW 14 of 21
50
40
30
20 UCM-Dataset
10
100
0
90
80
70
Rank-40
60
50
40 RL-CNN BOW MFF
30
20
Figure 9. The 10 retrieval Rank-40 accuracy between feature extractor using convolutional neural
0
network, bag of visual words, and multiple feature fusion (color and texture).
Figure 11 shows the retrieval results of the dense building class on a randomly given query. The
class achieved 35% accuracy, whereas 14 out of 40 images belong to the same class as the query image.
However, the rest of the images are stillRL-CNN visuallyBOWsimilar
MFF to the query retrieved from medium
residential and mobile home park classes. The freeway class with 50% accuracy has a similar
performance, whereas
Figure 9.9.The half
Theretrieval
of theaccuracy
Rank-40
retrieval
retrieved
Rank-40
images
between
accuracy
belong
feature
between
to runway
extractor
feature using and
extractor
overpassneural
convolutional classes,
using convolutional
which are
network,
neural
still visually
bag
network, very
of visual similar
bagwords,
of visual to
and the freeway
multiple
words, class
andfeature (Figure
fusion
multiple 12).
(color
feature and(color
fusion texture).
and texture).
Figure 11 shows the retrieval results of the dense building class on a randomly given query. The
class achieved 35% accuracy, whereas 14 out of 40 images belong to the same class as the query image.
However, the rest of the images are still visually similar to the query retrieved from medium
residential and mobile home park classes. The freeway class with 50% accuracy has a similar
performance, whereas half of the retrieved images belong to runway and overpass classes, which are
still visually very similar to the freeway class (Figure 12).
Retrieval results
Figure 10. Retrieval results of
of airplane
airplane category
category using
using convolutional
convolutional neural
neural network
network as
as a feature
Journalextractor
of Imaging(RL-CNN).
2019, 5, x FOR PEER
The REVIEW
methods obtained 100% retrieval accuracy. 15 of 21
extractor (RL-CNN). The methods obtained 100% retrieval accuracy.
Figure 10. Retrieval results of airplane category using convolutional neural network as a feature
extractor (RL-CNN). The methods obtained 100% retrieval accuracy.
11. Retrieval
Figure 11. Retrievalresults
results
of of dense-building
dense-building category
category usingusing convolutional
convolutional neuralneural network
network as a
as a feature
extractor (RL-CNN).
feature extractor The green
(RL-CNN). Therectangles indicateindicate
green rectangles correct retrieval results. results.
correct retrieval
Figure
J. Imaging 2019,11. Retrieval results of dense-building category using convolutional neural network as a15 of 21
5, 33
feature extractor (RL-CNN). The green rectangles indicate correct retrieval results.
Retrieval results
Figure 12. Retrieval
Figure results of
of freeway
freeway category
category using
using convolutional
convolutional neural
neural network
network as
as aa feature
feature
extractor (RL-CNN). The
extractor (RL-CNN). The redred rectangles indicate incorrect retrieval results.
The retrieval
The mAP of
retrieval mAP of different
different models
models on
on the
the UCM
UCM image
image dataset
dataset are
are listed
listed in
in Table
Table 1.
1. As
As shown
shown
in the
in thetable,
table,the
theRL-CNN
RL-CNNoutperformed
outperformed both
both thethe state-of-the-art
state-of-the-art techniques.
techniques. mAPmAP
TheThe measure
measure in
in RL-
RL-CNN is 90.1%, whereas the BOW and MFF achieved 86.2% and 69.8%,
CNN is 90.1%, whereas the BOW and MFF achieved 86.2% and 69.8%, respectively. respectively.
RETRIEVAL TIME
RL-CNN Hierarchically Netsed RL-CNN Sequential Searching BOW Inverted Index BOW Non-Hierarchy
3.9
4
3.5
2.5
TIME (SEC)
1.8
2
1.5
0.66
1
0.29
0.164
0.142
0.039
0.025
0.5
0
MK UCM
DATASET
the requirement of pre-defining the number of clusters in advance. The advantage of using such a
model is that if new images are added to the dataset, the clustering images and forming the hierarchical
structure will not be repeated from scratch.
Another improvement will be adding relevant feedback which enables users to have more interaction
with the system and provide feedback on the relevance of the retrieved images. The feedback can be used
for learning and improving the performance of the CBIR.
5. Conclusions
The research scope for this paper focused on highly scalable and memory efficient image retrieval
system. The aim was to overcome the limitations of conventional retrieval methods in the field of plant
biology and remote sensing to significantly boost the retrieval performance in terms of accuracy and
computational efficiency. The challenge was to preserve multi-dimensional and high discriminative
image representations derived by the CNN model and still maintain the computational efficiency of
the querying process. It is worth highlighting the following advantages of the proposed method:
• Fast Retrieval time: The proposed approach improves the retrieval process and is over 16 times
faster than the traditional brute-force sequential searching which is vital for large-scale databases.
• Scalability: The model is constructed in a hierarchical structure. The feature indexing in a
hierarchical form can handle a dynamic image database and can be easily integrated into the
server-client architecture.
• Unsupervised data mining: The proposed technique does not require any prior knowledge of
image repositories or any human intervention. However, in future work, human input/feedback
can potentially improve the performance.
• Recursive similarity measurement: The similarity measurements are done recursively,
which significantly reduces memory cost in high-scale multimedia CBIR systems.
• Discriminative power for quantifying images: Transfer learning is applied by utilizing a
pre-trained deep neural network model merely as a feature extractor. The results indicate that
the generic descriptors extracted from the CNNs are effective and powerful, and performed
consistently better than conventional content-based retrieval systems.
Furthermore, although the visual content was the main focus of this study, integrating keywords
and text to the CBIR pipeline can capture images’ semantic content and describe images which are
identical by linguistic clues.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Virlet, N.; Sabermanesh, K.; Sadeghi-Tehran, P.; Hawkesford, M.J. Field Scanalyzer: An automated robotic
field phenotyping platform for detailed crop monitoring. Funct. Plant Biol. 2017, 44, 143. [CrossRef]
2. Busemeyer, L.; Mentrup, D.; Möller, K.; Wunder, E.; Alheit, K.; Hahn, V.; Maurer, H.P.; Reif, J.C.; Würschum, T.;
Müller, J.; et al. BreedVision—A Multi-Sensor Platform for Non-Destructive Field-Based Phenotyping in
Plant Breeding. Sensors 2013, 13, 2830–2847. [CrossRef] [PubMed]
3. Kirchgessner, N.; Liebisch, F.; Yu, K.; Pfeifer, J.; Friedli, M.; Hund, A.; Walter, A. The ETH field phenotyping
platform FIP: A cable-suspended multi-sensor system. Funct. Plant Biol. 2017, 44, 154. [CrossRef]
4. Larson, R.R. Introduction to Information Retrieval. J. Am. Soc. Inf. Sci. 2010, 61, 852–853. [CrossRef]
5. Datta, R.; Joshi, D.; Li, J.; Wang, J.Z. Image retrieval: Ideas, influences, and trends of the new age.
ACM Comput. Surv. 2008, 40, 5–60. [CrossRef]
6. Lew, M.; Sebe, N.; Djeraba, C.; Jain, R. Content-based multimedia information retrieval: State of the art and
challenges. ACM Trans. Multimed. Comput. Commun. Appl. 2006, 2, 1–19. [CrossRef]
7. Smeulders, A.W.M.; Worring, M.; Santini, S.; Gupta, A.; Jain, R. Content-based image retrieval at the end of
the early years. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1349–1380. [CrossRef]
8. Alzu’bi, A.; Amira, A.; Ramzan, N. Semantic content-based image retrieval: A comprehensive study. J. Vis.
Commun. Image Represent. 2015, 32, 20–54. [CrossRef]
9. Yu, H.; Li, M.; Zhang, H.-J.; Feng, J. Color texture moments for content-based image retrieval. In Proceedings
of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; pp. 929–932.
10. Lin, C.-H.; Chen, R.-T.; Chan, Y.-K. A smart content-based image retrieval system based on color and texture
feature. J. Image Vis. Comput. 2009, 27, 658–665. [CrossRef]
11. Singh, S.M.; Hemach, K.; Hemachandran, K. Content-Based Image Retrieval using Color Moment and Gabor
Texture Feature. IJCSI Int. J. Comput. Sci. 2012, 9, 299–309.
12. Guo, Y.; Zhao, G.; Pietikainen, M. Discriminative features for texture description. Pattern Recognit.
2012, 45, 3834–3843. [CrossRef]
13. Ahonen, T.; Matas, J.; He, C.; Pietikainen, M. Rotation invariant image description with local binary
pattern histogram fourier features. In Proceedings of the 16th Scandinavian Conference on Image Analysis
(SCIA 2009), Oslo, Norway, 15–18 June 2009; Springer: Berlin/Heidelberg, Germany, 2009.
14. Mezaris, V.; Kompatsiaris, I.; Strintzis, M.G. An ontology approach to object-based image retrieval.
In Proceedings of the 2003 International Conference on Image Processing (Cat. No.03CH37429), Barcelona,
Spain, 14–17 September 2003.
15. Nikkam, P.S.; Reddy, B.E. A Key Point Selection Shape Technique for Content based Image Retrieval System.
Int. J. Comput. Vis. Image Process. 2016, 6, 54–70. [CrossRef]
J. Imaging 2019, 5, 33 19 of 21
16. Zhou, W.; Li, H.; Tian, Q. Recent Advance in Content-based Image Retrieval: A Literature Survey. arXiv 2017,
arXiv:1706.06064.
17. Tsai, C.F. Bag-of-words representation in image annotation: A review. ISRN Artif. Intell. 2012, 2012.
[CrossRef]
18. Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110.
[CrossRef]
19. Bay, H.; Tuytelaars, T.; Gool, L. Surf: Speeded Up Robust Features. In Proceedings of the 9th European
Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006;
pp. 404–417.
20. Leutenegger, S.; Chli, M.; Siegwart, R.Y. Brisk: Binary Robust Invariant Scalable Keypoints. In Proceedings
of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011;
pp. 2548–2555.
21. Perronnin, F.; Liu, Y.; Sánchez, J. Large-scale image retrieval with compressed fisher vectors. In Proceedings
of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco,
CA, USA, 13–18 June 2010.
22. Jegou, H.; Douze, M.; Schmid, C. Aggregating local descriptors into a compact image representation. In
Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
San Francisco, CA, USA, 13–18 June 2010.
23. Bengio, Y. Learning Deep Architectures for AI. Found. Trends®Mach. Learn. 2009, 2, 1–127. [CrossRef]
24. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition.
arXiv 2014, arXiv:1409.1556.
25. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.
Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9.
26. Tzelepi, M.; Tefas, A. Deep convolutional learning for Content Based Image Retrieval. Neurocomputing
2018, 275, 2467–2478. [CrossRef]
27. Schmidhuber, J. Deep learning in neural networks: An overview. Neural Netw. 2015, 61, 85–117. [CrossRef]
[PubMed]
28. Johnson, R.; Zhang, T. Semi-supervised Convolutional Neural Networks for Text Categorization via Region
Embedding. In Proceedings of the Twenty-Ninth Conference on Neural Information Processing Systems
(NIPS 2015), Montreal, QC, Canada, 7–12 December 2015.
29. Shen, Y.; He, X.; Gao, J.; Deng, L.; Mesnil, G. A Latent Semantic Model with Convolutional-Pooling Structure
for Information Retrieval. In Proceedings of the 23rd ACM International Conference on Information and
Knowledge Management, Shanghai, China, 3–7 November 2014.
30. Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional Neural Networks for
Speech Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1533–1545. [CrossRef]
31. Borji, A.; Cheng, M.-M.; Jiang, H.; Li, J. Salient Object—A Benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722.
[CrossRef] [PubMed]
32. Tzelepi, M.; Tefas, A. Deep convolutional image retrieval: A general framework. Signal Process. Image
Commun. 2018, 63, 30–43. [CrossRef]
33. Wan, J.; Wang, D.; Hoi, S.; Wu, P.; Zhu, J. Deep learning for content-based image retrieval: A comprehensive
study. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA,
3–7 November 2014; pp. 157–166.
34. Sun, S.; Zhou, W.; Tian, Q.; Li, H. Scalable Object Retrieval with Compact Image Representation from Generic
Object Regions. ACM Trans. Multimed. Comput. Commun. Appl. (TOMM) 2016, 12, 29. [CrossRef]
35. Lai, H.; Pan, Y.; Liu, Y.; Yan, S. Simultaneous Feature Learning and Hash Coding with Deep Neural Networks.
In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston,
MA, USA, 7–12 June 2015; pp. 3270–3278.
36. Gong, Y.; Wang, L.; Guo, R.; Lazebnik, S. Multi-scale Orderless Pooling of Deep Convolutional Activation
Features. In Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014;
Springer: Cham, Switzerland, 2014; Volume 8695, pp. 392–407.
J. Imaging 2019, 5, 33 20 of 21
37. Ng, J.Y.-H.; Yang, F.; Davis, L.S. Exploiting local features from deep networks for image retrieval.
In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 53–61.
38. Mohedano, E.; McGuinness, K.; O’Connor, N.E.; Salvador, A.; Marques, F.; Giro-i-Nieto, X. Bags of Local
Convolutional Features for Scalable Instance Search. In Proceedings of the 2016 ACM on International
Conference on Multimedia Retrieval, New York, NY, USA, 6–9 June 2016; ACM Press: New York, NY, USA,
2016; pp. 327–331.
39. Angelov, P.; Sadeghi-Tehran, P. Look-a-Like: A Fast Content-Based Image Retrieval Approach Using a
Hierarchically Nested Dynamically Evolving Image Clouds and Recursive Local Data Density. Int. J. Intell.
Syst. 2016, 32, 82–103. [CrossRef]
40. Angelov, P.; Sadeghi-Tehran, P. A Nested Hierarchy of Dynamically Evolving Clouds for Big Data Structuring
and Searching. Procedia Comput. Sci. 2015, 53, 1–8. [CrossRef]
41. Cai, J.; Liu, Q.; Chen, F.; Joshi, D.; Tian, Q. Scalable Image Search with Multiple Index Tables. In Proceedings
of the International Conference on Multimedia Retrieval, Glasgow, UK, 1–4 April 2014; p. 407.
42. Nister, D.; Stewenius, H. Scalable Recognition with a Vocabulary Tree. In Proceedings of the 2006 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, New York, NY, USA, 17–22
June 2006; pp. 2161–2168.
43. Zhou, W.; Lu, Y.; Li, H.; Song, Y.; Tian, Q. Spatial coding for large scale partial-duplicate web image search.
In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010;
pp. 511–520.
44. Wu, Z.; Ke, Q.; Isard, M.; Sun, J. Bundling features for large scale partial-duplicate web image search. In
Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA,
20–25 June 2009; pp. 25–32.
45. Bartolini, I.; Patella, M. Windsurf: the best way to SURF. Multimed. Syst. 2018, 24, 459–476. [CrossRef]
46. Zhang, J.; Peng, Y.; Ye, Z. Deep Reinforcement Learning for Image Hashing. arXiv 2018, arXiv:1802.02904.
47. Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep Supervised Hashing for Fast Image Retrieval. In Proceedings
of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA,
27–30 June 2016; pp. 2064–2072.
48. Jiang, K.; Que, Q.; Kulis, B. Revisiting Kernelized Locality-Sensitive Hashing for Improved Large-Scale
Image Retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Boston, MA, USA, 7–12 June 2015; pp. 4933–4941.
49. Tang, J.; Li, Z.; Wang, M. Neighborhood discriminant hashing for large-scale image retrieval. IEEE Trans.
Image Process. 2015, 24, 2827–2840. [CrossRef] [PubMed]
50. Datar, M.; Immorlica, N.; Indyk, P.; Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable
distributions. In Proceedings of the Twentieth Annual Symposium on Computational Geometry, Brooklyn,
NY, USA, 8–11 June 2004; pp. 253–262.
51. Cao, Z.; Long, M.; Wang, J.; Yu, P.S. HashNet: Deep Learning to Hash by Continuation. In Proceedings
of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017;
pp. 5609–5618.
52. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks.
In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV,
USA, 3–8 December 2012; pp. 1097–1105.
53. He, K.; Zhang, X.; Ren, S.; Sun, J. Identity Mappings in Deep Residual Networks. In Proceedings of the 14th
European Conference Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016;
Springer: Cham, Switzerland, 2016; Volume 9908, pp. 630–645.
54. Sharif Razavian, A.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN Features Off-the-Shelf: An Astounding
Baseline for Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, Columbus, OH, USA, 24–27 June 2014; pp. 806–813.
55. Olivas, E.S. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and
Techniques; IGI Global: Hershey, PA, USA, 2009.
56. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June
2016; pp. 770–778.
J. Imaging 2019, 5, 33 21 of 21
57. Yang, L.; Qi, X.; Xing, F.; Kurc, T.; Saltz, J.; Foran, D.J. Parallel content-based sub-image retrieval using
hierarchical searching. Bioinformatics 2013, 30, 996–1002. [CrossRef] [PubMed]
58. Distasi, R.; Vitulano, D.; Vitulano, S. A Hierarchical Representation for Content-based Image Retrieval. J. Vis.
Lang. Comput. 2000, 11, 369–382. [CrossRef]
59. Jiang, F.; Hu, H.M.; Zheng, J.; Li, B. A hierarchal BoW for image retrieval by enhancing feature salience.
Neurocomputing 2016, 175, 146–154. [CrossRef]
60. You, J.; Li, Q. On hierarchical content-based image retrieval by dynamic indexing and guided search.
In Proceedings of the 2009 8th IEEE International Conference on Cognitive Informatics (ICCI’09), Hong Kong,
China, 15–17 June 2009; pp. 188–195.
61. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [CrossRef]
62. Angelov, P. Anomalous System State Identification. U.S. Patent US9390265B2, 15 May 2012.
63. Angelov, P. Evolving Rule-Based Models: A Tool for Design of Flexible Adaptive Systems; Springer:
Berlin/Heidelberg, Germany, 2002.
64. Angelov, P.; Sadeghi-Tehran, P.; Ramezani, R. A Real-time Approach to Autonomous Novelty Detection and
Object Tracking in Video Stream. Int. J. Intell. Syst. 2011, 26, 189–205. [CrossRef]
65. Zhang, C.; Huang, L. Content-Based Image Retrieval Using Multiple Features. J. Comput. Inf. Technol.
2014, 22, 1–10. [CrossRef]
66. Wang, X.-Y.; Zhang, B.-B.; Yang, H.-Y. Content-based image retrieval by integrating color and texture features.
Multimed. Tools Appl. 2012, 68, 545–569. [CrossRef]
67. Yue, J.; Li, Z.; Liu, L.; Fu, Z. Content-based image retrieval using color and texture fused features.
Math. Comput. Model. Int. J. 2011, 54, 1121–1127. [CrossRef]
68. Oliva, A.; Torralba, A. Building the Gist of A Scene: The Role of Global Image Features in Recognition.
Prog. Brain Res. 2006, 155, 23–36. [PubMed]
69. Huang, J.; Kumar, S.R.; Mitra, M.; Zhu, W.-J.; Zabih, R. Image indexing using color correlograms.
In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition,
San Juan, Puerto Rico, USA, 17–19 June 1997; pp. 762–768.
70. Wang, J.; Yang, J.; Yu, K.; Lv, F.; Huang, T.; Gong, Y. Locality-Constrained Linear Coding For Image
Classification. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3360–3367.
71. Lazebnik, S.; Schmid, C.; Ponce, J. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing
Natural Scene Categories. In Proceedings of the 2006 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; pp. 2169–2178.
72. Han, S.; Chee, L.; Chan, S.; Wilkin, P.; Remagnino, P. Deep-Plant: Plant Identification with Convolutional
Neural Networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP),
Quebec City, QC, Canada, 27–30 September 2015.
73. Yang, Y.; Newsam, S. Bag-of-visual-words and spatial extensions for land-use classification. In Proceedings
of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, San Jose,
CA, USA, 2–5 November 2010; ACM: New York, NY, USA; pp. 270–279.
74. Yu, H.; Yang, W.; Xia, G.-S.; Liu, G. A Color-Texture-Structure Descriptor for High-Resolution Satellite Image
Classification. Remote Sens. 2016, 8, 259. [CrossRef]
75. Li, Y.; Tao, C.; Tan, Y. Unsupervised multilayer feature learning for satellite image scene classification.
IEEE Geosci. Remote Sens. Lett. 2016, 13, 157–161. [CrossRef]
76. Romero, A. Unsupervised deep feature extraction for remote sensing image classification. IEEE Trans. Geosci.
Remote Sens. 2016, 54, 1349–1362. [CrossRef]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).