boualleg@M2GARSS 2020

The document proposes a Triplet Low-Dimensional Convolutional Neural Network (TLDCNN) for high-resolution remote sensing image retrieval. TLDCNN learns discriminative image features using triplet loss, which pushes similar image features closer together and dissimilar features farther apart. It consists of three Low-Dimensional CNN models that take an input triplet of images and learns the embedding space. Standard distance metrics are not effective for measuring similarity in deep features, so TLDCNN learns the embedding space directly from triplets during training. Experiments show it effectively retrieves similar images from remote sensing datasets.

Uploaded by

yaakoub boualleg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views4 pages

boualleg@M2GARSS 2020

Uploaded by

yaakoub boualleg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

TLDCNN: A TRIPLET LOW DIMENSIONAL CONVOLUTIONAL NEURAL NETWORKS

FOR HIGH-RESOLUTION REMOTE SENSING IMAGE RETRIEVAL

Yaakoub Boualleg, Mohamed Farah, and Imed Riadh Farah.

University of Manouba, ENSI, RIADI, LR99ES26 (SIIVT Team)

University Campus, Manouba, 2010, Tunisia

ABSTRACT backs of the early text-based methods by describing the

HRRS images by their visual contents using hand-crafted
Remote Sensing Image Retrieval (RSIR) is quite a challeng-
features. For a comprehensive literature review, we refer
ing topic due to the complexity of High-Resolution Remote
readers to [1]. However, they are still limited and unable to
Sensing (HRRS) imagery. Recently, most researches are fo-
generate sufficiently discriminative feature representations.
cusing on feature extraction using learned features instead
of traditional hand-crafted features for RSIR. However, on Convolutional Neural Networks (CNNs) have been re-
the one hand, learning discriminative feature representations cently extensively used achieving remarkable performances
is not sufficient due to the limited ability of the standard in a wide range of applications, especially for computer vi-
metric distances for matching similar images. On the other sion tasks including image classification and retrieval. The
hand, the high dimensionality of the learned features makes success of these architectures comes from their ability to
the retrieval process not efficient for large-scale collections automatically learn powerful image feature representations.
of HRRS images. To address these issues, we propose a
Triplet Low-Dimensional Convolutional Neural Network Researchers have investigated the best strategies to gen-
(TLDCNN) which is based on the LDCNN model combined erate discriminative RS images representations directly from
with triplet loss function. We use end-to-end training process the pre-trained CNN models, from fine-tuned models [2],
to automatically learn fine-grained image similarity directly or from proposed CNN architectures [3]. Extensive experi-
from input triplet images. The experiment results demon- ments have been conducted using different features extraction
strated the effectiveness of the proposed network on three schemes by considering features from one fully-connected,
publicly available HRRS image datasets for HRRSIR. a convolutional layer, or from different layers using feature
Index Terms— Low-Dimensional Convolutional Neural fusion, combination, hashing, or aggregation methods.
Network, Metric learning, Remote Sensing Image Retrieval,
In the most existing studies that focused on the feature
Triplet network.
extraction phase, the retrieval process was simply conducted
by analysing the pairwise distances between the feature vec-
1. INTRODUCTION tors using standard similarity metrics such as the Euclidean
distance or the cosine similarity. However, these standard
Over the last decade, a huge amount of HRRS images have distance metrics are not discriminative enough for measur-
become increasingly available, leading a high demand for ing the similarity between pairs of the learned deep image
effective and efficient methods to manage and make full-use features since they are not able to capture the non-linear de-
of such large-scale data. Consequently, High-Resolution Re- pendencies within the images features, causing unsatisfactory
mote Sensing Image Retrieval (HRRSIR) methods have a retrieval performances [4].
growing interest due to their ability to search and retrieve
images from large archives. In this paper, we propose a non-metric similarity learn-
HRRSIR methods aim to retrieve the top-ranked images ing based on a triplet network architecture. The proposed
from a HRRS image dataset based on their similarity to a triplet network consists of three LDCNN model instances and
given query. We can distinguish two main phases in the re- a triplet loss function. The LDCNN model is able to learn
trieval process. Firstly, feature extraction which consists in discriminative RS image representations while the triplet loss
representing each image by a set of attributes. Secondly, im- is optimized to learn a low-dimensional embedding space in
age matching by measuring the similarity between images which similar images are mapped more closely together and
features to retrieve the top-ranked ones. pushing away the non-similar ones. For an input query image,
Several Remote Sensing Content-Based Image Retrieval the top-k ranked images that have the lowest distances to the
(RS-CBIR) methods were proposed to alleviate the draw- query image in the learned embedding space will be retrieved.

Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
can fast the learning process as well as the matching phase.
The LDCNN contains five convolutional layers, a multi-layer
perceptron layer, a global average pooling layer, and a soft-
max layer. For more details, we refer the readers to [3].
The convolutional (Conv) layer performs a specific trans-
formation on local regions in the input (receipt field) to ob-
tain a useful representation. It functions as a local feature
extractor. An input image is passed through a series of sliding
learnable kernels (filters) to create a 3-Dimensional convolu-
tion feature maps (Fmaps). The generated Fmaps are taken as
input for the next layer.
The multi-layer perceptron convolutional (mlpconv) layer
consists of three-layer-perceptron following the Conv layers.
In order to enhance the model power, this layer is used to learn
Fig. 1. The overall structure of the proposed triplet network. nonlinearly separable high-level features that have more ab-
straction than the convolutional features. The mlpconv layer
2. METHODOLOGY outputs one feature map corresponding to each class.
The Global Average Pooling (GAP) layer performs a sub-
For a trainingset S = {Imgi , yci } of n labelled RS images, sampling operation by vectorizing the extracted Fmaps from
where yci = {1, 2, ...,C} is the class label corresponding the last mlpconv layer and generating a fixed-length embed-
to Imgi . For each image Imgi ∈ S, VImgi represents its d- ding features. As a result, an n-dimensional feature vector is
dimensional feature vector. Our objective is to retrieve a set obtained by averaging each feature map where n is the num-
T ⊂ S of the k-most similar images to a given query q ∈ S. ber of classes.
The retrieved image set T = {Img1 , Img2 , ..., Imgk } in- The n-dimensional feature vector resulting from the GAP
cludes a k nearest points to the query in vector space Rd layer is normalized using an L2 normalisation layer instead of
D(q, Img1 ) ≤ D(q, Img2 ) ≤ ... ≤ D(q, Imgk ), where D(q, Imgi ) softmax. Since the normalization function is a differentiable
is the distance between the query feature vector Vq and an im- operation, adding this layer does not impede the end-to-end
age feature vector VImgi . The smaller the distance D(q, Imgi ), training process. Another desirable property of using such
the more similar are q and Imgi . normalization is that the squared Euclidean distance (D) in
For a triplet Tr(a, p, n), where a ∈ S is an image anchor, this case is proportional to their cosine similarity.
2 2 2
p is a positive image p ∈ S where a and p have the same class a b a b
a − b = a + b + 2 ab
a.b
= 2 − 2 ab
a.b
label, whereas n is a negative image from S that have a dif- By using the normalisation layer, we are able to boost the
ferent class label. We aim to learn an embedding function performance as well as simplifying and facilitating the choice
f (.), which vectorizes an input image Imgi to generate an d- of the margin since the squared Euclidean distance between
dimensional vector of embedding features f (Imgi ) = VImgi , the normalized features is always in the range of [1,4] since
giving a smaller distances in the embedding space Rd to the D(a, b) = 2 − 2 ×Cos(a, b) where Cos(a, b) is in [-1,1].
most similar images and pushing away the dissimilar ones
such that:
D(Va ,Vp ) < D(Va ,Vn ) (1) 2.2. Triplet loss
For that purpose, we propose a triplet network based on To measure the similarity between the two images a and p,
end-to-end learning to train network for automatically learn- we use the squared Euclidean distance between their feature
ing fine-grained image similarity directly from input images. vectors, which can be defined as follows:
2
2.1. Triplet network architecture D(Va ,Vp ) = Va −Vp 2 (2)

The triplet network consists of three network instances with In the objective of reducing the distance between simi-
the same architecture, sharing the same parameters, and joint lar samples and maximising the distance between dissimilar
in the top by a triplet loss function. The triplet network takes ones, we minimise the triplet loss function during the training.
as input a triplet Tr(a, p, n) in which each network instance The distance from the anchor to the negative sample must be
have one input image. greater than the distance between the anchor and the positive
As shown in Fig. 1, for each network instance, we use sample. Thus, the used triplet loss function can be deﬁned by:
Low Dimensional CNN (LDCNN) model [3] owing to its
promising results for HRRSIR and motivated by its capabil-
ity for learning discriminant low-dimensional features, which l(a, p, n) = Max(D(Va ,Vp ) − D(Va ,Vn ) + margin, 0) (3)

Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
where the margin separates the positive region and the nega- USCM is composed of 21 land-use classes with 100 im-
tive samples relative to an anchor. ages per class. The image size is 256 × 256 pixels with a
spatial resolution of 0.3m per pixel.
2.3. Training PatternNet have 38 scene classes with 800 images per
class. The image size is 256 × 256 pixels, with a very high
A valid triplet Tr(a, p, n) is a set of three images, where a and spatial resolution that varies from 0.62m to about 4.693m.
p have the same class label whereas n belongs to a different AID contains a total of 10,000 images labelled into 30
class. a and p should not be the same image. aerial scene categories. The number of images changes from
The network is learned progressively on valid triplets to 220 to 420 images per class. The image size is 600 × 600
increase the similarity between the anchors and the positive pixels with spatial resolutions varying from about 0.5m to 8m.
samples and decrease the similarity scores between each an-
chor and its negatives samples. Thus, the triplet selection is a
3.2. Evaluation metrics
crucial phase for training the network.
In the triplet mining technique that is called ”batch all” In order to evaluate the retrieval performance of the proposed
(i.e. select all valid triplets), the number of all valid triplets TLDCNN network, we adopt two widely used evaluation
can be formally given by : metrics namely; precision at top-k retrieved images (P@k)
and mean average precision (mAP).
NbrTr = (Nk × Nc ) × (Nk − 1) × (Nk × Nc − Nk ) (4)
Let Q a set of Nq image queries. For each image query
where Nc represents the number of classes and Nk is the num- qi ∈ Q, a set of top-k similar images are retrieved. We can
ber of images in each class. As can be seen from Eq 4, the formally define the P@k by:
number of triplets increases cubically with the number of im-
ages. Thus, in large scale datasets with this extremely large k
1
number of triplets, this technique is computationally expen- P@k = × ∑ r( j) (5)
k j=1
sive and can not be used.
Besides, in the random triplets selection technique, most where r( j) ∈ {0, 1} indicates that the jth retrieved image is
of the selected triplets can be easy triplets where the negative relevant or not. The image is considered relevant, (i.e. r( j) =
samples are sufficiently far from the anchors relative to the 1) if it belongs to the same category of the given query, other-
positive ones (i.e. D(Va ,Vp ) + margin < D(Va ,Vn )). The net- wise r( j) = 0.
work is prone to underfitting by quickly stopping the learning The mean Average Precision mAP is an extension of the
process. Moreover, selecting hard triplets (i.e. D(Va ,Vn ) < Average Precision (AP) which can be obtained by averaging
Dist(Va ,Vp )) can lead to a fast network convergence to local the AP scores of all the image queries in Q as follows:
optima with high probability of overfitting.
Nq
To avoid that, in this paper we use semi-hard triplets 1
mAP = × ∑ AP(qi ) (6)
selection where the negatives are further from positives but Nq i=1
within the margin boundary (i.e. D(Va ,Vp ) < D(Va ,Vn ) <
AP(qi ) is the Average Precision score such that
D(Va ,Vp ) + margin). We believe that using semi-hard se-
lection and using big-size batch can improve the embedding ∑kj=1 P( j) × r( j)
network convergence. AP(qi ) = (7)
∑kj=1 [r( j) = 1]
3. EXPERIMENTS AND ANALYSIS where P(j) is the precision at cut-off j images. It is computed
by averaging the summation of precision scores of the rele-
We evaluate the retrieval performance of the proposed TLD- vant images over the number of all relevant images.
CNN network for HRRSIR on different HRRS image datasets.
All the experiments are carried out using a p2.xlarge instance 3.3. Experimental setup
on the Amazon Web Service EC2 machines. In this section,
we firstly introduce the used HRRS datasets and performance The datasets are randomly split into three folds; a 70% and
measures. Then, we describe the experimental setup. Finally, 10% of the each dataset is used for training and validating
we analyse and discuss the experimental results. the TLDCNN network respectively. The remaining 20% of
the dataset is used as a set of image queries to evaluate the
retrieval performance. The images are firstly resized to 224 ×
3.1. Datasets
224 to be well matched with the input size of the used model.
Three publicly available RS image datasets are used for the In the experiments, the networks are trained using the
evaluation of our proposed model, namely, UC Merced land- Adam optimizer with a learning rate of 0.001 for 50 epochs
use Dataset (UCMD) [5], PatternNet dataset [6], and the as maximum training iteration number, whereas the margin is
Aerial Image Dataset (AID) [7]. set to 0.5 and the batch size to 200.

Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
the proposed network achieved a promising retrieval perfor-
Table 1. The retrieval performances in terms of P@100 and
mance on three HRRS image datasets.
mAP measures on the UCMD, the AID and the PatternNet
In future works, we will conduct a systematic investiga-
datasets.
UCMD AID PatternNet tion about using triplet and contrastive loss for the retrieval
P@100 mAP P@100 mAP P@100 mAP task and will focus on further improvements for the learned
LDCNN 0.4812 0.5410 0.6598 0.6933 0.7013 0.7626 metric by adding relevance feedback interactive stage.
SimNet 0.5679 0.6631 0.7124 0.7266 0.8096 0.8124
DML 0.4855 0.9663 0.9657 0.9822 0.9954 0.9955
TLDCNN 0.8829 0.9738 0.9766 0.9882 0.9971 0.9973 5. REFERENCES

[1] Y. Liu, D. Zhang, G. Lu, and W.-Y. Ma, “A survey of

3.4. Experimental results and discussion content-based image retrieval with high-level semantics,”
Pattern recognition, vol. 40, no. 1, pp. 262–282, 2007.
For comparison, we measure the similarity between the ex-
tracted deep features from the pre-trained LDCNN model [2] G.-S. Xia, X.-Y. Tong, F. Hu, Y. Zhong, M. Datcu, and
using the standard Euclidean distance. Also, two network L. Zhang, “Exploiting deep features for remote sens-
architectures are employed as similarity learning baselines. ing image retrieval: A systematic investigation,” arXiv
Firstly, the Siamese network (SimNet) that uses only two em- preprint arXiv:1707.07321, 2017.
bedding network instances instead of three, and a contrastive
[3] W. Zhou, S. Newsam, C. Li, and Z. Shao, “Learning
loss function between the pairs of images [8]. Secondly,
low dimensional convolutional neural networks for high-
Deep Metric Learning (DML) triplet network [4] which is the
resolution remote sensing image retrieval,” Remote Sens-
most obvious competitor to our network. DML is based on
ing, vol. 9, no. 5, p. 489, 2017.
ResNet50 [9] pre-trained CNN model. The extracted deep
features from fully connected layer are reduced by PCA then [4] R. Cao, Q. Zhang, J. Zhu, Q. Li, Q. Li, B. Liu, and
are fed into the triplet loss function for ranking the similar G. Qiu, “Enhancing remote sensing image retrieval with
images. triplet deep metric learning network,” arXiv preprint
As shown in Table 1, the metric learning-based methods arXiv:1902.05818, 2019.
outperform those using the Euclidean distance on the three
datasets. This proves that non-metric similarity learning is [5] Y. Yang and S. Newsam, “Bag-of-visual-words and spa-
much more able to capture the different non-linearity features tial extensions for land-use classification,” in Proceedings
dependencies. Also, we can see that our proposed TLD- of the 18th SIGSPATIAL international conference on ad-
CNN network achieves promising retrieval performances in vances in geographic information systems. ACM, 2010,
which the p@100 score is boosted by 0.4 compared to the pp. 270–279.
DML performance on UCMD. On the other datasets, the
[6] W. Zhou, S. Newsam, C. Li, and Z. Shao, “Patternnet: A
achieved performances are very competitive, whereas the
benchmark dataset for performance evaluation of remote
SimNet obtains unsatisfactory results because learning dis-
sensing image retrieval,” ISPRS journal of photogramme-
tances between pairs of dissimilar images are not efficient
try and remote sensing, vol. 145, pp. 197–209, 2018.
for the retrieval task. Also, we can notice that regards to
the generated features, the TLDCNN have a fewer features [7] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong,
size leading a less network complexity which allows a fast L. Zhang, and X. Lu, “Aid: A benchmark data set for per-
training of the network. formance evaluation of aerial scene classification,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 55,
4. CONCLUSION no. 7, pp. 3965–3981, 2017.

[8] N. Garcia and G. Vogiatzis, “Learning non-metric visual

We propose a Triplet Low-Dimensional Convolutional Neural similarity for image retrieval,” Image and Vision Comput-
Network (TLDCNN) for High-Resolution Remote Sensing ing, vol. 82, pp. 18–25, 2019.
Image Retrieval (HRRSIR). The proposed network is based
on three LDCNN instances jointed on the top by a ranking [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
triplet loss function. We use end-to-end learning process to learning for image recognition,” in Proceedings of the
automatically learn ﬁne-grained image similarity from input IEEE conference on computer vision and pattern recog-
triplet images. The triplet loss function is optimized to min- nition, 2016, pp. 770–778.
imise the distance between similar images and pushing away
the different ones on the learned low-dimensional embedding
space. Finally, the top-k similar images relative to a given im-
age query are retrieved. The experimental results showed that

Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.