boualleg@M2GARSS 2020
boualleg@M2GARSS 2020
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
can fast the learning process as well as the matching phase.
The LDCNN contains five convolutional layers, a multi-layer
perceptron layer, a global average pooling layer, and a soft-
max layer. For more details, we refer the readers to [3].
The convolutional (Conv) layer performs a specific trans-
formation on local regions in the input (receipt field) to ob-
tain a useful representation. It functions as a local feature
extractor. An input image is passed through a series of sliding
learnable kernels (filters) to create a 3-Dimensional convolu-
tion feature maps (Fmaps). The generated Fmaps are taken as
input for the next layer.
The multi-layer perceptron convolutional (mlpconv) layer
consists of three-layer-perceptron following the Conv layers.
In order to enhance the model power, this layer is used to learn
Fig. 1. The overall structure of the proposed triplet network. nonlinearly separable high-level features that have more ab-
straction than the convolutional features. The mlpconv layer
2. METHODOLOGY outputs one feature map corresponding to each class.
The Global Average Pooling (GAP) layer performs a sub-
For a trainingset S = {Imgi , yci } of n labelled RS images, sampling operation by vectorizing the extracted Fmaps from
where yci = {1, 2, ...,C} is the class label corresponding the last mlpconv layer and generating a fixed-length embed-
to Imgi . For each image Imgi ∈ S, VImgi represents its d- ding features. As a result, an n-dimensional feature vector is
dimensional feature vector. Our objective is to retrieve a set obtained by averaging each feature map where n is the num-
T ⊂ S of the k-most similar images to a given query q ∈ S. ber of classes.
The retrieved image set T = {Img1 , Img2 , ..., Imgk } in- The n-dimensional feature vector resulting from the GAP
cludes a k nearest points to the query in vector space Rd layer is normalized using an L2 normalisation layer instead of
D(q, Img1 ) ≤ D(q, Img2 ) ≤ ... ≤ D(q, Imgk ), where D(q, Imgi ) softmax. Since the normalization function is a differentiable
is the distance between the query feature vector Vq and an im- operation, adding this layer does not impede the end-to-end
age feature vector VImgi . The smaller the distance D(q, Imgi ), training process. Another desirable property of using such
the more similar are q and Imgi . normalization is that the squared Euclidean distance (D) in
For a triplet Tr(a, p, n), where a ∈ S is an image anchor, this case is proportional to their cosine similarity.
2 2 2
p is a positive image p ∈ S where a and p have the same class a b a b
a − b = a + b + 2 ab
a.b
= 2 − 2 ab
a.b
label, whereas n is a negative image from S that have a dif- By using the normalisation layer, we are able to boost the
ferent class label. We aim to learn an embedding function performance as well as simplifying and facilitating the choice
f (.), which vectorizes an input image Imgi to generate an d- of the margin since the squared Euclidean distance between
dimensional vector of embedding features f (Imgi ) = VImgi , the normalized features is always in the range of [1,4] since
giving a smaller distances in the embedding space Rd to the D(a, b) = 2 − 2 ×Cos(a, b) where Cos(a, b) is in [-1,1].
most similar images and pushing away the dissimilar ones
such that:
D(Va ,Vp ) < D(Va ,Vn ) (1) 2.2. Triplet loss
For that purpose, we propose a triplet network based on To measure the similarity between the two images a and p,
end-to-end learning to train network for automatically learn- we use the squared Euclidean distance between their feature
ing fine-grained image similarity directly from input images. vectors, which can be defined as follows:
2
2.1. Triplet network architecture D(Va ,Vp ) = Va −Vp 2 (2)
The triplet network consists of three network instances with In the objective of reducing the distance between simi-
the same architecture, sharing the same parameters, and joint lar samples and maximising the distance between dissimilar
in the top by a triplet loss function. The triplet network takes ones, we minimise the triplet loss function during the training.
as input a triplet Tr(a, p, n) in which each network instance The distance from the anchor to the negative sample must be
have one input image. greater than the distance between the anchor and the positive
As shown in Fig. 1, for each network instance, we use sample. Thus, the used triplet loss function can be defined by:
Low Dimensional CNN (LDCNN) model [3] owing to its
promising results for HRRSIR and motivated by its capabil-
ity for learning discriminant low-dimensional features, which l(a, p, n) = Max(D(Va ,Vp ) − D(Va ,Vn ) + margin, 0) (3)
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
where the margin separates the positive region and the nega- USCM is composed of 21 land-use classes with 100 im-
tive samples relative to an anchor. ages per class. The image size is 256 × 256 pixels with a
spatial resolution of 0.3m per pixel.
2.3. Training PatternNet have 38 scene classes with 800 images per
class. The image size is 256 × 256 pixels, with a very high
A valid triplet Tr(a, p, n) is a set of three images, where a and spatial resolution that varies from 0.62m to about 4.693m.
p have the same class label whereas n belongs to a different AID contains a total of 10,000 images labelled into 30
class. a and p should not be the same image. aerial scene categories. The number of images changes from
The network is learned progressively on valid triplets to 220 to 420 images per class. The image size is 600 × 600
increase the similarity between the anchors and the positive pixels with spatial resolutions varying from about 0.5m to 8m.
samples and decrease the similarity scores between each an-
chor and its negatives samples. Thus, the triplet selection is a
3.2. Evaluation metrics
crucial phase for training the network.
In the triplet mining technique that is called ”batch all” In order to evaluate the retrieval performance of the proposed
(i.e. select all valid triplets), the number of all valid triplets TLDCNN network, we adopt two widely used evaluation
can be formally given by : metrics namely; precision at top-k retrieved images (P@k)
and mean average precision (mAP).
NbrTr = (Nk × Nc ) × (Nk − 1) × (Nk × Nc − Nk ) (4)
Let Q a set of Nq image queries. For each image query
where Nc represents the number of classes and Nk is the num- qi ∈ Q, a set of top-k similar images are retrieved. We can
ber of images in each class. As can be seen from Eq 4, the formally define the P@k by:
number of triplets increases cubically with the number of im-
ages. Thus, in large scale datasets with this extremely large k
1
number of triplets, this technique is computationally expen- P@k = × ∑ r( j) (5)
k j=1
sive and can not be used.
Besides, in the random triplets selection technique, most where r( j) ∈ {0, 1} indicates that the jth retrieved image is
of the selected triplets can be easy triplets where the negative relevant or not. The image is considered relevant, (i.e. r( j) =
samples are sufficiently far from the anchors relative to the 1) if it belongs to the same category of the given query, other-
positive ones (i.e. D(Va ,Vp ) + margin < D(Va ,Vn )). The net- wise r( j) = 0.
work is prone to underfitting by quickly stopping the learning The mean Average Precision mAP is an extension of the
process. Moreover, selecting hard triplets (i.e. D(Va ,Vn ) < Average Precision (AP) which can be obtained by averaging
Dist(Va ,Vp )) can lead to a fast network convergence to local the AP scores of all the image queries in Q as follows:
optima with high probability of overfitting.
Nq
To avoid that, in this paper we use semi-hard triplets 1
mAP = × ∑ AP(qi ) (6)
selection where the negatives are further from positives but Nq i=1
within the margin boundary (i.e. D(Va ,Vp ) < D(Va ,Vn ) <
AP(qi ) is the Average Precision score such that
D(Va ,Vp ) + margin). We believe that using semi-hard se-
lection and using big-size batch can improve the embedding ∑kj=1 P( j) × r( j)
network convergence. AP(qi ) = (7)
∑kj=1 [r( j) = 1]
3. EXPERIMENTS AND ANALYSIS where P(j) is the precision at cut-off j images. It is computed
by averaging the summation of precision scores of the rele-
We evaluate the retrieval performance of the proposed TLD- vant images over the number of all relevant images.
CNN network for HRRSIR on different HRRS image datasets.
All the experiments are carried out using a p2.xlarge instance 3.3. Experimental setup
on the Amazon Web Service EC2 machines. In this section,
we firstly introduce the used HRRS datasets and performance The datasets are randomly split into three folds; a 70% and
measures. Then, we describe the experimental setup. Finally, 10% of the each dataset is used for training and validating
we analyse and discuss the experimental results. the TLDCNN network respectively. The remaining 20% of
the dataset is used as a set of image queries to evaluate the
retrieval performance. The images are firstly resized to 224 ×
3.1. Datasets
224 to be well matched with the input size of the used model.
Three publicly available RS image datasets are used for the In the experiments, the networks are trained using the
evaluation of our proposed model, namely, UC Merced land- Adam optimizer with a learning rate of 0.001 for 50 epochs
use Dataset (UCMD) [5], PatternNet dataset [6], and the as maximum training iteration number, whereas the margin is
Aerial Image Dataset (AID) [7]. set to 0.5 and the batch size to 200.
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.
the proposed network achieved a promising retrieval perfor-
Table 1. The retrieval performances in terms of P@100 and
mance on three HRRS image datasets.
mAP measures on the UCMD, the AID and the PatternNet
In future works, we will conduct a systematic investiga-
datasets.
UCMD AID PatternNet tion about using triplet and contrastive loss for the retrieval
P@100 mAP P@100 mAP P@100 mAP task and will focus on further improvements for the learned
LDCNN 0.4812 0.5410 0.6598 0.6933 0.7013 0.7626 metric by adding relevance feedback interactive stage.
SimNet 0.5679 0.6631 0.7124 0.7266 0.8096 0.8124
DML 0.4855 0.9663 0.9657 0.9822 0.9954 0.9955
TLDCNN 0.8829 0.9738 0.9766 0.9882 0.9971 0.9973 5. REFERENCES
Authorized licensed use limited to: Murdoch University. Downloaded on June 16,2020 at 14:44:14 UTC from IEEE Xplore. Restrictions apply.