Learning To Compare: Relation Network For Few-Shot Learning
Learning To Compare: Relation Network For Few-Shot Learning
Flood Sung Yongxin Yang3 Li Zhang2 Tao Xiang1 Philip H.S. Torr2 Timothy M. Hospedales3
1 2
Queen Mary University of London University of Oxford 3 The University of Edinburgh
[email protected] [email protected]
{lz, phst}@robots.ox.ac.uk {yongxin.yang, t.hospedales}@ed.ac.uk
arXiv:1711.06025v2 [cs.CV] 27 Mar 2018
We consider the task of few-shot classifier learning. For- K-shot For K-shot where K > 1, we element-wise sum
mally, we have three datasets: a training set, a support set, over the embedding module outputs of all samples from
and a testing set. The support set and testing set share the each training class to form this class’ feature map. This
same label space, but the training set has its own label space pooled class-level feature map is combined with the query
that is disjoint with support/testing set. If the support set image feature map as above. Thus, the number of relation
contains K labelled examples for each of C unique classes, scores for one query is always C in both one-shot or few-
the target few-shot problem is called C-way K-shot. shot setting.
With the support set only, we can in principle train a clas- Objective function We use mean square error (MSE)
sifier to assign a class label ŷ to each sample x̂ in the test loss (Eq. (2)) to train our model, regressing the relation
set. However, due to the lack of labelled samples in the sup- score ri,j to the ground truth: matched pairs have similarity
port set, the performance of such a classifier is usually not 1 and the mismatched pair have similarity 0.
satisfactory. Therefore we aim to perform meta-learning on
the training set, in order to extract transferrable knowledge m X
X n
that will allow us to perform better few-shot learning on the ϕ, φ ← argmin (ri,j − 1(yi == yj ))2 (2)
ϕ,φ i=1 j=1
support set and thus classify the test set more successfully.
An effective way to exploit the training set is to mimic The choice of MSE is somewhat non-standard. Our
the few-shot learning setting via episode based training, as problem may seem to be a classification problem with a la-
proposed in [39]. In each training iteration, an episode is bel space {0, 1}. However conceptually we are predicting
formed by randomly selecting C classes from the training relation scores, which can be considered a regression prob-
set with K labelled samples from each of the C classes to lem despite that for ground-truth we can only automatically
act as the sample set S = {(xi , yi )}mi=1 (m = K × C), as
generate {0, 1} targets.
well as a fraction of the remainder of those C classes’ sam-
3.3. Zero-shot Learning
ples to serve as the query set Q = {(xj , yj )}nj=1 . This sam-
ple/query set split is designed to simulate the support/test set Zero-shot learning is analogous to one-shot learning in
that will be encountered at test time. A model trained from that one datum is given to define each class to recognise.
sample/query set can be further fine-tuned using the support However instead of being given a support set with one-shot
set, if desired. In this work we adopt such an episode-based image for each of C training classes, it contains a semantic
training strategy. In our few-shot experiments (see Section class embedding vector vc for each. Modifying our frame-
4.1) we consider one-shot (K = 1, Figure 1) and five-shot work to deal with the zero-shot case is straightforward: as
(K = 5) settings. We also address the K = 0 zero-shot a different modality of semantic vectors is used for the sup-
learning case as explained in Section 3.3. port set (e.g. attribute vectors instead of images), we use a
embedding module relation module
Relation One-hot
score vector
𝑓" 𝑔$
Figure 1: Relation Network architecture for a 5-way 1-shot problem with one query example.
2X2 max-pool
feature concatenation
Convolutional Block
2X2 max-pool
FC2, ReLU
weight decay
Convolutional Block
Convolutional Block
Convolutional Block
2X2 max-pool
Convolutional Block
Figure 3: Relation Network architecture for zero-shot learning.
2X2 max-pool
5-way 1-shot contains 15 query images, and the 5-way 5-
Convolutional Block shot has 10 query images for each of the C sampled classes
in each training episode. This means for example that there
are 15×5+1×5 = 80 images in one training episode/mini-
batch for 5-way 1-shot experiments. We resize input images
Figure 2: Relation Network architecture for few-shot learning (b) to 84 × 84. Our model is trained end-to-end from scratch,
which is composed of elements including convolutional block (a). with random initialisation, and no additional training set.
Results Following [36], we batch 15 query images per
example that there are 19 × 5 + 1 × 5 = 100 images in
class in each episode for evaluation in both 1-shot and 5-
one training episode/mini-batch for the 5-way 1-shot exper-
shot scenarios and the few-shot classification accuracies
iments.
are computed by averaging over 600 randomly generated
Results Following [36], we computed few-shot classifi- episodes from the test set.
cation accuracies on Omniglot by averaging over 1000 ran- From Table 2, we can see that our model achieved state-
domly generated episodes from the testing set. For the 1- of-the-art performance on 5-way 1-shot settings and com-
shot and 5-shot experiments, we batch one and five query petitive results on 5-way 5-shot. However, the 1-shot result
images per class respectively for evaluation during testing. reported by prototypical networks [36] reqired to be trained
The results are shown in Table 1. We achieved state-of-the- on 30-way 15 queries per training episode, and 5-shot re-
art performance under all experiments setting with higher sult was trained on 20-way 15 queries per training episode.
averaged accuracies and lower standard deviations, except When trained with 5-way 15 query per training episode,
5-way 5-shot where our model is 0.1% lower in accuracy [36] only got 46.14 ± 0.77% for 1-shot evaluation, clearly
than [10]. This is despite that many alternatives have sig- weaker than ours. In contrast, all our models are trained
nificantly more complicated machinery [27, 8], or fine-tune on 5-way, 1 query for 1-shot and 5 queries for 5-shot per
on the target problem [10, 39], while we do not. training episode, with much less training queries than [36].
Table 1: Omniglot few-shot classification. Results are accuracies averaged over 1000 test episodes and with 95% confidence intervals
where reported. The best-performing method is highlighted, along with others whose confidence intervals overlap. ‘-’: not reported.
Table 4: Comparative results under the GBU setting. Under the conventional ZSL setting, the performance is evaluated using per-class
average Top-1 (T1) accuracy (%), and under GZSL, it is measured using u = T1 on unseen classes, s = T1 on seen classes, and H =
harmonic mean.