Learning To Compare: Relation Network For Few-Shot Learning

The document presents a Relation Network (RN) approach for few-shot learning. The RN learns to compare query images to few labeled sample images using two modules: an embedding module that generates representations of the images, and a relation module that determines if they are from matching categories. The RN is trained end-to-end using episode-based meta-learning to support few-shot learning, and can also be applied to zero-shot learning by comparing images to category descriptions. Experiments show the simple RN approach outperforms prior methods while being faster and without requiring complex inference or fine-tuning.

Uploaded by

Valerio Biscione

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views

Learning To Compare: Relation Network For Few-Shot Learning

Uploaded by

Valerio Biscione

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning to Compare: Relation Network for Few-Shot Learning

Flood Sung Yongxin Yang3 Li Zhang2 Tao Xiang1 Philip H.S. Torr2 Timothy M. Hospedales3
1 2
Queen Mary University of London University of Oxford 3 The University of Edinburgh
[email protected] [email protected]
{lz, phst}@robots.ox.ac.uk {yongxin.yang, t.hospedales}@ed.ac.uk
arXiv:1711.06025v2 [cs.CV] 27 Mar 2018

Abstract zero-shot [11, 3, 24, 45, 25, 31] learning.

Few-shot learning aims to recognise novel visual cate-
We present a conceptually simple, flexible, and general gories from very few labelled examples. The availability
framework for few-shot learning, where a classifier must of only one or very few examples challenges the standard
learn to recognise new classes given only few examples from ‘fine-tuning’ practice in deep learning [10]. Data augmen-
each. Our method, called the Relation Network (RN), is tation and regularisation techniques can alleviate overfit-
trained end-to-end from scratch. During meta-learning, it ting in such a limited-data regime, but they do not solve
learns to learn a deep distance metric to compare a small it. Therefore contemporary approaches to few-shot learning
number of images within episodes, each of which is de- often decompose training into an auxiliary meta learning
signed to simulate the few-shot setting. Once trained, a RN phase where transferrable knowledge is learned in the form
is able to classify images of new classes by computing rela- of good initial conditions [10], embeddings [36, 39] or opti-
tion scores between query images and the few examples of misation strategies [29]. The target few-shot learning prob-
each new class without further updating the network. Be- lem is then learned by fine-tuning [10] with the learned op-
sides providing improved performance on few-shot learn- timisation strategy [29] or computed in a feed-forward pass
ing, our framework is easily extended to zero-shot learning. [36, 39, 4, 32] without updating network weights. Zero-shot
Extensive experiments on five benchmarks demonstrate that learning also suffers from a related challenge. Recognisers
our simple approach provides a unified and effective ap- are trained by a single example in the form of a class de-
proach for both of these two tasks. scription (c.f., single exemplar image in one-shot), making
data insufficiency for gradient-based learning a challenge.
While promising, most existing few-shot learning ap-
1. Introduction proaches either require complex inference mechanisms [23,
Deep learning models have achieved great success in vi- 9], complex recurrent neural network (RNN) architectures
sual recognition tasks [22, 15, 35]. However, these super- [39, 32], or fine-tuning the target problem [10, 29]. Our
vised learning models need large amounts of labelled data approach is most related to others that aim to train an effec-
and many iterations to train their large number of parame- tive metric for one-shot learning [39, 36, 20]. Where they
ters. This severely limits their scalability to new classes due focus on the learning of the transferrable embedding and
to annotation cost, but more fundamentally limits their ap- pre-define a fixed metric (e.g., as Euclidean [36]), we fur-
plicability to newly emerging (eg. new consumer devices) ther aim to learn a transferrable deep metric for comparing
or rare (eg. rare animals) categories where numerous anno- the relation between images (few-shot learning), or between
tated images may simply never exist. In contrast, humans images and class descriptions (zero-shot learning). By ex-
are very good at recognising objects with very little direct pressing the inductive bias of a deeper solution (multiple
supervision, or none at all i.e., few-shot [23, 9] or zero-shot non-linear learned stages at both embedding and relation
[24] learning. For example, children have no problem gen- modules), we make it easier to learn a generalisable solu-
eralising the concept of “zebra” from a single picture in tion to the problem.
a book, or hearing its description as looking like a stripy Specifically, we propose a two-branch Relation Network
horse. Motivated by the failure of conventional deep learn- (RN) that performs few-shot recognition by learning to
ing methods to work well on one or few examples per class, compare query images against few-shot labeled sample im-
and inspired by the few- and zero-shot learning ability of ages. First an embedding module generates representations
humans, there has been a recent resurgence of interest in of the query and training images. Then these embeddings
machine one/few-shot [8, 39, 32, 18, 20, 10, 27, 36, 29] and are compared by a relation module that determines if they
are from matching categories or not. Defining an episode- manner with no model updates required, making it more
based strategy inspired by [39, 36], the embedding and re- convenient for low-latency or low-power applications.
lation modules are meta-learned end-to-end to support few-
RNN Memory Based Another category of approaches
shot learning. This can be seen as extending the strategy
leverage recurrent neural networks with memories [27, 32].
of [39, 36] to include a learnable non-linear comparator,
Here the idea is typically that an RNN iterates over an ex-
instead of a fixed linear comparator. Our approach out-
amples of given problem and accumulates the knowledge
performs prior approaches, while being simpler (no RNNs
required to solve that problem in its hidden activations, or
[39, 32, 29]) and faster (no fine-tuning [29, 10]). Our pro-
external memory. New examples can be classified, for ex-
posed strategy also directly generalises to zero-shot learn-
ample by comparing them to historic information stored in
ing. In this case the sample branch embeds a single-shot
the memory. So ‘learning’ a single target problem can oc-
category description rather than a single exemplar training
cur in unrolling the RNN, while learning-to-learn means
image, and the relation module learns to compare query im-
training the weights of the RNN by learning many distinct
age and category description embeddings.
problems. While appealing, these architectures face issues
Overall our contribution is to provide a clean framework in ensuring that they reliably store all the, potentially long
that elegantly encompasses both few and zero-shot learn- term, historical information of relevance without forgetting.
ing. Our evaluation on four benchmarks show that it pro- In our approach we avoid the complexity of recurrent net-
vides compelling performance across the board while being works, and the issues involved in ensuring the adequacy of
simpler and faster than the alternatives. their memory. Instead our learning-to-learn approach is de-
fined entirely with simple and fast feed forward CNNs.
2. Related Work Embedding and Metric Learning Approaches The
The study of one or few-shot object recognition has been prior approaches entail some complexity when learning the
of interest for some time [9]. Earlier work on few-shot target few-shot problem. Another category of approach
learning tended to involve generative models with complex aims to learn a set of projection functions that take query
iterative inference strategies [9, 23]. With the success of and sample images from the target problem and classify
discriminative deep learning-based approaches in the data- them in a feed forward manner [39, 36, 4]. One approach
rich many-shot setting [22, 15, 35], there has been a surge is to parameterise the weights of a feed-forward classifier
of interest in generalising such deep learning approaches to in terms of the sample set [4]. The meta-learning here is
the few-shot learning setting. Many of these approaches use to train the auxiliary parameterisation net that learns how
a meta-learning or learning-to-learn strategy in the sense to paramaterise a given feed-forward classification problem
that they extract some transferrable knowledge from a set in terms of a few-shot sample set. Metric-learning based
of auxiliary tasks (meta-learning, learning-to-learn), which approaches aim to learn a set of projection functions such
then helps them to learn the target few-shot problem well that when represented in this embedding, images are easy
without suffering from the overfitting that might be ex- to recognise using simple nearest neighbour or linear classi-
pected when applying deep models to sparse data problems. fiers [39, 36, 20]. In this case the meta-learned transferrable
knowledge are the projection functions and the target prob-
Learning to Fine-Tune The successful MAML approach lem is a simple feed-forward computation.
[10] aimed to meta-learn an initial condition (set of neural
The most related methodologies to ours are the proto-
network weights) that is good for fine-tuning on few-shot
typical networks of [36] and the siamese networks of [20].
problems. The strategy here is to search for the weight
These approaches focus on learning embeddings that trans-
configuration of a given neural network such that it can
form the data such that it can be recognised with a fixed
be effectively fine-tuned on a sparse data problem within
nearest-neighbour [36] or linear [20, 36] classifier. In con-
a few gradient-descent update steps. Many distinct target
trast, our framework further defines a relation classifier
problems are sampled from a multiple task training set; the
CNN, in the style of [33, 44, 14] (While [33] focuses on
base neural network model is then fine-tuned to solve each
reasoning about relation between two objects in a same im-
of them, and the success at each target problem after fine-
age which is to address a different problem.). Compared
tuning drives updates in the base model – thus driving the
to [20, 36], this can be seen as providing a learnable rather
production of an easy to fine-tune initial condition. The
than fixed metric, or non-linear rather than linear classifier.
few-shot optimisation approach [29] goes further in meta-
Compared to [20] we benefit from an episodic training strat-
learning not only a good initial condition but an LSTM-
egy with an end-to-end manner from scratch, and compared
based optimizer that is trained to be specifically effective for
to [32] we avoid the complexity of set-to-set RNN embed-
fine-tuning. However both of these approaches suffer from
ding of the sample-set, and simply rely on pooling [33].
the need to fine-tune on the target problem. In contrast, our
approach solves target problems in an entirely feed-forward Zero-Shot Learning Our approach is designed for few-
shot learning, but elegantly spans the space into zero-shot 3.2. Model
learning (ZSL) by modifying the sample branch to input a
single category description rather than single training im- One-Shot Our Relation Network (RN) consists of two
age. When applied to ZSL our architecture is related to modules: an embedding module fϕ and a relation module
methods that learn to align images and category embed- gφ , as illustrated in Figure 1. Samples xj in the query set Q,
dings and perform recognition by predicting if an image and samples xi in the sample set S are fed through the em-
and category embedding pair match [11, 3, 43, 46]. Sim- bedding module fϕ , which produces feature maps fϕ (xi )
ilarly to the case with the prior metric-based few-shot ap- and fϕ (xj ). The feature maps fϕ (xi ) and fϕ (xj ) are com-
proaches, most of these apply a fixed manually defined sim- bined with operator C(fϕ (xi ), fϕ (xj )). In this work we as-
ilarity metric or linear classifier after combining the image sume C(·, ·) to be concatenation of feature maps in depth,
and category embedding. In contrast, we again benefit from although other choices are possible.
a deeper end-to-end architecture including a learned non- The combined feature map of the sample and query are
linear metric in the form of our learned convolutional re- fed into the relation module gφ , which eventually produces
lation network; as well as from an episode-based training a scalar in range of 0 to 1 representing the similarity be-
strategy. tween xi and xj , which is called relation score. Thus, in
the C-way one-shot setting, we generate C relation scores
ri,j for the relation between one query input xj and training
sample set examples xi ,
3. Methodology

3.1. Problem Definition ri,j = gφ (C(fϕ (xi ), fϕ (xj ))), i = 1, 2, . . . , C (1)

We consider the task of few-shot classifier learning. For- K-shot For K-shot where K > 1, we element-wise sum
mally, we have three datasets: a training set, a support set, over the embedding module outputs of all samples from
and a testing set. The support set and testing set share the each training class to form this class’ feature map. This
same label space, but the training set has its own label space pooled class-level feature map is combined with the query
that is disjoint with support/testing set. If the support set image feature map as above. Thus, the number of relation
contains K labelled examples for each of C unique classes, scores for one query is always C in both one-shot or few-
the target few-shot problem is called C-way K-shot. shot setting.
With the support set only, we can in principle train a clas- Objective function We use mean square error (MSE)
sifier to assign a class label ŷ to each sample x̂ in the test loss (Eq. (2)) to train our model, regressing the relation
set. However, due to the lack of labelled samples in the sup- score ri,j to the ground truth: matched pairs have similarity
port set, the performance of such a classifier is usually not 1 and the mismatched pair have similarity 0.
satisfactory. Therefore we aim to perform meta-learning on
the training set, in order to extract transferrable knowledge m X
X n

that will allow us to perform better few-shot learning on the ϕ, φ ← argmin (ri,j − 1(yi == yj ))2 (2)
ϕ,φ i=1 j=1
support set and thus classify the test set more successfully.
An effective way to exploit the training set is to mimic The choice of MSE is somewhat non-standard. Our
the few-shot learning setting via episode based training, as problem may seem to be a classification problem with a la-
proposed in [39]. In each training iteration, an episode is bel space {0, 1}. However conceptually we are predicting
formed by randomly selecting C classes from the training relation scores, which can be considered a regression prob-
set with K labelled samples from each of the C classes to lem despite that for ground-truth we can only automatically
act as the sample set S = {(xi , yi )}mi=1 (m = K × C), as
generate {0, 1} targets.
well as a fraction of the remainder of those C classes’ sam-
3.3. Zero-shot Learning
ples to serve as the query set Q = {(xj , yj )}nj=1 . This sam-
ple/query set split is designed to simulate the support/test set Zero-shot learning is analogous to one-shot learning in
that will be encountered at test time. A model trained from that one datum is given to define each class to recognise.
sample/query set can be further fine-tuned using the support However instead of being given a support set with one-shot
set, if desired. In this work we adopt such an episode-based image for each of C training classes, it contains a semantic
training strategy. In our few-shot experiments (see Section class embedding vector vc for each. Modifying our frame-
4.1) we consider one-shot (K = 1, Figure 1) and five-shot work to deal with the zero-shot case is straightforward: as
(K = 5) settings. We also address the K = 0 zero-shot a different modality of semantic vectors is used for the sup-
learning case as explained in Section 3.3. port set (e.g. attribute vectors instead of images), we use a
embedding module relation module

Feature maps concatenation

Relation One-hot
score vector

𝑓" 𝑔$

Figure 1: Relation Network architecture for a 5-way 1-shot problem with one query example.

second heterogeneous embedding module fϕ2 besides the 4. Experiments

embedding module fϕ1 used for the image query set. Then
the relation net gφ is applied as before. Therefore, the rela- We evaluate our approach on two related tasks: few-shot
tion score for each query input xj will be: classification on Omniglot and miniImagenet, and zero-
shot classification on Animals with Attributes (AwA) and
Caltech-UCSD Birds-200-2011 (CUB). All the experiments
ri,j = gφ (C(fϕ1 (vc ), fϕ2 (xj ))), i = 1, 2, . . . , C (3)
are implemented based on PyTorch [1].
The objective function for zero-shot learning is the same 4.1. Few-shot Recognition
as that for few-shot learning.
Settings Few-shot learning in all experiments uses
3.4. Network Architecture Adam [19] with initial learning rate 10−3 , annealed by half
for every 100,000 episodes. All our models are end-to-end
As most few-shot learning models utilise four convolu- trained from scratch with no additional dataset.
tional blocks for embedding module [39, 36], we follow the
Baselines We compare against various state of the art
same architecture setting for fair comparison, see Figure 2.
baselines for few-shot recognition, including neural statisti-
More concretely, each convolutional block contains a 64-
cian [8], Matching Nets with and without fine-tuning [39],
filter 3 × 3 convolution, a batch normalisation and a ReLU
MANN [32], Siamese Nets with Memory [18], Convolu-
nonlinearity layer respectively. The first two blocks also
tional Siamese Nets [20], MAML [10], Meta Nets [27], Pro-
contain a 2 × 2 max-pooling layer while the latter two do
totypical Nets [36] and Meta-Learner LSTM [29].
not. We do so because we need the output feature maps
for further convolutional layers in the relation module. The
relation module consists of two convolutional blocks and 4.1.1 Omniglot
two fully-connected layers. Each of convolutional block
is a 3 × 3 convolution with 64 filters followed by batch Dataset Omniglot [23] contains 1623 characters (classes)
normalisation, ReLU non-linearity and 2 × 2 max-pooling. from 50 different alphabets. Each class contains 20 samples
The output size of last max pooling layer is H = 64 and drawn by different people. Following [32, 39, 36], we aug-
H = 64 ∗ 3 ∗ 3 = 576 for Omniglot and miniImageNet ment new classes through 90◦ , 180◦ and 270◦ rotations of
respectively. The two fully-connected layers are 8 and 1 existing data and use 1200 original classes plus rotations for
dimensional, respectively. All fully-connected layers are training and remaining 423 classes plus rotations for testing.
ReLU except the output layer is Sigmoid in order to gen- All input images are resized to 28 × 28.
erate relation scores in a reasonable range for all versions Training Besides the K sample images, the 5-way 1-
of our network architecture. shot contains 19 query images, the 5-way 5-shot has 15
The zero-shot learning architecture is shown in Figure 3. query images, the 20-way 1-shot has 10 query images and
In this architecture, the DNN subnet is an existing network the 20-way 5-shot has 5 query images for each of the C
(e.g., Inception or ResNet) pretrained on ImageNet. sampled classes in each training episode. This means for
(a) Convolutional Block (b) RN for few-shot learning relation score
relation score
ReLU
FC4 Sigmoid
batch norm
FC Sigmoid, 8X1
3X3 conv, 64 filters
FC3 ReLU
FC ReLU, HX8

2X2 max-pool
feature concatenation
Convolutional Block

2X2 max-pool
FC2, ReLU
weight decay
Convolutional Block

FC1, ReLU DNN

feature concatenation weight decay

Convolutional Block

2X2 max-pool

Convolutional Block
Figure 3: Relation Network architecture for zero-shot learning.
2X2 max-pool
5-way 1-shot contains 15 query images, and the 5-way 5-
Convolutional Block shot has 10 query images for each of the C sampled classes
in each training episode. This means for example that there
are 15×5+1×5 = 80 images in one training episode/mini-
batch for 5-way 1-shot experiments. We resize input images
Figure 2: Relation Network architecture for few-shot learning (b) to 84 × 84. Our model is trained end-to-end from scratch,
which is composed of elements including convolutional block (a). with random initialisation, and no additional training set.
Results Following [36], we batch 15 query images per
example that there are 19 × 5 + 1 × 5 = 100 images in
class in each episode for evaluation in both 1-shot and 5-
one training episode/mini-batch for the 5-way 1-shot exper-
shot scenarios and the few-shot classification accuracies
iments.
are computed by averaging over 600 randomly generated
Results Following [36], we computed few-shot classifi- episodes from the test set.
cation accuracies on Omniglot by averaging over 1000 ran- From Table 2, we can see that our model achieved state-
domly generated episodes from the testing set. For the 1- of-the-art performance on 5-way 1-shot settings and com-
shot and 5-shot experiments, we batch one and five query petitive results on 5-way 5-shot. However, the 1-shot result
images per class respectively for evaluation during testing. reported by prototypical networks [36] reqired to be trained
The results are shown in Table 1. We achieved state-of-the- on 30-way 15 queries per training episode, and 5-shot re-
art performance under all experiments setting with higher sult was trained on 20-way 15 queries per training episode.
averaged accuracies and lower standard deviations, except When trained with 5-way 15 query per training episode,
5-way 5-shot where our model is 0.1% lower in accuracy [36] only got 46.14 ± 0.77% for 1-shot evaluation, clearly
than [10]. This is despite that many alternatives have sig- weaker than ours. In contrast, all our models are trained
nificantly more complicated machinery [27, 8], or fine-tune on 5-way, 1 query for 1-shot and 5 queries for 5-shot per
on the target problem [10, 39], while we do not. training episode, with much less training queries than [36].

4.1.2 miniImageNet 4.2. Zero-shot Recognition

Dataset The miniImagenet dataset, originally proposed Datasets and settings We follow two ZSL settings: the
by [39], consists of 60,000 colour images with 100 classes, old setting and the new GBU setting provided by [42] for
each having 600 examples. We followed the split intro- training/test splits. Under the old setting, adopted by most
duced by [29], with 64, 16, and 20 classes for training, val- existing ZSL works before [42], some of the test classes
idation and testing, respectively. The 16 validation classes also appear in the ImageNet 1K classes, which have been
is used for monitoring generalisation performance only. used to pretrain the image embedding network, thus vio-
Training Following the standard setting adopted by most lating the zero-shot assumption. In contrast, the new GBU
existing few-shot learning work, we conducted 5 way 1-shot setting ensures that none of the test classes of the datasets
and 5-shot classification. Beside the K sample images, the appear in the ImageNet 1K classes. Under both settings, the
Model Fine Tune 5-way Acc. 20-way Acc.
1-shot 5-shot 1-shot 5-shot
M ANN [32] N 82.8% 94.9% - -
C ONVOLUTIONAL S IAMESE N ETS [20] N 96.7% 98.4% 88.0% 96.5%
C ONVOLUTIONAL S IAMESE N ETS [20] Y 97.3% 98.4% 88.1% 97.0%
M ATCHING N ETS [39] N 98.1% 98.9% 93.8% 98.5%
M ATCHING N ETS [39] Y 97.9% 98.7% 93.5% 98.7%
S IAMESE N ETS WITH M EMORY [18] N 98.4% 99.6% 95.0% 98.6%
N EURAL S TATISTICIAN [8] N 98.1% 99.5% 93.2% 98.1%
M ETA N ETS [27] N 99.0% - 97.0% -
P ROTOTYPICAL N ETS [36] N 98.8% 99.7% 96.0% 98.9%
M AML [10] Y 98.7 ± 0.4% 99.9 ± 0.1% 95.8 ± 0.3% 98.9 ± 0.2%
R ELATION N ET N 99.6 ± 0.2% 99.8± 0.1% 97.6 ± 0.2% 99.1± 0.1%

Table 1: Omniglot few-shot classification. Results are accuracies averaged over 1000 test episodes and with 95% confidence intervals
where reported. The best-performing method is highlighted, along with others whose confidence intervals overlap. ‘-’: not reported.

Model FT 5-way Acc.

learning. Unless otherwise specified, we use Inception-
V2 [38, 17] as the query image embedding DNN in the
1-shot 5-shot
old and conventional setting and ResNet101 [16] for the
M ATCHING N ETS [39] N 43.56 ± 0.84% 55.31 ± 0.73% GBU and generalised setting, taking the top pooling units
M ETA N ETS [27] N 49.21 ± 0.96% -
M ETA-L EARN LSTM [29] N 43.44 ± 0.77% 60.60 ± 0.71%
as image embedding with dimension D = 1024 and 2048
M AML [10] Y 48.70 ± 1.84% 63.11 ± 0.92% respectively. This DNN is pre-trained on ILSVRC 2012
P ROTOTYPICAL N ETS [36] N 49.42 ± 0.78% 68.20 ± 0.66% 1K classification without fine-tuning, as in recent deep ZSL
R ELATION N ET N 50.44 ± 0.82% 65.32 ± 0.70% works [25, 30, 45]. A MLP network is used for embed-
ding semantic attribute vectors. The size of hidden layer
Table 2: Few-shot classification accuracies on miniImagenet. All FC1 (Figure 3) is set to 1024 and 1200 for AwA and CUB
accuracy results are averaged over 600 test episodes and are re- respectively, and the output size FC2 is set to the same di-
ported with 95% confidence intervals, same as [36]. For each task, mension as the image embedding for both datasets. For the
the best-performing method is highlighted, along with any others relation module, the image and semantic embeddings are
whose confidence intervals overlap. ‘-’: not reported. concatenated before being fed into MLPs with hidden layer
FC3 size 400 and 1200 for AwA and CUB, respectively.
We add weight decay (L2 regularisation) in FC1 & 2 as
test set can comprise only the unseen class samples (conven- there is a hubness problem [45] in cross-modal mapping for
tional test set setting) or a mixture of seen and unseen class ZSL which can be best solved by mapping the semantic fea-
samples. The latter, termed generalised zero-shot learning ture vector to the visual feature space with regularisation.
(GZSL), is more realistic in practice. After that, FC3 & 4 (relation module) are used to compute
Two widely used ZSL benchmarks are selected for the the relation between the semantic representation (in the vi-
old setting: AwA (Animals with Attributes) [24] consists sual feature space) and the visual representation. Since the
of 30,745 images of 50 classes of animals. It has a fixed hubness problem does not existing in this step, no L2 regu-
split for evaluation with 40 training classes and 10 test larisation/weight decay is needed. All the ZSL models are
classes. CUB (Caltech-UCSD Birds-200-2011) [40] con- trained with weight decay 10−5 in the embedding network.
tains 11,788 images of 200 bird species with 150 seen The learning rate is initialised to 10−5 with Adam [19] and
classes and 50 disjoint unseen classes. Three datasets [42] then annealed by half every 200,000 iterations.
are selected for GBU setting: AwA1, AwA2 and CUB. The
newly released AwA2 [42] consists of 37,322 images of 50 Results under the old setting The conventional evalua-
classes which is an extension of AwA while AwA1 is same tion for ZSL followed by the majority of prior work is to
as AwA but under the GBU setting. assume that the test data all comes from unseen classes. We
evaluate this setting first. We compare 15 alternative ap-
Semantic representation For AwA, we use the contin- proaches in Table 3. With only the attribute vector used as
uous 85-dimension class-level attribute vector from [24], the sample class embedding, our model achieves competi-
which has been used by all recent works. For CUB, a con- tive result on AwA and state-of-the-art performance on the
tinuous 312-dimension class-level attribute vector is used. more challenging CUB dataset, outperforming the most re-
Implementation details Two different embedding mod- lated alternative prototypical networks [36] by a big margin.
ules are used for the two input modalities in zero-shot Note that only inductive methods are considered. Some re-
Model F SS AwA CUB
10-way 0-shot 50-way 0-shot
S JE [3] FG A 66.7 50.1
E SZSL [31] FG A 76.3 47.2
S SE-R ELU [46] FV A 76.3 30.4
J LSE [47] FV A 80.5 42.1
S YNC - STRUCT [6] FG A 72.9 54.5
S EC - ML [5] FV A 77.3 43.3
P ROTO . N ETS [36] FG A - 54.6 (a) Ground Truth (b) Relation Network
D EVISE [11] NG A/W 56.7/50.4 33.5
S OCHER et al. [37] NG A/W 60.8/50.3 39.6
M TMDL [43] NG A/W 63.7/55.3 32.3
BA et al. [25] NG A/W 69.3/58.7 34.0
D S - SJE [30] NG A/D - 50.4/ 56.8
S AE [21] NG A 84.7 61.4
D EM [45] NG A/W 86.7/78.8 58.3
R ELATION N ET NG A 84.5 62.0

(c) Metric Learning (d) Metric + Embedding

Table 3: Zero-shot classification accuracy (%) comparison on AwA and
CUB (hit@1 accuracy over all samples) under the old and conventional
setting. SS: semantic space; A: attribute space; W: semantic word vector
Figure 4: An example relation learnable by Relation Network and
space; D: sentence description (only available for CUB). F: how the vi- not by non-linear embedding + metric learning.
sual feature space is computed; For non-deep models: FO if overfeat [34]
is used; FG for GoogLeNet [38]; and FV for VGG net [35]. For neu-
ral network based methods, all use Inception-V2 (GoogLeNet with batch tion. In contrast to prior work’s fixed metric or fixed fea-
normalisation) [38, 17] as the DNN image imbedding subnet, indicated as
NG . tures and shallow learned metric, Relation Network can be
seen as both learning a deep embedding and learning a deep
non-linear metric (similarity function)2 . These are mutually
cent methods [48, 12, 13] are tranductive in that they use all tuned end-to-end to support each other in few short learn-
test data at once for model training, which gives them a big ing.
advantage at the cost of making a very strong assumption Why might this be particularly useful? By using a flex-
that may not be met in practical applications, so we do not ible function approximator to learn similarity, we learn a
compare with them here. good metric in a data driven way and do not have to man-
Results under the GBU setting We follow the evalua- ually choose the right metric (Euclidean, cosine, Maha-
tion setting of [42]. We compare our model with 11 alterna- lanobis). Fixed metrics like [39, 36] assume that features
tive ZSL models in Table 4. The 10 shallow models results are solely compared element-wise, and the most related [36]
are from [42] and the result of the state-of-the-art method assumes linear separability after the embedding. These are
DEM [45] is from the authors’ GitHub page1 . We can see thus critically dependent on the efficacy of the learned em-
that on AwA2 and CUB, Our model is particularly strong bedding network, and hence limited by the extent to which
under the more realistic GZSL setting measured using the the embedding networks generate inadequately discrimina-
harmonic mean (H) metric. While on AwA1, our method is tive representations. In contrast, by deep learning a non-
only outperformed by DEM [45]. linear similarity metric jointly with the embedding, Relation
Network can better identify matching/mismatching pairs.
5. Why does Relation Network Work?
5.2. Visualisation
5.1. Relationship to existing models
To illustrate the previous point about adequacy of learned
Related prior few-shot work uses fixed pre-specified dis- input embeddings, we show a synthetic example where ex-
tance metrics such as Euclidean or cosine distance to per- isting approaches definitely fail and our Relation Network
form classification [39, 36]. These studies can be seen as can succeed due to using a deep relation module. Assuming
distance metric learning, but where all the learning occurs in 2D query and sample input embeddings to a relation mod-
the feature embedding, and a fixed metric is used given the ule, Fig. 4(a) shows the space of 2D sample inputs for a
learned embedding. Also related are conventional metric fixed 2D query input. Each sample input (pixel) is colored
learning approaches [26, 7] that focus on learning a shallow according to whether it matches the fixed query or not. This
(linear) Mahalanobis metric for a fixed feature representa-
2 Our architecture does not guarantee the self-similarity and symmetry
1 https://github.com/lzrobots/ properties of a formal similarity function. But empirically we find these
DeepEmbeddingModel_ZSL properties hold numerically for a trained Relation Network.
AwA1 AwA2 CUB
ZSL GZSL ZSL GZSL ZSL GZSL
Model T1 u s H T1 u s H T1 u s H
DAP [24] 44.1 0.0 88.7 0.0 46.1 0.0 84.7 0.0 40.0 1.7 67.9 3.3
C ONSE [28] 45.6 0.4 88.6 0.8 44.5 0.5 90.6 1.0 34.3 1.6 72.2 3.1
S SE [46] 60.1 7.0 80.5 12.9 61.0 8.1 82.5 14.8 43.9 8.5 46.9 14.4
D EVISE [11] 54.2 13.4 68.7 22.4 59.7 17.1 74.7 27.8 52.0 23.8 53.0 32.8
S JE [3] 65.6 11.3 74.6 19.6 61.9 8.0 73.9 14.4 53.9 23.5 59.2 33.6
L ATEM [41] 55.1 7.3 71.7 13.3 55.8 11.5 77.3 20.0 49.3 15.2 57.3 24.0
E SZSL [31] 58.2 6.6 75.6 12.1 58.6 5.9 77.8 11.0 53.9 12.6 63.8 21.0
A LE [2] 59.9 16.8 76.1 27.5 62.5 14.0 81.8 23.9 54.9 23.7 62.8 34.4
S YNC [6] 54.0 8.9 87.3 16.2 46.6 10.0 90.5 18.0 55.6 11.5 70.9 19.8
S AE [21] 53.0 1.8 77.1 3.5 54.1 1.1 82.2 2.2 33.3 7.8 57.9 29.2
D EM [45] 68.4 32.8 84.7 47.3 67.1 30.5 86.4 45.1 51.7 19.6 54.0 13.6
R ELATION N ET 68.2 31.4 91.3 46.7 64.2 30.0 93.4 45.3 55.6 38.1 61.1 47.0

Table 4: Comparative results under the GBU setting. Under the conventional ZSL setting, the performance is evaluated using per-class
average Top-1 (T1) accuracy (%), and under GZSL, it is measured using u = T1 on unseen classes, s = T1 on seen classes, and H =
harmonic mean.

represents a case where the output of the embedding mod-

ules is not discriminative enough for trivial (Euclidean NN)
comparison between query and sample set. In Fig. 4(c) we
attempt to learn matching via a Mahalanobis metric learn-
ing relation module, and we can see the result is inadequate.
In Fig. 4(d) we learn a further 2-hidden layer MLP embed-
ding of query and sample inputs as well as the subsequent
Mahalanobis metric, which is also not adequate. Only by
learning the full deep relation module for similarity can we
solve this problem in Fig. 4(b).
In a real problem the difficulty of comparing embeddings
may not be this extreme, but it can still be challenging. We
qualitatively illustrate the challenge of matching two exam-
Figure 5: Example Omniglot few-shot problem visualisations.
ple Omniglot query images (embeddings projected to 2D,
Left: Matched (cyan) and mismatched (magenta) sample embed-
Figure 5(left)) by showing an analogous plot of real sample dings for a given query (yellow) are not straightforward to dif-
images colored by match (cyan) or mismatch (magenta) to ferentiate. Right: Matched (yellow) and mismatched (magenta)
two example queries (yellow). Under standard assumptions relation module pair representations are linearly separable.
[39, 36, 26, 7] the cyan matching samples should be near-
est neighbours to the yellow query image with some metric
(Euclidean, Cosine, Mahalanobis). But we can see that the
match relation is more complex than this. In Figure 5(right),
proach is far simpler and more efficient than recent few-shot
we instead plot the same two example queries in terms of a
meta-learning approaches, and produces state-of-the-art re-
2D PCA representation of each query-sample pair, as repre-
sults. It further proves effective at both conventional and
sented by the relation module’s penultimate layer. We can
generalised zero-shot settings.
see that the relation network has mapped the data into a
space where the (mis)matched pairs are linearly separable.
Acknowledgements This work was supported by the
ERC grant ERC-2012-AdG 321162-HELIOS, EPSRC
6. Conclusion grant Seebibyte EP/M013774/1, EPSRC/MURI grant
EP/N019474/1, EPSRC grant EP/R026173/1, and the
We proposed a simple method called the Relation Net- European Union’s Horizon 2020 research and innovation
work for few-shot and zero-shot learning. Relation network program (grant agreement no. 640891). We grate-
learns an embedding and a deep non-linear distance metric fully acknowledge the support of NVIDIA Corporation
for comparing query and sample items. Training the net- with the donation of the Titan Xp GPU and the ES-
work end-to-end with episodic training tunes the embedding PRC funded Tier 2 facility, JADE used for this research.
and distance metric for effective few-shot learning. This ap-
References [23] B. Lake, R. Salakhutdinov, J. Gross, and J. Tenenbaum. One
shot learning of simple visual concepts. In CogSci, 2011. 1,
[1] Pytorch. https://github.com/pytorch/ 2, 4
pytorch. 4 [24] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-
[2] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label- based classification for zero-shot visual object categoriza-
embedding for image classification. TPAMI, 2016. 8 tion. PAMI, 2014. 1, 6, 8
[3] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Eval- [25] J. Lei Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Pre-
uation of output embeddings for fine-grained image classifi- dicting deep zero-shot convolutional neural networks using
cation. In CVPR, 2015. 1, 3, 7, 8 textual descriptions. In ICCV, 2015. 1, 6, 7
[4] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr, [26] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric
and A. Vedaldi. Learning feed-forward one-shot learners. learning for large scale image classification: Generalizing to
In NIPS, 2016. 1, 2 new classes at near-zero cost. In ECCV, 2012. 7, 8
[5] M. Bucher, S. Herbin, and F. Jurie. Improving semantic em- [27] T. Munkhdalai and H. Yu. Meta networks. In ICML, 2017.
bedding consistency by metric learning for zero-shot classif- 1, 2, 4, 5, 6
fication. In ECCV, 2016. 7 [28] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens,
[6] S. Changpinyo, W.-L. Chao, B. Gong, and F. Sha. Synthe- A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning
sized classifiers for zero-shot learning. In CVPR, 2016. 7, by convex combination of semantic embeddings. In ICLR,
8 2014. 8
[7] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun. Bayesian face [29] S. Ravi and H. Larochelle. Optimization as a model for few-
revisited: A joint formulation. In ECCV. Springer Berlin shot learning. In ICLR, 2017. 1, 2, 4, 5, 6
Heidelberg, 2012. 7, 8 [30] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep
[8] H. Edwards and A. Storkey. Towards a neural statistician. representations of fine-grained visual descriptions. In CVPR,
ICLR, 2017. 1, 4, 5, 6 2016. 6, 7
[9] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of [31] B. Romera-Paredes and P. Torr. An embarrassingly simple
object categories. TPAMI, 2006. 1, 2 approach to zero-shot learning. In ICML, 2015. 1, 7, 8
[10] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- [32] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and
learning for fast adaptation of deep networks. In ICML, T. Lillicrap. Meta-learning with memory-augmented neural
2017. 1, 2, 4, 5, 6 networks. In ICML, 2016. 1, 2, 4, 6
[11] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, [33] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski,
T. Mikolov, et al. Devise: A deep visual-semantic embed- R. Pascanu, P. Battaglia, and T. Lillicrap. A simple neural
ding model. In NIPS, 2013. 1, 3, 7, 8 network module for relational reasoning. In NIPS, 2017. 2
[12] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong. [34] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus,
Transductive multi-view embedding for zero-shot recogni- and Y. LeCun. Overfeat: Integrated recognition, localization
tion and annotation. In ECCV, 2014. 7 and detection using convolutional networks. arXiv preprint
[13] Y. Fu and L. Sigal. Semi-supervised vocabulary-informed arXiv:1312.6229, 2013. 7
learning. In CVPR, 2016. 7 [35] K. Simonyan and A. Zisserman. Very deep convolutional
[14] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg. networks for large-scale image recognition. ICLR, 2015. 1,
Matchnet: Unifying feature and metric learning for patch- 2, 7
based matching. In CVPR, 2015. 2 [36] J. Snell, K. Swersky, and R. S. Zemel. Prototypical networks
[15] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for few-shot learning. In NIPS, 2017. 1, 2, 4, 5, 6, 7, 8
for image recognition. In CVPR, 2016. 1, 2 [37] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot
[16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning learning through cross-modal transfer. In NIPS, 2013. 7
for image recognition. In CVPR, 2016. 6 [38] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
deep network training by reducing internal covariate shift. In Going deeper with convolutions. In CVPR, 2015. 6, 7
ICML, 2015. 6, 7 [39] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.
[18] Ł. Kaiser, O. Nachum, A. Roy, and S. Bengio. Learning to Matching networks for one shot learning. In NIPS, 2016.
remember rare events. ICLR, 2017. 1, 4, 6 1, 2, 3, 4, 5, 6, 7, 8
[19] D. Kingma and J. Ba. Adam: A method for stochastic opti- [40] C. Wah, S. Branson, P. Perona, and S. Belongie. Multiclass
mization. In ICLR, 2015. 4, 6 recognition and part localization with humans in the loop. In
[20] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural ICCV, 2011. 6
networks for one-shot image recognition. In ICML Work- [41] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and
shop, 2015. 1, 2, 4, 6 B. Schiele. Latent embeddings for zero-shot classification.
[21] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder In CVPR, 2016. 8
for zero-shot learning. In CVPR, 2017. 7, 8 [42] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zero-
[22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet shot learning-a comprehensive evaluation of the good, the
classification with deep convolutional neural networks. In bad and the ugly. arXiv preprint arXiv:1707.00600, 2017. 5,
NIPS, 2012. 1, 2 6, 7
[43] Y. Yang and T. M. Hospedales. A unified perspective on
multi-domain and multi-task learning. In ICLR, 2015. 3, 7
[44] S. Zagoruyko and N. Komodakis. Learning to compare im-
age patches via convolutional neural networks. In CVPR,
2015. 2
[45] L. Zhang, T. Xiang, and S. Gong. Learning a deep embed-
ding model for zero-shot learning. In CVPR, 2017. 1, 6, 7,
8
[46] Z. Zhang and V. Saligrama. Zero-shot learning via semantic
similarity embedding. In ICCV, 2015. 3, 7, 8
[47] Z. Zhang and V. Saligrama. Zero-shot learning via joint la-
tent similarity embedding. In CVPR, 2016. 7
[48] Z. Zhang and V. Saligrama. Zero-shot recognition via struc-
tured prediction. In ECCV, 2016. 7