0% found this document useful (0 votes)
12 views

Attribute-Based Classification For Zero-Shot Visual Object Categorization

Uploaded by

Michael Todd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Attribute-Based Classification For Zero-Shot Visual Object Categorization

Uploaded by

Michael Todd
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO.

3, MARCH 2014 453

Attribute-Based Classification for


Zero-Shot Visual Object Categorization
Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling

Abstract—We study the problem of object recognition for categories for which we have no training examples, a task also called
zero-data or zero-shot learning. This situation has hardly been studied in computer vision research, even though it occurs frequently;
the world contains tens of thousands of different object classes, and image collections have been formed and suitably annotated for
only a few of them. To tackle the problem, we introduce attribute-based classification: Objects are identified based on a high-level
description that is phrased in terms of semantic attributes, such as the object’s color or shape. Because the identification of each such
property transcends the specific learning task at hand, the attribute classifiers can be prelearned independently, for example, from
existing image data sets unrelated to the current task. Afterward, new classes can be detected based on their attribute representation,
without the need for a new training phase. In this paper, we also introduce a new data set, Animals with Attributes, of over 30,000
images of 50 animal classes, annotated with 85 semantic attributes. Extensive experiments on this and two more data sets show that
attribute-based classification indeed is able to categorize images without access to any training images of the target classes.

Index Terms—Object recognition, vision and scene understanding

1 INTRODUCTION

T HE field of object recognition in natural images has made


tremendous progress over the last decade. For specific
object classes, in particular faces, pedestrians, and vehicles,
all. Therefore, numerous techniques for reducing the
number of necessary training images have been developed,
some of which we will discuss in Section 3. However, all of
reliable and efficient detectors are available, based on the these techniques still require at least some labeled training
combination of powerful low-level features, such as SIFT [1] examples to detect future object instances.
or HoG [2], with modern machine learning techniques, such Human learning works differently: Although humans
as support vector machines (SVMs) [3], [4] or boosting [5]. can, of course, learn and generalize well from examples, they
However, to achieve good classification accuracy, these are also capable of identifying completely new classes when
systems require a lot of manually labeled training data, provided with a high-level description. For example, from
typically several thousand example images for each class to the phrase “eight-sided red traffic sign with white writing,” we
be learned. will be able to detect stop signs, and when looking for “large
While building recognition systems this way is feasible gray animals with long trunks,” we will reliably identify
for categories of large common or commercial interest, one elephants. In this work, which extends our original publica-
cannot expect it to solve object recognition for all natural tion [9], we build on this observation and propose a system
categories. It has been estimated that humans distinguish that is able to classify objects from a list of high-level
between approximately 30,000 basic object categories [6], semantically meaningful properties that we call attributes.
and many more subordinate ones, such as different breeds The attributes serve as an intermediate layer in a classifier
of dogs or different car models [7]. It has even been argued cascade and they enable the system to recognize object
that there are infinitely many potentially relevant categor- classes for which it had not seen a single training example.
ization tasks because humans can create new categories on Clearly, a large number of potential attributes exist and
the fly, for example, “things to bring to a camping trip” [8]. collecting separate training material to learn an ordinary
Training conventional object detectors for all these would classifier for each of them would be as tedious as doing so
require millions or billions of well-labeled training images for all object classes. Therefore, one of our main contribu-
and is likely out of reach for many years, if it is possible at tions in this work is to show how, instead of creating a
separate training set for each attribute, we can exploit the
fact that meaningful high-level concepts transcend class
. C.H. Lampert is with the Institute of Science and Technology Austria, Am boundaries. To learn such attributes, we can make use of
Campus 1, Klosterneuburg 3400, Austria. E-mail: [email protected]. existing training data by merging images of several object
. H. Nickisch is with Philips Research, Röntgenstrasse 24-26, 22335
Hamburg, Germany. E-mail: [email protected].
classes. To learn, for example, the attribute striped, we can
. S. Harmeling is with the Max Planck Institute for Intelligent Systems, use images of zebras, bees, and tigers. For the attribute
72076 Tübingen, Germany. E-mail: [email protected]. yellow, zebras would not be included, but bees and tigers
Manuscript received 5 Sept. 2012; revised 15 Mar. 2013; accepted 12 July would still prove useful, possibly together with canary
2013; published online 29 July 2013. birds. It is this possibility to obtain knowledge about
Recommended for acceptance by D. Forsyth. attributes from different object classes and, vice versa, the
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS
fact that each attribute can be used for the detection of many
Log Number TPAMI-2012-09-0701. object classes that makes our proposed learning method
Digital Object Identifier no. 10.1109/TPAMI.2013.140. statistically efficient.
0162-8828/14/$31.00 ß 2014 IEEE Published by the IEEE Computer Society
454 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

2 INFORMATION TRANSFER BY ATTRIBUTE SHARING


We begin by formalizing the problem and our intuition
from the previous section that the use of attributes allows us
to transfer information between object classes. We first
define the exact situation of our interest:
Learning with disjoint training and test classes. Let X be an
arbitrary feature space and let Y ¼ fy1 ; . . . ; yK g and Z ¼
fz1 ; . . . ; zL g be sets of object categories, also called classes. The
task of learning with disjoint training and test classes is to
construct a classifier f : X ! Z by making use of training
examples ðx1 ; l1 Þ; . . . ; ðxn ; ln Þ  X  Y even if Y \ Z ¼ ;.1

Fig. 2a illustrates graphically why this task cannot be


solved by ordinary multiclass classification: Standard
classifiers learn one parameter vector (or other representa-
tion) k for each training class y1 ; . . . ; yK . Because the
classes z1 ; . . . ; zL are not present during the training step, no
parameter vector can be derived for them, and it is
impossible to make predictions about these classes for Fig. 1. Examples from the Animals with Attributes: object classes with
future samples. per-class attribute annotation.
To make predictions about classes for which no training
data are available, one needs to introduce a coupling It is possible to assign attributes on a per-image basis, or
between the classes in Y and Z. Since no training data for on a per-class basis. The latter is particularly helpful, since it
the unobserved classes are available, this coupling cannot allows the creation of attribute annotation for a new classes
be learned from samples, but it has to be inserted into the with minimal effort. To make use of such attribute
system by human effort. Preferably, the amount of human
annotation, we propose attribute-based classification.
effort to specify new classes should be small because
otherwise collecting and labeling training samples might be Attribute-based classification. Assume the situation of
a simpler solution. learning with disjoint training and test classes. If for each
class z 2 Z and y 2 Y, an attribute representation az ; ay 2 A
2.1 Attribute-Based Classification is available, then we can learn a nontrivial classifier  : X !
We propose a solution for learning with disjoint training Z by transferring information between Y and Z through A.
and test classes by introducing a small set of high-level
semantic attributes that can be specified either on a per- In the rest of this paper, we will demonstrate that
class or on a per-image level. While we currently have no attribute-based classification indeed offers a solution to the
formal definition of what should count as an attribute, in problem of learning with disjoint training and test classes, and
the rest of the manuscript, we rely on the following how it can be practically used for object classification. For
characterization. this, we introduce and compare two generic methods to
Attributes. We call a property of an object an attribute, if a integrate attributes into multiclass classification.
human has the ability to decide whether the property is present Direct attribute prediction (DAP), illustrated in Fig. 2b,
or not for a certain object.2 uses an in-between layer of attribute variables to decouple
the images from the layer of labels. During training, the
Attributes are typically nameable properties, for exam- output class label of each sample induces a deterministic
ple, the color of an object, or the presence or absence of a labeling of the attribute layer. Consequently, any super-
certain body part. Note that the definition allows properties vised learning method can be used to learn per-attribute
parameters m . At test time, these allow the prediction of
that are not directly visible but related to visual informa-
attribute values for each test sample, from which the test
tion, such as an animal’s natural habitat. Fig. 1 shows
class labels are inferred. Note that the classes during testing
examples of classes and attributes.
can differ from the classes used for training, as long as the
An important distinction between attributes and arbi-
coupling attribute layer is determined in a way that does
trary features is the aspect of semantics: Humans associate a not require a training phase.
meaning with a given attribute name. This allows them to Indirect attribute prediction (IAP), depicted in Fig. 2c, also
create annotation directly in form of attribute values, which uses the attributes to transfer knowledge between classes,
can then be used by the computer. Ordinary image features, but the attributes form a connecting layer between two
on the other hand, are typically computable, but they lack layers of labels: one for classes that are known at training
the human interpretability. time and one for classes that are not. The training phase of
1. It is not necessary for Y and Z to be disjoint for the problems described
IAP consists of learning a classifier for each training class, as
to occur, Z 6 Y is sufficient. However, for the sake of clarity, we only treat it would be the case in ordinary multiclass classification. At
the case of disjoint class sets in this work. test time, the predictions for all training classes induce a
2. In this manuscript, we only consider binary-valued attributes. More
general forms of attributes have already appeared in the literature; see labeling of the attribute layer, from which a labeling over
Section 3. the test classes is inferred.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 455

Fig. 2. Graphical representation of the proposed across-class learning task: Dark gray nodes are always observed; light gray nodes are observed
only during training. White nodes are never observed but must be inferred. An ordinary, flat, multiclass classifier (left) learns one parameter set k for
each training class. It cannot generalize to classes ðzl Þl¼1;...;L that are not part of the training set. In an attribute-based classifier (middle) with fixed
class-attribute relations (thick lines), training labels ðyk Þk¼1;...;K imply training values for the attributes ðam Þm¼1;...;M , from which parameters m are
learned. At test time, attribute values can directly be inferred, and these imply output class label even for previously unseen classes. A multiclass-
based attribute classifier (right) combines both ideas: Multiclass parameters k are learned for each training class. At test time, the posterior
distribution of the training class labels induces a distribution over the labels of unseen classes by means of the class-attribute relationship.

The major difference between both approaches lies in per-image attribute annotations, if available, or we infer the
the relationship between training classes and test classes. labels from the entry of the attribute vector corresponding
Directly learning the attributes results in a network where to the sample’s label, i.e., all samples of class y have the
all classes are treated equally. When class labels are binary label aym . The trained classifiers provide us with
inferred at test time, the decision for all classes is based estimates of pðam j xÞ, from which we form a model for the
only on the attribute layer. We can expect it, therefore, to Q
complete image-attribute layer as pða j xÞ ¼ M m¼1 pðam j xÞ.
also handle the situation where training and test classes At test time, we assume that every class z induces its
are not disjoint. In contrast, when predicting the attribute attribute vector az in a deterministic way, i.e., pða j zÞ ¼
values indirectly, the training classes occur also at test time
½½a ¼ az , where we have made use of Iverson’s bracket
as an intermediate feature layer. On the one hand, this can
notation [11]: ½½P  ¼ 1 if the condition P is true and it is 0
introduce a bias, if training classes are also potential
otherwise. By applying Bayes’ rule, we obtain pðz j aÞ ¼
output classes during testing. On the other hand, one can pðzÞ z
argue that deriving the attribute layer from the label layer pðaz Þ ½½a ¼ a  as the representation of the attribute-class layer.
instead of from the samples will act as a regularization Combining both layers, we can calculate the posterior of a
step that creates only sensible attribute combinations and, test class given an image:
therefore, makes the system more robust. In the following, X
we will develop realizations for both methods and bench- pðzÞ Y M  
pðz j xÞ ¼ pðz j aÞpða j xÞ ¼ z
p azm j x : ð1Þ
mark their performance. pða Þ m¼1
a2f0;1gM

2.2 A Probabilistic Realization In the absence of more specific knowledge, we assume


Both classification methods, DAP and IAP, are essentially identical test class priors, which allows us to ignore the
metastrategies that can be realized by combining existing factor pðzÞ in the following. For the factor pðaÞ, we assume
learning tools: a supervised classifier or regressor for the Q
a factorial distribution pðaÞ ¼ M m¼1 pðam Þ, using the em-
image-attribute or image-class prediction with a parameter- P
free inference method to channel the information through pirical means pðam Þ ¼ K1 K yk
k¼1 m over the training classes
a
3
the attribute layer. In the following, we use a probabilistic as attribute priors. As decision rule f : X ! Z that assigns
model that reflects the graphical structures in Figs. 2b and the best output class from all test classes z1 ; . . . ; zL to a test
2c. For simplicity, we assume that all attributes have binary sample x, we then use MAP prediction:
values such that the attribute representation a ¼ ða1 ; . . . ;  
aM Þ for any class are fixed-length binary vectors. Contin- YM
p azml j x
fðxÞ ¼ argmax pðz ¼ l j xÞ ¼ argmax  zl  : ð2Þ
uous attributes can, in principle, be handled in the same l¼1;...;L l¼1;...;L m¼1 p am
way by using regression instead of classification. A
generalization to relative attributes [10] or variable length 2.2.2 Indirect Attribute Prediction
descriptions should also be possible, but lies beyond the To realize IAP, we only modify the image-attribute stage: As a
scope of this paper. first step, we learn a probabilistic multiclass classifier
2.2.1 Direct Attribute Prediction estimating pðyk j xÞ for each training classes yk , k ¼ 1; . . . ; K.
As for DAP, we assume a deterministic dependence between
For DAP, we start by learning probabilistic classifiers for
each attribute am . As training samples, we can use 3. In practice, the prior pðaÞ is not crucial to the procedure and
all images from all training classes, as labels, we use either setting pðam Þ ¼ 12 yields comparable results.
456 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

attributes and classes, setting pðam j yÞ ¼ ½½am ¼ aym . The 3.1 Sharing Information between Classes
combination of both steps yields The aspect of sharing information between classes has
attracted the attention of many researchers. A common idea
X
K
pðam j xÞ ¼ pðam j yk Þpðyk j xÞ; ð3Þ is to construct multiclass classifiers in a cascaded way. By
k¼1 making similar classes share large parts of their decision
paths, fewer classification functions need to be learned,
so in comparison to DAP, we only perform an additional thereby increasing the system’s prediction speed [23].
matrix-vector multiplication after evaluating the classifiers. Similarly, one can reduce the number of feature calculations
With the estimate of pða j xÞ obtained from (3) we proceed by actively selecting low-level features that help discrimi-
in the same way as in for DAP, i.e., we classify test samples nation for many classes simultaneously [24]. Combinations
using (2). of both approaches are also possible [25].
In contrast, interclass transfer does not aim at higher
3 RELATED WORK speed, but at better generalization performance, typically
for object classes with only few available training
Multilayer or cascaded classifiers have a long tradition in instances. From known object classes, one infers prior
pattern recognition and computer vision: multilayer percep- distributions over the expected intraclass variance in terms
trons [12], decision trees [13], mixtures of experts [14], and of distortions [26] or shapes and appearances [27].
boosting [15] are prominent examples of classification Alternatively, features that are known to be discriminative
systems built as feed-forward architectures with several for some classes can be reused and adapted to support the
stages. Multiclass classifiers are also often constructed as detection of new classes [28]. To our knowledge, no
layers of binary decisions from which the final output is previous approach allows the direct incorporation of
inferred, for example, [16], [17]. These methods differ in human prior knowledge. Also, all above methods require
their training methodologies, but they share the goal of at least some training examples of the target classes and
decomposing a difficult classification problem into a cannot handle completely new objects.
collection of simpler ones. However, their emphasis lies A notable exception is [29] that, like DAP and IAP, aims at
on the classification performance in a fully supervised classification with disjoint train and test set. It assumes that
scenario, so the methods are not capable of generalizing each class has a description vector, which can be used to
across class boundaries. transfer between classes. However, because these descrip-
Especially in the area of computer vision, multilayered tion vectors do not necessarily have a semantic meaning,
classification systems have been constructed, in which they cannot be obtained from human prior knowledge.
intermediate layers have interpretable properties: Artificial Instead, an additional data source is needed to create them,
neural networks or deep belief networks have been shown to for example, data samples in a different representation.
learn interpretable filters, but these are typically restricted
3.2 Predicting Semantic Attributes
to low-level properties, such as edge and corner detectors
[18]. Popular local feature descriptors, such as SIFT [1] or A second relevant line of related work is the prediction of
HoG [2], can be seen as hand-crafted stages in a feed- high-level semantic attributes for images. Prior work in the
area of computer vision has mainly studied elementary
forward architecture that transform an image from the pixel
properties, such as colors and geometric patterns [30], [31],
domain into a representation invariant to noninformative
[32], achieving high accuracy by developing task-specific
image variations. Similarly, image segmentation has been
features and representations. In the field of multimedia
formulated as an unsupervised method to extract contours
retrieval, similar tasks occur. For example, the TRECVID
that are discriminative for object classes [19]. Such pre-
contest [33] contains a task of high-level feature extraction,
processing steps are generic in the sense that they still allow
which consists of predicting semantic concepts, in particular
the subsequent detection of arbitrary object classes. How-
scene types, for example, outdoor, urban, and high-level
ever, the basic elements, local image descriptors or actions, for example, sports. It has been shown that by
segments shapes, alone are not reliable enough indicators combining searches for several such attributes, one can
of generic visual object classes, unless they are used as input build more powerful retrieval database mechanisms, for
to a subsequent statistical learning step. example, of faces [34], [35].
On a higher level of abstraction, pictorial structures [20], Instead of relying on manually defined attributes, it has
the constellation model [21], and recent discriminatively trained recently been proposed to identify attributes automatically.
deformable part models [22] are examples of the many Parikh and Grauman [36] introduced a semiautomatic
methods that recognize objects in images by detecting technique for this that combines classifier outputs with
discriminative parts. In principle, humans can give descrip- human feedback. Sharmanska et al. [37] propose an
tions of object classes in terms of such parts, for example, unsupervised technique for augmenting existing attribute
arms or wheels. However, it is a difficult problem to build a representations with additional nonsemantic binary fea-
system that learns to detect exactly the parts described. tures to make them more discriminative. It has also been
Instead, the above methods identify parts in an unsuper- shown that new attributes can be found by text mining [38],
vised way during training, which often reduces the parts to [39], [40], and that object classes themselves can act as
reproducible patterns of local feature points, not to units attributes for other tasks [41]. Berg et al. [40] showed that
with a semantic meaning. In general, parts learned this way instead of predicting only the presence or absence of an
do not generalize across class boundaries. attribute, their occurrence can also be localized within the
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 457

TABLE 1 TABLE 2
Animal Classes of the Animals with Attributes Data Set Eighty-Five Semantic Attributes of the
Animals with Attributes Data Set in Short Form

The 40 classes of the first four columns are used for training, the 10
classes of the last column (in italics) are the test classes.

image. Other alternative models for predicting attributes


from images include conditional random fields [42], and Longer forms given to human subject for annotation were complete
probabilistic topic models [43]. Scheirer et al. [44] intro- phrases, such as has flippers, eats plankton, or lives in water.
duced an alternative technique for turning the output of
attribute classifiers into probability estimates based on 4 THE ANIMALS WITH ATTRIBUTES (AwA)
extremal value theory. The concept that attributes are DATA SET
properties of single images has also been generalized:
In the early 1990s, Osherson et al. [56] collected judgements
Parikh and Grauman [10] introduced relative attributes,
from human subjects on the “relative strength of association”
which encode a comparison between two images instead of
between 85 semantic attributes and 48 mammals. Kemp
specifying an absolute property, for example, is larger than,
et al. [57] later added two more classes and their attributes
instead of is large.
for a total of 50  85 class-attribute associations.4 The full
3.3 Other Uses of Semantic Attributes list of classes and attributes can be found in Tables 1 and 2.
In parallel to our original work [9], Farhadi et al. [45] Besides the original continuous-valued matrix, also a binary
introduced the concept of predicting high-level semantic version was created by thresholding the original matrix at
attributes of objects with the objective of being able to its overall mean value; see Fig. 3 for excerpts from both
describe objects, even if their class membership is unknown. matrices. Note that because of the data collection process,
Numerous follow-up papers have explored even more the data are not completely error free. For example, contrary
applications of attributes in computer vision tasks, for to what is specified in the binary matrix, panda bears do not
example, for scene classification [46], face verification [35], have buck teeth, and walruses do have tails.
action recognition [47], and surveillance [48]. Rohrbach Our goal in creating the Animals with Attributes data set5
et al. [49] performed an in-depth analysis of attribute-based was to make this attribute-matrix accessible for computer
classification for transfer learning. Kulkarni et al. [50] used vision experiments. We collected images by querying the
attribute predictions in combination with object detection image search engines of Google, Microsoft, Yahoo, and Flickr
for each of the 50 animals classes. We manually removed
and techniques from natural language processing to
outliers and duplicates as well as images in which the target
automatically create descriptions of images in natural
animals were not in prominent enough view to be
language of images. Attributes have also been suggested
recognizable. The remaining image set has 30,475 images,
as feedback mechanisms to improve image retrieval [51]
where the minimum number of images for any class is
and categorization [52].
92 (mole) and the maximum is 1,168 (collie). Fig. 1 shows
3.4 Related Work Outside of Computer Science exemplary images and their attribute annotation.
In comparison to computer science, cognitive science To facilitate use by researchers from outside of computer
research started much earlier to study the relations between vision, and to increase the reproducibility of results, we
object recognition and attributes. Typical questions in the provide precomputed feature vectors for all images of the
field are how human judgements are influenced by data set. The representations were chosen to reflect different
characteristic object attributes [53], [54], and how the aspects of the images (color, texture, shape), and to allow
human performance in object detection tasks depends on easy use with off-the-shelf classifiers: HSV color histograms,
the presence or absence of object properties and contextual SIFT [1], rgSIFT [58], PHOG [59], SURF [60], and local self-
cues [55]. Since one of our goals is to integrate human similarity histograms [61]. The color histograms and PHOG
knowledge into a computer vision task, we would like to feature vectors are extracted separately for all 21 cells of a
benefit from the prior work in this field, at least as a source three-level spatial pyramids (11, 22, 44). For each cell,
128-dimensional color histograms are extracted and con-
of high-quality data that, so far, cannot be obtained by an
catenated to form a 2,688-dimensional feature vector. For
automatic process. In the following section, we describe a
PHOG, the same pyramid is used, but with 12-dimensional
data set of animal images that allows us to leverage
established class-attribute association data from the cogni- 4. http://www.psy.cmu.edu/ckemp/code/irm.html.
tive science research community. 5. http://www.ist.ac.at/~chl/AwA/.
458 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

Fig. 3. Real-valued (left) and binary-valued (right) class-attribute matrices of the Animals with Attributes data set. Shown are 13  33 excerpts of the
complete 50  85 matrices.

base histograms in each cell. The other feature vectors each 5.2 SUN Attributes
are bag-of-visual-words histograms obtained from quantizing The SUN Attributes8 data set was introduced by Patterson
the original descriptors with 2,000-element codebooks that and Hays [46]. It is a subset of the SUN Database [62] for
were obtained by k-means clustering on 250,000 element fine-grained scene categorization and consists of 14,340
subsets of the descriptors. images from 717 classes (20 images per class). Each image is
We define a fixed split of the data set into 40 classes annotated with 102 binary attributes that describe the
(24,295 images) to be used for training, and 10 classes scenes’ material and surface properties as well as lighting
(6,180 images) to be used for testing; see Table 1. This split conditions, functions, affordances, and general image
was not done randomly, but much of the diversity of the layout. For our experiments, we rely on the feature vectors
animals in the data set (water/land-based, wild/domestic, that are provided by the authors of [46] as part of the data
etc.) is reflected in the training as well as in the test set of set. These consists of GIST, HOG, self-similarity, and
classes. The assignments were based only on the class geometric color histograms.
names and before any experiments were performed, so in
particular, the split was not designed for best zero-shot
classification performance. Random train-test splits of 6 EXPERIMENTAL EVALUATION
similar characteristics can be created by fivefold cross In this section, we perform an experimental evaluation of
validation (CV) over the classes. the DAP and the IAP model on the Animals with Attributes
data set as well as the other data sets described above.
5 OTHER DATA SETS FOR ATTRIBUTE-BASED Since our goal is the categorization of classes for which
CLASSIFICATION no training samples are available, we always use training
and test set with disjoint class structure.
Besides the Animals with Attributes data set, we also For DAP, we train one nonlinear support vector machine
perform experiments on two other data sets of natural
for each binary attributes, a1 ; . . . ; aM . In each case, we use
images for which attribute annotations have been released.
90 percent of the images of the training classes for training,
We briefly summarize their characteristics here. An over-
with binary labels for the attribute, which are either
view is also provided in Table 3.
obtained from the class-attribute matrix by assigning each
5.1 aPascal-aYahoo image the attribute value of its class, or by per-image
The aPascal-aYahoo data set6 was introduced by Farhadi attribute annotation, where available. We use the remaining
et al. [45]. It consists of a 12,695-image subset of the 10 percent of training images to estimate the parameters of a
PASCAL VOC 2008 data set7 and 2,644 images that were sigmoid curve for Platt scaling, to convert the SVM outputs
collected using the Yahoo image search engine. The into probability estimates [63].
PASCAL part serves as training data, and the Yahoo part At test time, we apply the trained SVMs with Platt
as test data. Both sets have disjoint classes (20 classes for scaling to each test image and make test class predictions
PASCAL, 12 for Yahoo), so learning with disjoint training using (2).
and test classes is unavoidable. Attribute annotation is For IAP, we train one-versus-rest SVMs for each training
available on the image level: Each image has been class, again using a 90/10 percent split for training of the
annotated with 64 binary attribute that characterize shape, decision functions, and of the sigmoid coefficients for Platt
material, and the presence of important parts of the visible
scaling. At test time, we predict a vector of class
object. As image representation, we rely on the precom-
probabilities for each test image. We L1 -normalize this
puted color, texture, edge orientation, and HoG features
vector such that we can interpret it as a posterior
that the authors of [45] extracted from the objects’ bounding
boxes (as provided by the PASCAL VOC annotation) and distribution over the training classes. We then use (3) to
released as part of the data set. predict attribute values, from which we obtain test class
predictions by (2) as above.
6. http://vision.cs.uiuc.edu/attributes/.
7. http://www.pascal-network.org/challenges/VOC/. 8. http://cs.brown.edu/~gen/sunattributes.html.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 459

TABLE 3
Characteristics of Data Sets with Attribute Annotation:
Animals with Attributes [9], aPascal/aYahoo (aP/aY) [45],
SUN Attributes (SUN) [46]

6.1 SVM Kernels and Model Selection Fig. 4. Confusion matrices between 10 test classes of the Animals with
Attributes data set. Left: Indirect attribute prediction. Right: Direct
To achieve optimal performance of the SVM classifiers, we
attributes prediction.
use established kernel functions and perform thorough
model selection. All SVMs are trained with linearly combined we also perform experiments with random class split using
2 -kernels: For any D-dimensional feature vectors, hðxÞ 2 fivefold cross validation for Animals with Attributes (i.e., 40
IRD and hð xÞ 2 IRD , of images x and x, we set kðx; xÞ ¼ training classes, 10 test classes), and 10-fold cross validation
 ¼ PD ðhi hi Þ . For DAP,
2
2
expð ðhðxÞ; hð xÞÞÞ with 2 ðh; hÞ i¼1 hi þhi for SUN Attributes (approximately 637  1 classes for
the bandwidth parameter  is selected in the following way: training and 70  1 classes for testing). We measure the
For each attribute, we perform fivefold cross validation, quality of the prediction steps in terms of normalized
computing the receiver operating characteristic (ROC) curve multiclass accuracy (MC acc.) on the test set (the mean of
of each predictor and averaging the areas under the curves the diagonal of the confusion matrix). We also report areas
(AUCs) over the attributes. The result is a single mean under the ROC curve for each test class z and attribute a,
attrAUC score for any value of the bandwidth. We perform when their posterior probabilities pðz j xÞ and pða j xÞ,
this estimation for  ¼ c 2 f0:01; 0:03; 0:1; 0:3; 1; 3; 10g, respectively, are treated as ranking measures over all test
P images.
where c ¼ n12 ni;j¼1 2 ðhðxi Þ; hðxj ÞÞ, i.e., we parameterize 
In the following, we show detailed results for Animals
relative to the average 2 -distance of all points in the training
with Attributes and summaries of the results for the other
set.  ¼ 3 was consistently found as best value. data sets.
Given L different feature functions, h1 ; . . . ; hK , we obtain
L kernel functions k1 ; . . . ; kL , and we use
P their unnormalized 6.2.1 Results—Animals with Attributes
sum as the final SVM kernel, kðx; xÞ ¼ Ll¼1 kl ðx; xÞ. Once we The Animals with Attributes data set comes only with per-
fixed the kernel, we identify the SVMs C parameter among class annotation, so there are two models to compare: per-
the values f0:01; 0:03; 0:1; . . . ; 30; 100; 3;000; 1;000g in an
class DAP and per-class IAP. Fig. 4 shows the resulting
analogous procedure. We perform fivefold cross validation
confusion matrices for both methods. The class-normalized
for each attribute, and we pick C that achieves the highest
multiclass accuracy can be read off from the mean value of
mean attrAUC. Note that we use the same C values for all
the diagonal as 41.4 percent for DAP and 42.2 percent for
attribute classifiers. Technically, this would not be necessary,
IAP. While the results are not as high as a supervised
but we prefer it to avoid large scaling differences between
method could achieve, it nevertheless clearly proves our
the SVM outputs of different attribute predictors. Also, one
original claim about attribute-based classification: By shar-
can expect the optimal C values to not vary strongly between
ing information via an attribute layer, it is possible to classify
different attributes, because all classifiers use the same
images of classes for which we had no training examples. As a
kernel matrix and differ only in their label annotation.
baseline, we compare against a zero-shot classifier, where
For IAP, we use the same kernel as for DAP and
determine C using fivefold cross validation similar to the for each test class, we identify the most similar training
procedure one described above, except that we use the class and predict using a classifier for it trained on all
mean area under the ROC curve of class predictions (mean training data. We use two different methods to define the
classAUC) as selection criterion. similarity between the classes’ attribute representations:
Hamming distance or cross correlation. As it turns out, both
6.2 Results variants make almost identical decisions, resulting in
We use the above-described procedures to train DAP and multiclass accuracies of 30.7 and 30.8 percent. This is
IAP models for all data sets. For DAP, where applicable, we clearly better than chance performance, but below the
use both per-image or per-class annotation to find out results of DAP and IAP.
whether the time-consuming per-image annotation is Using random class splits instead of the predefined one,
necessary. For the data set with per-image attribute we obtain slightly lower multiclass accuracies of 34.8/44.8/
annotation, we create class-attribute matrices by averaging 34.7/35.1/36.3 percent (average 37.1 percent) for DAP, and
all attribute vectors of each class and thresholding the 33.4/42.8/27.3/31.9/35.3 percent (average 34.1 percent) for
resulting real-valued matrix at its global mean value. IAP. Again, the baselines achieve clearly lower results:
Besides experiments with fixed train/test splits of classes, 32.4/31.9/28.1/25.3/20.9 percent (average 27.7 percent) for
460 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

TABLE 4
Numeric Results on the Animals with Attributes Data Set
in Percent: Multiclass Accuracy for DAP, IAP,
Class-Transfer Classifier Using Cross Correlation (CT-cc)
or Hamming Distance (CT-H) of Class Attributes,
and Chance Performance (rnd)

Fig. 5. Retrieval performance of attribute-based classification (DAP


method): ROC curves and area under curve for the 10 Animals with
Attributes test classes.

performance for humpback whale is even on par of what we


can expect to achieve with fully supervised learning
the cross-correlation version, and 33.0/29.0/28.4/25.3/
techniques. Fig. 6 shows the five images with highest
20.9 percent (average 27.3 percent) for the version based
on Hamming distance. posterior score for each test class, therefore allowing to
The quantitative results for all method are summarized judge the quality of a hypothetical image retrieval system
in Table 4. One can see that the differences between the two based on (1). One can see that the rankings for humpback
approaches, DAP and IAP, are relatively small. One might whales, leopards, and hippopotamuses are very reliable.
see a slight overall advantage for DAP, but as the large Confusions that occur are typically between animals classes
variance between class splits is rather high, this could also with similar characteristics, such as a whale mistaken for a
be explained by random fluctuations. To avoid redundancy, seal, or a racoon mistaken for a rat.
we give detailed results only for the DAP model in the rest Because all classifiers base their decisions on the same
of this section. learned attribute classifiers, one can presume that the easier
Another measure of prediction performance besides classes are characterized either by more distinctive attribute
multiclass accuracy is how well the predicted posterior vectors or by attributes that are easier to learn from visual
probability of any of the test classes can be used to retrieve data. We believe that the first explanation is not correct,
images of this class from the set of all test images. We since the matrix of pairwise distances between attribute
evaluate this by plotting the corresponding ROC curves in vectors does not resemble the confusion matrices in Table 4.
Fig. 5 and report their AUC. One can see that for all classes, We, therefore, analyze the quality of the individual
reasonable classifiers have been learned with AUCs clearly attribute predictors in more detail. Fig. 7 summarizes their
higher than the chance level 0.5. With an AUC of 0.99, the quality in terms of the area under the ROC curve

Fig. 6. Highest ranking results for each test class in the Animals with Attributes data set. Classes with unique characteristics are identified well, for
example, humpback whales and leopards. Confusions occur between visually similar categories, for example, pigs and hippopotamuses.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 461

Fig. 7. Quality of individual attribute predictors (trained on train classes, tested on test classes), as measured by the area under the ROC curve.
Attributes without entries have constant values for all test classes, so their ROC curve cannot be computed.

(attrAUC). Missing entries indicate that all images in the from a global image representation because they typically
test set coincided in their value for this attribute, so no reflect information that is localized within only the object
ROC curve can be computed. Fig. 8 shows, for a selection region. On the other hand, nonvisual attributes can often
of attributes, the five images of highest posterior score still be predicted from image information because they
within the test set. occur correlated with visual properties, for example,
On average, attributes can be predicted clearly better characteristic texture. It is known that the integration of
than random (the average AUC is 72.4 percent, whereas such contextual information can improve the accuracy of
random prediction would have 50 percent). However, the visual classifiers, for example, road regions helps the
variance within the predictions is large, ranging from near detection of cars. However, it remains to be seen if this
perfect prediction, for example, for is yellow and eats effect will be sufficient for purely nonvisual attributes, or
plankton, to essentially random performance, for example, whether it would be better in the long run to replace
on has buckteeth or is timid. Contrary to what one might nonvisual attributes by the visual counterparts they are
expect, attributes that refer to visual properties are not correlated with.
automatically predicted more accurately than others. For Another interesting observation is that the system
example, is blue is identified reliably, but is brown is not. learned to correctly predict attributes such as is big and is
Overall good performance is also achieved on several small, which are ultimately defined only by context. While
attributes that describe body parts, such as has paws, or the this is desirable in our setup, where the context is
natural habitat lives in trees, and even on nonvisual consistent, it also suggests that the learned attribute
properties like, such as, is smelly. There are two explana- predictors themselves are context dependent and cannot
tions for this effect: On the one hand, attributes that are be expected to generalize to object classes very different
clearly visual, such as colors, can still be hard to predict from the training classes.

Fig. 8. Highest ranking results for a selection of attribute predictors (see Section 6.2) learned by DAP on the Animals with Attributes data set.
462 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

TABLE 5 TABLE 6
Numeric Results on the aPascal/aYahoo and the SUN Attributes Mean Mutual Information between Individual Attributes
Data Sets in Percent: DAP with Per-Image Annotation (DAP-I), and Class Labels with Per-Class or Per-Image Annotation
DAP with Per-Class Annotation (DAP-C), IAP, Class-Transfer
Classifier (CT-H), and Chance Performance (rnd)

a definite explanation for this. However, two additional


observations suggest that the reason might be a bias-
variance effect: First, per-image attribute annotation does
not follow class boundaries, so its mutual information of
the ground truth attribute annotation with the class labels
is lower than for per-class annotation (see Table 6).
Second, the visual learning tasks defined by per-image
annotation do not seem easier learnable than the per-class
counterparts, as indicated by the reduced mean attribute
AUC in Tables 4, 5a, and 5b. Likely this is because per-
6.2.2 Results—Other Data Sets class annotation is correlated with many other visual
We performed the same evaluation as for the Animals with properties in the images and therefore often easy to
Attributes data set also for the other data sets. Since these predict, whereas per-image annotation singles out the
data sets have per-image attribute annotation in addition to actual attribute in question.
per-class attribute annotation, we obtain results for two In combination, we expect the per-image annotation to
variants of DAP: trained with per-image labels and trained lead to less bias in the training problem, therefore having
with per-class labels. In both cases, test time inference is the potential for better attribute classifiers given enough
done with per-class labels, since we still assume that no data. However, because of the harder learning problem,
examples of the test classes are available. As additional the resulting classifiers have higher variance when trained
baseline, we use the class transferred classifier as in on a fixed amount of data. We take the results as a sign
Section 6.2.1. Since both variants perform almost identically, that the second effect is currently the dominant source of
we report only results for the one based on Hamming errors. We plan to explore this hypothesis in future work
distance. The results are summarized in Tables 5a and 5b. by studying the learning curves of attribute learning with
For the SUN data set, we measure the classification per-class and per-image annotation for varying amounts of
training data.
performance based on the three-level SUN hierarchy
There is also a second, more empirical, explanation:
suggested in [62]. At test time, the ground truth label and
Per-class training of attribute classifiers resembles recent
the predicted class label each corresponds to one path in the work on discriminatively learned image representations,
hierarchy (or multiple paths, since the hierarchy is not a tree, such as classemes [64]. These have been found to work
but a directed acyclic graph). A prediction is considered well for image categorization tasks, even for categories
correct at a certain level, if both paths run through a common that are not part of the classemes set. A similar effect
node in that level. At the third level, each class is a separate might hold for per-class trained attribute representations:
leaf, so level-3 accuracy is identical to the unnormalized Even if their interpretation as semantic image properties
multiclass accuracy, which coincides with the diagonal of is not as straightforward as for classifiers trained with
the confusion matrix in this case, since all classes have the per-image annotation, they might simply lead to a good
same number of images. However, at levels 1 and 2, image representation.
semantically similar classes are mapped to the same node,
and confusions between these classes are, therefore, dis- 6.2.3 Comparison to the Supervised Setup
regarded. Note that the values obtained for the SUN data set Besides the relative comparison of the different methods to
are not directly comparable to earlier supervised work using each other, we also try to highlight how DAP and IAP
these data. Because we split the data into disjoint train perform on an absolute scale. We, therefore, compare our
(90 percent) and test classes (10 percent), fewer classes of the method to ordinary multiclass classification with a small
data set are present at test time. number of training examples. For each test class, we
The results for both data sets confirm our observations randomly pick a fixed number of training examples and
from the Animals with Attributes data set. DAP (both use them to train a one-versus-rest multiclass SVM, which
variants) as well as IAP achieves far better than random we evaluate using the remaining images of the test classes.
performance in terms of multiclass accuracy, mean per-class The kernel function and parameters are the same as for the
AUC, and mean per-attribute AUC, and also better than the IAP model. Fig. 7 summarizes the results in form of the
Hamming distance-based baseline classifier. mean over 10 such splits. For an easier comparison, we also
A more surprising observation is that per-image repeat the range of values that zero-shot with DAP or IAP
attribute annotation, as it is available for the aPascal achieved (see Tables 4, 5a, and 5b).
and SUN Attributes data sets, does not improve the By comparing the last column to the others, one sees that
prediction accuracy compared to the per-class annotation, on the Animals with Attributes data set, attribute-based
which is much easier to create. We currently do not have classification achieves results on par with supervised
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 463

TABLE 7
for attribute-based classification that solve this problem by
Numeric Results of One-versus-Rest Multiclass SVMs Trained transferring information between classes. In both cases, the
with n 2 f1; 2; 3; 4; 5; 10; 15; 20g Training Examples from transfer is achieved by an intermediate representation that
Each Test Class in Comparison to the Results Achieved by consists of high-level semantic attributes that provide a fast
Zero-Shot Learning with DAP and IAP (in Percent) and simple way to include human knowledge into the
system. To predict the attribute level, we either rely on
classifiers trained directly on attribute annotation (DAP), or
we infer the attribute layer from classifiers trained to
identify other classes (indirect attribute prediction). Once
trained, the system can detect new object categories, if a
suitable characterization in terms of attributes is available
for them, and it does not require retraining.
As a second contribution, we introduced the Animals with
Attributes data set: It consists of over 30,000 images with
precomputed reference features for 50 animal classes, for
which a semantic attribute annotation is available that has
been used in earlier cognitive science work. We hope that
this data set will foster research and serve as a testbed for
training with 10-15 training examples per test class, i.e., 100- attribute-based classification.
150 training images in total. On the aPascal, the attribute
representations perform worse. Their results are compar- 7.1 Open Questions and Future Work
able to supervised training with at most one example per Despite the promising results of the proposed system,
class, if judged by multiclass accuracy, and two to three several questions remain open and require future work. For
examples per class, if judged by mean classAUC. On the example, the assumption of disjoint training and test classes
SUN data set, approximately two examples per class (142 is clearly artificial. It has been observed, for example, in
total) are necessary for equal mean class accuracy, and 5-10 [65], that existing methods, including DAP and IAP, do not
examples per class (355 to 710 total) for equal mean AUC. work well if this assumption is violated, since their
Note, however, that all the above comparisons may over- decisions become biased toward the previously seen
estimate the power of the supervised classifiers: In a classes. In the supervised scenario, methods to overcome
realistic setup with so few training examples, model this limitation have been suggested, for example, [66], [67],
selection is problematic, whereas to create Table 7, we just but a unified framework that includes the possibility of
reused the parameters obtained by thorough model selec- zero-shot learning is still missing.
tion for the IAP model. A related open problem is how zero-shot learning can be
Interpreting the low performance on the aPascal-aYahoo unified with supervised learning when a small number of
data set, one has to take the background of this data set into labeled training examples are available. While some work
account. Its attributes were selected to provide additional in this direction exists (see our discussion in Section 3), we
information about object classes, not to discriminate believe that it will also be able to extend DAP and IAP for
between them. While the resulting attribute set is compar- this for purpose. For example, one could make use of their
ably difficult to learn (see Table 5(a)), each attribute on probabilistic formulation to define an attribute-based prior
average contains less information about the class labels that is combined with a likelihood term derived from the
training examples.
(see Table 6), mainly because several of the attributes are
Beyond the specific task of multiclass classification, there
meaningful only for a small subset of the categories. We
are many other open questions that will need to be tackled
conclude from this that attributes that are useful to describe
if we want to make true progress in solving the grand tasks
objects from different categories are not automatically also
of computer vision: How do we handle the problem that
useful to distinguish between the categories, a fact that
many object categories are rare? How can we build object
should be taken into account in the future creation of
recognition systems that adapt and incorporate new
attribute annotation for image data sets.
categories that they encounter? How can we integrate
Overall, we do not think that the experiments we
presented are sufficient to make a definite statement about human knowledge about the visual world besides specify-
the quality of attribute-based versus supervised classifica- ing training examples? We believe that attribute-based
tion. However, we believe that the results confirm the classification will be able to help in answering at least some
intuition that a larger ratio of attributes to classes improves of these questions.
the prediction performance. However, not only the number
of attributes matters, but also how informative the chosen ACKNOWLEDGMENTS
attributes are about the classes.
This work was in part funded by the European Research
Council under the European Union’s Seventh Framework
7 CONCLUSION Programme (FP7/2007-2013)/ERC grant agreement no
In this paper, we introduced learning with disjoint training 308036. The authors would like to thank Charles Kemp
and test classes. It formalizes the problem of learning an for providing the Osherson/Wilkie class-attribute matrix
object classification systems for classes for which no and Jens Weidmann for his help on creating the Animals
training images are available. We proposed two methods with Attributes data set.
464 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014

REFERENCES [28] E. Bart and S. Ullman, “Cross-Generalization: Learning Novel


Classes from a Single Example by Feature Replacement,” Proc.
[1] D.G. Lowe, “Distinctive Image Features from Scale-Invariant IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2005.
Keypoints,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004. [29] H. Larochelle, D. Erhan, and Y. Bengio, “Zero-Data Learning of
[2] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for New Tasks,” Proc. 23rd Nat’l Conf. Artificial Intelligence, vol. 1,
Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern no. 2, pp. 646-651, 2008.
Recognition (CVPR), 2005. [30] K. Yanai and K. Barnard, “Image Region Entropy: A Measure of
[3] B. Schölkopf and A.J. Smola, Learning with Kernels. MIT Press, Visualness of Web Images Associated with One Concept,” Proc.
2002. 13th Ann. ACM Int’l Conf. Multimedia, pp. 419-422, 2005.
[4] C.H. Lampert, “Kernel Methods in Computer Vision,” Foundations [31] J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus, “Learning
and Trends in Computer Graphics and Vision, vol. 4, no. 3, pp. 193- Color Names for Real-World Applications,” IEEE Trans. Image
285, 2009. Processing, vol. 18, no. 7, pp. 1512-1523, July 2009.
[5] R.E. Schapire and Y. Freund, Boosting: Foundations and Algorithms. [32] V. Ferrari and A. Zisserman, “Learning Visual Attributes,” Proc.
MIT Press, 2012. Advances in Neural Information Processing Systems (NIPS), 2008.
[6] I. Biederman, “Recognition by Components: A Theory of Human [33] A.F. Smeaton, P. Over, and W. Kraaij, “Evaluation Campaigns and
Image Understanding,” Psychological Rev., vol. 94, no. 2, pp. 115- TRECVid,” Proc. Eighth ACM Int’l Workshop Multimedia Information
147, 1987. Retrieval, 2006.
[7] B. Yao, A. Khosla, and L. Fei-Fei, “Combining Randomization and [34] N. Kumar, P.N. Belhumeur, and S.K. Nayar, “Facetracer: A Search
Discrimination for Fine-Grained Image Categorization,” Proc. Engine for Large Collections of Images with Faces,” Proc. European
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011. Conf. Computer Vision (ECCV), 2008.
[8] G.L. Murphy, The Big Book of Concepts. MIT Press, 2004. [35] N. Kumar, A. Berg, P. Belhumeur, and S. Nayar, “Describable
[9] C.H. Lampert, H. Nickisch, and S. Harmeling, “Learning to Detect Visual Attributes for Face Verification and Image Search,” IEEE
Unseen Object Classes by Between-Class Attribute Transfer,” Proc. Trans. Pattern Analysis and Machine Intelligence, vol. 33, no. 10,
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2009. pp. 1962-1977, Oct. 2011.
[10] D. Parikh and K. Grauman, “Relative Attributes,” Proc. IEEE Int’l [36] D. Parikh and K. Grauman, “Interactively Building a Discrimina-
Conf. Computer Vision (ICCV), 2011. tive Vocabulary of Nameable Attributes,” Proc. IEEE Conf.
[11] D.E. Knuth, “Two Notes on Notation,” Am. Math. Monthly, vol. 99, Computer Vision and Pattern Recognition (CVPR), pp. 1681-1688,
no. 5, pp. 403-422, 1992. 2011.
[12] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, “Learning [37] V. Sharmanska, N. Quadrianto, and C.H. Lampert, “Augmented
Internal Representations by Error Propagation,” Parallel Distrib- Attribute Representations” Proc. European Conf. Computer Vision
uted Processing, MIT Press, 1986. (ECCV), 2012.
[38] K. Yanai and K. Barnard, “Image Region Entropy: A Measure of
[13] L. Breiman, J.J. Friedman, R.A. Olshen, and C.J. Stone, Classifica-
‘Visualness’ of Web Images Associated with One Concept,” Proc.
tion and Regression Trees. Wadsworth, 1984.
13th Ann. ACM Int’l Conf. Multimedia, 2005.
[14] M.I. Jordan and R.A. Jacobs, “Hierarchical Mixtures of Experts
[39] J. Wang, K. Markert, and M. Everingham, “Learning Models for
and the EM Algorithm,” Neural Computation, vol. 6, no. 2, pp. 181-
Object Recognition from Natural Language Descriptions,” Proc.
214, 1994.
British Machine Vision Conf. (BMVC), 2009.
[15] Y. Freund and R.E. Schapire, “A Decision-Theoretic General-
[40] T.L. Berg, A.C. Berg, and J. Shih, “Automatic Attribute Discovery
ization of On-Line Learning and an Application to Boosting,”
and Characterization from Noisy Web Images,” Proc. European
J. Computer and System Sciences, vol. 55, no. 1, pp. 119-139, 1997.
Conf. Computer Vision (ECCV), 2010.
[16] T.G. Dietterich and G. Bakiri, “Solving Multiclass Learning [41] L.J. Li, H. Su, Y. Lim, and L. Fei-Fei, “Objects as Attributes for
Problems via Error-Correcting Output Codes,” J. Artificial Intelli- Scene Classification,” Proc. First Int’l Workshop Parts and Attributes
gence Research, vol. 2, pp. 263-286, 1995. at European Conf. Computer Vision, 2010.
[17] R. Rifkin and A. Klautau, “In Defense of One-vs-All Classifica- [42] Y. Wang and G. Mori, “A Discriminative Latent Model of Object
tion,” J. Machine Learning Research, vol. 5, pp. 101-141, 2004. Classes and Attributes,” Proc. European Conf. Computer Vision
[18] M. Ranzato, F.J. Huang, Y.-L. Boureau, and Y. LeCun, “Un- (ECCV), pp. 155-168, 2010.
supervised Learning of Invariant Feature Hierarchies with [43] X. Yu and Y. Aloimonos, “Attribute-Based Transfer Learning for
Applications to Object Recognition,” Proc. IEEE Conf. Computer Object Categorization with Zero/One Training Example,” Proc.
Vision and Pattern Recognition (CVPR), 2007. European Conf. Computer Vision (ECCV), pp. 127-140, 2010.
[19] J. Winn and N. Jojic, “LOCUS: Learning Object Classes with [44] W.J. Scheirer, N. Kumar, P.N. Belhumeur, and T.E. Boult, “Multi-
Unsupervised Segmentation,” Proc. IEEE Int’l Conf. Computer Attribute Spaces: Calibration for Attribute Fusion and Similarity
Vision (ICCV), vol. 1, 2005. Search,” Proc. IEEE Conf. Computer Vision and Pattern Recognition
[20] M.A. Fischler and R.A. Elschlager, “The Representation and (CVPR), 2012.
Matching of Pictorial Structures,” IEEE Trans. Computers, vol. 22, [45] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth, “Describing
no. 1, pp. 67-92, Jan. 1973. Objects by their Attributes,” Proc. IEEE Conf. Computer Vision and
[21] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition Pattern Recognition (CVPR), 2009.
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. [46] G. Patterson and J. Hays, “SUN Attribute Database: Discovering,
Computer Vision and Pattern Recognition (CVPR), 2003. Annotating, and Recognizing Scene Attributes,” Proc. IEEE Conf.
[22] P.F. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discrimi- Computer Vision and Pattern Recognition (CVPR), 2012.
natively Trained, Multiscale, Deformable Part Model,” Proc. IEEE [47] J. Liu, B. Kuipers, and S. Savarese, “Recognizing Human Actions
Conf. Computer Vision and Pattern Recognition (CVPR), 2008. by Attributes,” Proc. IEEE Conf. Computer Vision and Pattern
[23] J.C. Platt, N. Cristianini, and J. Shawe-Taylor, “Large Margin Recognition (CVPR), 2011.
DAGs for Multiclass Classification,” Proc. Advances in Neural [48] R. Feris, B. Siddiquie, Y. Zhai, J. Petterson, L. Brown, and S.
Information Processing Systems (NIPS), 1999. Pankanti, “Attribute-Based Vehicle Search in Crowded Surveil-
[24] A. Torralba and K.P. Murphy, “Sharing Visual Features for lance Videos,” Proc. ACM Int’l Conf. Multimedia Retrieval (ICMR),
Multiclass and Multiview Object Detection,” IEEE Trans. Pattern article 18, 2011.
Analysis and Machine Intelligence, vol. 29, no. 5, pp. 854-869, May [49] M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele,
2007. “What Helps Where—and Why? Semantic Relatedness for
[25] P. Zehnder, E.K. Meier, and L.J.V. Gool, “An Efficient Shared Knowledge Transfer,” Proc. IEEE Conf. Computer Vision and Pattern
Multi-Class Detection Cascade,” Proc. British Machine Vision Conf. Recognition (CVPR), 2010.
(BMVC), 2008. [50] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A.C. Berg, and T.L.
[26] E. Miller, N. Matsakis, and P. Viola, “Learning from One Example Berg, “Baby Talk: Understanding and Generating Simple Image
through Shared Densities on Transforms,” Proc. IEEE Conf. Descriptions,” Proc. IEEE Conf. Computer Vision and Pattern
Computer Vision and Pattern Recognition (CVPR), 2000. Recognition (CVPR), pp. 1601-1608, 2011.
[27] F.F. Li, R. Fergus, and P. Perona, “One-Shot Learning of Object [51] A. Kovashka, D. Parikh, and K. Grauman, “Whittlesearch: Image
Categories,” IEEE Trans. Pattern Analysis and Machine Intelligence, Search with Relative Attribute Feedback,” Proc. IEEE Conf.
vol. 28, no. 4, pp. 594-611, Apr. 2006. Computer Vision and Pattern Recognition (CVPR), 2012.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 465

[52] A. Parkash and D. Parikh, “Attributes for Classifier Feedback,” Hannes Nickisch received degrees from the
Proc. European Conf. Computer Vision (ECCV), 2012. Université de Nantes, France, in 2004, and the
[53] D. Osherson, E.E. Smith, T.S. Myers, E. Shafir, and M. Stob, Technical University Berlin, Germany, in 2006.
“Extrapolating Human Probability Judgment,” Theory and Deci- During his PhD work at the Max Planck Institute
sion, vol. 36, no. 2, pp. 103-129, 1994. for Biological Cybernetics, Tübingen, Germany,
[54] S.A. Sloman, “Feature-Based Induction,” Cognitive Psychology, he worked on large-scale approximate Bayesian
vol. 25, pp. 231-280, 1993. inference and magnetic resonance image re-
[55] T. Hansen, M. Olkkonen, S. Walter, and K.R. Gegenfurtner, construction. Since 2011, he has been with
“Memory Modulates Color Appearance,” Nature Neuroscience, Philips Research, Hamburg, Germany. His
vol. 9, pp. 1367-1368, 2006. research interests include medical image pro-
[56] D.N. Osherson, J. Stern, O. Wilkie, M. Stob, and E.E. Smith, cessing, machine learning, and biophysical modeling.
“Default Probability,” Cognitive Science, vol. 15, no. 2, pp. 251-269,
1991. Stefan Harmeling studied mathematics and
[57] C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda, logic at the University of Münster (Dipl Math
“Learning Systems of Concepts with an Infinite Relational 1998) and computer science with an emphasis
Model,” Proc. Nat’l Conf. Artificial Intelligence (AAAI), 2006. on artificial intelligence at Stanford University
[58] K.E.A. van de Sande, T. Gevers, and C.G.M. Snoek, “Evaluation of (MSc 2000). During his doctoral studies, he was
Color Descriptors for Object and Scene Recognition,” Proc. IEEE a member of Prof. Klaus-Robert Müller’s re-
Conf. Computer Vision and Pattern Recognition (CVPR), 2008. search group at the Fraunhofer Institute FIRST
[59] A. Bosch, A. Zisserman, and X. Muñoz, “Representing Shape with (Dr rer nat 2004). Thereafter, he was a Marie
a Spatial Pyramid Kernel,” Proc. Int’l Conf. Content-Based Image and Curie fellow at the University of Edinburgh from
Video Retrieval (CIVR), 2007. 2005 to 2007, before joining the Max Planck
[60] H. Bay, A. Ess, T. Tuytelaars, and L.J.V. Gool, “Speeded-Up Institute for Biological Cybernetics/Intelligent Systems. He is currently a
Robust Features (SURF),” Computer Vision and Image Understand- senior research scientist in Prof. Bernhard Schölkopf’s Department of
ing, vol. 110, no. 3, pp. 346-359, 2008. Empirical Inference at the Max Planck Institute for Intelligent Systems
[61] E. Shechtman and M. Irani, “Matching Local Self-Similarities (formerly Biological Cybernetics). His research interests include
across Images and Videos,” Proc. IEEE Conf. Computer Vision and machine learning, image processing, computational photography, and
Pattern Recognition (CVPR), 2007. probabilistic inference. In 2011, he received the DAGM Paper Prize, and
[62] J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A. Torralba, “SUN in 2012 the Günter Petzow Prize for outstanding work at the Max Planck
Database: Large-Scale Scene Recognition from Abbey to Zoo,” Institute for Intelligent Systems.
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pp. 3485-3492, 2010.
[63] J.C. Platt, “Probabilities for SV Machines,” Advances in Large
Margin Classifiers. MIT Press, 2000. . For more information on this or any other computing topic,
[64] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient Object please visit our Digital Library at www.computer.org/publications/dlib.
Category Recognition Using Classemes,” Proc. European Conf.
Computer Vision (ECCV), pp. 776-789, Sept. 2010.
[65] K.D. Tang, M.F. Tappen, R. Sukthankar, and C.H. Lampert,
“Optimizing One-Shot Recognition with Micro-Set Learning,”
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
2010.
[66] W.J. Scheirer, A. Rocha, A. Sapkota, and T.E. Boult, “Toward
Open Set Recognition,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 35, no. 7, pp. 1757-1772, July 2013.
[67] T. Tommasi, N. Quadrianto, B. Caputo, and C.H. Lampert,
“Beyond Data Set Bias: Multi-Task Unaligned Shared Knowledge
Transfer,” Proc. Asian Conf. Computer Vision (ACCV), 2012.

Christoph H. Lampert received the PhD


degree in mathematics from the University of
Bonn in 2003. He was a senior researcher at the
German Research Center for Artificial Intelli-
gence in Kaiserslautern and a senior research
scientist at the Max Planck Institute for Biologi-
cal Cybernetics in Tübingen. He is currently an
assistant professor at the Institute of Science
and Technology Austria, where he heads a
research group for computer vision and machine
learning. He has received several international and national awards for
his research, including the Best Paper Prize of CVPR 2008 and Best
Student Paper Award of ECCV 2008. In 2012, he was awarded an ERC
Starting Grant by the European Research Council. He is an associate
editor of the IEEE Transactions on Pattern Analysis and Machine
Intelligence and an action editor for the Journal of Machine Learning
Research.

You might also like