Attribute-Based Classification For Zero-Shot Visual Object Categorization
Attribute-Based Classification For Zero-Shot Visual Object Categorization
Abstract—We study the problem of object recognition for categories for which we have no training examples, a task also called
zero-data or zero-shot learning. This situation has hardly been studied in computer vision research, even though it occurs frequently;
the world contains tens of thousands of different object classes, and image collections have been formed and suitably annotated for
only a few of them. To tackle the problem, we introduce attribute-based classification: Objects are identified based on a high-level
description that is phrased in terms of semantic attributes, such as the object’s color or shape. Because the identification of each such
property transcends the specific learning task at hand, the attribute classifiers can be prelearned independently, for example, from
existing image data sets unrelated to the current task. Afterward, new classes can be detected based on their attribute representation,
without the need for a new training phase. In this paper, we also introduce a new data set, Animals with Attributes, of over 30,000
images of 50 animal classes, annotated with 85 semantic attributes. Extensive experiments on this and two more data sets show that
attribute-based classification indeed is able to categorize images without access to any training images of the target classes.
1 INTRODUCTION
Fig. 2. Graphical representation of the proposed across-class learning task: Dark gray nodes are always observed; light gray nodes are observed
only during training. White nodes are never observed but must be inferred. An ordinary, flat, multiclass classifier (left) learns one parameter set k for
each training class. It cannot generalize to classes ðzl Þl¼1;...;L that are not part of the training set. In an attribute-based classifier (middle) with fixed
class-attribute relations (thick lines), training labels ðyk Þk¼1;...;K imply training values for the attributes ðam Þm¼1;...;M , from which parameters m are
learned. At test time, attribute values can directly be inferred, and these imply output class label even for previously unseen classes. A multiclass-
based attribute classifier (right) combines both ideas: Multiclass parameters k are learned for each training class. At test time, the posterior
distribution of the training class labels induces a distribution over the labels of unseen classes by means of the class-attribute relationship.
The major difference between both approaches lies in per-image attribute annotations, if available, or we infer the
the relationship between training classes and test classes. labels from the entry of the attribute vector corresponding
Directly learning the attributes results in a network where to the sample’s label, i.e., all samples of class y have the
all classes are treated equally. When class labels are binary label aym . The trained classifiers provide us with
inferred at test time, the decision for all classes is based estimates of pðam j xÞ, from which we form a model for the
only on the attribute layer. We can expect it, therefore, to Q
complete image-attribute layer as pða j xÞ ¼ M m¼1 pðam j xÞ.
also handle the situation where training and test classes At test time, we assume that every class z induces its
are not disjoint. In contrast, when predicting the attribute attribute vector az in a deterministic way, i.e., pða j zÞ ¼
values indirectly, the training classes occur also at test time
½½a ¼ az , where we have made use of Iverson’s bracket
as an intermediate feature layer. On the one hand, this can
notation [11]: ½½P ¼ 1 if the condition P is true and it is 0
introduce a bias, if training classes are also potential
otherwise. By applying Bayes’ rule, we obtain pðz j aÞ ¼
output classes during testing. On the other hand, one can pðzÞ z
argue that deriving the attribute layer from the label layer pðaz Þ ½½a ¼ a as the representation of the attribute-class layer.
instead of from the samples will act as a regularization Combining both layers, we can calculate the posterior of a
step that creates only sensible attribute combinations and, test class given an image:
therefore, makes the system more robust. In the following, X
we will develop realizations for both methods and bench- pðzÞ Y M
pðz j xÞ ¼ pðz j aÞpða j xÞ ¼ z
p azm j x : ð1Þ
mark their performance. pða Þ m¼1
a2f0;1gM
attributes and classes, setting pðam j yÞ ¼ ½½am ¼ aym . The 3.1 Sharing Information between Classes
combination of both steps yields The aspect of sharing information between classes has
attracted the attention of many researchers. A common idea
X
K
pðam j xÞ ¼ pðam j yk Þpðyk j xÞ; ð3Þ is to construct multiclass classifiers in a cascaded way. By
k¼1 making similar classes share large parts of their decision
paths, fewer classification functions need to be learned,
so in comparison to DAP, we only perform an additional thereby increasing the system’s prediction speed [23].
matrix-vector multiplication after evaluating the classifiers. Similarly, one can reduce the number of feature calculations
With the estimate of pða j xÞ obtained from (3) we proceed by actively selecting low-level features that help discrimi-
in the same way as in for DAP, i.e., we classify test samples nation for many classes simultaneously [24]. Combinations
using (2). of both approaches are also possible [25].
In contrast, interclass transfer does not aim at higher
3 RELATED WORK speed, but at better generalization performance, typically
for object classes with only few available training
Multilayer or cascaded classifiers have a long tradition in instances. From known object classes, one infers prior
pattern recognition and computer vision: multilayer percep- distributions over the expected intraclass variance in terms
trons [12], decision trees [13], mixtures of experts [14], and of distortions [26] or shapes and appearances [27].
boosting [15] are prominent examples of classification Alternatively, features that are known to be discriminative
systems built as feed-forward architectures with several for some classes can be reused and adapted to support the
stages. Multiclass classifiers are also often constructed as detection of new classes [28]. To our knowledge, no
layers of binary decisions from which the final output is previous approach allows the direct incorporation of
inferred, for example, [16], [17]. These methods differ in human prior knowledge. Also, all above methods require
their training methodologies, but they share the goal of at least some training examples of the target classes and
decomposing a difficult classification problem into a cannot handle completely new objects.
collection of simpler ones. However, their emphasis lies A notable exception is [29] that, like DAP and IAP, aims at
on the classification performance in a fully supervised classification with disjoint train and test set. It assumes that
scenario, so the methods are not capable of generalizing each class has a description vector, which can be used to
across class boundaries. transfer between classes. However, because these descrip-
Especially in the area of computer vision, multilayered tion vectors do not necessarily have a semantic meaning,
classification systems have been constructed, in which they cannot be obtained from human prior knowledge.
intermediate layers have interpretable properties: Artificial Instead, an additional data source is needed to create them,
neural networks or deep belief networks have been shown to for example, data samples in a different representation.
learn interpretable filters, but these are typically restricted
3.2 Predicting Semantic Attributes
to low-level properties, such as edge and corner detectors
[18]. Popular local feature descriptors, such as SIFT [1] or A second relevant line of related work is the prediction of
HoG [2], can be seen as hand-crafted stages in a feed- high-level semantic attributes for images. Prior work in the
area of computer vision has mainly studied elementary
forward architecture that transform an image from the pixel
properties, such as colors and geometric patterns [30], [31],
domain into a representation invariant to noninformative
[32], achieving high accuracy by developing task-specific
image variations. Similarly, image segmentation has been
features and representations. In the field of multimedia
formulated as an unsupervised method to extract contours
retrieval, similar tasks occur. For example, the TRECVID
that are discriminative for object classes [19]. Such pre-
contest [33] contains a task of high-level feature extraction,
processing steps are generic in the sense that they still allow
which consists of predicting semantic concepts, in particular
the subsequent detection of arbitrary object classes. How-
scene types, for example, outdoor, urban, and high-level
ever, the basic elements, local image descriptors or actions, for example, sports. It has been shown that by
segments shapes, alone are not reliable enough indicators combining searches for several such attributes, one can
of generic visual object classes, unless they are used as input build more powerful retrieval database mechanisms, for
to a subsequent statistical learning step. example, of faces [34], [35].
On a higher level of abstraction, pictorial structures [20], Instead of relying on manually defined attributes, it has
the constellation model [21], and recent discriminatively trained recently been proposed to identify attributes automatically.
deformable part models [22] are examples of the many Parikh and Grauman [36] introduced a semiautomatic
methods that recognize objects in images by detecting technique for this that combines classifier outputs with
discriminative parts. In principle, humans can give descrip- human feedback. Sharmanska et al. [37] propose an
tions of object classes in terms of such parts, for example, unsupervised technique for augmenting existing attribute
arms or wheels. However, it is a difficult problem to build a representations with additional nonsemantic binary fea-
system that learns to detect exactly the parts described. tures to make them more discriminative. It has also been
Instead, the above methods identify parts in an unsuper- shown that new attributes can be found by text mining [38],
vised way during training, which often reduces the parts to [39], [40], and that object classes themselves can act as
reproducible patterns of local feature points, not to units attributes for other tasks [41]. Berg et al. [40] showed that
with a semantic meaning. In general, parts learned this way instead of predicting only the presence or absence of an
do not generalize across class boundaries. attribute, their occurrence can also be localized within the
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 457
TABLE 1 TABLE 2
Animal Classes of the Animals with Attributes Data Set Eighty-Five Semantic Attributes of the
Animals with Attributes Data Set in Short Form
The 40 classes of the first four columns are used for training, the 10
classes of the last column (in italics) are the test classes.
Fig. 3. Real-valued (left) and binary-valued (right) class-attribute matrices of the Animals with Attributes data set. Shown are 13 33 excerpts of the
complete 50 85 matrices.
base histograms in each cell. The other feature vectors each 5.2 SUN Attributes
are bag-of-visual-words histograms obtained from quantizing The SUN Attributes8 data set was introduced by Patterson
the original descriptors with 2,000-element codebooks that and Hays [46]. It is a subset of the SUN Database [62] for
were obtained by k-means clustering on 250,000 element fine-grained scene categorization and consists of 14,340
subsets of the descriptors. images from 717 classes (20 images per class). Each image is
We define a fixed split of the data set into 40 classes annotated with 102 binary attributes that describe the
(24,295 images) to be used for training, and 10 classes scenes’ material and surface properties as well as lighting
(6,180 images) to be used for testing; see Table 1. This split conditions, functions, affordances, and general image
was not done randomly, but much of the diversity of the layout. For our experiments, we rely on the feature vectors
animals in the data set (water/land-based, wild/domestic, that are provided by the authors of [46] as part of the data
etc.) is reflected in the training as well as in the test set of set. These consists of GIST, HOG, self-similarity, and
classes. The assignments were based only on the class geometric color histograms.
names and before any experiments were performed, so in
particular, the split was not designed for best zero-shot
classification performance. Random train-test splits of 6 EXPERIMENTAL EVALUATION
similar characteristics can be created by fivefold cross In this section, we perform an experimental evaluation of
validation (CV) over the classes. the DAP and the IAP model on the Animals with Attributes
data set as well as the other data sets described above.
5 OTHER DATA SETS FOR ATTRIBUTE-BASED Since our goal is the categorization of classes for which
CLASSIFICATION no training samples are available, we always use training
and test set with disjoint class structure.
Besides the Animals with Attributes data set, we also For DAP, we train one nonlinear support vector machine
perform experiments on two other data sets of natural
for each binary attributes, a1 ; . . . ; aM . In each case, we use
images for which attribute annotations have been released.
90 percent of the images of the training classes for training,
We briefly summarize their characteristics here. An over-
with binary labels for the attribute, which are either
view is also provided in Table 3.
obtained from the class-attribute matrix by assigning each
5.1 aPascal-aYahoo image the attribute value of its class, or by per-image
The aPascal-aYahoo data set6 was introduced by Farhadi attribute annotation, where available. We use the remaining
et al. [45]. It consists of a 12,695-image subset of the 10 percent of training images to estimate the parameters of a
PASCAL VOC 2008 data set7 and 2,644 images that were sigmoid curve for Platt scaling, to convert the SVM outputs
collected using the Yahoo image search engine. The into probability estimates [63].
PASCAL part serves as training data, and the Yahoo part At test time, we apply the trained SVMs with Platt
as test data. Both sets have disjoint classes (20 classes for scaling to each test image and make test class predictions
PASCAL, 12 for Yahoo), so learning with disjoint training using (2).
and test classes is unavoidable. Attribute annotation is For IAP, we train one-versus-rest SVMs for each training
available on the image level: Each image has been class, again using a 90/10 percent split for training of the
annotated with 64 binary attribute that characterize shape, decision functions, and of the sigmoid coefficients for Platt
material, and the presence of important parts of the visible
scaling. At test time, we predict a vector of class
object. As image representation, we rely on the precom-
probabilities for each test image. We L1 -normalize this
puted color, texture, edge orientation, and HoG features
vector such that we can interpret it as a posterior
that the authors of [45] extracted from the objects’ bounding
boxes (as provided by the PASCAL VOC annotation) and distribution over the training classes. We then use (3) to
released as part of the data set. predict attribute values, from which we obtain test class
predictions by (2) as above.
6. http://vision.cs.uiuc.edu/attributes/.
7. http://www.pascal-network.org/challenges/VOC/. 8. http://cs.brown.edu/~gen/sunattributes.html.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 459
TABLE 3
Characteristics of Data Sets with Attribute Annotation:
Animals with Attributes [9], aPascal/aYahoo (aP/aY) [45],
SUN Attributes (SUN) [46]
6.1 SVM Kernels and Model Selection Fig. 4. Confusion matrices between 10 test classes of the Animals with
Attributes data set. Left: Indirect attribute prediction. Right: Direct
To achieve optimal performance of the SVM classifiers, we
attributes prediction.
use established kernel functions and perform thorough
model selection. All SVMs are trained with linearly combined we also perform experiments with random class split using
2 -kernels: For any D-dimensional feature vectors, hðxÞ 2 fivefold cross validation for Animals with Attributes (i.e., 40
IRD and hð xÞ 2 IRD , of images x and x, we set kðx; xÞ ¼ training classes, 10 test classes), and 10-fold cross validation
¼ PD ðhi hi Þ . For DAP,
2
2
expð ðhðxÞ; hð xÞÞÞ with 2 ðh; hÞ i¼1 hi þhi for SUN Attributes (approximately 637 1 classes for
the bandwidth parameter is selected in the following way: training and 70 1 classes for testing). We measure the
For each attribute, we perform fivefold cross validation, quality of the prediction steps in terms of normalized
computing the receiver operating characteristic (ROC) curve multiclass accuracy (MC acc.) on the test set (the mean of
of each predictor and averaging the areas under the curves the diagonal of the confusion matrix). We also report areas
(AUCs) over the attributes. The result is a single mean under the ROC curve for each test class z and attribute a,
attrAUC score for any value of the bandwidth. We perform when their posterior probabilities pðz j xÞ and pða j xÞ,
this estimation for ¼ c 2 f0:01; 0:03; 0:1; 0:3; 1; 3; 10g, respectively, are treated as ranking measures over all test
P images.
where c ¼ n12 ni;j¼1 2 ðhðxi Þ; hðxj ÞÞ, i.e., we parameterize
In the following, we show detailed results for Animals
relative to the average 2 -distance of all points in the training
with Attributes and summaries of the results for the other
set. ¼ 3 was consistently found as best value. data sets.
Given L different feature functions, h1 ; . . . ; hK , we obtain
L kernel functions k1 ; . . . ; kL , and we use
P their unnormalized 6.2.1 Results—Animals with Attributes
sum as the final SVM kernel, kðx; xÞ ¼ Ll¼1 kl ðx; xÞ. Once we The Animals with Attributes data set comes only with per-
fixed the kernel, we identify the SVMs C parameter among class annotation, so there are two models to compare: per-
the values f0:01; 0:03; 0:1; . . . ; 30; 100; 3;000; 1;000g in an
class DAP and per-class IAP. Fig. 4 shows the resulting
analogous procedure. We perform fivefold cross validation
confusion matrices for both methods. The class-normalized
for each attribute, and we pick C that achieves the highest
multiclass accuracy can be read off from the mean value of
mean attrAUC. Note that we use the same C values for all
the diagonal as 41.4 percent for DAP and 42.2 percent for
attribute classifiers. Technically, this would not be necessary,
IAP. While the results are not as high as a supervised
but we prefer it to avoid large scaling differences between
method could achieve, it nevertheless clearly proves our
the SVM outputs of different attribute predictors. Also, one
original claim about attribute-based classification: By shar-
can expect the optimal C values to not vary strongly between
ing information via an attribute layer, it is possible to classify
different attributes, because all classifiers use the same
images of classes for which we had no training examples. As a
kernel matrix and differ only in their label annotation.
baseline, we compare against a zero-shot classifier, where
For IAP, we use the same kernel as for DAP and
determine C using fivefold cross validation similar to the for each test class, we identify the most similar training
procedure one described above, except that we use the class and predict using a classifier for it trained on all
mean area under the ROC curve of class predictions (mean training data. We use two different methods to define the
classAUC) as selection criterion. similarity between the classes’ attribute representations:
Hamming distance or cross correlation. As it turns out, both
6.2 Results variants make almost identical decisions, resulting in
We use the above-described procedures to train DAP and multiclass accuracies of 30.7 and 30.8 percent. This is
IAP models for all data sets. For DAP, where applicable, we clearly better than chance performance, but below the
use both per-image or per-class annotation to find out results of DAP and IAP.
whether the time-consuming per-image annotation is Using random class splits instead of the predefined one,
necessary. For the data set with per-image attribute we obtain slightly lower multiclass accuracies of 34.8/44.8/
annotation, we create class-attribute matrices by averaging 34.7/35.1/36.3 percent (average 37.1 percent) for DAP, and
all attribute vectors of each class and thresholding the 33.4/42.8/27.3/31.9/35.3 percent (average 34.1 percent) for
resulting real-valued matrix at its global mean value. IAP. Again, the baselines achieve clearly lower results:
Besides experiments with fixed train/test splits of classes, 32.4/31.9/28.1/25.3/20.9 percent (average 27.7 percent) for
460 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014
TABLE 4
Numeric Results on the Animals with Attributes Data Set
in Percent: Multiclass Accuracy for DAP, IAP,
Class-Transfer Classifier Using Cross Correlation (CT-cc)
or Hamming Distance (CT-H) of Class Attributes,
and Chance Performance (rnd)
Fig. 6. Highest ranking results for each test class in the Animals with Attributes data set. Classes with unique characteristics are identified well, for
example, humpback whales and leopards. Confusions occur between visually similar categories, for example, pigs and hippopotamuses.
LAMPERT ET AL.: ATTRIBUTE-BASED CLASSIFICATION FOR ZERO-SHOT VISUAL OBJECT CATEGORIZATION 461
Fig. 7. Quality of individual attribute predictors (trained on train classes, tested on test classes), as measured by the area under the ROC curve.
Attributes without entries have constant values for all test classes, so their ROC curve cannot be computed.
(attrAUC). Missing entries indicate that all images in the from a global image representation because they typically
test set coincided in their value for this attribute, so no reflect information that is localized within only the object
ROC curve can be computed. Fig. 8 shows, for a selection region. On the other hand, nonvisual attributes can often
of attributes, the five images of highest posterior score still be predicted from image information because they
within the test set. occur correlated with visual properties, for example,
On average, attributes can be predicted clearly better characteristic texture. It is known that the integration of
than random (the average AUC is 72.4 percent, whereas such contextual information can improve the accuracy of
random prediction would have 50 percent). However, the visual classifiers, for example, road regions helps the
variance within the predictions is large, ranging from near detection of cars. However, it remains to be seen if this
perfect prediction, for example, for is yellow and eats effect will be sufficient for purely nonvisual attributes, or
plankton, to essentially random performance, for example, whether it would be better in the long run to replace
on has buckteeth or is timid. Contrary to what one might nonvisual attributes by the visual counterparts they are
expect, attributes that refer to visual properties are not correlated with.
automatically predicted more accurately than others. For Another interesting observation is that the system
example, is blue is identified reliably, but is brown is not. learned to correctly predict attributes such as is big and is
Overall good performance is also achieved on several small, which are ultimately defined only by context. While
attributes that describe body parts, such as has paws, or the this is desirable in our setup, where the context is
natural habitat lives in trees, and even on nonvisual consistent, it also suggests that the learned attribute
properties like, such as, is smelly. There are two explana- predictors themselves are context dependent and cannot
tions for this effect: On the one hand, attributes that are be expected to generalize to object classes very different
clearly visual, such as colors, can still be hard to predict from the training classes.
Fig. 8. Highest ranking results for a selection of attribute predictors (see Section 6.2) learned by DAP on the Animals with Attributes data set.
462 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014
TABLE 5 TABLE 6
Numeric Results on the aPascal/aYahoo and the SUN Attributes Mean Mutual Information between Individual Attributes
Data Sets in Percent: DAP with Per-Image Annotation (DAP-I), and Class Labels with Per-Class or Per-Image Annotation
DAP with Per-Class Annotation (DAP-C), IAP, Class-Transfer
Classifier (CT-H), and Chance Performance (rnd)
TABLE 7
for attribute-based classification that solve this problem by
Numeric Results of One-versus-Rest Multiclass SVMs Trained transferring information between classes. In both cases, the
with n 2 f1; 2; 3; 4; 5; 10; 15; 20g Training Examples from transfer is achieved by an intermediate representation that
Each Test Class in Comparison to the Results Achieved by consists of high-level semantic attributes that provide a fast
Zero-Shot Learning with DAP and IAP (in Percent) and simple way to include human knowledge into the
system. To predict the attribute level, we either rely on
classifiers trained directly on attribute annotation (DAP), or
we infer the attribute layer from classifiers trained to
identify other classes (indirect attribute prediction). Once
trained, the system can detect new object categories, if a
suitable characterization in terms of attributes is available
for them, and it does not require retraining.
As a second contribution, we introduced the Animals with
Attributes data set: It consists of over 30,000 images with
precomputed reference features for 50 animal classes, for
which a semantic attribute annotation is available that has
been used in earlier cognitive science work. We hope that
this data set will foster research and serve as a testbed for
training with 10-15 training examples per test class, i.e., 100- attribute-based classification.
150 training images in total. On the aPascal, the attribute
representations perform worse. Their results are compar- 7.1 Open Questions and Future Work
able to supervised training with at most one example per Despite the promising results of the proposed system,
class, if judged by multiclass accuracy, and two to three several questions remain open and require future work. For
examples per class, if judged by mean classAUC. On the example, the assumption of disjoint training and test classes
SUN data set, approximately two examples per class (142 is clearly artificial. It has been observed, for example, in
total) are necessary for equal mean class accuracy, and 5-10 [65], that existing methods, including DAP and IAP, do not
examples per class (355 to 710 total) for equal mean AUC. work well if this assumption is violated, since their
Note, however, that all the above comparisons may over- decisions become biased toward the previously seen
estimate the power of the supervised classifiers: In a classes. In the supervised scenario, methods to overcome
realistic setup with so few training examples, model this limitation have been suggested, for example, [66], [67],
selection is problematic, whereas to create Table 7, we just but a unified framework that includes the possibility of
reused the parameters obtained by thorough model selec- zero-shot learning is still missing.
tion for the IAP model. A related open problem is how zero-shot learning can be
Interpreting the low performance on the aPascal-aYahoo unified with supervised learning when a small number of
data set, one has to take the background of this data set into labeled training examples are available. While some work
account. Its attributes were selected to provide additional in this direction exists (see our discussion in Section 3), we
information about object classes, not to discriminate believe that it will also be able to extend DAP and IAP for
between them. While the resulting attribute set is compar- this for purpose. For example, one could make use of their
ably difficult to learn (see Table 5(a)), each attribute on probabilistic formulation to define an attribute-based prior
average contains less information about the class labels that is combined with a likelihood term derived from the
training examples.
(see Table 6), mainly because several of the attributes are
Beyond the specific task of multiclass classification, there
meaningful only for a small subset of the categories. We
are many other open questions that will need to be tackled
conclude from this that attributes that are useful to describe
if we want to make true progress in solving the grand tasks
objects from different categories are not automatically also
of computer vision: How do we handle the problem that
useful to distinguish between the categories, a fact that
many object categories are rare? How can we build object
should be taken into account in the future creation of
recognition systems that adapt and incorporate new
attribute annotation for image data sets.
categories that they encounter? How can we integrate
Overall, we do not think that the experiments we
presented are sufficient to make a definite statement about human knowledge about the visual world besides specify-
the quality of attribute-based versus supervised classifica- ing training examples? We believe that attribute-based
tion. However, we believe that the results confirm the classification will be able to help in answering at least some
intuition that a larger ratio of attributes to classes improves of these questions.
the prediction performance. However, not only the number
of attributes matters, but also how informative the chosen ACKNOWLEDGMENTS
attributes are about the classes.
This work was in part funded by the European Research
Council under the European Union’s Seventh Framework
7 CONCLUSION Programme (FP7/2007-2013)/ERC grant agreement no
In this paper, we introduced learning with disjoint training 308036. The authors would like to thank Charles Kemp
and test classes. It formalizes the problem of learning an for providing the Osherson/Wilkie class-attribute matrix
object classification systems for classes for which no and Jens Weidmann for his help on creating the Animals
training images are available. We proposed two methods with Attributes data set.
464 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 3, MARCH 2014
[52] A. Parkash and D. Parikh, “Attributes for Classifier Feedback,” Hannes Nickisch received degrees from the
Proc. European Conf. Computer Vision (ECCV), 2012. Université de Nantes, France, in 2004, and the
[53] D. Osherson, E.E. Smith, T.S. Myers, E. Shafir, and M. Stob, Technical University Berlin, Germany, in 2006.
“Extrapolating Human Probability Judgment,” Theory and Deci- During his PhD work at the Max Planck Institute
sion, vol. 36, no. 2, pp. 103-129, 1994. for Biological Cybernetics, Tübingen, Germany,
[54] S.A. Sloman, “Feature-Based Induction,” Cognitive Psychology, he worked on large-scale approximate Bayesian
vol. 25, pp. 231-280, 1993. inference and magnetic resonance image re-
[55] T. Hansen, M. Olkkonen, S. Walter, and K.R. Gegenfurtner, construction. Since 2011, he has been with
“Memory Modulates Color Appearance,” Nature Neuroscience, Philips Research, Hamburg, Germany. His
vol. 9, pp. 1367-1368, 2006. research interests include medical image pro-
[56] D.N. Osherson, J. Stern, O. Wilkie, M. Stob, and E.E. Smith, cessing, machine learning, and biophysical modeling.
“Default Probability,” Cognitive Science, vol. 15, no. 2, pp. 251-269,
1991. Stefan Harmeling studied mathematics and
[57] C. Kemp, J.B. Tenenbaum, T.L. Griffiths, T. Yamada, and N. Ueda, logic at the University of Münster (Dipl Math
“Learning Systems of Concepts with an Infinite Relational 1998) and computer science with an emphasis
Model,” Proc. Nat’l Conf. Artificial Intelligence (AAAI), 2006. on artificial intelligence at Stanford University
[58] K.E.A. van de Sande, T. Gevers, and C.G.M. Snoek, “Evaluation of (MSc 2000). During his doctoral studies, he was
Color Descriptors for Object and Scene Recognition,” Proc. IEEE a member of Prof. Klaus-Robert Müller’s re-
Conf. Computer Vision and Pattern Recognition (CVPR), 2008. search group at the Fraunhofer Institute FIRST
[59] A. Bosch, A. Zisserman, and X. Muñoz, “Representing Shape with (Dr rer nat 2004). Thereafter, he was a Marie
a Spatial Pyramid Kernel,” Proc. Int’l Conf. Content-Based Image and Curie fellow at the University of Edinburgh from
Video Retrieval (CIVR), 2007. 2005 to 2007, before joining the Max Planck
[60] H. Bay, A. Ess, T. Tuytelaars, and L.J.V. Gool, “Speeded-Up Institute for Biological Cybernetics/Intelligent Systems. He is currently a
Robust Features (SURF),” Computer Vision and Image Understand- senior research scientist in Prof. Bernhard Schölkopf’s Department of
ing, vol. 110, no. 3, pp. 346-359, 2008. Empirical Inference at the Max Planck Institute for Intelligent Systems
[61] E. Shechtman and M. Irani, “Matching Local Self-Similarities (formerly Biological Cybernetics). His research interests include
across Images and Videos,” Proc. IEEE Conf. Computer Vision and machine learning, image processing, computational photography, and
Pattern Recognition (CVPR), 2007. probabilistic inference. In 2011, he received the DAGM Paper Prize, and
[62] J. Xiao, J. Hays, K.A. Ehinger, A. Oliva, and A. Torralba, “SUN in 2012 the Günter Petzow Prize for outstanding work at the Max Planck
Database: Large-Scale Scene Recognition from Abbey to Zoo,” Institute for Intelligent Systems.
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
pp. 3485-3492, 2010.
[63] J.C. Platt, “Probabilities for SV Machines,” Advances in Large
Margin Classifiers. MIT Press, 2000. . For more information on this or any other computing topic,
[64] L. Torresani, M. Szummer, and A. Fitzgibbon, “Efficient Object please visit our Digital Library at www.computer.org/publications/dlib.
Category Recognition Using Classemes,” Proc. European Conf.
Computer Vision (ECCV), pp. 776-789, Sept. 2010.
[65] K.D. Tang, M.F. Tappen, R. Sukthankar, and C.H. Lampert,
“Optimizing One-Shot Recognition with Micro-Set Learning,”
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR),
2010.
[66] W.J. Scheirer, A. Rocha, A. Sapkota, and T.E. Boult, “Toward
Open Set Recognition,” IEEE Trans. Pattern Analysis and Machine
Intelligence, vol. 35, no. 7, pp. 1757-1772, July 2013.
[67] T. Tommasi, N. Quadrianto, B. Caputo, and C.H. Lampert,
“Beyond Data Set Bias: Multi-Task Unaligned Shared Knowledge
Transfer,” Proc. Asian Conf. Computer Vision (ACCV), 2012.