2. Incremental learning
2. Incremental learning
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
that this sharing leads to a sublinear growth of required al- model to viewpoint change. The evaluation is carried out on
phabet entries / detectors, but maintains excellent detection the ETH-80 dataset. This is a toy dataset (pun intended), but
performance. is useful here for illustration because it contains image sets
of various instances of categories at controlled viewpoints.
2 The boundary fragment model (BFM) We carry out the following experiment: a BFM is learnt
from instances of the cow category in side views. The
We present a very brief overview of our previous work model is then used to detect cows in test images which vary
which introduced a boundary fragment model (BFM) de- in two ways: (i) they contain cows (seven different object
tector (see [13] for details). The BFM consists of a set of instances) over varying viewpoints – object rotation about
curve fragments representing the edges of the object, both a vertical and horizontal axis (see figure 2); (ii) they con-
internal and external (silhouette), with additional geomet- tain instances of other categories (horses, apples, cars . . . ),
ric information about the object centroid (in the manner again over varying viewpoints.
of [10]). A BFM is learnt in two stages. First, random Figure 2 shows the resulting Hough votes on the cen-
boundary fragments γi are extracted from the training im- troid, averaged over the seven cow instances for a number
ages. Then costs K(γi ) are calculated for each fragment of rotations. It can be seen that the BFM is robust to signif-
on a validation set. Low costs are achieved for boundary icant viewpoint changes with the mode still clearly defined
fragments that match well on the positive validation im- (though elongated). The graph in figure 3 summarizes the
ages, not so well on the negative ones, and have good cen- change in the detection response averaged over the different
troid predictions on the positive validation images. Second, cows or other objects under rotation about a vertical axis
combinations of k = 2 boundary fragments are learnt as (as in the top row of figure 2). Note that the cow detection
weak detectors (not just classifiers) within an AdaBoost [6] response is above that of other non-cow category objects.
framework. Detecting instances of the object category in a The side-trained BFM can still discriminate the object class
new test image is done by applying the weak detectors and based on detection responses with rotations up to 45 de-
collecting their votes in a Hough voting space. An object grees in both directions. In summary: the BFM trained on
is detected if a mode (obtained using Mean-Shift mode esti- one visual aspect can correctly detect the object class over
mation) is above a detection threshold. Following the detec- a wide range of viewpoints, with little confusion with other
tion, boundary fragments that contributed to that mode are object classes. Similar results are obtained for BFM detec-
backprojected into the test image and provide an object seg-
mentation. An overview of the detection method is shown H:90,V:90 H:90,V:112 H:90,V:135 H:90,V:158 H:90,V:180 H:90,V:202 H:90,V:225 H:90,V:248 H:90,V:270
in figure 1.
3 On multiple aspects tors learnt for other object categories (e.g. horses), whilst
for some categories with greater invariance to viewpoint
We want to enable an object to be detected over several (e.g. bottles) the response is even more stable. These re-
visual aspects. The BFM implicitly couples fragments via sults allow us to cut down the bi-infinite space of different
the centroid, and so is not as flexible as, say, a “bag of” viewpoints to a few category relevant aspects. These as-
features model where feature position is not constrained. In pects allow the object to be categorized and also to predict
this section we investigate qualitatively the tolerance of the its viewpoint.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
40 But there is more information that can be used for shar-
apple
35 car ing. The second possibility of sharing is achieved by eval-
cup
30 dog uating each boundary fragment on the validation sets of all
detection confidence
horse
25 pear
tomatoe
other categories. This results in average matching costs of
20 avg. cow
avg. other
the boundary fragment on all these other categories. These
15 costs indicate how suitable the boundary fragment is for
10 each of the other categories. The straight forward way
5 of sharing is now that each alphabet entry whose bound-
0
−100 −50 0 50 100
ary fragment has costs below thK on a certain category
degree of rotation
is also shared for that category. However, costs are low if
the boundary fragment matches well on the validation im-
Figure 3: The detection response of a BFM trained on cows- ages of that category and gives a reliable centroid predic-
side, and tested on cows rotated about a vertical axis and on tion. The final possibility of sharing is where the boundary
other objects. fragment matches well, but additional centroid vectors are
associated for the fragment for the new category. Figure 4
shows an example of a boundary fragment extracted from
4 Learning the shape based alphabet incre- one category also matching on images of another class (or
mentally aspect). The first column shows the original boundary frag-
ment (in red/bold) on the training image from which it was
learnt (green/bold cross showing the true object centroid,
In this section we describe how the basic alphabet is as-
and blue/bold the centroid vote of this boundary fragment).
sembled for a set of classes. Each entry in the alphabet
The other columns show sharing on another category (first
consists of three elements: (i) a curve fragment, (ii) asso-
row), and within aspects of the same category (second row).
ciated vectors specifying the object’s centroid, and (iii) the
Note, that we share the curve fragment and update the geo-
set of categories to which the vectors apply. The alphabet
metric information.
can be enlarged in two ways: (i) adding additional curve
fragments, or (ii) adding additional vectors to existing curve
fragments – so that a fragment can vote for additional ob-
ject’s centroids. Pairs of curve fragments are used to con-
struct the weak detectors of section 5.
We start from a set of boundary fragments for each cat-
egory. This set is obtained from the fragment extraction
stage (see section 2 or [13]) by choosing fragments whose
costs on the validation set of the category are below a given
threshold thK . Typically this threshold is chosen so that
there are about 100 fragments available per category. Our
aim is to learn a common alphabet from these pooled indi-
vidual sets that is suitable for all the categories one wants to
learn. Figure 4: Sharing of boundary fragments over categories
(first row) and aspects (second row).
4.1 Building the alphabet and sharing of
boundary fragments
In a sequential way each boundary fragment from each 4.2 Class similarities on the alphabet level
category is compared (using Chamfer distance) to all exist-
ing alphabet entries. If the distance to a certain alphabet We now have alphabet entries for a number of classes.
entry is below a similarity threshold thsim , the geometric Using this information we can preview class similarities be-
information (for the centroid vote) is updated. If the exist- fore training the final detector. A class similarity matrix is
ing alphabet entry originates from another category than the calculated where each element is a count of the number of
boundary fragment we are currently processing, we also up- alphabet entries in common between the classes. In turn,
date the information for which categories this entry is suit- the classes can be agglomeratively clustered based on their
able. This is the first case where boundary fragments are similarity. For this clustering the normalized columns of
shared. This sharing is just based on the boundary fragment the similarity matrix provide feature vectors and Euclidean
similarity. distance is used as a distance measure. An example similar-
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
ity matrix and dendrogram (representing the clustering) are 2C − 1 possible subsets Sn of the jointly trained classes C,
shown in figures 8(a) and (b) respectively. we employ the maximally greedy strategy from [16]. This
starts with the first class that achieves alone the lowest er-
5 Incremental Joint-Adaboost Learning ror on the validation set, and then incrementally adds the
next class with the lowest training error. The combination
In this section we describe the new Adaboost based algo- which achieves the best overall detection performance over
rithm for learning the strong object detectors. It is designed all classes is then selected. [16] showed that this approxi-
mation does not reduce the performance much.
to scale well for many categories and to enable incremen-
tal and/or joint learning. It has to do two jobs: (i) select
pairs of fragments to form the weak detectors (see section Incremental learning: implements the following idea:
2); and (ii) select weak detectors to form the strong detector suppose our model was jointly trained on a set of cate-
for each object category. Sharing occurs at two levels: first, gories C L = {c1 , c2 , c3 }. Hence the “knowledge” learnt
at the alphabet level where an alphabet entry may be ap- is contained in a set of three strong detectors HL =
plicable to several categories; second, at the weak detector {H1 , H2 , H3 } which are composed from a set of weak de-
level, where weak detectors are shared across strong detec- tectors hL . The number of these weak detectorsdepends
C
tors. on the degree of sharing and is defined as Ts ≤ i=1 Tci
The algorithm can operate in two modes: either joint (C = 3 here). Now we want to use this existing informa-
learning (as in [16]); or incremental learning. In both cases tion to learn a detector for a new class cnew (or classes) in-
our aim is a reduction in the total number of weak detectors crementally. To achieve this, one can search already learnt
required compared to independently learning weak detectors hL to see whether they are also suitable
Ceach class. (cnew < 0.5) for the new class. If so, these existing weak
For C classes this gain can be measured by i=1 Tci − Ts
(as suggested in [16]) where Tci is the number of weak de- detectors are also used to form a detector for the new cate-
tectors required for each class trained separately (to achieve gory and only a reduced number of new weak detectors have
a certain error on the validation set) and Ts is the number of to be learnt using the joint learning procedure. Note that
weak detectors required when sharing is used. In the sepa- joint and incremental training reduces to standard Boosting
rate training case this sum is O(C), whereas in the sharing if there is only one category.
case it should grow sub-linearly with the number of classes.
The algorithm optimizes an error rate En over all classes. Weak detectors: are formed from pairs of fragments.
The possible combinations of k fragments define the fea-
Joint learning: involves for each iteration searching for ture pool (the size of this set is the binomial coefficient
the weak detector for a subset Sn ∈ C that has the lowest of k and the number of alphabet entries). This means for
accumulated error En on all classes C. Subsets might be each sharing of each iteration we must search over all these
e.g. S1 = {c2 } or S3 = {c1 , c2 , c4 }. A weak detector only possibilities to find our best weak detector. We can reduce
fits for a category if ci on this category ci is below 0.5 (and the size of this feature pool by using only combinations of
is rejected otherwise). En is the sum of all class specific boundary fragments which can be shared over the same cat-
errors ci if ci ∈ Sn and a penalty error p (0.6 in our im- egories as candidates for weak detectors. E.g. it does not
plementation) otherwise. Searching for a minimum of En make much sense to test a weak detector which is combined
over a set of subsets Sn guides the learning towards sharing from a boundary fragment representing a horses leg and one
weak detectors over several categories. We give a brief ex- that represents a bicycle wheel if the boundary horses leg
ample of that behavior: imagine we learn three categories, never matches in the bike images.
c1 , c2 and c3 . There is one weak detector with c1 = 0.1 but
this weak detector does not fit any other category (c2 > 0.5 Details of the algorithm: The algorithm is summarized
and c3 > 0.5). Another weak detector can be found with in figure 5. We train on C different classes where each class
c1 = 0.2, c2 = 0.4 and c3 = 0.4. In this case the al- ci consists of Nci validation images, and a set of Nbg back-
gorithm would select the second weak detector as its accu- ground validation images (which are shared for all classes
mulated error of En = 1.0 is smaller than the error of the and are labeled 0i ). The total number of validation im-
first weak detector of En = 1.3 (note that for each category ages for all classes and background is denoted by N . The
not shared p is added). This makes the measure En useful weights are initialized for each class separately. This re-
to find detectors that are suitable for both distinguishing a sults in a weight vector wic of length N for each class ci ,
class from the background, and for distinguishing a class normalized with respect to the varying number of positive
from other classes. Clearly, the amount of sharing is influ- validation images Nci . In each iteration a weak detector
enced by the parameter p which enables us to control the for a subset Sn is learnt. To encourage the algorithm to fo-
degree of sharing in this algorithm. Instead of exploring all cus also on the categories which were not included in Sn
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
we vary the weights of these categories slightly for the next means we are not searching and comparing the learning ef-
iteration (c = p, ∀c ∈
/ Sn , with p = 0.47 in our implemen- fort for a certain error rate (as is done in [16]), but we re-
tation). port the RPC-equal-error-rate for a certain learning effort
(namely T weak detectors). Keeping track of the training
0
error is more difficult in our model, as we detect in the
Input: Validation images (I1 , P
1 ), . . . , (IN , N ),
C
4. Update Tci , and if Tci ≥ T ∀ci → ST OP Dataset: we have combined different categories from sev-
eral available datasets (at [8]) together with new images
from Google Image Search, in order to assemble a dataset
Figure 5: Incremental joint-Adaboost learning algorithm. containing 17 categories of varying complexity and aspect.
Figure 6 overviews the dataset, giving an example image for
6 Experiments each of the 17 categories. Table 1 summarizes the data used
for training, validation and testing.
We will measure detector performance in two ways: first, We use the same test set as [5] for the first four cate-
by applying the detector to a category specific test set (posi- gories so that our performance can be compared to others
tive images vs. background). The measure used is the Recall (although fewer training images are used). The same is
Precision Curve (RPC)-equal-error rate. This rate is com- done for category 11 (CowSide) so that performance can
monly used for detection and pays respect to false positive be compared with [10]. For the other categories we are not
detections (see [1] for more details); second, by a confusion directly comparable as subsets of the training and test data
table computed on a multi-class testset. Note that a detec- have been selected. As background images we used a sub-
area(box ∩boxgt )
tion is correct if area(boxpred
pred ∪boxgt )
≥ 0.5, with boxpred set of the background images used in [5] and [12] (the same
being the predicted bounding box and boxgt the bounding number of background as positive training images). To de-
box denoting the ground truth. termine to what extent the model confuses categories, we
The detectors are trained in three ways: (i) independently select a multiclass test dataset M which consists of the first
using the category’s validation set (images with the object, 10 test images from each category1 .
and background images); (ii) jointly over multiple cate-
gories; and (iii) incrementally. We compare performance, The alphabet: Figure 7 shows entries of the alphabet
learning complexity, and efficiency of the final strong de- trained on horses only. This nicely illustrates the different
tectors over these three methods. properties of each entry: shape and geometric information
For all experiments training is over a fixed number of for the centroid. When we train on 17 categories each of the
weak detectors T = 100 per class (for C classes the max-
imum number of weak detectors is Tmax = T · C). This 1 The whole dataset is available at [7].
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Figure 6: Example images of the 17 different categories (or aspects) used in the experiments.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
1
1600
alphabet entries
weak detectors (incr.)
1400 0.8
weak detectors (joint)
1 − RPC−equ.−error
nr. of weak detectors
worst case
1200
1000 0.6
800
0.4
600
independently
400
0.2 jointly
200
0 0
0 2 4 6 8 10 12 14 16 17 0 5 10 15 20 25 30
Number of categories Number of training images per class
Figure 8: (a) Similarity matrix of alphabet entries for the different categories (brighter is more similar). (b) Dendrogram
generated from this similarity matrix. (c) The increase in the number of alphabet entries and weak detectors when adding
new classes incrementally or training a set of classes jointly. The values are compared C to the worst case (linear growth,
dotted line). For weak detectors the worst case is training independent and given by ( i=1 Tci ), and for the alphabet we
approximate the worst case by assuming an addition of 100 boundary fragments per category. Classes are taken sequentially
(Planes(1), CarRear(2), Motorbike(3), ...). Note the sublinear growth. (d) Error averaged for 6 categories (Planes, CarRear,
Motorbike, Face, BikeSide and HorseSide) either learnt independently or jointly with a varying number of training images
per category.
weak classifiers) that can be shared. One could use the in- egories test set (category images and background images),
formation from the dendrogram from figure 8(b) to find out denoted by T and on the multiclass test set (M) for both
the optimal order of the classes for the incremental learning, cases. It also gives comparisons to some other methods that
but this is future work. used this data in the single category case where we used the
same test data. The joint learning procedure does not sig-
nificantly reduce the detection error (although we gain more
Joint learning: First we learn detectors for different than we loose), but we gain in requiring just 623 weak de-
aspects of cows, namely the categories CowSide and tectors instead of the straightforward 1700 (i.e. 100 times
CowFront independently, and then compare this perfor- the number of classes for independent learning). Errors are
mance with joint learning. For CowSide the RPC-equal- more often because of false positives than false negatives.
error is 0% for both cases. For CowFront the error is re- We are superior or similar in our performance compared to
duced from 18% (independent learning) to 12% (joint learn- state-of-the-art approaches (note that classification is easier
ing). At the same time the number of learnt weak hypothe- than detection) as shown in table 2. Looking at the multi-
ses is reduced from 200 to 171. We have carried out a sim- class case (I, M, and J, M, in error per image), we obtain
ilar comparison for horses which again shows the same be- comparable error rates for independent and joint learning.
havior. This is due to the reuse of some information gath- Figure 9 shows examples of weak detectors learnt in this
ered from the side aspect images to detect instances from experiment, and their sharing over various categories.
the front. Information that is shared here are e.g. legs,
or parts of the head. This is precisely what the algorithm
should achieve – fewer weak detectors with the same or a 7 Discussion
superior performance. The joint algorithm has the opportu-
nity of selecting and sharing a weak detector that can sep- It is worth comparing our algorithm and results to that
arate both classes from the background. This only has to of Torralba et al. [16]. We have used AdaBoost instead of
be done once. On the other hand, the independent learning GentleBoost (used in [16]) as in experiments it gave supe-
does not have this opportunity, and so has to find such a rior performance and proofed that it is more suitable for our
weak detector for each class. type of weak detectors. Compared to [16] we share signif-
In figure 8(d) we show that joint learning can achieve icantly fewer entries as they have a 4-fold reduction, com-
better performance with less training data as a result of shar- pared to our 2-fold reduction. This is mainly caused by their
ing information over several categories (we use 6 categories type of basic features which are much less complex and thus
in this specific experiment). more common over different categories than ours.
Finally we focus on many categories, and compare in- Initial experiments show that a combination of our model
dependent learning performance to that achieved by learn- with appearance patches increases the detection perfor-
ing jointly. Table 2 shows the detection results on the cat- mance, but this is the subject of future work.
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE
Class Plane CarR Mb Face B-S B-R B-F Car23 CarF Bottle CowS H-S H-F CowF Pers. Mug Cup
6.3 6.1 7.6 6.0 0.0
Ref.
[5],C [10],D [15],D [15],D [10],D
I,T 7.4 2.3 4.4 3.6 28.0 25.0 41.7 12.5 10.0 9.0 0.0 8.2 13.8 18.0 47.4 6.7 18.8
J,T 7.4 3.2 3.9 3.7 22.4 20.8 31.3 12.5 7.6 10.7 0.0 7.8 11.5 12.0 42.0 6.7 12.5
I,M 1.1 7.0 6.2 1.4 10.3 7.7 8.5 5.2 7.6 7.1 1.6 10.0 8.2 9.5 29.1 5.1 8.0
J,M 1.5 4.3 4.5 1.6 8.9 5.9 7.7 3.8 8.5 6.1 1.3 11.0 4.7 6.8 27.7 5.8 8.3
Table 2: Detection results. In the first row we compare categories to previously published results. We distinguish between
detection D (RPC-eq.-err.) and classification C (ROC-eq.err.). Then we compare our model, either trained by the independent
method (I) or by the joint (J) method, and tested on the class test set T or the multiclass test set M. On the multiclass set
we count the best detection in an image (over all classes) as the object category. The abbreviations are: B=Bike, H=Horse,
Mb=Motorbike, F=Front, R=Rear, S=Side.
Figure 9: Examples of weak detectors that have been learnt for the whole dataset (resized to the same width for this illus-
tration). The black rectangles indicate which classes share a detector. Rather basic structures are shared over many classes
(e.g. column 2). Similar classes (e.g. rows 5, 6, 7) share more specific weak detectors (e.g. column 12, indicated by the arrow,
where parts of the bike’s wheel are shared).
Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)
0-7695-2597-0/06 $20.00 © 2006 IEEE