0% found this document useful (0 votes)
4 views

Zeni_Distilling_Knowledge_From_Refinement_in_Multiple_Instance_Detection_Networks_CVPRW_2020_paper

This document discusses advancements in Weakly Supervised Object Detection (WSOD) using Multiple Instance Detection Networks (MIDN) to improve object detection accuracy with minimal annotation. The authors propose a method called Boosted-OICR, which incorporates knowledge distillation and an adaptive supervision aggregation function to enhance the performance of existing models. Experimental results demonstrate significant improvements in detection metrics on the Pascal VOC 2007 dataset, making the proposed approach competitive with state-of-the-art methods.

Uploaded by

luis.zeni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Zeni_Distilling_Knowledge_From_Refinement_in_Multiple_Instance_Detection_Networks_CVPRW_2020_paper

This document discusses advancements in Weakly Supervised Object Detection (WSOD) using Multiple Instance Detection Networks (MIDN) to improve object detection accuracy with minimal annotation. The authors propose a method called Boosted-OICR, which incorporates knowledge distillation and an adaptive supervision aggregation function to enhance the performance of existing models. Experimental results demonstrate significant improvements in detection metrics on the Pascal VOC 2007 dataset, making the proposed approach competitive with state-of-the-art methods.

Uploaded by

luis.zeni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Distilling Knowledge from Refinement in Multiple Instance Detection Networks

Luis Felipe Zeni Claudio R. Jung


[email protected] [email protected]

Institute of Informatics, Federal University of Rio Grande do Sul, Brazil

Abstract where the object detector is trained using only image cate-
gory annotations (presence or absence of interest classes in
Weakly supervised object detection (WSOD) aims to the image), which is much easier and faster to generate.
tackle the object detection problem using only labeled im- Most WSOD methods [2, 1, 19, 5, 18, 23] follow the
age categories as supervision. A common approach used Multiple Instance Learning (MIL) pipeline [6] to train de-
in WSOD to deal with the lack of localization information tectors using only image category level annotations. In
is Multiple Instance Learning, and in recent years meth- the adaptation of MIL to the WSOD task, each image is
ods started adopting Multiple Instance Detection Networks considered a bag of positive and negative object propos-
(MIDN), which allows training in an end-to-end fashion. In als generated by object proposal methods such as Selective
general, these methods work by selecting the best instance Search [22] or Edge Boxes [27]. The training process in
from a pool of candidates and then aggregating other in- the MIL framework encompasses two steps: (i) to train an
stances based on similarity. In this work, we claim that instance selector to compute the object score of each object
carefully selecting the aggregation criteria can consider- proposal; (ii) to select the proposal with the highest score
ably improve the accuracy of the learned detector. We start and use it to mine positive instances and train detector es-
by proposing an additional refinement step to an existing timators. The majority of recent methods explore features
approach (OICR), which we call refinement knowledge dis- extracted by Convolutional Neural Networks (CNN) as an
tillation. Then, we present an adaptive supervision aggre- off-the-shelf feature extractor [2, 10] or train an end-to-end
gation function that dynamically changes the aggregation Multiple Instance Detection Network (MIDN) [1].
criteria for selecting boxes related to one of the ground- The lack of localization supervision during the training
truth classes, background, or even ignored during the gen- process, as expected, makes detection accuracy of WSOD
eration of each refinement module supervision. Experi- methods worse than its supervised counterparts. However,
ments in Pascal VOC 2007 demonstrate that our Knowledge the promise of a lower annotation cost attracted the efforts
Distillation and smooth aggregation function significantly of many researchers to WSOD, and significant improve-
improves the performance of OICR in the weakly supervised ments were achieved in recent years exploring a variety of
object detection and weakly supervised object localization strategies [2, 19, 20, 18, 23].
tasks. These improvements make the Boosted-OICR com- In this paper, we focused on the instance mining step
petitive again versus other state-of-the-art approaches. of MIL-based methods, and used a modification of an ex-
isting baseline approach as a proof-of-concept. More pre-
cisely, we propose improvements to boost the performance
1. Introduction of OICR, which we call Boosted-OICR (BOICR). We first
observed that it is possible to extract extra information
Supervised object detection has been achieving increas- from the refinement modules to boost the detection mAP
ingly better results in terms of accuracy and speed along the of OICR, which we call refinement knowledge distillation.
past years [13, 11]. The main drawback of these methods is We also propose an adaptive supervision aggregation func-
the need for annotated bounding boxes, which is a tedious, tion that dynamically changes the IoU threshold to select
error-prone, time-consuming, and expensive task. The an- boxes that will be aggregated as belonging to one of the
notation cost directly impacts the viability of deployment of ground-truth class, background, or ignored during the gen-
these detectors in real-world applications, particularly when eration of each refinement module supervision. The selec-
starting from scratch for a specific application. One ap- tion process follows the principle that at the beginning of the
proach that researchers are exploring to alleviate the annota- training is better to aggregate boxes with small IoU (since
tion cost is Weakly Supervised Object Detection (WSOD), the best instance is typically small and comprehends a small

1
portion of the object, such as the face for a person or cat). one supervision box is interesting because, usually, objects
To avoid an overgrowth of the object-related proposals, the can have multiple parts and also have multiple instances
IoU threshold is tightened as the training phase advances. present in the image. However, a limitation of the clustering
We also embedded an adapted version of the “trick” pro- process is that it increases the computational cost making
posed in [20], which ignores boxes with small intersection the whole training process slower. Our Boosted-OICR has
in the refinement losses. We evaluate our method in Pascal a better mAP result than [18] without using the clustering
VOC 2007, and our approach presents competitive state-of- process.
art results both in detection mAP and CorLoc mAP. Diba et. al. [5] proposed a three-stage cascaded method
Our main contributions in this paper are the introduction that mines boxes from Class Activation Maps (CAM). The
of: i) a module to distill extra knowledge from refinement first stage is inspired by [26], which uses a fully convolu-
agents; and ii) an adaptive supervision aggregation function tional CNN with global average pooling (GAP) to create
to mine candidate instances. Next, we present the state-of- the CAMs in conjunction with the classification scores. The
the-art on WSOD, and then describe the proposed method- second stage uses the CAM from the first stage as supervi-
ology with the experimental results and conclusions. sion to generate a segmentation map that is used to select
a set of candidate bounding boxes using the connective al-
2. Related Work gorithm from [26]. Finally, the features of the candidate
There is a considerable number of WSOD works that boxes are extracted by an SPP layer [9], and a MIL algo-
precede the CNN era [14, 16, 17]. However, we focus on rithm is applied to select the best candidate boxes for each
CNN- based methods as all state-of-the-art methods rely on class. In the same direction, Wei et al. [25] introduced a
CNN architectures. The adoption of CNN features was not method that uses CAMs to mine tight object boxes by ex-
immediate, and initial works started combining the CNN ploiting segmentation confidence maps. The segmentation
features with features extracted by other kinds of feature confidence maps are employed to evaluate the objectness
descriptors. Cinbis et. al. [2] proposed a multi-fold mul- scores of proposals according to two properties – purity and
tiple instance learning training procedure, which splits the completeness –, and the detection process is based on [19].
positive instances in K training folds. The method com- Although the idea of using CAMs to guide the selection of
bines the Fisher Vector with CNN features as descriptors, the supervision boxes is interesting, the training process of
and an objectness refinement is proposed to improve local- [5, 25] is overly complex.
ization accuracy. Since a pre-trained CNN is only used as a Wan et al. [24] proposed a min-entropy latent model to
feature extractor, its weights are not fine-tuned, which can measure the randomness of object localization. The learn-
lead to lower accuracy. Li et al. [10] introduced a two-stage ing process operates with two network branches. The first
adaptation algorithm. The first stage fine-tunes the network branch is designated for discovering objects using a global
to collect class-specific object proposals with higher preci- min-entropy layer that defines the distribution of object
sion; the second uses confident object candidates to opti- probability. This discovery process targets at finding can-
mize the CNN representations to turn image classifiers into didate object cliques, which is a proposal with high object
object detectors gradually. A drawback of the method is the confidence. The second branch is designated to localize ob-
need for individually forwarding each region proposal into jects using a local min-entropy layer and a softmax layer.
CNN to extract features, making the whole process very The local min-entropy layer classifies the object candidates
slow. This problem is solved in more recent methods us- in a clique into pseudo objects and hard negatives by opti-
ing Spatial Pyramidal Pooling (SPP) [9]. mizing the local entropy.
Bilen et al. [1] proposed a two-stream method, where one Non-convexity is also a common problem in multiple in-
stream performs classification and the other detection. The stance learning, which might lead to sub-optimal results.
output of both streams is combined into a global scoring Wan et al. [23] introduced a continuation optimization
matrix by taking the Hadamard product of the two streams. method that uses a series of smoothed loss functions to
The classification scores are calculated by summing the val- approximate the target (desired) loss, claiming that this
ues in the proposals dimension of this matrix. Tang et smoothed process alleviates the non-convexity problem in
al. [19] improved the smoothed version of MIL proposed MIL. The authors also propose a parametric strategy, for
by [1] using an online instance classification refinement that instance, subset partition, which is combined with a deep
utilizes cascaded refinement modules to increase the detec- neural network to activate a full object extent. In contrast,
tion performance, where each refinement steep makes the Tang et al. [20] proposed a two-stage region proposal net-
detector able to detect larger objects parts during training work that explores the responses in mid-layers of a network
gradually. In [18], the refinement process of [19] is further to create object proposals. The process creates coarse pro-
improved, adding proposal clusters to select one or more posals using an objectness score metric and sliding window
supervision boxes during the training. Selecting more than boxes. Later, the coarse proposals are refined proposals us-
Figure 1: The proposed architecture and its four modules. The proposals feature extraction module uses an SSP layer
to extract features from proposals generated by selective search. The multiple instance detection network module learns
to select the best proposal instance and generates an image classification score. The instance refinement modules have k
instances, and each one learns to refine instances from its predecessor result. Finally, the knowledge distillation module
aggregates all the knowledge learned by all the K refinement agents.

ing a region-based CNN classifier, which are used to train of K refinement agents. The k th refinement agent uses as
the network proposed in [19]. supervision the output from the previous agent {k − 1},
In summary, existing WSOD approaches vary regard- and the supervision for the 1st refinement agent (k = 1)
ing the selection of candidate proposals, the strategy for comes from the instance classifier branch. The third state,
mining instances, and the underlying classification network proposed by us, utilizes the knowledge of all K refinement
that guides the supervision, which leads to different levels agents to train a new agent. We call this process knowledge
of complexity for both implementation and training times. distillation as it aims to extract extra knowledge during the
This paper focuses mostly on the instance selection part, refinement process.
and we used the continuation function proposed in [23] In this section, we will explain all the employed stages in
as inspiration to adaptively select positive and negative in- detail. Also, in section 3.4, we explain the adaptive supervi-
stances. We also present and additional step to the refine- sion aggregation function that is employed by all refinement
ment supervision of [19]. The proposed method is presented agents during the learning process.
next.
3.1. Instance selection
3. The Proposed Approach Following [19], we use the method proposed by [1] be-
cause of its effectiveness and implementation convenience.
Since we propose improvements to boost OICR’s The instance selection works by branching the proposal fea-
pipeline [19], we will try to follow the same notation of ture vectors into two streams, and each stream starts with an
the original paper, and Fig. 1 shows a high-level diagram fc layer to produce two matrices xc , xd ∈ RC×|R| , where
of all stages of the proposed architecture. The first stage C is the number of classes and |R| is the number of propos-
aims to extract feature vectors from a given image, and can- als. A softmax function is applied to both matrices along
didate proposals are extracted using selective search [22]. different dimensions, yielding
The image and the extracted proposals feed a CNN back-
c d
bone with SPP to produce a fixed-size feature map to each exij exij
proposal. The proposals feature maps are converted to pro- σ(xc )]ij = PC xckj
, σ(xd )]ij = P|R| . (1)
xcik
k=1 e k=1 e
posal feature vectors using two fully connected (fc) layers,
which are branched into three different stages. The two first The two streams are then combined to generate pro-
stages are similar to [19] stages, where the first one trains posal scores using Hadamard (element-wise) matrix prod-
a basic instance classifier, and the second stage trains a set uct, yielding xR = σ(xc ) ⊙ σ(xd ). Finally, the classifica-
To build Yjk , first the proposal with highest score is selected
from the agent k − 1th supervision, sa given in Eq. (3).

jck−1 = arg max xcr


R(k−1)
. (3)
r

The highest score proposal is labeled as belonging to class


k k ′
λ = 0.5 λ = 0.25 c, i.e., ycj k−1 = 1 and y ′ k−1 = 0, c 6= c. Next, proposals
c cjc
with high overlap with jck−1 are labeled as belonging to the
same class of jck−1 , otherwise the adjacent proposals are
labeled as background. More precisely, this assignment is
given by
k
if IoU (jck−1 , jcj

c, )≥λ
c∗ kj = , (4)
C + 1, otherwise
λ = 0.1 λ = 0.01
where λ is the IoU threshold. We claim in this work that
Figure 2: Effect of changing the IoU threshold λ for in- selecting a fixed value for λ might not be the best choice,
stance selection. Green boxes are denote the supervision, and present our dynamic threshold in Section 3.4. Each ycj k
blue boxes pass the threshold (selected) and red boxes fail ∗k k
is updated using c j , that is, yc∗k j = 1. Meanwhile, if there
(not selecetd). j
is no object c in the image, all values are set to zero, i.e.,
k
ycj = 0.
tion score φc ∈ (0, 1) for class c is obtained by by summing Now that ycj k
is ready it can be used as supervision to
P|R|
over proposal dimensions, i.e,. φc = r=1 xR cr . We train train the k th refine agent using the loss function in Eq. 5.
the instance classifier using multi-class cross entropy loss,
|R| C+1
defined as 1 XX k k
LK
agent = − w y log xRk
cr , (5)
C |R| r=1 c=1 r cr
X
Lclass = − yc log φc + (1 − yc ) log(1 − φc ), (2)
c=1 where wrk is a weight term introduced to reduce noise dur-
Rk−1
ing the supervision and is obtained as wrk = xcj k−1 . More
c
where yc =∈ 0, 1 indicates if the image contains any in- details can be found in [19].
stance of class c in the image. More details can be found
in [1, 19] 3.3. Knowledge distillation module

3.2. Classifier refinement agents The motivation behind cascading K refinement agents
in [19] is that it allows the detector to gradually learn larger
To refine the outputs of the instance classifier, we use the parts of objects, starting from the best instance only. How-
online labeling and refinement strategy proposed by [19]. ever, we can observe that the supervision generated by a
Here we refer to each k th refinement pass as k th refinement k th agent will not be directly used by the k + 2th agent.
agent. In contrast with the instance classifier, each refine- This happens because agent k + 1 will learn with the super-
ment agent outputs an additional dimension for background vision k and will pass its own supervision to the next agent
in its score vector xRk
j ∈ R(C+1)×1 , k ∈ 1, 2, ..., K, where k + 2. In other words, during the agent supervision process,
the k is the index of the agent, K is the total of agents, some knowledge could be lost between the connections of
and the C + 1th dimension relates to the background. The the agents. We try to recover this information loss using
score vector from the instance classifier is represented here our knowledge distillation module. The distillation agent is
as xR0
j ∈ RC×1 , and is used to initialize the refinements. a special kind of agent that learns using all the K outputs
To obtain xRkj for k > 0, the feature vector related to the as supervision. In reality, this agent only differs in the su-
proposals is passed through a single fc layer, and a softmax pervision part when compared with a standard refinement
layer is applied over class dimension. agent.
Each agent needs some kind of supervision to learn how The distillation agent also outputs a score vector in the
to separate the proposals related to the background from format xDkj ∈ R(C+1)×1 . To obtain xDk j , the proposals-
those related to ground-truth classes. Thus, the supervision related feature vector is passed through a single fc layer,
for agent k is obtained from the previous agent xR(k−1) and and a softmax layer is applied over the class dimension.
a supervision label vector is created for each proposal j in The supervision process of the distillation agent, instead
the format Yjk = [y1j k k
, y2j k
, · · · , y(C+1),j ]T ∈ R(C+1)×1 . of getting the supervision from a previous agent, uses all
process. The function should be monotonically increasing,
such that more candidates are aggregated in the beginning
and less at the end. During our experiments, we evaluated a
set of different adaptive supervision aggregation functions,
and the best results were archived using the following func-
tion, also explored by C-MIL in a different context [23]:
1 log(s + lb ) − log lb
λ= , (7)
2 log(S + lb ) − log lb
where s is the current training step, S is the total of training
Figure 3: A visual example of instance mining for steeps, and lb defines the velocity that the curve grows.
“chicken” class, where the green box is the best instance.
Boxes in blue present large IoU, in red present small (but
not zero) IoU, and in yellow, the IoU is zero.

refinement agents outputs as supervision. More precisely, it


is computed by averaging the outputs of the K refinement
agents outputs:
K
1 X Rk
xDcj = xcj . (6)
K
k=1

Using xD Figure 4: A visual interpretation of the proposed adaptive


cj as the input to the supervision, the remaining
process is similar to the described in section 3.2 and the loss supervision aggregation function. X-axis shows the itera-
function Ldestill is the same as the weighted softmax loss in tion step number, and Y-axis shows the IoU with the box of
Eq. (5). the highest score.

3.4. Adaptive supervision aggregation function Another deficiency of the supervision selection approach
given by Eq. (4) is that when more than one instance of
In [19], the authors experimentally chose λ = 0.5 as the
a class is present in the image, it will obligatorily include
proposal selection scheme in Eq. (4) to create the supervi-
all other instances as background in during the supervision
sion matrices wrk and ycr k
. The interpretation of this value
(since their IoU with the best instance is small – in gen-
is that only boxes with IoU > 0.5 w.r.t. the best overall
eral, null). This is a bad decision, as we do not want to
proposal are selected as belonging to the ground-truth class
lower the scoring of these instances. In Fig. 3, we present a
c. The problem with using a fixed value is that at the be-
visual example of this problem, considering the “chicken”
ginning of training, the instance selection module tends to
class. In the figure, the rectangles are the candidate propos-
select only small boxes as top score proposals, typically re-
als, with the best one shown in green. Boxes shown in blue
lated to discriminant features of the objects (e.g., the face of
indicate proposals considered similar to the best one, ac-
a person or animal, as shown in Fig. 2). As a consequence,
cording to Eq. (4), which leaves several proposals related to
only other small boxes will have IoU > 0.5 w.r.t. this box,
the chicken class (in yellow) marked as background, which
and hence only small boxes will be considered as belong-
is not desirable.
ing to the class c. Figure 2 shows the effect of changing λ,
One solution to solve the penalization of other instances
where green denotes the best proposal, and blue the similar
in the loss is to include the “trick” proposed by [20], where
proposals according to the selected threshold.
a threshold value λign is used to ignore boxes with a low
Although the goal of refinement agents is to gradually
IoU w.r.t. jck−1 in the loss. With the trick, all the instances
improve the detectors to find larger parts of objects, start-
of Fig. 3 in yellow would be ignored, and the ones in red
ing with a larger value for λ causes each agent to highlight
would be marked as background.
only small boxes in beginning of the process, and in some
In contrast to [20], where λign has a fixed value, we pro-
cases, the optimization will be stuck in small boxes during
pose to use an adaptive value similar to the scheme used
all training (especially for deformable objects). Relaxing λ
for mining positive instances. Although the choice for λign
alleviates this issue, but it also tends to include proposals
could be independent from λ, we propose a “complemen-
that are not related to the correct class.
tary” threshold selection scheme given by
Instead of using a fixed value for λ, we use an adap-
tive supervision function that changes λ during the training λign = λmax − λ, (8)
where λmax defines the starting point of the adaptive trick. ID K λ λign distillation mAP
Fig. 4 presents the visual interpretation of λ and λign 1 3 0.5 0 No 42.3
2 3 adaptive 0 No 41.6
during the supervision process. Thus, we can adapt Eq. (4) 3 3 adaptive adaptive No 46.6
to include the trick as is defined in Eq. (9), leading to 4 3 adaptive adaptive Yes 49.7
5 4 adaptive adaptive No 48.1
if IoU (jck−1 , jcj
k

 c, ) ≥ λ,
∗k
c j = C + 1, if IoU (jck−1 , jcj
k
) ≥ λign , (9) Table 1: Ablation study performance (%) on the VOC 2007.
−1, otherwise

where −1 defines indices to be ignored in the agent loss approach among WSOD methods [23, 19, 24, 18] and cre-
functions. ates a total of ten augmented images. The learning process
was done using the SGD algorithm with momentum 0.9,
3.5. Final loss function
weight decay 5e−4 , and batch size 2. We set lb = 100 and
The classification, refinement and distillations modules λmax = 0.51. The learning rate is set to 0.001 for the first
present individual loss functions. However, we train our 30K and 60K iterations and then decreases to 0.0001 in the
model using a single loss that combine the individual loss following 20K and 30K iterations, respectively, for pascal
functions given by VOC 2007 and 2012. During test time, all ten images are
passed in the network, and the outputs are averaged. As
K
X an additional result, we also trained a supervised object
L = Lclass + Ldistill + Lkagent . (10) detector by choosing top-scoring proposals as ground truth
k=1
labels, as done in [19, 18, 23]. To make a fair comparison,
. we also trained a Fast RCNN (FRCNN) [8] detection
network using the five image scales. The supervision
4. Experiments boxes are chosen by its score (larger than 0.3) and using
Boosted-OICR was evaluated on the challenging PAS- non-maxima suppression (with 30% IoU threshold).
CAL VOC 2007 and 2012 datasets [7]. Although the
ground truth bounding box annotations are present in these 4.2. Ablation experiments
datasets, we only use the (weak) classification annotations We conduct some ablation experiments to illustrate the
(presence or absence of a class in the given image). The per- effectiveness of the proposed improvements over the base-
formed evaluation is based on the two standard metrics in line method OICR [19].
WSOD, that is, mean average precision (mAP) [7] and cor- We first study the impact of using the adaptive supervi-
rect localization (CorLoc) [4]. The former provides a mea- sion aggregation function instead of fixed IoU thresholds for
sure of how well the detector adapts to all instances, while proposal mining. We display the different scenarios in Ta-
the latter indicates if the best detection is a good match. ble 1. The experiment with ID= 1 presents the results using
Both metrics utilizes PASCAL criteria of IoU > 0.5 be- the standard OICR pipeline. In the experiment ID= 2 we
tween ground truths and predicted boxes. replace the fixed λ value by the proposed adaptive aggrega-
4.1. Implementation Details tion function defined in Eq. (7), in this experiment all boxes
with IoU < λ are considered as background. As the ex-
All experiments were performed using PyTorch periment suggests, using the adaptive supervision aggrega-
1.2 [12]1 . Our method uses VGG16 [15] pre-trained on Im- tion function alone without the adaptive trick makes the re-
ageNet [3] as backbone. We replaced the last max-pooling sults worse than the OICR’s baseline. However, adding the
layer by the SPP layer, and the last FC layer and softmax adaptive trick (experiment ID=3) leads to an improvement
loss layer by the layers described in Section 3. The new of 4.3% in the final mAP, suggesting that using our adap-
layers are initialized using Gaussian distributions with tive supervision aggregation function can boost the OICR
0-mean and standard deviations 0.01. Biases are initialized detection mAP significantly.
to 0. The object proposals are extracted using Selective We also evaluated the effect of including the distillation
Search [22]. For data augmentation, the input images refinement module. In fact, one could argue that using such
were re-sized into five scales {480, 576, 688, 864, 1200} a module could produce the same result as cascading one
concerning the smallest image dimension. During training more refinement agent. To show the difference, we tested
time, the scale of the image was randomly selected, and the our method using K = 4 (and no distillation) vs. K = 3
image was randomly horizontal flipped, which is a standard with distillation, and results with distillation were consider-
1 Source code available at: http://github.com/luiszeni/ ably better (see experiments ID= 4 vs. ID= 5 in Table 1).
Boosted-OICR As we can see, adding the knowledge distillation improves
Network Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
WSDDN [1] 46.4 58.3 35.5 25.9 14.0 66.7 53.0 39.2 8.9 41.8 26.6 38.6 44.7 59.0 10.8 17.3 40.7 49.6 56.9 50.8 39.2
OICR [19] 58.0 62.4 31.1 19.4 13.0 65.1 62.2 28.4 24.8 44.7 30.6 25.3 37.8 65.5 15.7 24.1 41.7 46.9 64.3 62.6 42.0
WCCN [5] 49.5 60.6 38.6 29.2 16.2 70.8 56.9 42.5 10.9 44.1 29.9 42.2 47.9 64.1 13.8 23.5 45.9 54.1 60.8 54.5 42.8
TS2C [25] 59.3 57.5 43.7 27.3 13.5 63.9 61.7 59.9 24.1 46.9 36.7 45.6 39.9 62.6 10.3 23.6 41.7 52.4 58.7 56.6 44.3
WeakRPN [21] 57.9 70.5 37.8 5.7 21.0 66.1 69.2 59.4 3.4 57.1 57.3 35.2 64.2 68.6 32.8 28.6 50.8 49.5 41.1 30.0 45.3
VGG16
PCL [18] 54.4 69.0 39.3 19.2 15.7 62.9 64.4 30.0 25.1 52.5 44.4 19.6 39.3 67.7 17.8 22.9 46.6 57.5 58.6 63 43.5
MELM [24] 55.6 66.9 34.2 29.1 16.4 68.8 68.1 43.0 25.0 65.6 45.3 53.2 49.6 68.6 2.0 25.4 52.5 56.8 62.1 57.1 47.3
C-MIL [23] 62.5 58.4 49.5 32.1 19.8 70.5 66.1 63.4 20.0 60.5 52.9 53.5 57.4 68.9 8.4 24.6 51.8 58.7 66.7 63.5 50.5
Ours 68.6 62.4 55.5 27.2 21.4 71.1 71.6 56.7 24.7 60.3 47.4 56.1 46.4 69.2 2.7 22.9 41.5 47.7 71.1 69.8 49.7
OICR [19] 65.5 67.2 47.2 21.6 22.1 68.0 68.5 35.9 5.7 63.1 49.5 30.3 64.7 66.1 13.0 25.6 50.0 57.1 60.2 59.0 47.0
TS2C [25] - - - - - - - - - - - - - - - - - - - - 48.0
FRCNN PCL [18] 63.2 69.9 47.9 22.6 27.3 71.0 69.1 49.6 12.0 60.1 51.5 37.3 63.3 63.9 15.8 23.6 48.8 55.3 61.2 62.1 48.8
Re-train WeakRPN [21] 63.0 69.7 40.8 11.6 27.7 70.5 74.1 58.5 10.0 66.7 60.6 34.7 75.7 70.3 25.7 26.5 55.4 56.4 55.5 54.9 50.4
C-MIL [23] 61.8 60.9 56.2 28.9 18.9 68.2 69.6 71.4 18.5 64.3 57.2 66.9 65.9 65.7 13.8 22.9 54.1 61.9 68.2 66.1 53.1
Ours 65.8 58.6 55.0 32.4 19.5 74.2 71.4 70.9 19.2 54.8 46.2 67.5 57.0 65.6 1.4 16.7 40.4 53.0 69.5 61.1 50.0

Table 2: Detection performance (%) on the VOC 2007 test set. Comparison to the state-of-the-arts.

Network Method aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv mAP
WSDDN [1] 65.1 58.8 58.5 33.1 39.8 68.3 60.2 59.6 34.8 64.5 30.5 43.0 56.8 82.4 25.5 41.6 61.5 55.9 65.9 63.7 53.5
OICR [19] 81.7 80.4 48.7 49.5 32.8 81.7 85.4 40.1 40.6 79.5 35.7 33.7 60.5 88.8 21.8 57.9 76.3 59.9 75.3 81.4 60.6
WCCN [5] 83.9 72.8 64.5 44.1 40.1 65.7 82.5 58.9 33.7 72.5 25.6 53.7 67.4 77.4 26.8 49.1 68.1 27.9 64.5 55.7 56.7
TS2C [25] 84.2 74.1 61.3 52.1 32.1 76.7 82.9 66.6 42.3 70.6 39.5 57.0 61.2 88.4 9.3 54.6 72.2 60.0 65.0 70.3 61.0
VGG16 WeakRPN [21] 77.5 81.2 55.3 19.7 44.3 80.2 86.6 69.5 10.1 87.7 68.4 52.1 84.4 91.6 57.4 63.4 77.3 58.1 57.0 53.8 63.8
PCL [18] 79.6 85.5 62.2 47.9 37.0 83.8 83.4 43.0 38.3 80.1 50.6 30.9 57.8 90.8 27.0 58.2 75.3 68.5 75.7 78.9 62.7
MELM [24] - - - - - - - - - - - - - - - - - - - - 61.4
C-MIL [23] - - - - - - - - - - - - - - - - - - - - 65.0
Ours 86.7 73.3 72.4 55.3 46.9 83.2 87.5 64.5 44.6 76.7 46.4 70.9 67.0 88.0 9.6 56.4 69.1 52.4 79.8 82.8 65.7

Table 3: Localization performance (%) on the VOC 2007 trainval set. Comparison to the state-of-the-arts.

Method mAP Corloc sults generated by our WSOD method. We also re-trained
WCCN [5] 37.9 -
OICR [19] 37.9 62.1
an Fast-RCNN detector using the learned pseudo objects as
TS2C [25] 40 64.4 ground-truth, and achieved 50% mAP, as shown in Table 2,
WeakRPN [21] 40.8 64.9 which improved our method by 0.3%.
PCL [18] 40.6 63.2 Table 3 presents a comparison in localization per-
MELM [24] 42.4 -
C-MIL [23] 46.6 67.4 formance of our method and SOTA in the Pascal
Ours * 66.3 VOC 2007 train-val set. Boosted-OICR outperformed
OICR [19] (5.1%), WCCN [5] (9.0%), TS2C [25] (4.7%),
Table 4: Detection (test set) and localization (trainval set) WeakRPN [21] (1.9%), PCL [18] (3.0%), MELM [24]
performance (%) on the VOC 2012 dataset using VGG16. (4.3%), and C-MIL [23] (0.7%). The better corloc result
of our method in comparison with C-MIL suggests that C-
MIL is just a little better dealing with images with more than
the results in 1.6% mAP more than adding an extra refine- one instance (which impacts the final detection mAP). We
ment agent. We select the model utilized in the experiment also compare the localization performance of our method in
ID=4 as default to the next experiments. pascal VOC 20122 . in Table 4. Boosted-OICR presents a
competitive corloc in VOC 2012 outperforming OICR [19]
4.3. Comparison with state-of-the-art (4.2%), TS2C [25] (1.9%), WeakRPN [21] (1.4%) and
PCL [18] (3.1%), being inferior to C-MIL [23] by 1.1%
We compare our results with other state-of-the-art mAP.
(SOTA) methods in the Pascal VOC 2007 and 2012
datasets. Table 2 shows a comparison of detection perfor- 5. Conclusions
mance of our method and SOTA in the Pascal VOC 2007
test set. It can be seen that Boosted-OICR improves the In this paper, we propose two improvements to boost
original OICR paper [19] in 7.7% mAP and outperformed the online instance classifier refinement. First, we pro-
other approaches such as WCCN [5] (6.9%), TS2C [25] pose a knowledge distillation methodology that extracts ex-
(5.4%), WeakRPN [21] (4.4%), PCL [18] (6.2%), and tra knowledge from the refinement agents. Second, we pro-
MELM [24] (2.4%). Boosted-OICR was only inferior to C- pose an adaptive supervision aggregation function that im-
MIL [23] by a small value (0.8% mAP). However, Boosted- proves the way that each refinement agent learns to separate
OICR presented the highest AP results in 9 of the total 20 2 We submitted our results for VOC 2012 to the evaluation server, but
classes (aeroplane, bird, bottle, bus, car, dog, still did not get the feedback. The anonymous submission link is http:
motorbike, train and tv). Figure 5 presents some re- //host.robots.ox.ac.uk:8080/anonymous/E7JSMD.html
Figure 5: Detection examples for Pascal VOC 2007 dataset. Blue rectangles are ground-truth boxes that have at least one
detection with IoU > 0, and yellow ones are ground-truth with no detection intersection. Green boxes are correct detections
(IoU > 0.5 with ground truth), and red boxes are wrong detections. The label in each detection box is the class label and
confidence score of the detection.

class-related instances, background instances, and which (when there are occlusions), or the whole body.
instances ignore. Both contributions were built using OICR In the future, we intend to explore improvements that
as a baseline approach, and the proposed contributions were make WSOD methods to not focus on the most discrimi-
able to provide a 7.4 mAP boost over the OICR base- nated part of deformable objects such as the human face.
line method. Boosted-OICR presents competitive SOTA We further plan to explore mid-layers of the network and
results on Pascal VOC 2007 dataset, being inferior only class activation maps to create object proposals as an alter-
to [23] by a small margin (0.8% mAP). Also, Boosted- native to the selective search module.
OICR presents the highest AP results in 9 of the 20 classes,
such as airplane, bird, bottle, and train. Al- Acknowledgments
though Boosted-OICR has the best performance in these
classes, it fails in deformable objects such as person class. The authors would like to thank Brazilian funding agen-
In fact, the person class is very challenging, since the cies CNPq and CAPES (Finance Code 001), as well as
GT annotations might contain only the face or upper body NVIDIA Corporation for the donation of a Titan Xp Pas-
cal GPU used for this research.
References [15] Karen Simonyan and Andrew Zisserman. Very deep convo-
lutional networks for large-scale image recognition. arXiv
[1] Hakan Bilen and Andrea Vedaldi. Weakly Supervised Deep preprint arXiv:1409.1556, 2014. 6
Detection Networks. Proceedings of the IEEE Computer So-
[16] Parthipan Siva, Chris Russell, Tao Xiang, and Lourdes
ciety Conference on Computer Vision and Pattern Recogni-
Agapito. Looking beyond the image: Unsupervised learn-
tion, 2016-Decem:2846–2854, 2016. 1, 2, 3, 4, 7
ing for object saliency and detection. In Proceedings of the
[2] Ramazan Gokberk Cinbis, Jakob Verbeek, and Cordelia
IEEE conference on computer vision and pattern recogni-
Schmid. Weakly supervised object localization with multi-
tion, pages 3238–3245, 2013. 2
fold multiple instance learning. IEEE transactions on pattern
[17] Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, and Trevor
analysis and machine intelligence, 39(1):189–203, 2016. 1,
Darrell. Weakly-supervised discovery of visual pattern con-
2
figurations. In Advances in Neural Information Processing
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
Systems, pages 1637–1645, 2014. 2
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vision and [18] Peng Tang, Xinggang Wang, Song Bai, Wei Shen, Xiang Bai,
pattern recognition, pages 248–255. Ieee, 2009. 6 Wenyu Liu, and Alan Loddon Yuille. Pcl: Proposal cluster
learning for weakly supervised object detection. IEEE trans-
[4] Thomas Deselaers, Bogdan Alexe, and Vittorio Ferrari.
actions on pattern analysis and machine intelligence, 2018.
Weakly supervised localization and learning with generic
1, 2, 6, 7
knowledge. International journal of computer vision,
100(3):275–293, 2012. 6 [19] Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu.
[5] Ali Diba, Vivek Sharma, Ali Pazandeh, Hamed Pirsiavash, Multiple instance detection network with online instance
and Luc Van Gool. Weakly supervised cascaded convo- classifier refinement. Proceedings - 30th IEEE Conference
lutional networks. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017,
on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January:3059–3067, 2017. 1, 2, 3, 4, 5, 6, 7
2017-Janua:5131–5139, 2017. 1, 2, 7 [20] Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan,
[6] Thomas G. Dietterich, Richard H. Lathrop, and Tomas Wenyu Liu, Junzhou Huang, and Alan Yuille. Weakly su-
Lozano-Perez. Solving the multiple instance problem with pervised region proposal network and object detection. In
axis-parallel rectangles. Artificial Intelligence, 89(1-2):31– Proceedings of the European conference on computer vision
71, 1997. 1 (ECCV), pages 352–368, 2018. 1, 2, 5
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, [21] Peng Tang, Xinggang Wang, Angtian Wang, Yongluan Yan,
John Winn, and Andrew Zisserman. The pascal visual object Wenyu Liu, Junzhou Huang, and Alan Yuille. Weakly Super-
classes (voc) challenge. International journal of computer vised Region Proposal Network and Object Detection. Lec-
vision, 88(2):303–338, 2010. 6 ture Notes in Computer Science (including subseries Lecture
[8] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- Notes in Artificial Intelligence and Lecture Notes in Bioin-
national conference on computer vision, pages 1440–1448, formatics), 11215 LNCS:370–386, 2018. 7
2015. 6 [22] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gev-
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ers, and Arnold WM Smeulders. Selective search for ob-
Spatial pyramid pooling in deep convolutional networks for ject recognition. International journal of computer vision,
visual recognition. IEEE transactions on pattern analysis 104(2):154–171, 2013. 1, 3, 6
and machine intelligence, 37(9):1904–1916, 2015. 2 [23] Fang Wan, Chang Liu, Wei Ke, Xiangyang Ji, Jianbin Jiao,
[10] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming- and Qixiang Ye. C-MIL: Continuation Multiple Instance
Hsuan Yang. Weakly Supervised Object Localization with Learning for Weakly Supervised Object Detection. The IEEE
Progressive Domain Adaptation. 2016 IEEE Conference Conference on Computer Vision and Pattern Recognition
on Computer Vision and Pattern Recognition (CVPR), pages (CVPR), 1:2199–2208, 2019. 1, 2, 3, 5, 6, 7, 8
3512–3520, 2016. 1, 2 [24] Fang Wan, Pengxu Wei, Zhenjun Han, Jianbin Jiao, and Qix-
[11] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian iang Ye. Min-Entropy Latent Model for Weakly Supervised
Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Object Detection. IEEE Transactions on Pattern Analysis
Berg. Ssd: Single shot multibox detector. In European con- and Machine Intelligence, pages 1–1, 2019. 2, 6, 7
ference on computer vision, pages 21–37. Springer, 2016. 1 [25] Yunchao Wei, Zhiqiang Shen, Bowen Cheng, Honghui Shi,
[12] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Jinjun Xiong, Jiashi Feng, and Thomas Huang. Ts2c:
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Al- Tight box mining with surrounding segmentation context for
ban Desmaison, Luca Antiga, and Adam Lerer. Automatic weakly supervised object detection. In Proceedings of the
differentiation in pytorch. 2017. 6 European Conference on Computer Vision (ECCV), pages
[13] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, 434–450, 2018. 2, 7
stronger. In Proceedings of the IEEE conference on computer [26] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
vision and pattern recognition, pages 7263–7271, 2017. 1 and Antonio Torralba. Learning deep features for discrimi-
[14] Olga Russakovsky, Yuanqing Lin, Kai Yu, and Li Fei-Fei. native localization. In Computer Vision and Pattern Recogni-
Object-centric spatial pooling for image classification. Com- tion (CVPR), 2016 IEEE Conference on, pages 2921–2929.
puter Vision–ECCV 2012, pages 1–15, 2012. 2 IEEE, 2016. 2
[27] C Lawrence Zitnick and Piotr Dollár. Edge boxes: Locating
object proposals from edges. In European Conference on
Computer Vision, pages 391–405. Springer, 2014. 1

You might also like