YOLO V2 For Object Detection
YOLO V2 For Object Detection
YOLOv2
Agenda Topics Covered
Why YOLOv2?
Due to this, they removed the Dropout layer which was used
in YOLOv1.
Batch Normalization
Batch Norm is a normalization technique done between the
layers of a Neural Network instead of in the raw data.
YOLOv1 used 448 x 448 resolution for the input. In YOLOv2, they resize the input image
randomly to different resolutions between 320 x 320 to 608 x 608 (the resolution is always
a multiple of 32).
This multi-scale training can be thought of like augmentation, it forces the network to learn
to predict well across a variety of input dimensions. This increased the mAP by 1.5%.
Anchor Boxes
YOLO (v1) tries to assign the object to the grid cell that
contains the middle of the object.
Using this idea the red cell in the image to the right must
detect both the man and his necktie, but since any grid
cell can only detect one object, a problem will arise here.
YOLOv1 predicts the coordinates of bounding boxes directly using fully connected layers
on top of the convolutional feature extractor.
But, it makes a significant amount of localisation error. It is easier to predict the offset
based on anchor boxes than to predict the co-ordinates directly.
Anchor Boxes
In this image, we have a grid cell(red) and 5 anchor
boxes(yellow) with different shapes (aspect ratios).
Someone may ask how and why they chose these 5 boxes?
bounding boxes without any manipulation. Using Right: Dimension clusters obtained from the train
images.
these cluster centers helped increase mAP by 5%.
Direct location prediction
When they used anchor boxes with YOLO, they
encountered the issue of model instabilities
especially during initial iterations.
The graph and table below show how different methods perform with respect to precision and
speed on VOC 2007 dataset and VOC 2007 + 2012 dataset respectively.
YOLO9000
Comparison to Other
Detection Systems
YOLO9000
Sometimes we need a model that can detect more than 20 classes, and that is what YOLO9000
does. It is a real-time framework for detecting more than 9000 object categories by jointly
optimizing detection and classification.
As we mentioned previously, YOLOv2 was trained for classification then for detection. This is
because the dataset for classification -which contains one object- is different from the dataset
for detection. In YOLOv2, the authors propose a mechanism for jointly training on classification
and detection data.
During training, they mix images from both detection and classification datasets. When the
network sees an image labeled for detection, we can backpropagate based on the full YOLOv2
loss function. When it sees a classification image we only backpropagate loss from the
classification specific parts of the architecture.
YOLO9000
The idea of mixing detection and classification data faces a few challenges:
2- Detection datasets have only common objects and general labels, like “dog” or “boat”, while
Classification datasets have a much wider and deeper range of labels. For example, ImageNet dataset
has more than a hundred breeds of dogs like “german shepherd” and “Bedlington terrier.”
YOLO9000
To merge these two datasets the authors created a hierarchical model of visual concepts and called it
WordTree.
As we see, all the classes are under the root (physical object). They trained the Darknet-19 model on
WordTree .They extracted the 1000 classes of ImageNet dataset from WordTree and added to it all
the intermediate nodes, which expands the label space from 1000 to 1369 and called it WordTree1k.
Now the size of the output layer of darknet-19 became 1369 instead of 1000.
YOLO9000
For these 1369 predictions, we don’t compute one softmax, but we compute a separate softmax
overall synsets that are hyponyms of the same concept.
Despite adding 369 additional concepts Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5
accuracy.
The detector predicts a bounding box and the tree of probabilities, but since we use more than one
softmax we need to traverse the tree to find the predicted class.
YOLO9000
We traverse the tree from top to down, taking the highest confidence path at every split until we
reach a node with probability < threshold-probability then we predict that object class.
For example, if the input image contains a dog, the tree of probabilities will be like this tree below.
Instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the
value of Pr(physical object), which is the root of the tree.
YOLO9000
The model outputs a softmax for each branch level. We choose the
node with the highest probability (if it is higher than a threshold
value) as we move from top to down. The prediction will be the node
where we stop.
In the tree above, the model will go through physical object =>
dog=>hunting dog. It will stop at ‘hunting dog’ and do not go down to
sighthound (a type of hunting dogs) because its confidence is less
than the confidence threshold value, so the model will predict hunting
dog not sighthound.
When the network sees a detection image, we backpropagate loss as normal. When it sees a
classification image we only backpropagate classification loss. YOLO9000 uses the base YOLOv2
architecture but only 3 priors instead of 5 to limit the output size.
Since COCO does not have a bounding box label for many categories, YOLO9000 struggles to
model some categories like “sunglasses” or “swimming trunks.” YOLO9000 gets 19.7 mAP overall
with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data
for when evaluated on the ImageNet detection task.