0% found this document useful (0 votes)

35 views38 pages

YOLO V2 For Object Detection

The document discusses the YOLOv2 object detection model. YOLOv2 improves on YOLOv1 by adding batch normalization, using a deeper backbone network called Darknet-19, training the classifier at a higher resolution before detection, adding multi-scale training, and using k-means clustering to select anchor boxes rather than predefined boxes. YOLOv2 achieves state-of-the-art accuracy while maintaining most of YOLOv1's speed.

Uploaded by

eziggurat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views38 pages

YOLO V2 For Object Detection

Uploaded by

eziggurat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

YOLO FAMILY

YOLOv2
Agenda Topics Covered

Why YOLOv2?

How does YOLOv2 Work?

Comparison with other methods

YOLO 9000: Better, Faster, Stronger

YOLOv2
Why YOLOv2?
Why YOLOv2?
Clearly, YOLOv1 performed a lot faster
compared to the other methods. But, it’s
detections suffered by ~10 mAP compared to
Faster R-CNN VGG-16.

YOLOv1 makes a signiﬁcant number of

localization errors. Furthermore, YOLOv1 has a
relatively low recall.

So, the authors focused mainly on improving

recall and localization while maintaining
classification accuracy.
Why YOLOv2?
This was one of the motive behind the 2nd
version of YOLO, which was introduced in late
2016.

YOLOv2 outperforms all the other methods in

both speed and detection.

At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007.

At 40 FPS, YOLOv2 gets 78.6 mAP (Mean
Average Precision).
YOLOv2
How does YOLOv2 work?
How does YOLOv2 Work?
Computer vision generally trends towards larger, deeper
networks. Better performance often hinges on training larger
networks or ensembling multiple models together.

However, with YOLOv2, they wanted a more accurate detector

that is still fast. Instead of scaling up the network, they
simplified the network to make the representation easier to
learn. Here’s how they improved YOLO’s performance:
Batch Normalization
They added batch normalization after every convolutional
layer in YOLOv1.

This itself resulted in 2% increment in mAP.

Batch Normalization improves the model convergence while

regularising it.

Due to this, they removed the Dropout layer which was used
in YOLOv1.
Batch Normalization
Batch Norm is a normalization technique done between the
layers of a Neural Network instead of in the raw data.

It is done along mini-batches instead of the full data set.

Batch Norm – in the image represented with a red line – is

applied to the neurons’ output just before applying the
activation function.
Batch Normalization
Usually, a neuron without Batch Norm
would be computed as follows:

g() the linear transformation of the neuron, w the

weights of the neuron, b the bias of the neurons, and f()
the activation function.
Batch Normalization
Adding Batch Norm, it looks as:

z^N the output of Batch Norm, m_z the mean of

the neurons’ output, s_z the standard deviation of
the output of the neurons, and gamma and beta
learning parameters of Batch Norm.

Note that the bias of the neurons (b) is removed.

This is because as we subtract the mean m_z, any
constant over the values of z – such as b – can be
ignored as it will be subtracted by itself.
Architecture
To solve the problems of complexity and accuracy the
authors propose a new classiﬁcation model called
Darknet-19 to be used as a backbone for YOLOv2.

Darknet-19 has 19 convolutional layers and 5 max-pooling

layers with 11 more layers for object detection. It
achieved 91.2% top-5 accuracy on ImageNet which is
better than VGG (90%) and YOLO network(88%) for
classification
High Resolution Classifier
General way in which object detection works is, the model
is pretrained on imagenet for classification.

Then for detection, the network is resized to higher

resolution especially to detect smaller objects in a scene.

The original YOLO was trained as follow:

i-They trained the classiﬁer network at 224×224 input size.

ii-Then they increased the resolution to 448 for detection.

High Resolution Classifier
This means when switching to detection the network
has to simultaneously switch to learning object
detection and adjust to the new input resolution.

While for YOLOv2 they initially trained the model on

images at 224×224, then they ﬁne tune the
classiﬁcation network at the full 448×448 resolution
for 10 epochs on ImageNet before training for
detection.

This gives the network time to adjust its ﬁlters to

work better on higher resolution input. This high-
resolution classiﬁcation network gives an increase of
almost 4% mAP.
Training
The model was first trained for classification then it
was trained for detection.

1 - Classification: they trained Darknet-19 network on

the standard ImageNet 1000 class classiﬁcation
dataset with input shape 224x224 for 160 epochs.

After that, they fine-tune the network at a large

input size 448x448 for 10 epochs. This gives a top-1
accuracy of 76.5% and a top-5 accuracy of 93.3%
Training
The model was first trained for classification then it
was trained for detection.

2 - Detection: After training for classification they

removed the last convolutional layer from Darknet-19
and instead they added three 3 × 3 convolutional
layers and a 1x1 convolutional layer with the number
of outputs they need for detection(13x13x125).

Also, a passthrough layer was added so that their

model can use ﬁne grain features from previous
layers.

Then they trained the network for 160 epochs on

detection datasets (VOC and COCO datasets).
Fine Grained Features
YOLOv2 predicts the detections using the 13 x 13 feature
map.

This is sufficient for identifying large objects but not smaller

objects.

To better localize smaller objects, a passthrough layer that

takes features from an earlier layer at 26 x 26 resolution is
concatenated with the lower resolution features.

This gives a 1% increase in performance.

Multiscale Training
To make YOLOv2 robust to running on images of different sizes they trained the model for
.ٍdifferent input sizes

Since the model uses only convolutional and pooling layers

.the input can be resized on the ﬂy.

YOLOv1 used 448 x 448 resolution for the input. In YOLOv2, they resize the input image
randomly to different resolutions between 320 x 320 to 608 x 608 (the resolution is always
a multiple of 32).

This multi-scale training can be thought of like augmentation, it forces the network to learn
to predict well across a variety of input dimensions. This increased the mAP by 1.5%.
Anchor Boxes
YOLO (v1) tries to assign the object to the grid cell that
contains the middle of the object.

Using this idea the red cell in the image to the right must
detect both the man and his necktie, but since any grid
cell can only detect one object, a problem will arise here.

To solve this, the authors tried to allow the grid cell to

detect more than one object using k bounding box.
Anchor Boxes
There are two ways of predicting the bounding boxes-

1. Directly predicting the bounding box of the object

2. Using a set of pre-defined bounding boxes (Anchor box) to predict the actual
bounding box of the object.

YOLOv1 predicts the coordinates of bounding boxes directly using fully connected layers
on top of the convolutional feature extractor.

But, it makes a significant amount of localisation error. It is easier to predict the offset
based on anchor boxes than to predict the co-ordinates directly.
Anchor Boxes
In this image, we have a grid cell(red) and 5 anchor
boxes(yellow) with different shapes (aspect ratios).

In the paper they called the anchor box a (prior box)

YOLOv2 tries to use the idea of anchor boxes but

instead of picking the k anchor boxes by hand, it tries to
find the best anchor box shapes to make it easier for
the network to learn detection.

We can predict the bounding box relative to the anchor

box instead of predicting the box relative to the image.
Using this idea it will be easier for the network to learn.
Anchor Boxes
In this image, the 5 red boxes represent the average
dimensions and locations of objects in VOC 2007 dataset.

Someone may ask how and why they chose these 5 boxes?

Instead of using predefined anchor boxes, they looked at

the bounding boxes in training data (VOC 2007, COCO) and
run k-means clustering on the training set bounding boxes
for various values of k and plot the average IOU with the
closest centroid.

But instead of using Euclidean distance, they used IOU

between the bounding box and the centroid. This resulted
in Dimension clusters
Anchor Boxes
They chose k = 5 as a good trade-off between
model complexity and high recall.

When they ran K-means clustering on VOC 2008

and COCO training data, this is what they obtained
the image to the right.

The graph on the right shows how much the

Dimension clusters overlap with the training data’s Left: Avg. IOU vs Number of clusters.

bounding boxes without any manipulation. Using Right: Dimension clusters obtained from the train
images.
these cluster centers helped increase mAP by 5%.
Direct location prediction
When they used anchor boxes with YOLO, they
encountered the issue of model instabilities
especially during initial iterations.

Instead of predicting offsets, YOLOv2 follows the

approach of YOLOv1 and predicts location
coordinates relative to the location of the grid cell.

This bounds the ground truth to fall between 0

and 1. The network predicts 5 bounding boxes for
each cell. It predicts 5 coordinates for each
bounding box, tx, ty, tw, th, and to.
Direct location prediction
If the cell is offset from the top left corner of the
image by (cx,cy) and the bounding box prior(anchor
box) has width and height pw, ph, and the
predictions correspond to b_x, b_y, b_w, b_h.

For example, if we use 2 anchor boxes the grid cell

(2,2) in the image will output 2 boxes (the blue and
the yellow boxes). Let the black dotted boxes
represent the 2 anchor boxes for that cell.
Direct location prediction

Now consider only the blue box, instead of

assigning the predicted blue box to the grid cell
only as in YOLOv1, YOLOv2 assigns the blue box
not only to the grid cell but also to one of the
anchor boxes and that will be the one that has the
highest IOU with the ground truth box.

YOLOv2 uses the equations on the right to assign

the blue box to the grid cell and the anchor box.
Output shape
YOLOv2 output shape is 13x13x(kx(1+4+20)) where k is the number of anchor
boxes, and 20 is the number of classes. For k=5 the output shape will be
13x13x125.
Loss Function
This function defines the loss function for an iteration t.

If a bounding box doesn’t have any object then its

confidence of objectness need to be reduced and it is
represented as first loss term.

As the bounding box coordinates prediction need to align

with our prior information, a loss term reducing the
difference between prior and the predicted is added for
few iterations (t < 12800).

If a bounding box k is responsible for a truth box, then the

predictions need to be aligned with the truth values which
is represented as the third loss term. The 𝞴 values are the
pre-defined weightages for each of the loss terms.
YOLOv2
Comparison to Other
Detection Systems
Comparison to Other Detection Systems
YOLOv2 is state-of-the-art and faster than other detection systems across a variety of detection
datasets. Furthermore, it can be run at a variety of image sizes to provide a smooth trade-off
between speed and accuracy.

The graph and table below show how different methods perform with respect to precision and
speed on VOC 2007 dataset and VOC 2007 + 2012 dataset respectively.
YOLO9000
Comparison to Other
Detection Systems
YOLO9000
Sometimes we need a model that can detect more than 20 classes, and that is what YOLO9000
does. It is a real-time framework for detecting more than 9000 object categories by jointly
optimizing detection and classiﬁcation.

As we mentioned previously, YOLOv2 was trained for classification then for detection. This is
because the dataset for classification -which contains one object- is different from the dataset
for detection. In YOLOv2, the authors propose a mechanism for jointly training on classiﬁcation
and detection data.

During training, they mix images from both detection and classification datasets. When the
network sees an image labeled for detection, we can backpropagate based on the full YOLOv2
loss function. When it sees a classification image we only backpropagate loss from the
classification specific parts of the architecture.
YOLO9000
The idea of mixing detection and classification data faces a few challenges:

1- Detection datasets are small compared to classification datasets.

2- Detection datasets have only common objects and general labels, like “dog” or “boat”, while
Classiﬁcation datasets have a much wider and deeper range of labels. For example, ImageNet dataset
has more than a hundred breeds of dogs like “german shepherd” and “Bedlington terrier.”
YOLO9000
To merge these two datasets the authors created a hierarchical model of visual concepts and called it
WordTree.

As we see, all the classes are under the root (physical object). They trained the Darknet-19 model on
WordTree .They extracted the 1000 classes of ImageNet dataset from WordTree and added to it all
the intermediate nodes, which expands the label space from 1000 to 1369 and called it WordTree1k.
Now the size of the output layer of darknet-19 became 1369 instead of 1000.
YOLO9000
For these 1369 predictions, we don’t compute one softmax, but we compute a separate softmax
overall synsets that are hyponyms of the same concept.

Despite adding 369 additional concepts Darknet-19 achieves 71.9% top-1 accuracy and 90.4% top-5
accuracy.

The detector predicts a bounding box and the tree of probabilities, but since we use more than one
softmax we need to traverse the tree to find the predicted class.
YOLO9000
We traverse the tree from top to down, taking the highest conﬁdence path at every split until we
reach a node with probability < threshold-probability then we predict that object class.

For example, if the input image contains a dog, the tree of probabilities will be like this tree below.

Instead of assuming every image has an object, we use YOLOv2’s objectness predictor to give us the
value of Pr(physical object), which is the root of the tree.
YOLO9000
The model outputs a softmax for each branch level. We choose the
node with the highest probability (if it is higher than a threshold
value) as we move from top to down. The prediction will be the node
where we stop.

In the tree above, the model will go through physical object =>
dog=>hunting dog. It will stop at ‘hunting dog’ and do not go down to
sighthound (a type of hunting dogs) because its confidence is less
than the confidence threshold value, so the model will predict hunting
dog not sighthound.

Performing classiﬁcation in this manner also has some beneﬁts.

Performance degrades gracefully on new or unknown object
categories. For example, if the network sees a picture of a dog but it
is uncertain which type of dog it is, it will stop at the dog with high
conﬁdence and the output will be (dog).
YOLO9000
The combined dataset was created using the COCO detection dataset and the top 9000 classes
from the full ImageNet release. YOLO9000 uses three priors(anchor boxes) instead of 5 to limit
the output size. It learns to ﬁnd objects in images using the detection data in the COCO dataset
and it learns to classify a wide variety of these objects using data from the ImageNet dataset.

When the network sees a detection image, we backpropagate loss as normal. When it sees a
classiﬁcation image we only backpropagate classiﬁcation loss. YOLO9000 uses the base YOLOv2
architecture but only 3 priors instead of 5 to limit the output size.

Since COCO does not have a bounding box label for many categories, YOLO9000 struggles to
model some categories like “sunglasses” or “swimming trunks.” YOLO9000 gets 19.7 mAP overall
with 16.0 mAP on the disjoint 156 object classes that it has never seen any labelled detection data
for when evaluated on the ImageNet detection task.

TFE-HMM- Combining State Transitions and Price Trends for Stock Price Forecasting
100% (1)
TFE-HMM- Combining State Transitions and Price Trends for Stock Price Forecasting
17 pages
Mastering All YOLO Models From YOLOv1 To YOLO
100% (1)
Mastering All YOLO Models From YOLOv1 To YOLO
58 pages
Music Recommendation Using Emotion Detection
100% (1)
Music Recommendation Using Emotion Detection
22 pages
Object Detection Week 2 YOLOv1-YOLOv8
100% (1)
Object Detection Week 2 YOLOv1-YOLOv8
264 pages
Review of YOLO - Drawback and Improvement From v1 To v3
No ratings yet
Review of YOLO - Drawback and Improvement From v1 To v3
6 pages
YOLO_v2
No ratings yet
YOLO_v2
9 pages
Project Presentation Lab
No ratings yet
Project Presentation Lab
13 pages
YOLO
No ratings yet
YOLO
7 pages
YOLO
No ratings yet
YOLO
43 pages
Documents 2025-0 [v3]-Object Detection Object Detection-L3 v3
No ratings yet
Documents 2025-0 [v3]-Object Detection Object Detection-L3 v3
170 pages
Yolov3: An Incremental Improvement: Joseph Redmon, Ali Farhadi
No ratings yet
Yolov3: An Incremental Improvement: Joseph Redmon, Ali Farhadi
6 pages
YOLO
No ratings yet
YOLO
10 pages
YOLOv 5
No ratings yet
YOLOv 5
10 pages
YOLO (You Only Look Once)
No ratings yet
YOLO (You Only Look Once)
4 pages
Yolo India
No ratings yet
Yolo India
14 pages
Deep Learning YOLOv2
No ratings yet
Deep Learning YOLOv2
3 pages
Signature Object Detection Based On YOLOv3
No ratings yet
Signature Object Detection Based On YOLOv3
4 pages
Yolo Paper
No ratings yet
Yolo Paper
10 pages
yolo
No ratings yet
yolo
20 pages
Red Mon 2016
No ratings yet
Red Mon 2016
10 pages
Algoritm For MOD
No ratings yet
Algoritm For MOD
32 pages
Evolution of Yolov3
No ratings yet
Evolution of Yolov3
2 pages
A4121119119
No ratings yet
A4121119119
4 pages
yolopdf
No ratings yet
yolopdf
10 pages
Object Detection Technique (YOLO)
No ratings yet
Object Detection Technique (YOLO)
19 pages
You Only Look Once - Unified, Real-Time Object Detection
No ratings yet
You Only Look Once - Unified, Real-Time Object Detection
10 pages
YOLO Object Detection Explained_ A Beginner's Guide _ DataCamp
No ratings yet
YOLO Object Detection Explained_ A Beginner's Guide _ DataCamp
14 pages
MJEER-Volume 30-Issue 1 - Page 52-57
No ratings yet
MJEER-Volume 30-Issue 1 - Page 52-57
6 pages
Object_Detection_Document
No ratings yet
Object_Detection_Document
4 pages
Overview of YOLO ObjectDetectionAlgorithm
No ratings yet
Overview of YOLO ObjectDetectionAlgorithm
7 pages
Yolo
No ratings yet
Yolo
10 pages
What's New in YOLO v3 - A Review of The YOLO v3 Object - by Ayoosh Kathuria - Towards Data Science
No ratings yet
What's New in YOLO v3 - A Review of The YOLO v3 Object - by Ayoosh Kathuria - Towards Data Science
14 pages
YOLOv5 Architecture and Algorithm for Object Detection
No ratings yet
YOLOv5 Architecture and Algorithm for Object Detection
7 pages
Lecture 10 Summary
No ratings yet
Lecture 10 Summary
2 pages
Report
No ratings yet
Report
9 pages
Yolo 220209212833
No ratings yet
Yolo 220209212833
17 pages
Projects and Prep Docs Report and Data Science Notes
No ratings yet
Projects and Prep Docs Report and Data Science Notes
50 pages
yolo
No ratings yet
yolo
32 pages
Seminar 201202175023
No ratings yet
Seminar 201202175023
16 pages
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
No ratings yet
(IJCST-V8I3P4) :sakshi Gupta, Dr. T. Uma Devi
5 pages
Model Overview
No ratings yet
Model Overview
1 page
Computer Vision - Compressed
No ratings yet
Computer Vision - Compressed
46 pages
Ex No 06
No ratings yet
Ex No 06
4 pages
Object Detection Method Based On YOLOv3 Using - Deep Learning Networks
No ratings yet
Object Detection Method Based On YOLOv3 Using - Deep Learning Networks
4 pages
Paper 5
No ratings yet
Paper 5
13 pages
YOLO Versions
No ratings yet
YOLO Versions
1 page
YOLO
No ratings yet
YOLO
4 pages
yolov8
No ratings yet
yolov8
12 pages
Yolov 3
No ratings yet
Yolov 3
42 pages
YOLO
No ratings yet
YOLO
31 pages
"Object Detection With Yolo": A Seminar On
No ratings yet
"Object Detection With Yolo": A Seminar On
14 pages
Real-Time Face Detection Based On YOLO
No ratings yet
Real-Time Face Detection Based On YOLO
4 pages
Project
100% (1)
Project
30 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Deep Learning Assignment 3 BSAI-022
No ratings yet
Deep Learning Assignment 3 BSAI-022
4 pages
YOLO Based Detection and Classification of Objects in Video Records
No ratings yet
YOLO Based Detection and Classification of Objects in Video Records
5 pages
Yolo Report
No ratings yet
Yolo Report
6 pages
YOLOv1 v8综述
No ratings yet
YOLOv1 v8综述
36 pages
Seatbelt Detection Report
No ratings yet
Seatbelt Detection Report
18 pages
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
From Everand
Flood Fill: Flood Fill: Exploring Computer Vision's Dynamic Terrain
Fouad Sabry
No ratings yet
Voxel: Exploring the Depths of Computer Vision with Voxel Technology
From Everand
Voxel: Exploring the Depths of Computer Vision with Voxel Technology
Fouad Sabry
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Fully Convolutional Neural Network
No ratings yet
Fully Convolutional Neural Network
31 pages
Ker As Tutorial
No ratings yet
Ker As Tutorial
33 pages
CV Lab 9
No ratings yet
CV Lab 9
4 pages
Case Studies Why Look at Case Studies?: Deeplearning - Ai
No ratings yet
Case Studies Why Look at Case Studies?: Deeplearning - Ai
50 pages
Project Report
No ratings yet
Project Report
19 pages
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
No ratings yet
Phishing Url Detection Using CNNLSTM and Random Forest Classifier
6 pages
Report Sample
No ratings yet
Report Sample
17 pages
An AI-Driven Virtual Preparation Platform For Interviews
No ratings yet
An AI-Driven Virtual Preparation Platform For Interviews
8 pages
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
No ratings yet
Document AI: A Comparative Study of Transformer-Based, Graph-Based Models, and Convolutional Neural Networks For Document Layout Analysis
14 pages
AI-Hardware
No ratings yet
AI-Hardware
4 pages
Project Group
No ratings yet
Project Group
20 pages
Erkihun Mulu Muche
No ratings yet
Erkihun Mulu Muche
3 pages
Instance Segmentation For Autonomous Vehicle
No ratings yet
Instance Segmentation For Autonomous Vehicle
6 pages
Using Deep Learning To Detect Price Change Indications in Financial Markets
No ratings yet
Using Deep Learning To Detect Price Change Indications in Financial Markets
5 pages
Final Report
No ratings yet
Final Report
56 pages
Neuroscience-Inspired Artificial Intelligence
No ratings yet
Neuroscience-Inspired Artificial Intelligence
15 pages
Mini Project - Merged
No ratings yet
Mini Project - Merged
48 pages
1-s2.0-S2214785323009343-main
No ratings yet
1-s2.0-S2214785323009343-main
14 pages
Project Plan - Kel 5 PDF
No ratings yet
Project Plan - Kel 5 PDF
5 pages
DEEP LEARNING RECORD
No ratings yet
DEEP LEARNING RECORD
28 pages
SYLLABUS
No ratings yet
SYLLABUS
1 page
Conference_paper_05_12
No ratings yet
Conference_paper_05_12
14 pages
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
No ratings yet
Comprehensive Review On CNN-based Malware Detection With Hybrid Optimization Algorithm
13 pages
mTechPesWeJune21Grp6 Final+Submission
No ratings yet
mTechPesWeJune21Grp6 Final+Submission
50 pages
Concepts in Deep Learning
No ratings yet
Concepts in Deep Learning
14 pages
B.tech. 3rd Yr CSE(IOT) 2022 23 Revised
No ratings yet
B.tech. 3rd Yr CSE(IOT) 2022 23 Revised
32 pages
Weather Radar Echo Prediction Method Based On Convolution Neural
No ratings yet
Weather Radar Echo Prediction Method Based On Convolution Neural
9 pages
2015 - Convolutional Neural Networks For Sentence Classification (XXX) (15 Slides)
No ratings yet
2015 - Convolutional Neural Networks For Sentence Classification (XXX) (15 Slides)
15 pages