0% found this document useful (0 votes)
139 views

YOLO

Uploaded by

Manel Lnsry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
139 views

YOLO

Uploaded by

Manel Lnsry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

You Only Look Once

path to design a detector


Feng Wang

AIRD, Coretronic Co.


Apr 17, 2019
The slides and a list of references can be found from
https://github.com/fwcore/object-detection
Outlines

 Concepts in object detection

 A brief history of object detection

 YOLO
 design
 loss function
 training
 weaknesses
Classification vs detection/recognition
Common tasks on images

https://medium.com/@nikasa1889/the-modern-history-of-object-recognition-infographic-aea18517c318
Bounding box proposal
Region of interest, region proposal, box proposal
Ground truth

Proposed bounding box

5 parameters
 w, h
 x, y
 confidence score: how likely it
contains an object & accuracy
of the box
How good: Intersection over Union (IOU)

Overlap Area Examples


IOU =
Union Area
0:

1:
Outlines

 Concepts in object detection

 A brief history of object detection

 YOLO
 design
 loss function
 training
 weaknesses
A brief history of object detection

https://stats385.github.io
A brief history of object detection

 Before CNN, people use handcrafted features to locate and


classify objects. (not too bad)

 CNN boosts the accuracy of classification

ImageNet
A brief history of object detection

Region proposal -> Single shot:


classification Region proposal + classification
 e.g. RCNN  e.g. YOLO, SSD
 accurate  fast
 slow  less accurate
Outlines

 Concepts in object detection

 A brief history of object detection

 YOLO
 design
 loss function
 training
 weaknesses
YOLO: you look only once

Results
 x, y, w, h
 confidence
Look once score:
contain an object &
box accuracy
 class score:
belong to a class

Let's use CNN, Why not regress?


since it's good. They are just numbers.
Let's go to CNN

YOLO v1's CNN: GoogLeNet variant, 24 layers

YOLO v3's CNN: darknet-53

YOLO v2's CNN: darknet-19, 19 layers


Let's do regression
-- wait, wait, how many bounding boxes? Where are they
initially?
Better solution: using grids

Results for one box


 x, y, w, h
 confidence score:
contain an object &
box accuracy
 class score:
belong to a class
 Maybe set N as a large number?
 Maybe initially put them randomly?

Note: N is large, but much smaller than R-CNN's


region proposal.
Let's do regression with non-maximal suppression
Proposed Proposed Class scores
box 1 box 2
class 1
Grid x, y, w, h x, y, w, h class 2,
1
confidence confidence ...
score score class 20

... ... ... ...


Proposed Proposed Class scores
box 1 box 2
class 1
We can use CNN to extract features, and Grid x, y, w, h x, y, w, h class 2,
SxS
finally perform a regression to detect confidence confidence ...
objects. score score class 20
 YOLO v1: fully connected layers
 v2 & v3: convolutional layers
arXiv: 1506.02640, 1612.08242, 1804.02767 vector size: SxSx(5x2+20)
Loss function
Problems
 One object is partially/fully covered by several boxes.
 Most boxes has no objects.
 Multi-task training problem: location & class
 Small objects need more accurate location & box
size.

Solution
Oh, no math please. Let's speak human language

Problem 1:
One object is
partially/fully
covered by
several boxes.

 Each true object has one proposed box “responsible” to it.


Rule: the one with highest overlap with the ground truth boxes.
 When inference, we use non-maximal suppression to select the best among the proposals.
Human language

Problem 2: 0.5
Most boxes has
no objects.
Human language

Problem 3:
Multi-task training
problem: location
& class. Weighted sum: here the problem is left untouched.
Human language

sqrt

Problem 4:
Small objects need
more accurate
location & box size.
Other problems
 x, y can be out of the grid cell
 smaller objects can locate
worse than the largers

 probability can be out of [0, 1]


Fix them in YOLO v2

Pre-defined box size


Pre-defined box: anchor
 Naturally, objects have special aspect ratios and sizes.
 This can be a good starting point.
 We don't need randomly initialized boxes' shapes.

 Handcrafted box size vs clustering algorithms

 Box can reshape during training.

 The number of pre-defined boxes is


a hyperparameter
 v2 uses 5
 v3 uses 9

Anchor-free detection is a research topic, see https://arxiv.org/abs/1904.01355 for an instance. anchors used in YOLO v2
Improvements (in v2)
 Resizing image sizes randomly during training: {320, 352, ..., 608}
 CNN only reduce an image by a constant factor (here 32), hence is robust to input image size
 resize every 10 epochs.
 multi-scale training

 Passthrough layer  Odd number of grid cells


 No loss to perform reshaping

vs

Feature map
Training
ImageNet: COCO/PASCAL VOC:
classification dataset detection dataset

YOLO
Step 1: Step 2 (transfer learning):
 train classification backbone  remove head layers
 add regression as new head
 fine-tune backbone & train head

Training tricks
 decaying learning rate
 batch normalization
 data augmentation
Performance
Generalizability

Picasso & People-Art dataset


But ... no free lunch
 YOLO is not as accurate as RCNN-series models
 multi-task problem:
YOLO wins in less background error,
however, loses in localization error.

 YOLO is poor for detecting small objects


 CNN: training on ImageNet may not generalize well for small objects (classification)
 loss function equalizes location weights for small & large objects (localization)
50+ years
 YOLO is not good at crowd objects
 non-maximal suppression. See an improvement: Adaptive NMS (arXiv:1904.03629)

 YOLO is bad when encountering strange aspect ratio


 pre-defined anchors, or anchors learned from data. Go anchor-free (arXiv:1904.01355).
Security
CNN (classification) can be fooled, as well as
YOLO, and the issues can be even worse.

Non-maximal suppression is fooled.

Daedalus: Breaking Non-Maximum


Suppression in Object Detection via
Adversarial Examples. arXiv:1902.02067
Is there anything helpful to improve?
Darwin's evolution

arXiv: 1807.05511

You might also like