0% found this document useful (0 votes)
54 views

Fast Methods For Deep Learning Based Object Detection

This document summarizes problems with the R-CNN object detection method and introduces Fast R-CNN and Faster R-CNN as improved methods. R-CNN training is slow and requires extracting deep learning features for each object proposal. Fast R-CNN improves on this by only extracting features once per image and using ROI pooling to classify and regress proposals. Faster R-CNN further speeds up detection by adding a Region Proposal Network to generate proposals, removing the need for an external proposal method. It enables end-to-end training of the whole system.

Uploaded by

seul alone
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Fast Methods For Deep Learning Based Object Detection

This document summarizes problems with the R-CNN object detection method and introduces Fast R-CNN and Faster R-CNN as improved methods. R-CNN training is slow and requires extracting deep learning features for each object proposal. Fast R-CNN improves on this by only extracting features once per image and using ROI pooling to classify and regress proposals. Faster R-CNN further speeds up detection by adding a Region Proposal Network to generate proposals, removing the need for an external proposal method. It enables end-to-end training of the whole system.

Uploaded by

seul alone
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Fast Methods for Deep Learning based

Object Detection
R-CNN: Problems

● Training is a multi-stage pipeline.


○ R-CNN first finetunes a ConvNet on object proposals using log loss.
○ Then, it fits SVMs to ConvNet features. These SVMs act as object detectors, replacing the softmax
classifier learnt by fine-tuning.
○ In the third training stage, bounding-box regressors are learned.
● Training is expensive in space and time.
○ For SVM and bounding-box regressor training, features are extracted from each object proposal in
each image and written to disk.
○ With very deep networks, such as VGG16, this process takes 2.5 GPU-days for the 5k images of the
VOC07 trainval set. These features require hundreds of gigabytes of storage.
● Object detection is slow.
○ At test-time, features are extracted from each object proposal in each test image.
○ Detection with VGG16 takes 47s / image (on a GPU).
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Fast R-CNN
Training
Fast R-CNN
Training
Fast R-CNN

● Only calculate features once.


● ROI Pooling layer extracts constant length vector representations of proposals.
● Classify and regress bounding boxes with multi purpose loss for end-to-end
training.
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN: ROI Pooling
Fast R-CNN

● Instead of SVM + bounding box regression:


○ SoftMax classifier output
○ Bounding box regression output
● Multi-task training:
Fast R-CNN

● Advantages
○ Training is single-stage, using a multi-task loss
○ Training can update all network layers
○ No disk storage is required for feature caching
○ More accurate 66.9mAP vs 66.0mAP.
○ Faster training time 9.5h vs 84h (x8.8)
○ Faster test time per image: 0.32s vs 47s (x146)
● Problem
○ Test time don’t include region proposals.
○ Test time with region proposals: 2s vs 50s (x25)
● Solution
○ Make the CNN do region proposals too!
Faster R-CNN
● Faster R-CNN: Towards Real-Time Object Detection
with Region Proposal Networks (2015)
○ Shaoqing Ren, Kaiming He, Ross Girshick
● Insert a Region Proposal Network (RPN) after the
last convolutional layer.
● RPN trained to produce region proposals directly;
no need for external region proposals!
● After RPN, use RoI Pooling and an upstream
classifier and bbox regressor just like Fast R-CNN.
Faster R-CNN: RPN
● Slide a small window on the already computed
feature map (FREE!).
● Build a small network for:
○ Classifying object or not-object, and
○ Regressing bbox locations
● Position of the sliding window provides
localization information with reference to the
image.
● Box regression provides finer localization
information with reference to this sliding
window
Faster R-CNN: Training
● In the paper: Ugly pipeline
○ Use alternating optimization to train RPN, then Fast
R-CNN with RPN proposals, etc.
○ More complex than it has to be
● Since publication: Joint training!
○ One network, four losses
■ RPN classification (anchor good / bad)
■ RPN regression (anchor -> proposal)
■ Fast R-CNN classification (over classes)
■ Fast R-CNN regression (proposal -> box)
How Many Anchors Do We Need?
How Many Proposals Do We Need?

● Fast R-CNN used 2000 proposals from selective search.


● Faster R-CNN needs only 300 proposals from the RPN.
● RPN is better than selective search
○ Deep learning vs. classical computer vision
○ Optimized for this task
How Much Data Do We Need?
Also Read:
R-FCN: Object Detection via Region-based Fully
Convolutional Networks
https://arxiv.org/abs/1605.06409
Another Approach For
Speeding Up
Proposals
Another Approach For
Speeding Up
Proposals
Just Don’t Do It
Just RPN From Faster R-CNN

● Much faster than Faster R-CNN!


● But RPN had only object/not object classifier.
Add Classification!

● What about accuracy?


● How well does it handle different object scales?
Add More Scales!
Add More classifiers
SSD: Single Shot MultiBox Detector
SSD: Single Shot MultiBox Detector
Why Does Stride Matter?
● Smaller stride means more scanned
windows.
● Handles close objects better.
○ Need to have enough default boxes to do
accurate matching in each.
● Handles small objects better.
○ Better IoU with objects.
○ More positive windows per object.
● Too little stride is bad
○ Too many windows means too many false
positives to filter.
Improving Accuracy

● Object detection data is unbalanced


○ 1-30 True Positives per image.
○ 8,000 - 25,000 False Positives per image.
● Solution
○ Resample at fixed ratio (1:3)
● Not all negatives are equal!
○ Some are harder than others
● Better Solution
○ Hard negative mining: resample worst-misclassified false positives at fixed ratio.
Improving Accuracy

● Not enough data?


● Solution: Data augmentation
○ Random horizontal flip
○ Random crop
○ Random color distortion
○ Random expansion
How Much Does It Help?
Also Read:
YOLO9000: Better, Faster, Stronger
https://arxiv.org/abs/1612.08242
Speed/accuracy factors in object detectors

● Algorithm: Faster R-CNN / SSD / R-FCN / YOLO / ...


● Backbone: VGG16 / ResNet / MobileNet / etc…
● Input size
● Many other hyperparameters...
Speed/accuracy trade-offs for modern convolutional object
detectors (Google)
Frameworks

● Caffe
○ Faster R-CNN: https://github.com/rbgirshick/py-faster-rcnn
○ SSD: https://github.com/weiliu89/caffe/tree/ssd
● Tensorflow Object Detection API:
○ https://github.com/tensorflow/models/tree/master/research/object_detection
● Detectron:
○ https://github.com/facebookresearch/Detectron
● Many more re-implementations in different languages...
Honorable mentions

● VGG16: https://arxiv.org/abs/1409.1556
● ResNet: https://arxiv.org/abs/1512.03385
● Inception-ResNet: https://arxiv.org/abs/1602.07261
● ResNeXt: https://arxiv.org/abs/1611.05431
● Xception: https://arxiv.org/abs/1610.02357
● DenseNet: https://arxiv.org/abs/1608.06993
● MobileNet: https://arxiv.org/abs/1704.04861
● SqueezeNet: https://arxiv.org/abs/1602.07360
Looking for brilliant researchers

[email protected]

You might also like