Yolo
Yolo
Washingtonÿ , http://pjreddie.com/yolo/
Horse: 0.28
1. Resize image.
Prior work on object detection repurposes classifiers to per- 2. Run convolutional network.
3. Non-max suppression.
form detection. Instead, we frame object detection as a re-
gression problem to spatially separated bounding boxes and
associated class probabilities. A single neural network pre- Figure 1: The YOLO Detection System. Processing images
dicts bounding boxes and class probabilities directly from with YOLO is simple and straightforward. Our system (1) resizes
full images in one evaluation. Since the whole detection the input image to 448 × 448, (2) runs a single convolutional net-work
pipeline is a single network, it can be optimized end-to-end on the image, and (3) thresholds the resulting detections by
the model’s confidence.
directly on detection performance.
Our unified architecture is extremely fast. Our base
YOLO model processes images in real-time at 45 frames
methods to first generate potential bounding boxes in an im-
per second. A smaller version of the network, Fast YOLO,
age and then run a classifier on these proposed boxes. After
processes an astounding 155 frames per second while
classification, post-processing is used to refine the bound-ing
still achieving double the mAP of other real-time detec-tors.
boxes, eliminate duplicate detections, and rescore the
v6yi1.Xas0rM
5v04620.605]1V:C c2
a[
9
1
Machine Translated by Google
making predictions. Unlike sliding window and region one set of class probabilities per grid cell, regardless of the
proposal-based techniques, YOLO sees the entire image number of boxes B.
during training and test time so it implicitly encodes contex- At test time we multiply the conditional class probabili-
tual information about classes as well as their appearance. ties and the individual box confidence predictions,
Fast R-CNN, a top detection method [14], mistakes back- = Pr(Classi) ÿ IOUtruth
Pr(Classi|Object) ÿ Pr(Object) ÿ IOUtruth pred pred (1)
ground patches in an image for objects because it can’t see
the larger context. YOLO makes less than half the number which gives us class-specific confidence scores for each
of background errors compared to Fast R-CNN. box. These scores encode both the probability of that class
Third, YOLO learns generalizable representations of ob- appearing in the box and how well the predicted box fits the
jects. When trained on natural images and tested on art- object.
work, YOLO outperforms top detection methods like DPM
and R-CNN by a wide margin. Since YOLO is highly gen-
eralizable it is less likely to break down when applied to
new domains or unexpected inputs.
YOLO still lags behind state-of-the-art detection systems
in accuracy. While it can quickly identify objects in im-ages
Bounding boxes + confdence
it struggles to precisely localize some objects, espe-cially
small ones. We examine these tradeoffs further in our
experiments.
S × S grid on input Final detections
All of our training and testing code is open source. A
variety of pretrained models are also available to download.
448
112
3
56
3
3
448 3 28
3
14
3 3 7 7 7
3
112
3
56 3
28
14
7 7 7
Conv. Layer Conv. Layer Conv. Layers Conv. Layers Conv. Layers Conv. Layers Conn. Layer Conn. Layer
7x7x64-s-2 3x3x192 1x1x128 1x1x256 1x1x512 3x3x1024
Maxpool Layer Maxpool Layer 3x3x256 3x3x512 }×4 3x3x1024 }×2 3x3x1024
2x2-s-2 2x2-s-2 1x1x256 1x1x512 3x3x1024
3x3x512 3x3x1024 3x3x1024-s-2
Maxpool Layer Maxpool Layer
2x2-s-2 2x2-s-2
Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1
convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification
task at half the resolution (224 × 224 input image) and then double the resolution for detection.
The final output of our network is the 7 × 7 × 30 tensor of predictions. model. We use sum-squared error because it is easy to op-timize,
however it does not perfectly align with our goal of maximizing average
precision. It weights localization er-ror equally with classification error
2.2. Training which may not be ideal.
We pretrain our convolutional layers on the ImageNet 1000-class Also, in every image many grid cells do not contain any object. This
competition dataset [30]. For pretraining we use the first 20 convolutional pushes the “confidence” scores of those cells towards zero, often
layers from Figure 3 followed by a average-pooling layer and a fully overpowering the gradient from cells that do contain objects. This can
connected layer. We train this network for approximately a week and lead to model instability, causing training to diverge early on.
achieve a single crop top-5 accuracy of 88% on the ImageNet 2012
valida-tion set, comparable to the GoogLeNet models in Caffe’s Model
Zoo [24]. We use the Darknet framework for all training and inference To remedy this, we increase the loss from bounding box coordinate
[26]. predictions and decrease the loss from confi-dence predictions for
boxes that don’t contain objects. We use two parameters, ÿcoord and
We then convert the model to perform detection. Ren et al. show ÿnoobj to accomplish this. We set ÿcoord = 5 and ÿnoobj = .5.
that adding both convolutional and connected lay-ers to pretrained
networks can improve performance [29].
Following their example, we add four convolutional lay-ers and two
Sum-squared error also equally weights errors in large boxes and
fully connected layers with randomly initialized weights. Detection often small boxes. Our error metric should reflect that small deviations in large
requires fine-grained visual infor-mation so we increase the input
boxes matter less than in small boxes. To partially address this we
resolution of the network from 224 × 224 to 448 × 448.
predict the square root of the bounding box width and height instead of
the width and height directly.
Our final layer predicts both class probabilities and bounding box
coordinates. We normalize the bounding box width and height by the
image width and height so that they fall between 0 and 1. We
parametrize the bounding box x and y coordinates to be offsets of a YOLO predicts multiple bounding boxes per grid cell.
particular grid cell loca-tion so they are also bounded between 0 and At training time we only want one bounding box predictor to be
We use a linear activation function for the final layer and all other for predicting an object based on which prediction has the highest
layers use the following leaky rectified linear acti-vation: current IOU with the ground truth. This leads to specialization between
the bounding box predictors. Each predictor gets better at predicting
certain sizes, aspect ratios, or classes of object, improving overall recall.
x, if x > 0
ÿ(x) = (2)
0.1x, otherwise
We optimize for sum-squared error in the output of our During training we optimize the following, multi-part
Machine Translated by Google
loss function: the border of multiple cells can be well localized by multi-ple cells. Non-
maximal suppression can be used to fix these multiple detections. While
B
obj
(xi ÿ xˆi) 2
+ (yi ÿ yˆi) 2 not critical to performance as it is for R-CNN or DPM, non-maximal
ÿcoordS2 ij
i=0 j=0 suppression adds 2- 3% in mAP.
B 2
2
+ ÿcoordS2 obj
ij
ÿwi ÿ wˆi + hi ÿ hˆi
i=0 j=0
2.4. Limitations of YOLO
S2 B
2
+ obj
ij
Ci ÿ Cˆi YOLO imposes strong spatial constraints on bounding box
i=0 j=0
predictions since each grid cell only predicts two boxes and can only
B
2
+ ÿnoobjS2 noobj Ci ÿ Cˆi have one class. This spatial constraint lim-its the number of nearby
ij
i=0 j=0 objects that our model can pre-dict. Our model struggles with small
S2 objects that appear in groups, such as flocks of birds.
+ obj
i
(pi(c) ÿ pˆi(c))2 (3)
i=0 cÿclasses
Search [35] generates potential bounding boxes, a convolu- grasp detection by Redmon et al [27]. Our grid approach to
tional network extracts features, an SVM scores the boxes, bounding box prediction is based on the MultiGrasp system
a linear model adjusts the bounding boxes, and non-max for regression to grasps. However, grasp detection is a
sup-pression eliminates duplicate detections. Each stage of much simpler task than object detection. MultiGrasp only
this complex pipeline must be precisely tuned independently needs to predict a single graspable region for an image
and the resulting system is very slow, taking more than 40 containing one object. It doesn’t have to estimate the size,
seconds per image at test time [14]. location, or boundaries of the object or predict it’s class,
YOLO shares some similarities with R-CNN. Each grid only find a region suitable for grasping. YOLO predicts both
cell proposes potential bounding boxes and scores those bounding boxes and class probabilities for multiple objects
boxes using convolutional features. However, our system of multi-ple classes in an image.
puts spatial constraints on the grid cell proposals which
helps mitigate multiple detections of the same object. Our 4. Experiments
system also proposes far fewer bounding boxes, only 98
First we compare YOLO with other real-time detection
per image compared to about 2000 from Selective Search.
systems on PASCAL VOC 2007. To understand the differ-
Finally, our system combines these individual components
ences between YOLO and R-CNN variants we explore the
into a single, jointly optimized model.
errors on VOC 2007 made by YOLO and Fast R-CNN, one
Other Fast Detectors Fast and Faster R-CNN focus on
of the highest performing versions of R-CNN [14]. Based on
speeding up the R-CNN framework by sharing computa- the different error profiles we show that YOLO can be used
tion and using neural networks to propose regions instead to rescore Fast R-CNN detections and reduce the er-rors
of Selective Search [14] [28]. While they offer speed and from background false positives, giving a significant
accuracy improvements over R-CNN, both still fall short of performance boost. We also present VOC 2012 results and
real-time performance. compare mAP to current state-of-the-art methods. Finally,
Many research efforts focus on speeding up the DPM we show that YOLO generalizes to new domains better than
pipeline [31] [38] [5]. They speed up HOG computation, use other detectors on two artwork datasets.
cascades, and push computation to GPUs. However, only
30Hz DPM [31] actually runs in real-time. 4.1. Comparison to Other Real-Time Systems
Instead of trying to optimize individual components of a Many research efforts in object detection focus on mak-
large detection pipeline, YOLO throws out the pipeline ing standard detection pipelines fast. [5] [38] [31] [14] [17]
entirely and is fast by design. [28] However, only Sadeghi et al. actually produce a de-
Detectors for single classes like faces or people can be tection system that runs in real-time (30 frames per second
highly optimized since they have to deal with much less or better) [31]. We compare YOLO to their GPU imple-
variation [37]. YOLO is a general purpose detector that mentation of DPM which runs either at 30Hz or 100Hz.
learns to detect a variety of objects simultaneously. While the other efforts don’t reach the real-time milestone
Deep MultiBox. Unlike R-CNN, Szegedy et al. train a we also compare their relative mAP and speed to examine
convolutional neural network to predict regions of interest the accuracy-performance tradeoffs available in object de-
[8] instead of using Selective Search. MultiBox can also tection systems.
perform single object detection by replacing the confidence Fast YOLO is the fastest object detection method on
prediction with a single class prediction. However, Multi- PASCAL; as far as we know, it is the fastest extant object
Box cannot perform general object detection and is still just detector. With 52.7% mAP, it is more than twice as accurate
a piece in a larger detection pipeline, requiring further im- as prior work on real-time detection. YOLO pushes mAP to
age patch classification. Both YOLO and MultiBox use a 63.4% while still maintaining real-time performance.
convolutional network to predict bounding boxes in an im- We also train YOLO using VGG-16. This model is more
age but YOLO is a complete detection system. accurate but also significantly slower than YOLO. It is use-
OverFeat. Sermanet et al. train a convolutional neural ful for comparison to other detection systems that rely on
network to perform localization and adapt that localizer to VGG-16 but since it is slower than real-time the rest of the
perform detection [32]. OverFeat efficiently performs slid- paper focuses on our faster models.
ing window detection but it is still a disjoint system. Over- Fastest DPM effectively speeds up DPM without sacri-
Feat optimizes for localization, not detection performance. ficing much mAP but it still misses real-time performance by
Like DPM, the localizer only sees local information when a factor of 2 [38]. It also is limited by DPM’s relatively low
making a prediction. OverFeat cannot reason about global accuracy on detection compared to neural network ap-
context and thus requires significant post-processing to pro- proaches.
duce coherent detections. R-CNN minus R replaces Selective Search with static
MultiGrasp. Our work is similar in design to work on bounding box proposals [20]. While it is much faster than
Machine Translated by Google
detectors on PASCAL and it’s detections are publicly avail-able. Fast R-CNN (2007 data) 66.9 Fast R- 72.4 .6
mAP
VOC 2012 test cat chair cow table dogaero bike
horse bird person
mbike boat bottle
plantbus car sofa train tv
sheep
MR CNN MORE DATA [11] 73.9 85.5 82.9 76.6 57.8 62.7 79.4 77.2 86.6 55.0 79.1 62.2 87.0 83.4 84.7 78.9 45.3 73.4 65.8 80.3 74.0
HyperNet VGG 71.4 84.2 78.5 73.6 55.6 53.7 78.7 79.8 87.7 49.6 74.9 52.1 86.0 81.7 83.3 81.8 48.6 73.5 59.4 79.9 65.7
HyperNet SP 71.3 84.1 78.3 73.3 55.5 53.6 78.6 79.6 87.5 49.5 74.9 52.1 85.6 81.6 83.2 81.6 48.4 73.2 59.3 79.7 65.6
Fast R-CNN + YOLO 70.7 83.4 78.5 73.5 55.8 43.4 79.1 73.1 89.4 49.4 75.5 57.0 87.5 80.9 81.0 74.7 41.8 71.5 68.5 82.1 67.2
MR CNN S CNN [11] 70.7 85.0 79.6 71.5 55.3 57.7 76.0 73.9 84.6 50.5 74.3 61.7 85.5 79.9 81.7 76.4 41.0 69.0 61.2 77.7 72.1
Faster R-CNN [28] 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5
DEEP ENS COCO 70.1 84.0 79.4 71.6 51.9 51.1 74.1 72.1 88.6 48.3 73.4 57.8 86.1 80.0 80.7 70.4 46.6 69.6 68.8 75.9 71.4
NoC [29] 68.8 82.8 79.0 71.6 52.3 53.7 74.1 69.0 84.9 46.9 74.3 53.1 85.0 81.3 79.5 72.2 38.9 72.4 59.5 76.7 68.1
Fast R-CNN [14] 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2
UMICH FGS STRUCT 66.4 82.9 76.1 64.1 44.6 49.4 70.3 71.2 84.6 42.7 68.6 55.8 82.7 77.1 79.9 68.7 41.4 69.0 60.0 72.0 66.2
NUS NIN C2000 [7] 63.8 80.2 73.8 61.9 43.7 43.0 70.3 67.6 80.7 41.9 69.7 51.7 78.2 75.2 76.9 65.1 38.6 68.3 58.0 68.7 63.3
BabyLearning [7] 63.2 78.0 74.2 61.3 45.7 42.7 68.2 66.8 80.2 40.6 70.0 49.8 79.0 74.5 77.9 64.0 35.3 67.9 55.7 68.7 62.6
NUS NIN 62.4 77.9 73.1 62.6 39.5 43.3 69.1 66.4 78.9 39.1 68.1 50.0 77.2 71.3 76.1 64.7 38.4 66.9 56.2 66.9 62.7
R-CNN VGG BB [13] 62.4 79.6 72.7 61.9 41.2 41.9 65.9 66.4 84.6 38.5 67.2 46.7 82.0 74.8 76.0 65.2 35.6 65.4 54.2 67.4 60.3
R-CNN VGG [13] 59.2 76.8 70.9 56.6 37.5 36.9 62.9 63.6 81.1 35.7 64.3 43.9 80.4 71.6 74.0 60.0 30.8 63.4 52.0 63.5 58.7
YOLO 57.9 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8
Feature Edit [33] 56.3 74.6 69.1 54.4 39.1 33.1 65.2 62.7 69.7 30.8 56.0 44.6 70.0 64.4 71.1 60.2 33.3 61.3 46.4 61.7 57.8
R-CNN BB [13] 53.3 71.8 65.8 52.0 34.1 32.6 59.6 60.0 69.8 27.6 52.0 41.7 69.6 61.3 68.3 57.8 29.6 57.8 40.9 59.3 54.1
SDS [16] 50.7 69.7 58.4 48.5 28.3 28.8 61.3 57.5 70.8 24.1 50.7 35.9 64.9 59.1 65.8 57.1 26.0 58.8 38.6 58.9 50.7
R-CNN [13] 49.6 68.1 63.8 46.1 29.4 27.9 56.6 57.0 65.9 26.5 48.7 39.5 66.2 57.3 65.4 53.2 26.2 54.5 38.1 50.6 51.6
Table 3: PASCAL VOC 2012 Leaderboard. YOLO compared with the full comp4 (outside data allowed) public leaderboard as of
November 6th, 2015. Mean average precision and per-class average precision are shown for a variety of detection methods. YOLO is the
only real-time detector. Fast R-CNN + YOLO is the forth highest scoring method, with a 2.3% boost over Fast R-CNN.
mAP increases by 3.2% to 75.0%. We also tried combining the test data can diverge from what the system has seen be-
the top Fast R-CNN model with several other versions of fore [3]. We compare YOLO to other detection systems on
Fast R-CNN. Those ensembles produced small increases in the Picasso Dataset [12] and the People-Art Dataset [3], two
mAP between .3 and .6%, see Table 2 for details. datasets for testing person detection on artwork.
The boost from YOLO is not simply a byproduct of Figure 5 shows comparative performance between
model ensembling since there is little benefit from combin-ing YOLO and other detection methods. For reference, we give
different versions of Fast R-CNN. Rather, it is precisely VOC 2007 detection AP on person where all models are
because YOLO makes different kinds of mistakes at test trained only on VOC 2007 data. On Picasso models are
time that it is so effective at boosting Fast R-CNN’s per- trained on VOC 2012 while on People-Art they are trained
formance. he VOC 2010.
Unfortunately, this combination doesn’t benefit from the R-CNN has high AP on VOC 2007. However, R-CNN
speed of YOLO since we run each model seperately and drops off considerably when applied to artwork. R-CNN
then combine the results. However, since YOLO is so fast uses Selective Search for bounding box proposals which is
it doesn’t add any significant computational time compared tuned for natural images. The classifier step in R-CNN only
to Fast R-CNN. sees small regions and needs good proposals.
4.4. VOC 2012 Results DPM maintains its AP well when applied to artwork.
Prior work theorizes that DPM performs well because it has
On the VOC 2012 test set, YOLO scores 57.9% mAP. strong spatial models of the shape and layout of objects.
This is lower than the current state of the art, closer to Though DPM doesn’t degrade as much as R-CNN, it starts
the original R-CNN using VGG-16, see Table 3. Our sys-tem from a lower AP.
struggles with small objects compared to its closest YOLO has good performance on VOC 2007 and its AP
competitors. On categories like bottle, sheep, and degrades less than other methods when applied to artwork.
tv/monitor YOLO scores 8-10% lower than R-CNN or Like DPM, YOLO models the size and shape of objects,
Feature Edit. However, on other categories like cat and as well as relationships between objects and where objects
train YOLO achieves higher performance. commonly appear. Artwork and natural images are very
Our combined Fast R-CNN + YOLO model is one of the
different on a pixel level but they are similar in terms of
highest performing detection methods. Fast R-CNN gets the size and shape of objects, thus YOLO can still predict
a 2.3% improvement from the combination with YOLO, good bounding boxes and detections.
boosting it 5 spots up on the public leaderboard.
4.5. Generalizability: Person Detection in Artwork 5. Real-Time Detection In The Wild
Academic datasets for object detection draw the training YOLO is a fast, accurate object detector, making it ideal
and testing data from the same distribution. In real-world for computer vision applications. We connect YOLO to a
applications it is hard to predict all possible use cases and webcam and verify that it maintains real-time performance,
Machine Translated by Google
Humans
YOLO
(b) Quantitative results on the VOC 2007, Picasso, and People-Art Datasets.
(a) Picasso Dataset precision-recall curves. The Picasso Dataset evaluates on both AP and best F1 score.
Figure 6: Qualitative Results. YOLO running on sample artwork and natural images from the internet. It is mostly accurate although it
does think one person is an airplane.
including the time to fetch images from the camera and dis-play the directly on full images. Unlike classifier-based approaches,
detections. YOLO is trained on a loss function that directly corresponds
The resulting system is interactive and engaging. While to detection performance and the entire model is trained
YOLO processes images individually, when attached to a jointly.
webcam it functions like a tracking system, detecting ob-jects as they move
Fast YOLO is the fastest general-purpose object detec-tor in the
around and change in appearance. A
literature and YOLO pushes the state-of-the-art in
demo of the system and the source code can be found on
real-time object detection. YOLO also generalizes well to
our project website: http://pjreddie.com/yolo/.
new domains making it ideal for applications that rely on
fast, robust object detection.
6. Conclusion
Acknowledgements: This work is partially supported by
We introduce YOLO, a unified model for object detec-tion. Our model is ONR N00014-13-1-0720, NSF IIS-1338054, and The Allen
simple to construct and can be trained Distinguished Investigator Award.
Machine Translated by Google