Fast Feature Pyramids For Object Detection
Fast Feature Pyramids For Object Detection
8, AUGUST 2014
Abstract—Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed
explicitly. This fundamental insight allows us to design object detection algorithms that are as accurate, and considerably faster, than
the state-of-the-art. The computational bottleneck of many modern detectors is the computation of features at every scale of a finely-
sampled image pyramid. Our key insight is that one may compute finely sampled feature pyramids at a fraction of the cost, without
sacrificing performance: for a broad family of features we find that features computed at octave-spaced scale intervals are sufficient to
approximate features on a finely-sampled pyramid. Extrapolation is inexpensive as compared to direct feature computation. As a result,
our approximation yields considerable speedups with negligible loss in detection accuracy. We modify three diverse visual recognition
systems to use fast feature pyramids and show results on both pedestrian detection (measured on the Caltech, INRIA, TUD-Brussels
and ETH data sets) and general object detection (measured on the PASCAL VOC). The approach is general and is widely applicable to
vision algorithms requiring fine-grained multi-scale analysis. Our approximation is valid for images with broad spectra (most natural
images) and fails for images with narrow band-pass spectra (e.g., periodic textures).
Index Terms—Visual features, object detection, image pyramids, pedestrian detection, natural image statistics, real-time systems
1 INTRODUCTION
image structure across scales. Our analysis and experiments [48], [49], [50], coarse-to-fine search [51], distance trans-
show that this makes it possible to inexpensively estimate forms [52], etc., all focus on optimizing classification speed
features at a dense set of scales by extrapolating computa- given precomputed image features. Our work focuses on
tions carried out expensively, but infrequently, at a coarsely fast feature pyramid construction and is thus complemen-
sampled set of scales. tary to such approaches.
Our insight leads to considerably decreased run-times An effective framework for object detection is the sliding
for state-of-the-art object detectors that rely on rich repre- window paradigm [27], [53]. Top performing methods on
sentations, including histograms of gradients [21], with neg- pedestrian detection [31] and the PASCAL VOC [38] are
ligible impact on their detection rates. We demonstrate the based on sliding windows over multiscale feature pyramids
effectiveness of our proposed fast feature pyramids with [21], [29], [35]; fast feature pyramids are well suited for such
three distinct detection frameworks including integral chan- sliding window detectors. Alternative detection paradigms
nel features (ICF) [29], aggregate channel features (a novel have been proposed [54], [55], [56], [57], [58], [59]. Although
variant of integral channel features), and deformable part a full review is outside the scope of this work, the approxi-
models (DPM) [35]. We show results for both pedestrian mations we propose could potentially be applicable to such
detection (measured on the Caltech [31], INRIA [21], TUD- schemes as well.
Brussels [36] and ETH [37] data sets) and general object As mentioned, a number of state-of-the-art detectors
detection (measured on the PASCAL VOC [38]). Demon- have recently been introduced that exploit our fast feature
strated speedups are significant and impact on accuracy is pyramid construction to operate at frame rate including
relatively minor. [40] and [30]. Alternatively, parallel implementation using
Building on our work on fast feature pyramids (first pre- GPUs [60], [61], [62] can achieve fast detection while using
sented in [39]), a number of systems show state-of-the-art rich representations but at the cost of added complexity
accuracy while running at frame rate on 640 480 images. and hardware requirements. Zhu et al. [63] proposed fast
Aggregate channel features, described in this paper, operate computation of gradient histograms using integral histo-
at over 30 fps while achieving top results on pedestrian grams [64]; the proposed system was real time for single-
detection. Crosstalk cascades [40] use fast feature pyramids scale detection only. In scenarios such as automotive appli-
and couple detector evaluations of nearby windows to cations, real time systems have also been demonstrated
achieve speeds of 35-65 fps. Benenson et al. [30] imple- [65], [66]. The insights outlined in this paper allow for real
mented fast feature pyramids on a GPU, and with addi- time multiscale detection in general, unconstrained settings.
tional innovations achieved detection rates of over 100 fps.
In this work we examine and analyze feature scaling and its 3 MULTISCALE GRADIENT HISTOGRAMS
effect on object detection in far more detail than in our pre-
vious work [39]. We begin by exploring a simple question: given image gra-
The remainder of this paper is organized as follows. dients computed at one scale, is it possible to approximate gradient
We review related work in Section 2. In Section 3 we histograms at a nearby scale solely from the computed gradients?
show that it is possible to create high fidelity approxima- If so, then we can avoid computing gradients over a finely
tions of multiscale gradient histograms using gradients sampled image pyramid. Intuitively, one would expect this
computed at a single scale. In Section 4 we generalize this to be possible, as significant image structure is preserved
finding to a broad family of feature types. We describe when an image is resampled. We begin with an in-depth
our efficient scheme for computing finely sampled feature look at a simple form of gradient histograms and develop a
pyramids in Section 5. In Section 6 we show applications more general theory in Section 4.
of fast feature pyramids to object detection, resulting in A gradient histogram measures the distribution of the
considerable speedups with minor loss in accuracy. We gradient angles within an image. Let Iðx; yÞ denote an
conclude in Section 7. m n discrete signal, and @I=@x and @I=@y denote the dis-
crete derivatives of I (typically 1D centered first differen-
ces are used). Gradient magnitude and orientation are
@I
2 RELATED WORK defined @I by: Mði; jÞ2 ¼ @x ði; jÞ2 þ @I
@y ði; jÞ
2
and Oði; jÞ ¼
@I
Significant research has been devoted to scale space theory arctan @y ði; jÞ= @x ði; jÞ . To compute the gradient histo-
[41], including real time implementations of octave and gram of an image, each pixel casts a vote, weighted by its
half-octave image pyramids [42], [43]. Sparse image pyra- gradient magnitude, for the bin corresponding to its gra-
mids often suffice for certain approximations, e.g., [42] dient orientation. After the orientation O is quantized into
shows how to recover a disk’s characteristic scale using Q bins so that Oði; jÞ P2 f1; Qg, the qth bin of the histogram
half-octave pyramids. Although only loosely related, these is defined by: hq ¼ i;j Mði; jÞ1½Oði; jÞ ¼ q, where 1 is the
ideas provide the intuition that finely sampled feature pyra- indicator function. In the following everything that holds
mids can perhaps be approximated. for global histograms also applies to local histograms
Fast object detection has been of considerable interest in (defined identically except for the range of the indices i
the community. Notable recent efforts for increasing detec- and j).
tion speed include work by Felzenszwalb et al. [44] and
Pedersoli et al. [45] on cascaded and coarse-to-fine deform- 3.1 Gradient Histograms in Upsampled Images
able part models, respectively, Lampert et al.’s [46] applica- Intuitively the information content of an upsampled image
tion of branch and bound search for detection, and Doll ar is similar to that of the original, lower-resolution image
et al.’s work on crosstalk cascades [40]. Cascades [27], [47], (upsampling does not create new structure). Assume I is a
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
Fig. 1. Behavior of gradient histograms in images resampled by a factor of two. (a) Upsampling gradients. Given images I and I 0 where I 0
denotes I upsampled by two, and corresponding gradient magnitude images M and M 0 , the ratio SM=SM 0 should be approximately 2. The
middle/bottom panels show the distribution of this ratio for gradients at fixed orientation over pedestrian/natural images. In both cases the
mean m 2, as expected, and the variance is relatively small. (b) Downsampling gradients. Given images I and I 0 where I 0 denotes I down-
sampled by two, the ratio SM=SM 0 0:34, not 0:5 as might be expected from (a) as downsampling results in loss of high frequency content.
(c) Downsampling normalized gradients. Given normalized gradient magnitude images M e and M e M
e 0 , the ratio SM=S e 0 0:27. Instead of trying
to derive analytical expressions governing the scaling properties of various feature types under different resampling factors, in Section 4 we
describe a general law governing feature scaling.
continuous signal, and let I 0 denote I upsampled by a factor pedestrian approximately 96 pixels tall. The second image
of k: I 0 ðx; yÞ Iðx=k; y=kÞ. Using the definition of a deriva- set contains 128 64 windows cropped at random positions
0
tive, one can show that @I 1 @I
@x ði; jÞ ¼ k @x ði=k; j=kÞ, and likewise from the 1;218 images in the INRIA negative training set.
@I 0
for @y , which simply states the intuitive fact that the rate of We sample 5,000 windows but exclude nearly uniform win-
change in the upsampled image is k times slower the rate of dows, i.e., those with average gradient magnitude under
change in the original image. While not exact, the above 0:01, resulting in 4,280 images. We refer to the two sets as
also holds approximately for interpolated discrete signals. ‘pedestrian images’ and ‘natural images,’ although the latter
Let M 0 ði; jÞ 1k Mðdi=ke; dj=keÞ denote the gradient magni- is biased toward scenes that may (but do not) contain
tude in an upsampled discrete image. Then: pedestrians.
In order to measure the fidelity of this approximation, we
kn X
X km kn X
X km
1 define the ratio rq ¼ h0q =hq and quantize orientation into
M 0 ði; jÞ Mðdi=ke; dj=keÞ Q ¼ 6 bins. Fig. 1a shows the distribution of rq for one bin
i¼1 j¼1 i¼1 j¼1
k
(1) on the 1;237 pedestrian and 4; 280 natural images given an
X
n X
m
1 n X
X m
upsampling of k ¼ 2 (results for other bins were similar). In
¼k 2
Mði; jÞ ¼ k Mði; jÞ:
i¼1 j¼1
k i¼1 j¼1
both cases the mean is m 2, as expected, and the variance
is relatively small, meaning the approximation is unbiased
and reasonable.
Thus, the sum of gradient magnitudes in the original and
Thus, although individual gradients may change, gradi-
upsampled image should be related by about a factor of k.
0 ent histograms in an upsampled and original image will be
Angles should also be mostly preserved since @I @x ði; jÞ
@I 0 @I @I related by a multiplicative constant roughly equal to the
@y ði; jÞ @x ði=k; j=kÞ @y ði=k; j=kÞ. Therefore, according to scale change between them. We examine gradient histo-
the definition of gradient histograms, we expect the rela-
grams in downsampled images next.
tionship between hq (computed over I) and h0q (computed
over I 0 ) to be: h0q khq . This allows us to approximate gra-
dient histograms in an upsampled image using gradients 3.2 Gradient Histograms in Downsampled Images
computed at the original scale. While the information content of an upsampled image is
Experiments. One may verify experimentally that in roughly the same as that of the original image, information
images of natural scenes, upsampled using bilinear interpo- is typically lost during downsampling. However, we find
lation, the approximation h0q khq is reasonable. We use that the information loss is consistent and the resulting
two sets of images for these experiments, one class specific approximation takes on a similarly simple form.
and one class independent. First, we use the 1;237 cropped If I contains little high frequency energy, then the
pedestrian images from the INRIA pedestrians training approximation h0q khq derived in Section 3.1 should apply.
data set [21]. Each image is 128 64 and contains a In general, however, downsampling results in loss of high
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1535
Fig. 2. Approximating gradient histograms in images resampled by a factor of two. For each image set, we take the original image (green border) and
generate an upsampled (blue) and downsampled (orange) version. At each scale we compute a gradient histogram with eight bins, multiplying each
bin by 0:5 and 1=0:34 in the upsampled and downsampled histogram, respectively. Assuming the approximations from Section 3 hold, the three nor-
malized gradient histograms should be roughly equal (the blue, green, and orange bars should have the same height at each orientation). For the first
four cases, the approximations are fairly accurate. In the last two cases, showing highly structured Brodatz textures with significant high frequency
content, the downsampling approximation fails. The first four images are representative, the last two are carefully selected to demonstrate images
with atypical statistics.
frequency content which can lead to measured gradients the context of object detection (see Section 6). Observe that
undershooting the extrapolated gradients. Let I 0 now we have now introduced an additional nonlinearity to the
denote I downsampled by a factor of k. We expect that hq gradient computation; do the previous results for gradient
(computed over I) and h0q (computed over I 0 ) will satisfy histograms still hold if we use M e instead of M?
h0q hq =k. The question we seek to answer here is whether In Fig. 1c we plot the distribution of rq ¼ h0q =hq for histo-
the information loss is consistent. grams of normalized gradients given a downsampling fac-
Experiments. As before, define rq ¼ h0q =hq . In Fig. 1b we tor of k ¼ 2. As with the original gradient histograms, the
show the distribution of rq for a single bin on the pedes- distributions of rq are normally distributed and have similar
trian and natural images given a downsampling factor of means for pedestrian and natural images (m 0:26 and
k ¼ 2. Observe that the information loss is consistent: rq is m 0:27, respectively). Observe, however, that the expected
normally distributed around m 0:34 < 0:5 for natural value of rq for normalized gradient histograms is quite dif-
images (and similarly m 0:33 for pedestrians). This ferent than for the original histograms (Fig. 1b).
implies that h0q mhq could serve as a reasonable approxi- Deriving analytical expressions governing the scaling
mation for gradient histograms in images downsampled properties of progressively more complex feature types
by k ¼ 2. would be difficult or even impossible. Instead, in Section 4
In other words, similarly to upsampling, gradient histo- we describe a general law governing feature scaling.
grams computed over original and half resolution images
tend to differ by a multiplicative constant (although the con-
4 STATISTICS OF MULTISCALE FEATURES
stant is not the inverse of the sampling factor). In Fig. 2 we
show the quality of the above approximations on example To understand more generally how features behave in
images. The agreement between predictions and observa- resampled images, we turn to the study of natural image
tions is accurate for typical images (but fails for images with statistics [7], [33]. The analysis below provides a deep
atypical statistics). understanding of the behavior of multiscale features. The
practical result is a simple yet powerful approach for pre-
dicting the behavior of gradients and other low-level fea-
3.3 Histograms of Normalized Gradients tures in resampled images without resorting to analytical
Suppose we replaced the gradient magnitude M by the nor- derivations that may be difficult except under the simplest
malized gradient magnitude M e defined as Mði;e jÞ ¼ conditions.
Mði; jÞ=ðMði; jÞ þ 0:005Þ, where M is the average gradient We begin by defining a broad family of features. Let V be
magnitude in each 11 11 image patch (computed by any low-level shift invariant function that takes an image I
convolving M with an L1 normalized 11 11 triangle filter). and creates a new channel image C ¼ VðIÞ where a channel
Using the normalized gradient M e gives improved results in C is a per-pixel feature map such that output pixels in C are
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
computed from corresponding patches of input pixels in I We observe that a single image can itself be considered
(thus preserving overall image layout). C may be down- an ensemble of image patches (smaller images). Since V is
sampled relative to I and may contain multiple layers k. We shift invariant, we can interpret fV ðIÞ as computing the
define a feature fVP ðIÞ as a weighted sum of the channel average of fV ðI k Þ over every patch I k of I and therefore
C ¼ VðIÞ: fV ðIÞ ¼ ijk wijk Cði; j; kÞ. Numerous local and Eq. (3) can be applied directly for a single image. We formal-
global features can be written in this form including gradi- ize this below.
ent histograms, linear filters, color statistics, and others [29]. We can decompose an image I into K smaller images
Any such low-level shift invariant V can be used, making I 1 . . . I K such that I ¼ ½I 1 I K . Given that V must be shift
this representation quite general. invariant and ignoring boundary effects gives VðIÞ ¼
Let Is denote I at scale s, where the dimensions hs ws Vð½I 1 I K Þ ½VðI 1 Þ VðI K Þ, and substituting into
of Is are s times the dimensions of I. For s < 1, Is (which Eq. (2) yields fV ðIÞ SfV ðI k Þ=K. However, we can con-
denotes a higher resolution version of I) typically differs sider I 1 I K as a (small) image ensemble, and fV ðIÞ
from I upsampled by s, while for s < 1 an excellent E½fV ðI k Þ an expectation over that ensemble. Therefore,
approximation of Is can be obtained by downsampling I. substituting fV ðIs1 Þ E½fV ðIsk1 Þ and fV ðIs2 Þ E½fV ðIsk2 Þ
Next, for simplicity we redefine fV ðIs Þ as1 into Eq. (3) yields:
P 1 24
Fig. 3. Power law feature scaling. For each of six channel types we plot ms ¼ N1 fV ðIsi Þ=fV ðI1i Þ for s 2 8 ; . . . ; 2 8 on a log-log plot for both pedes-
trian and natural image ensembles. Plots of fV ðIs1 Þ=fV ðIs2 Þ for 20 randomly selected pedestrian images are shown as faint gray lines. Additionally
the best-fit line to ms for the natural images is shown. The resulting and expected error jE½Ej are given in the plot legends. In all cases the ms follow
a power law as predicted by Eq. (4) and are nearly identical for both pedestrian and natural images, showing the estimate of is robust and generally
applicable. The tested channels are: (a) histograms of gradients described in Section 3; (b) histograms of normalized gradients described in Section
3.3; (c) a difference of gaussian (DoG) filter (with inner and outer s of 0:71 and 1:14, respectively); (d) grayscale images (with ¼ 0 as expected); (e)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pixel standard deviation computed over local 5 5 neighborhoods Cði;jÞ¼ E½Iði;jÞ2 E½Iði;jÞ; (f) HOG [21] with 4 4 spatial bins (results were averaged
over HOG’s 36 channels). Code for generating such plots is available (see chnsScaling.m in Piotr’s Toolbox).
ms to have the form ms ¼ aV s V , with aV 6¼ 1 as an artifact s s ¼ stdev½fV ðIsi Þ=fV ðI1i Þ ¼ stdev½E; (6)
of the interpolation. Note that aV is only necessary for esti-
mating V from downsampled images and is not used sub- where ‘stdev’ denotes the sample standard deviation (com-
sequently. To estimate aV and V , we use a least squares fit puted over N images) and E is the error associated with
of log2 ðms0 Þ ¼ a0V V log2 ðs0 Þ to the 24 measurements com- each image and scaling factor as defined in Eq. (4). In
0
puted over natural images (and set aV ¼ 2aV ). Resulting Section 4.2 we confirmed that E½E ffi 0, our goal now is to
pffiffiffiffiffiffiffiffi
estimates of V are given in plot legends in Fig. 3. understand how s s ¼ stdev½E E½E 2 behaves.
There is strong agreement between the resulting best-fit In Fig. 4 we plot s s as a function of s for the same chan-
lines and the observations. In legend brackets in Fig. 3 we nels as in Fig. 3. In legend brackets we report s s for s ¼ 12 for
report expected error jE½Ej ¼ jms aV s V j for both natu- both natural and pedestrian images; for all channels studied
ral and pedestrian images averaged over s (using aV and s 1=2 < :2. In all cases s s increases gradually with increasing
V estimated using natural images). For basic gradient his- s and the deviation is low for small s. The expected magni-
tograms jE½Ej ¼ 0:018 for natural images and jE½Ej ¼ tude of E varies across channels, for example histograms of
0:037 for pedestrian images. Indeed, for every channel type normalized gradients (Fig. 4b) have lower s s than their
Eq. (4) is an excellent fit to the observations ms for both unnormalized counterparts (Fig. 4a). The trivial grayscale
image ensembles. channel (Fig. 4d) has s s ¼ 0 as the approximation is exact.
The derivation of Eq. (4) depends on the distribution Observe that often s s is greater for natural images than
of image statistics being stationary with respect to scale; for pedestrian images. Many of the natural images contain
that this holds for all channel types tested, and with relatively little structure (e.g., a patch of sky), for such
nearly an identical constant for both pedestrian and images fV ðIÞ is small for certain V (e.g., simple gradient his-
natural images, shows the estimate of V is robust and tograms) resulting in more variance in the ratio in Eq. (4).
generally applicable. For HOG channels (Fig. 4f), which have additional normali-
zation, this effect is minimized.
qffiffiffiffiffiffiffiffiffiffiffiffi
Fig. 4. Power law deviation for individual images. For each of the six channel types described in Fig. 3 we plot s s versus s where s s ¼ E½E 2 and E is
the deviation from the power law for a single image as defined in Eq. (4). In brackets we report s 1=2 for both natural and pedestrian images. s s
increases gradually as a function of s, meaning that not only does Eq. (4) hold for an ensemble of images but also the deviation from the power law
for individual images is low for small s.
s 1=2 increases with decreasing window size (see also the 5 FAST FEATURE PYRAMIDS
derivation of Eq. (4)).
We introduce a novel, efficient scheme for computing fea-
Upsampling. The power law can predict features in higher
ture pyramids. First, in Section 5.1 we outline an approach
resolution images but not upsampled images. In practice,
for scaling feature channels. Next, in Section 5.2 we show its
though, we want to predict features in higher resolution as
application to constructing feature pyramids efficiently and
opposed to (smooth) upsampled images.
we analyze computational complexity in Section 5.3.
Robust estimation. In preceding derivations, when com-
puting fV ðIs1 Þ=fV ðIs2 Þ we assumed that fV ðIs2 Þ 6¼ 0. For the
V’s considered this was the case after windows of near uni- 5.1 Feature Channel Scaling
form intensity were excluded (see Section 3.1). Alterna- We propose an extension of the power law governing fea-
tively, we have found that excluding I with fV ðIÞ 0 when ture scaling introduced in Section 4 that applies directly to
estimating results in more robust estimates. channel images. As before, let Is denote I captured at scale
Sparse channels. For sparse channels where frequently s and RðI; sÞ denote I resampled by s. Suppose we have
fV ðIÞ 0, e.g., the output of a sliding-window object detec- computed C ¼ VðIÞ; can we predict the channel image
tor, s will be large. Such channels may not be good candi- Cs ¼ VðIs Þ at a new scale s using only C?
dates for the power law approximation. The standard approach is to compute Cs ¼ VðRðI; sÞÞ,
One-shot estimates. We can estimate as described in Sec- ignoring the information contained in C ¼ VðIÞ. Instead,
tion 4.2 using a single image in place of an ensemble (N ¼ 1). we propose the following approximation:
Such estimates are noisy but not entirely unreasonable; e.g.,
on normalized gradient histograms (with 0:101) the V
Cs RðC; sÞ s : (7)
mean of 4,280 single image estimates of if 0:096 and the
standard deviation of the estimates is 0:073. A visual demonstration of Eq. (7) is shown in Fig. 6.
Scale range. We expect the power law to break down at
extreme scales not typically encountered under natural
viewing conditions (e.g., under high magnification).
Fig. 8. Overview of the ACF detector. Given an input image I, we compute several channels C ¼ VðIÞ, sum every block of pixels in C, and smooth
the resulting lower resolution channels. Features are single pixel lookups in the aggregated channels. Boosting is used to learn decision trees over
these features (pixels) to distinguish object from background. With the appropriate choice of channels and careful attention to design, ACF achieves
state-of-the-art performance in pedestrian detection.
channels C ¼ VðIÞ, sum every block of pixels in C, and MRs for 16 competing methods are shown. ACF outper-
smooth the resulting lower resolution channels. Features forms competing approaches on nearly all datasets. When
are single pixel lookups in the aggregated channels. Boost- averaged over the four data sets, the MR of ACF is 40 per-
ing is used to train and combine decision trees over these cent with exact feature pyramids and 41 percent with fast
features (pixels) to distinguish object from background and feature pyramids, a negligible difference, demonstrating the
a multiscale sliding-window approach is employed. With effectiveness of our approach.
the appropriate choice of channels and careful attention to Speed. MR versus speed for numerous detectors is shown
design, ACF achieves state-of-the-art performance in pedes- in Fig. 10. ACF with fast feature pyramids runs at 32 fps.
trian detection. The only two faster approaches are Crosstalk cascades [40]
Channels. ACF uses the same channels as [39]: normal- and the VeryFast detector from Benenson et al. [30]. Their
ized gradient magnitude, histogram of oriented gradients additional speedups are based on improved cascade strate-
(six channels), and LUV color channels. Prior to computing gies and combining multi-resolution models with a GPU
the 10 channels, I is smoothed with a [1 2 1]/4 filter. The implementation, respectively, and are orthogonal to the
channels are divided into 4 4 blocks and pixels in each gains achieved by using approximate multiscale features.
block are summed. Finally the channels are smoothed, again Indeed, all the detectors that run at 5 fps and higher exploit
with a [1 2 1]/4 filter. For 640 480 images, computing the the power law governing feature scaling.
channels runs at over 100 fps on a modern PC. The code is Pyramid parameters. Detection performance on INRIA [21]
optimized but runs on a single CPU; further gains could be with fast feature pyramids under varying settings is shown
obtained using multiple cores or a GPU as in [30]. in Fig. 11. The key result is given in Fig. 11a: when approxi-
Pyramid. Computation of feature pyramids at octave- mating seven of eight scales per octave, the MR for ACF is
spaced scale intervals runs at 75 fps on 640 480 images. 0:169 which is virtually identical to the MR of 0:166 obtained
Meanwhile, computing exact feature pyramids with eight
scales per octave slows to 15 fps, precluding real-time
TABLE 1
detection. In contrast, our fast pyramid construction (see
MRs of Leading Approaches for Pedestrian Detection
Section 5) with seven of eight scales per octave approxi- on Four Data Sets
mated runs at nearly 50 fps.
Detector. For pedestrian detection, AdaBoost [69] is used
to train and combine 2,048 depth-two trees over the
128 64 10=16 ¼ 5;120 candidate features (channel pixel
lookups) in each 128 64 window. Training with multiple
rounds of bootstrapping takes 10 minutes (a parallel
implementation reduces this to 3 minutes). The detector
has a step size of four pixels and eight scales per octave. For
640 480 images, the complete system, including fast pyra-
mid construction and sliding-window detection, runs at
over 30 fps allowing for real-time uses (with exact feature
pyramids the detector slows to 12 fps).
Code. Code for the ACF framework is available online.4
For more details on the channels and detector used in ACF,
including exact parameter settings and training framework,
we refer users to the source code.
Accuracy. We report accuracy of ACF with exact and fast
feature pyramids in Table 1. Following the methodology of
[31], we summarize performance using the log-average
miss rate (MR) between 10 2 and 100 false positives per
image. Results are reported on four pedestrian data sets:
INRIA [21], Caltech [31], TUD-Brussels [36] and ETH [37].
For ICF and ACF exact and approximate detection results are shown
with only small differences between them. For the latest pedestrian
4. Code: http://vision.ucsd.edu/ pdollar/toolbox/doc/index.html. detection results please see [32].
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1541
Fig. 10. Log-average miss rate on the INRIA pedestrian data set [21] versus frame rate on 640 480 images for multiple detectors. Method runtimes
were obtained from [31], see also [31] for citations for detectors A-L. Numbers in brackets indicate MR/fps for select approaches, sorted by speed.
All detectors that run at 5 fps and higher are based on our fast feature pyramids; these methods are also the most accurate. They include: (M)
FPDW [39] which is our original implementation of ICF, (N) ICF [Section 6.2], (O) ACF [Section 6.1], (P) crosstalk cascades [40], and (Q) the Very-
Fast detector from Benenson et al. [30]. Both (P) and (Q) use the power law governing feature scaling described in this work; the additional speedups
in (P) and (Q) are based on improved cascade strategies, multi-resolution models and a GPU implementation, and are orthogonal to the gains
achieved by using approximate multiscale features.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1542 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014
Fig. 11. Effect of parameter setting of fast feature pyramids on the ACF detector [Section 6.1]. We report log-average miss rate averaged over 25 tri-
als on the INRIA pedestrian data set [21]. Orange diamonds denote default parameter settings: 7/8 scales approximated per octave, 0:17 for the
normalized gradient channels, and eight scales per octave in the pyramid. (a) The MR stays relatively constant as the fraction of approximated scales
increases up to 7/8 demonstrating the efficacy of the proposed approach. (b) Sub-optimal values of when approximating the normalized gradient
channels cause a marked decrease in performance. (c) At least eight scales per octave are necessary for good performance, making the proposed
scheme crucial for achieving detection results that are both fast and accurate.
models, respectively. Our work is complementary as we widely believed that the price to be paid for improved
focus on improving the speed of pyramid construction. performance is sharply increased computational costs.
The current bottleneck of DPMs is in the classification We have shown that this is not necessarily so. Finely
stage, therefore pyramid construction accounts for only a sampled pyramids may be obtained inexpensively by
fraction of total runtime. However, if fast feature pyra- extrapolation from coarsely sampled ones. This insight
mids are coupled with optimized classification schemes decreases computational costs substantially.
[44], [45], DPMs have the potential to have more competi- Our insight ultimately relies on the fractal structure of
tive runtimes. We focus on demonstrating DPMs can much of the visual world. By investigating the statistics of
achieve good accuracy with fast feature pyramids and natural images we have demonstrated that the behavior of
leave the coupling of fast feature pyramids and optimized image features can be predicted reliably across scales. Our
classification schemes to practitioners. calculations and experiments show that this makes it possi-
DPM code is available online [35]. We tested pre-trained ble to estimate features at a given scale inexpensively by
DPM models on the 20 PASCAL 2007 categories using exact extrapolating computations carried out at a coarsely sam-
HOG pyramids and HOG pyramids with nine of 10 scales pled set of scales. While our results do not hold under all
per octave approximated using our proposed approach. circumstances, for instance, on images of textures or white
Average precision (AP) scores for the two approaches, noise, they do hold for images typically encountered in the
denoted DPM and DPM, respectively, are shown in natural world.
Table 2. The mean AP across the 20 categories is 26.6 percent In order to validate our findings we studied the perfor-
for DPMs and 24.5 percent for DPMs. Using fast HOG fea- mance of three end-to-end object detection systems. We
ture pyramids only decreased mean AP 2 percent, demon- found that detection rates are relatively unaffected while
strating the validity of the proposed approach. computational costs decrease considerably. This has led to
the first detectors that operate at frame rate while using rich
feature representations.
7 CONCLUSION Our results are not restricted to object detection nor to
Improvements in the performance of visual recognition visual recognition. The foundations we have developed
systems in the past decade have in part come from the should readily apply to other computer vision tasks where
realization that finely sampled pyramids of image fea- a fine-grained scale sampling of features is necessary as the
tures provide a good front-end for image analysis. It is image processing front end.
Fig. 12. Effect of parameter setting of fast feature pyramids on the ICF detector [Section 6.2]. The plots mirror the results shown in Fig. 11 for the ACF
detector, although overall performance for ICF is slightly lower. (a) When approximating seven of every eight scales in the pyramid, the MR for ICF is
0:195 which is only slightly worse than the MR of 0:176 obtained using exact feature pyramids. (b) Computing approximate channels with an incorrect
value of results in decreased performance (although using a slightly larger than predicted appears to improve results marginally). (c) Similarly to
the ACF framework, at least eight scales per octave are necessary to achieve good results.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1543
[41] T. Lindeberg, “Scale-Space for Discrete Signals,” IEEE Trans. Pat- [67] D.L. Ruderman, “The Statistics of Natural Images,” Network: Com-
tern Analysis and Machine Intelligence, vol. 12, no. 3, pp. 234-254, putation in Neural Systems, vol. 5, no. 4, pp. 517-548, 1994.
Mar. 1990. [68] S.G. Ghurye, “A Characterization of the Exponential Function,”
[42] J.L. Crowley, O. Riff, and J.H. Piater, “Fast Computation of Char- The Am. Math. Monthly, vol. 64, no. 4, pp. 255-257, 1957.
acteristic Scale Using a Half-Octave Pyramid,” Proc. Fourth Int’l [69] J. Friedman, T. Hastie, and R. Tibshirani, “Additive Logistic
Conf. Scale-Space Theories in Computer Vision, 2002. Regression: A Statistical View of Boosting,” Annals of Statistics,
[43] R.S. Eaton, M.R. Stevens, J.C. McBride, G.T. Foil, and M.S. Snorra- vol. 38, no. 2, pp. 337-374, 2000.
son, “A Systems View of Scale Space,” Proc. IEEE Int’l Conf. Com- [70] P. Sabzmeydani and G. Mori, “Detecting Pedestrians by Learning
puter Vision Systems (ICVS), 2006. Shapelet Features,” Proc. IEEE Conf. Computer Vision and Pattern
[44] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade Object Recognition (CVPR), 2007.
Detection with Deformable Part Models,” Proc. IEEE Conf. Com- [71] Z. Lin and L.S. Davis, “A Pose-Invariant Descriptor for Human
puter Vision and Pattern Recognition (CVPR), 2010. Detection and Segmentation,” Proc. 10th European Conf. Computer
[45] M. Pedersoli, A. Vedaldi, and J. Gonzalez, “A Coarse-to-Fine Vision (ECCV), 2008.
Approach for Fast Deformable Object Detection,” Proc. IEEE Conf. [72] S. Maji, A. Berg, and J. Malik, “Classification Using Intersection
Computer Vision and Pattern Recognition (CVPR), 2011. Kernel SVMs Is Efficient,” Proc. IEEE Conf. Computer Vision and
[46] C.H. Lampert, M.B. Blaschko, and T. Hofmann, “Efficient Sub- Pattern Recognition (CVPR), 2008.
window Search: A Branch and Bound Framework for Object [73] X. Wang, T.X. Han, and S. Yan, “An HOG-LBP Human Detector
Localization,” IEEE Trans. Pattern Analysis and Machine Intelligence, with Partial Occlusion Handling,” Proc. IEEE Int’l Conf. Computer
vol. 31, no. 12, pp. 2129-2142, Dec. 2009. Vision (ICCV), 2009.
[47] L. Bourdev and J. Brandt, “Robust Object Detection via Soft [74] C. Wojek and B. Schiele, “A Performance Evaluation of Single and
Cascade,” Proc. IEEE Conf. Computer Vision and Pattern Recognition Multi-Feature People Detection,” Proc. 30th DAGM Symp. Pattern
(CVPR), 2005. Recognition (DAGM), 2008.
[48] C. Zhang and P. Viola, “Multiple-Instance Pruning for Learning [75] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis, “Human
Efficient Cascade Detectors,” Proc. Advances in Neural Information Detection Using Partial Least Squares Analysis,” Proc. IEEE 12th
Processing Systems (NIPS), 2007. Int’l Conf. Computer Vision (ICCV), 2009.
[49] J. Sochman and J. Matas, “Waldboost—Learning for Time Con- [76] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and
strained Sequential Detection,” Proc. IEEE Conf. Computer Vision Insights for Pedestrian Detection,” Proc. IEEE Conf. Computer
and Pattern Recognition (CVPR), 2005. Vision and Pattern Recognition (CVPR), 2010.
[50] H. Masnadi-Shirazi and N. Vasconcelos, “High Detection-Rate [77] D. Park, D. Ramanan, and C. Fowlkes, “Multiresolution Models
Cascades for Real-Time Object Detection,” Proc. IEEE 11th Int’l for Object Detection,” Proc. 11th European Conf. Computer Vision
Conf. Computer Vision (ICCV), 2007. (ECCV), 2010.
[51] F. Fleuret and D. Geman, “Coarse-To-Fine Face Detection,” Int’l J.
Computer Vision, vol. 41, no. 1/2, pp. 85-107, 2001. Piotr Dolla r received the master’s degree in
[52] P. Felzenszwalb and D. Huttenlocher, “Efficient Matching of Pic- computer science from Harvard University in
torial Structures,” Proc. IEEE Conf. Computer Vision and Pattern Rec- 2002 and the PhD degree from the University of
ognition (CVPR), 2000. California, San Diego in 2007. He joined the
[53] C. Papageorgiou and T. Poggio, “A Trainable System for Object Computational Vision Lab at the California Insti-
Detection,” Int’l J. Computer Vision, vol. 38, no. 1, pp. 15-33, 2000. tute of Technology Caltech as a postdoctoral fel-
[54] M. Weber, M. Welling, and P. Perona, “Unsupervised Learning of low in 2007. Upon being promoted to a senior
Models for Recognition,” Proc. European Conf. Computer Vision postdoctoral fellow he realized it was time to
(ECCV), 2000. move on, and in 2011, he joined the Interactive
[55] S. Agarwal and D. Roth, “Learning a Sparse Representation for Visual Media Group at Microsoft Research,
Object Detection,” Proc. European Conf. Computer Vision (ECCV), Redmond, Washington, where he currently
2002. resides. He has worked on object detection, pose estimation, boundary
[56] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition learning, and behavior recognition. His general interests include
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. Com- machine learning and pattern recognition, and their application to com-
puter Vision and Pattern Recognition (CVPR), 2003. puter vision.
[57] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object Detection
with Interleaved Categorization and Segmentation,” Int’l J. Com-
puter Vision, vol. 77, no. 1-3, pp. 259-289, May 2008. Ron Appel received the bachelor’s and master’s
[58] C. Gu, J.J. Lim, P. Arbelaez, and J. Malik, “Recognition Using degrees in electrical and computer engineering
Regions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition from the University of Toronto in 2006 and
(CVPR), 2009. 2008, respectively. He is working toward the
[59] B. Alexe, T. Deselaers, and V. Ferrari, “What Is an Object?” Proc. PhD degree in the Computational VisionLab at
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010. the California Institute of Technology, Caltech,
[60] C. Wojek, G. Dork o, A. Schulz, and B. Schiele, “Sliding-Windows where he currently holds an NSERC Graduate
for Rapid Object Class Localization: A Parallel Technique,” Proc. Award. He cofounded ViewGenie Inc., a com-
30th DAGM Symp. Pattern Recognition, 2008. pany specializing in intelligent image processing
[61] L. Zhang and R. Nevatia, “Efficient Scan-Window Based Object and search. His research interests include
Detection Using GPGPU,” Proc. Workshop Visual Computer Vision machine learning, visual object detection, and
on GPU’s (CVGPU), 2008. algorithmic optimization.
[62] B. Bilgic, “Fast Human Detection with Cascaded Ensembles,”
master’s thesis, MIT, Feb. 2010.
[63] Q. Zhu, S. Avidan, M. Yeh, and K. Cheng, “Fast Human Detection
Using a Cascade of Histograms of Oriented Gradients,” Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2006.
[64] F.M. Porikli, “Integral Histogram: A Fast Way to Extract Histo-
grams in Cartesian Spaces,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2005.
[65] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “Robust Multi-Per-
son Tracking from a Mobile Platform,” IEEE Trans. Pattern Analy-
sis and Machine Intelligence, vol. 31, no. 10, pp. 1831-1846, Oct. 2009.
[66] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan, and L.H.
Matthies, “A Fast Stereo-Based System for Detecting and Tracking
Pedestrians from a Moving Vehicle,” Int’l J. Robotics Research,
vol. 28, pp. 1466-1485, 2009.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1545
Serge Belongie received the BS (with honor) in Pietro Perona received the graduate degree
electrical engineering from the California Institute di
in electrical engineering from the Universita
of Technology, Caltech in 1995 and the PhD Padova in 1985 and the PhD degree in electri-
degree in electrical engineering and computer cal engineering and computer science from
science from the University of California at Ber- the University of California at Berkeley in
keley in 2000. While at the University of California 1990. After a postdoctoral fellowship at MIT in
at Berkeley, his research was supported by the 1990-1991 he joined the faculty of the Califor-
US National Science Foundation (NSF) Graduate nia Institute of Technology, Caltech in 1991,
Research Fellowship. From 2001 to 2013, he where he is now an Allen E. Puckett professor
was a professor in the Department of Computer of electrical engineering and computation and
Science and Engineering at the University of Cal- neural systems. His current interests include
ifornia, San Diego (UCSD). He is currently a professor at Cornell NYC visual recognition, modeling vision in biological systems, modeling
Tech and the Cornell Computer Science Department, Ithaca, New York. and measuring behavior, and Visipedia. He has worked on aniso-
His research interests include computer vision, machine learning, crowd- tropic diffusion, multiresolution-multiorientation filtering, human tex-
sourcing, and human-in-the-loop computing. He is also a cofounder of ture perception and segmentation, dynamic vision, grouping,
several companies including Digital Persona, Anchovi Labs (acquired by analysis of human motion, recognition of object categories, and
Dropbox) and Orpix. He is a recipient of the US National Science Foun- modeling visual search.
dation (NSF) CAREER Award, the Alfred P. Sloan Research Fellowship,
and the MIT Technology Review “Innovators Under 35” Award.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.