0% found this document useful (0 votes)

10 views

Fast Feature Pyramids For Object Detection

论文

Uploaded by

1070038825

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

Fast Feature Pyramids For Object Detection

论文

Uploaded by

1070038825

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

1532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO.

8, AUGUST 2014

Fast Feature Pyramids for Object Detection

r, Ron Appel, Serge Belongie, and Pietro Perona
Piotr Dolla

Abstract—Multi-resolution image features may be approximated via extrapolation from nearby scales, rather than being computed
explicitly. This fundamental insight allows us to design object detection algorithms that are as accurate, and considerably faster, than
the state-of-the-art. The computational bottleneck of many modern detectors is the computation of features at every scale of a finely-
sampled image pyramid. Our key insight is that one may compute finely sampled feature pyramids at a fraction of the cost, without
sacrificing performance: for a broad family of features we find that features computed at octave-spaced scale intervals are sufficient to
approximate features on a finely-sampled pyramid. Extrapolation is inexpensive as compared to direct feature computation. As a result,
our approximation yields considerable speedups with negligible loss in detection accuracy. We modify three diverse visual recognition
systems to use fast feature pyramids and show results on both pedestrian detection (measured on the Caltech, INRIA, TUD-Brussels
and ETH data sets) and general object detection (measured on the PASCAL VOC). The approach is general and is widely applicable to
vision algorithms requiring fine-grained multi-scale analysis. Our approximation is valid for images with broad spectra (most natural
images) and fails for images with narrow band-pass spectra (e.g., periodic textures).

Index Terms—Visual features, object detection, image pyramids, pedestrian detection, natural image statistics, real-time systems

1 INTRODUCTION

M ULTI-RESOLUTION multi-orientation decompositions

are one of the foundational techniques of image anal-
ysis. The idea of analyzing image structure separately at
with respect to changes in viewpoint, lighting, and image
deformations is a contributing factor to their superior
performance.
every scale and orientation originated from a number of To understand the value of richer representations, it is
sources: measurements of the physiology of mammalian instructive to examine the reasons behind the breathtaking
visual systems [1], [2], [3], principled reasoning about the progress in visual category detection during the past ten
statistics and coding of visual information [4], [5], [6], [7] years. Take, for instance, pedestrian detection. Since the
(Gabors, DOGs, and jets), harmonic analysis [8], [9] (wave- groundbreaking work of Viola and Jones (VJ) [27], [28], false
lets), and signal processing [9], [10] (multirate filtering). positive rates have decreased two orders of magnitude. At
Such representations have proven effective for visual proc- 80 percent detection rate on the INRIA pedestrian data set
essing tasks such as denoising [11], image enhancement [21], VJ outputs over 10 false positives per image (FPPI),
[12], texture analysis [13], stereoscopic correspondence [14], HOG [21] outputs 1 FPPI, and more recent methods [29],
motion flow [15], [16], attention [17], boundary detection [30] output well under 0.1 FPPI (data from [31], [32]). In
[18] and recognition [19], [20], [21]. comparing the different detection schemes one notices the
It has become clear that such representations are best at representations at the front end are progressively enriched
extracting visual information when they are overcomplete, (e.g., more channels, finer scale sampling, enhanced normal-
i.e., when one oversamples scale, orientation and other ker- ization schemes); this has helped fuel the dramatic improve-
nel properties. This was suggested by the architecture of ments in detection accuracy witnessed over the course of
the primate visual system [22], where striate cortical cells the last decade.
(roughly equivalent to a wavelet expansion of an image) Unfortunately, improved detection accuracy has been
outnumber retinal ganglion cells (a representation close to accompanied by increased computational costs. The VJ
image pixels) by a factor ranging from 102 to 103 . Empirical detector ran at 15 frames per second (fps) over a decade
studies in computer vision provide increasing evidence in ago, on the other hand, most recent detectors require multi-
favor of overcomplete representations [21], [23], [24], [25], ple seconds to process a single image as they compute richer
[26]. Most likely the robustness of these representations image representations [31]. This has practical importance:
in many applications of visual recognition, such as robotics,
human computer interaction, automotive safety, and mobile
P. Dollar is with the Interactive Visual Media Group at Microsoft devices, fast detection rates and low computational require-
Research, One Microsoft Way, Redmond, WA 98052.
E-mail: [email protected]. ments are of the essence.
R. Appel and P. Perona are with the Department of Electrical Engineering, Thus, while increasing the redundancy of the representa-
California Institute of Technology, Pasadena, CA. tion offers improved detection and false-alarm rates, it is
E-mail: {appel, perona}@caltech.edu.
S. Belongie is with Cornell NYC Tech and the Cornell Computer
paid for by increased computational costs. Is this a neces-
Science Department. sary tradeoff? In this work we offer the hoped-for but sur-
Manuscript received 14 Feb. 2013; revised 15 Dec. 2013; accepted 8 Jan. 2014. prising answer: no.
Date of publication 15 Jan. 2014; date of current version 10 July 2014. We demonstrate how to compute richer representations
Recommended for acceptance by D. Forsyth. without paying a large computational price. How is this
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below.
possible? The key insight is that natural images have fractal
Digital Object Identifier no. 10.1109/TPAMI.2014.2300479 statistics [7], [33], [34] that we can exploit to reliably predict
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1533

image structure across scales. Our analysis and experiments [48], [49], [50], coarse-to-fine search [51], distance trans-
show that this makes it possible to inexpensively estimate forms [52], etc., all focus on optimizing classification speed
features at a dense set of scales by extrapolating computa- given precomputed image features. Our work focuses on
tions carried out expensively, but infrequently, at a coarsely fast feature pyramid construction and is thus complemen-
sampled set of scales. tary to such approaches.
Our insight leads to considerably decreased run-times An effective framework for object detection is the sliding
for state-of-the-art object detectors that rely on rich repre- window paradigm [27], [53]. Top performing methods on
sentations, including histograms of gradients [21], with neg- pedestrian detection [31] and the PASCAL VOC [38] are
ligible impact on their detection rates. We demonstrate the based on sliding windows over multiscale feature pyramids
effectiveness of our proposed fast feature pyramids with [21], [29], [35]; fast feature pyramids are well suited for such
three distinct detection frameworks including integral chan- sliding window detectors. Alternative detection paradigms
nel features (ICF) [29], aggregate channel features (a novel have been proposed [54], [55], [56], [57], [58], [59]. Although
variant of integral channel features), and deformable part a full review is outside the scope of this work, the approxi-
models (DPM) [35]. We show results for both pedestrian mations we propose could potentially be applicable to such
detection (measured on the Caltech [31], INRIA [21], TUD- schemes as well.
Brussels [36] and ETH [37] data sets) and general object As mentioned, a number of state-of-the-art detectors
detection (measured on the PASCAL VOC [38]). Demon- have recently been introduced that exploit our fast feature
strated speedups are significant and impact on accuracy is pyramid construction to operate at frame rate including
relatively minor. [40] and [30]. Alternatively, parallel implementation using
Building on our work on fast feature pyramids (first pre- GPUs [60], [61], [62] can achieve fast detection while using
sented in [39]), a number of systems show state-of-the-art rich representations but at the cost of added complexity
accuracy while running at frame rate on 640 480 images. and hardware requirements. Zhu et al. [63] proposed fast
Aggregate channel features, described in this paper, operate computation of gradient histograms using integral histo-
at over 30 fps while achieving top results on pedestrian grams [64]; the proposed system was real time for single-
detection. Crosstalk cascades [40] use fast feature pyramids scale detection only. In scenarios such as automotive appli-
and couple detector evaluations of nearby windows to cations, real time systems have also been demonstrated
achieve speeds of 35-65 fps. Benenson et al. [30] imple- [65], [66]. The insights outlined in this paper allow for real
mented fast feature pyramids on a GPU, and with addi- time multiscale detection in general, unconstrained settings.
tional innovations achieved detection rates of over 100 fps.
In this work we examine and analyze feature scaling and its 3 MULTISCALE GRADIENT HISTOGRAMS
effect on object detection in far more detail than in our pre-
vious work [39]. We begin by exploring a simple question: given image gra-
The remainder of this paper is organized as follows. dients computed at one scale, is it possible to approximate gradient
We review related work in Section 2. In Section 3 we histograms at a nearby scale solely from the computed gradients?
show that it is possible to create high fidelity approxima- If so, then we can avoid computing gradients over a finely
tions of multiscale gradient histograms using gradients sampled image pyramid. Intuitively, one would expect this
computed at a single scale. In Section 4 we generalize this to be possible, as significant image structure is preserved
finding to a broad family of feature types. We describe when an image is resampled. We begin with an in-depth
our efficient scheme for computing finely sampled feature look at a simple form of gradient histograms and develop a
pyramids in Section 5. In Section 6 we show applications more general theory in Section 4.
of fast feature pyramids to object detection, resulting in A gradient histogram measures the distribution of the
considerable speedups with minor loss in accuracy. We gradient angles within an image. Let Iðx; yÞ denote an
conclude in Section 7. m n discrete signal, and @I=@x and @I=@y denote the dis-
crete derivatives of I (typically 1D centered first differen-
ces are used). Gradient magnitude and orientation are
@I
2 RELATED WORK defined @I by: Mði; jÞ2 ¼ @x ði; jÞ2 þ @I
@y ði; jÞ
2
and Oði; jÞ ¼
@I
Significant research has been devoted to scale space theory arctan @y ði; jÞ= @x ði; jÞ . To compute the gradient histo-
[41], including real time implementations of octave and gram of an image, each pixel casts a vote, weighted by its
half-octave image pyramids [42], [43]. Sparse image pyra- gradient magnitude, for the bin corresponding to its gra-
mids often suffice for certain approximations, e.g., [42] dient orientation. After the orientation O is quantized into
shows how to recover a disk’s characteristic scale using Q bins so that Oði; jÞ P2 f1; Qg, the qth bin of the histogram
half-octave pyramids. Although only loosely related, these is defined by: hq ¼ i;j Mði; jÞ1½Oði; jÞ ¼ q, where 1 is the
ideas provide the intuition that finely sampled feature pyra- indicator function. In the following everything that holds
mids can perhaps be approximated. for global histograms also applies to local histograms
Fast object detection has been of considerable interest in (defined identically except for the range of the indices i
the community. Notable recent efforts for increasing detec- and j).
tion speed include work by Felzenszwalb et al. [44] and
Pedersoli et al. [45] on cascaded and coarse-to-fine deform- 3.1 Gradient Histograms in Upsampled Images
able part models, respectively, Lampert et al.’s [46] applica- Intuitively the information content of an upsampled image
tion of branch and bound search for detection, and Doll ar is similar to that of the original, lower-resolution image
et al.’s work on crosstalk cascades [40]. Cascades [27], [47], (upsampling does not create new structure). Assume I is a
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1534 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

Fig. 1. Behavior of gradient histograms in images resampled by a factor of two. (a) Upsampling gradients. Given images I and I 0 where I 0
denotes I upsampled by two, and corresponding gradient magnitude images M and M 0 , the ratio SM=SM 0 should be approximately 2. The
middle/bottom panels show the distribution of this ratio for gradients at fixed orientation over pedestrian/natural images. In both cases the
mean m 2, as expected, and the variance is relatively small. (b) Downsampling gradients. Given images I and I 0 where I 0 denotes I down-
sampled by two, the ratio SM=SM 0 0:34, not 0:5 as might be expected from (a) as downsampling results in loss of high frequency content.
(c) Downsampling normalized gradients. Given normalized gradient magnitude images M e and M e M
e 0 , the ratio SM=S e 0 0:27. Instead of trying
to derive analytical expressions governing the scaling properties of various feature types under different resampling factors, in Section 4 we
describe a general law governing feature scaling.

continuous signal, and let I 0 denote I upsampled by a factor pedestrian approximately 96 pixels tall. The second image
of k: I 0 ðx; yÞ Iðx=k; y=kÞ. Using the definition of a deriva- set contains 128 64 windows cropped at random positions
0
tive, one can show that @I 1 @I
@x ði; jÞ ¼ k @x ði=k; j=kÞ, and likewise from the 1;218 images in the INRIA negative training set.
@I 0
for @y , which simply states the intuitive fact that the rate of We sample 5,000 windows but exclude nearly uniform win-
change in the upsampled image is k times slower the rate of dows, i.e., those with average gradient magnitude under
change in the original image. While not exact, the above 0:01, resulting in 4,280 images. We refer to the two sets as
also holds approximately for interpolated discrete signals. ‘pedestrian images’ and ‘natural images,’ although the latter
Let M 0 ði; jÞ 1k Mðdi=ke; dj=keÞ denote the gradient magni- is biased toward scenes that may (but do not) contain
tude in an upsampled discrete image. Then: pedestrians.
In order to measure the fidelity of this approximation, we
kn X
X km kn X
X km
1 define the ratio rq ¼ h0q =hq and quantize orientation into
M 0 ði; jÞ Mðdi=ke; dj=keÞ Q ¼ 6 bins. Fig. 1a shows the distribution of rq for one bin
i¼1 j¼1 i¼1 j¼1
k
(1) on the 1;237 pedestrian and 4; 280 natural images given an
X
n X
m
1 n X
X m
upsampling of k ¼ 2 (results for other bins were similar). In
¼k 2
Mði; jÞ ¼ k Mði; jÞ:
i¼1 j¼1
k i¼1 j¼1
both cases the mean is m 2, as expected, and the variance
is relatively small, meaning the approximation is unbiased
and reasonable.
Thus, the sum of gradient magnitudes in the original and
Thus, although individual gradients may change, gradi-
upsampled image should be related by about a factor of k.
0 ent histograms in an upsampled and original image will be
Angles should also be mostly preserved since @I @x ði; jÞ
@I 0 @I @I related by a multiplicative constant roughly equal to the
@y ði; jÞ @x ði=k; j=kÞ @y ði=k; j=kÞ. Therefore, according to scale change between them. We examine gradient histo-
the definition of gradient histograms, we expect the rela-
grams in downsampled images next.
tionship between hq (computed over I) and h0q (computed
over I 0 ) to be: h0q khq . This allows us to approximate gra-
dient histograms in an upsampled image using gradients 3.2 Gradient Histograms in Downsampled Images
computed at the original scale. While the information content of an upsampled image is
Experiments. One may verify experimentally that in roughly the same as that of the original image, information
images of natural scenes, upsampled using bilinear interpo- is typically lost during downsampling. However, we find
lation, the approximation h0q khq is reasonable. We use that the information loss is consistent and the resulting
two sets of images for these experiments, one class specific approximation takes on a similarly simple form.
and one class independent. First, we use the 1;237 cropped If I contains little high frequency energy, then the
pedestrian images from the INRIA pedestrians training approximation h0q khq derived in Section 3.1 should apply.
data set [21]. Each image is 128 64 and contains a In general, however, downsampling results in loss of high
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1535

Fig. 2. Approximating gradient histograms in images resampled by a factor of two. For each image set, we take the original image (green border) and
generate an upsampled (blue) and downsampled (orange) version. At each scale we compute a gradient histogram with eight bins, multiplying each
bin by 0:5 and 1=0:34 in the upsampled and downsampled histogram, respectively. Assuming the approximations from Section 3 hold, the three nor-
malized gradient histograms should be roughly equal (the blue, green, and orange bars should have the same height at each orientation). For the first
four cases, the approximations are fairly accurate. In the last two cases, showing highly structured Brodatz textures with significant high frequency
content, the downsampling approximation fails. The first four images are representative, the last two are carefully selected to demonstrate images
with atypical statistics.

frequency content which can lead to measured gradients the context of object detection (see Section 6). Observe that
undershooting the extrapolated gradients. Let I 0 now we have now introduced an additional nonlinearity to the
denote I downsampled by a factor of k. We expect that hq gradient computation; do the previous results for gradient
(computed over I) and h0q (computed over I 0 ) will satisfy histograms still hold if we use M e instead of M?
h0q hq =k. The question we seek to answer here is whether In Fig. 1c we plot the distribution of rq ¼ h0q =hq for histo-
the information loss is consistent. grams of normalized gradients given a downsampling fac-
Experiments. As before, define rq ¼ h0q =hq . In Fig. 1b we tor of k ¼ 2. As with the original gradient histograms, the
show the distribution of rq for a single bin on the pedes- distributions of rq are normally distributed and have similar
trian and natural images given a downsampling factor of means for pedestrian and natural images (m 0:26 and
k ¼ 2. Observe that the information loss is consistent: rq is m 0:27, respectively). Observe, however, that the expected
normally distributed around m 0:34 < 0:5 for natural value of rq for normalized gradient histograms is quite dif-
images (and similarly m 0:33 for pedestrians). This ferent than for the original histograms (Fig. 1b).
implies that h0q mhq could serve as a reasonable approxi- Deriving analytical expressions governing the scaling
mation for gradient histograms in images downsampled properties of progressively more complex feature types
by k ¼ 2. would be difficult or even impossible. Instead, in Section 4
In other words, similarly to upsampling, gradient histo- we describe a general law governing feature scaling.
grams computed over original and half resolution images
tend to differ by a multiplicative constant (although the con-
4 STATISTICS OF MULTISCALE FEATURES
stant is not the inverse of the sampling factor). In Fig. 2 we
show the quality of the above approximations on example To understand more generally how features behave in
images. The agreement between predictions and observa- resampled images, we turn to the study of natural image
tions is accurate for typical images (but fails for images with statistics [7], [33]. The analysis below provides a deep
atypical statistics). understanding of the behavior of multiscale features. The
practical result is a simple yet powerful approach for pre-
dicting the behavior of gradients and other low-level fea-
3.3 Histograms of Normalized Gradients tures in resampled images without resorting to analytical
Suppose we replaced the gradient magnitude M by the nor- derivations that may be difficult except under the simplest
malized gradient magnitude M e defined as Mði;e jÞ ¼ conditions.
Mði; jÞ=ðMði; jÞ þ 0:005Þ, where M is the average gradient We begin by defining a broad family of features. Let V be
magnitude in each 11 11 image patch (computed by any low-level shift invariant function that takes an image I
convolving M with an L1 normalized 11 11 triangle filter). and creates a new channel image C ¼ VðIÞ where a channel
Using the normalized gradient M e gives improved results in C is a per-pixel feature map such that output pixels in C are
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1536 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

computed from corresponding patches of input pixels in I We observe that a single image can itself be considered
(thus preserving overall image layout). C may be down- an ensemble of image patches (smaller images). Since V is
sampled relative to I and may contain multiple layers k. We shift invariant, we can interpret fV ðIÞ as computing the
define a feature fVP ðIÞ as a weighted sum of the channel average of fV ðI k Þ over every patch I k of I and therefore
C ¼ VðIÞ: fV ðIÞ ¼ ijk wijk Cði; j; kÞ. Numerous local and Eq. (3) can be applied directly for a single image. We formal-
global features can be written in this form including gradi- ize this below.
ent histograms, linear filters, color statistics, and others [29]. We can decompose an image I into K smaller images
Any such low-level shift invariant V can be used, making I 1 . . . I K such that I ¼ ½I 1 I K . Given that V must be shift
this representation quite general. invariant and ignoring boundary effects gives VðIÞ ¼
Let Is denote I at scale s, where the dimensions hs ws Vð½I 1 I K Þ ½VðI 1 Þ VðI K Þ, and substituting into
of Is are s times the dimensions of I. For s < 1, Is (which Eq. (2) yields fV ðIÞ SfV ðI k Þ=K. However, we can con-
denotes a higher resolution version of I) typically differs sider I 1 I K as a (small) image ensemble, and fV ðIÞ
from I upsampled by s, while for s < 1 an excellent E½fV ðI k Þ an expectation over that ensemble. Therefore,
approximation of Is can be obtained by downsampling I. substituting fV ðIs1 Þ E½fV ðIsk1 Þ and fV ðIs2 Þ E½fV ðIsk2 Þ
Next, for simplicity we redefine fV ðIs Þ as1 into Eq. (3) yields:

1 X fV ðIs1 Þ=fV ðIs2 Þ ¼ ðs1 =s2 Þ V

þE;
fV ðIs Þ Cs ði; j; kÞ where Cs ¼ VðIs Þ: (2) (4)
hs ws k ijk
where we use E to denote the deviation from the power law
In other words fV ðIs Þ denotes the global mean of Cs com- for a given image. Each channel type V has its own corre-
puted over locations ij and layers k. Everything in the fol- sponding V , which we can determine empirically.
lowing derivations based on global means also holds for In Section 4.2 we show that on average Eq. (4) provides a
local means (e.g., local histograms). remarkably good fit for multiple channel types and image
Our goal is to understand how fV ðIs Þ behaves as a func- sets (i.e., we can fit V such that E½E 0). Additionally,
tion of s for any choice of shift invariant V. experiments in Section 4.3 indicate that the magnitude of
deviation for individual images, E½E 2 , is reasonable and
4.1 Power Law Governs Feature Scaling increases only gradually as a function of s1 =s2 .
Ruderman and Bialek [33], [67] explored how the statistics
of natural images behave as a function of the scale at which
4.2 Estimating
an image ensemble was captured, i.e., the visual angle cor- We perform a series of experiments to verify Eq. (4) and
responding to a single pixel. Let fðIÞ denote an arbitrary estimate V for numerous channel types V.
(scalar) image statistic and E½ denote expectation over an To estimate V for a given V, we first compute:
ensemble of natural images. Ruderman and Bialek made
the fundamental discovery that the ratio of E½fðIs1 Þ to 1X N
ms ¼ fV Isi =fV I1i ; (5)
E½fðIs2 Þ, computed over two ensembles of natural images N i¼1
captured at scales s1 and s2 , respectively, depends only on
the ratio of s1 =s2 and is independent of the absolute scales for N images I i and multiple values of s < 1, where Isi is
s1 and s2 of the ensembles. obtained by downsampling I1i ¼ I i . We use two image
Ruderman and Bialek’s findings imply that E½fðIs Þ fol- ensembles, one of N ¼ 1; 237 pedestrian images and one of
lows a power law2: N ¼ 4; 280 natural images (for details see Section 3.1).
According to Eq. (4), ms ¼ s V þ E½E. Our goal is to fit V
f accordingly and verify the fidelity of Eq. (4) for various
E½fðIs1 Þ=E½fðIs2 Þ ¼ ðs1 =s2 Þ : (3)
channel types V (i.e., verify that E½E 0).
Every statistic f will have its own corresponding f . In the For each V, we measure ms according to Eq. (5) across
context of our work, for any channel type V we can use the three octaves with eight scales per octave for a total of 24
scalar fV ðIÞ in place of fðIÞ and V in place of f . While
1 24
measurements at s ¼ 2 8 ; . . . ; 2 8 . Since image dimensions
Eq. (3) gives the behavior of fV w.r.t. to scale over an are rounded to the nearest integer, we compute and use
ensemble of images, we are interested in the behavior of fV pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
s0 ¼ hs ws =hw, where h w and hs ws are the dimensions
for a single image.
of the original and downsampled images, respectively.
In Fig. 3 we plot ms versus s0 using a log-log plot for six
1. The definition of fV ðIs Þ in Eq. (2) differs from our previous defini-
tion in [39], where fðI; sÞ denoted the channel sum after resampling by channel types for both the pedestrian and natural images.3
2s . The new definition and notation allow for a cleaner derivation, and In all cases ms follows a power law with all measurements
the exponential scaling law becomes a more intuitive power law. falling along a line on the log-log plots, as predicted. How-
2. Let F ðsÞ ¼ E½fðIs Þ. We can rewrite the observation by saying
there exists a function R such that F ðs1 Þ=F ðs2 Þ ¼ Rðs1 =s2 Þ. Applying ever, close inspection shows ms does not start exactly at 1 as
repeatedly gives F ðs1 Þ=F ð1Þ ¼ Rðs1 Þ, F ð1Þ=F ðs2 Þ ¼ Rð1=s2 Þ, and expected: downsampling introduces a minor amount of
F ðs1 Þ=F ðs2 Þ ¼ Rðs1 =s2 Þ. Therefore Rðs1 =s2 Þ ¼ Rðs1 ÞRð1=s2 Þ. Next, let blur even for small downsampling factors. We thus expect
R0 ðsÞ ¼ Rðes Þ and observe that R0 ðs1 þ s2 Þ ¼ R0 ðs1 ÞR0 ðs2 Þ since
Rðs1 s2 Þ ¼ Rðs1 ÞRðs2 Þ. If R0 is also continuous and non-zero, then it
must take the form R0 ðsÞ ¼ e s for some constant [68]. This implies 3. Fig. 3 generalizes the results shown in Fig. 1. However, by switch-
RðsÞ ¼ R0 ðlnðsÞÞ ¼ e lnðsÞ ¼ s . Therefore, E½fðIs Þ must follow a ing from channel sums to channel means, m1=2 in Figs. 3a and 3b is 4
power law (see also Eq. (9) in [67]). larger than m in Figs. 1b and 1c, respectively.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1537

P 1 24
Fig. 3. Power law feature scaling. For each of six channel types we plot ms ¼ N1 fV ðIsi Þ=fV ðI1i Þ for s 2 8 ; . . . ; 2 8 on a log-log plot for both pedes-
trian and natural image ensembles. Plots of fV ðIs1 Þ=fV ðIs2 Þ for 20 randomly selected pedestrian images are shown as faint gray lines. Additionally
the best-fit line to ms for the natural images is shown. The resulting and expected error jE½Ej are given in the plot legends. In all cases the ms follow
a power law as predicted by Eq. (4) and are nearly identical for both pedestrian and natural images, showing the estimate of is robust and generally
applicable. The tested channels are: (a) histograms of gradients described in Section 3; (b) histograms of normalized gradients described in Section
3.3; (c) a difference of gaussian (DoG) filter (with inner and outer s of 0:71 and 1:14, respectively); (d) grayscale images (with ¼ 0 as expected); (e)
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pixel standard deviation computed over local 5 5 neighborhoods Cði;jÞ¼ E½Iði;jÞ2 E½Iði;jÞ; (f) HOG [21] with 4 4 spatial bins (results were averaged
over HOG’s 36 channels). Code for generating such plots is available (see chnsScaling.m in Piotr’s Toolbox).

ms to have the form ms ¼ aV s V , with aV 6¼ 1 as an artifact s s ¼ stdev½fV ðIsi Þ=fV ðI1i Þ ¼ stdev½E; (6)
of the interpolation. Note that aV is only necessary for esti-
mating V from downsampled images and is not used sub- where ‘stdev’ denotes the sample standard deviation (com-
sequently. To estimate aV and V , we use a least squares fit puted over N images) and E is the error associated with
of log2 ðms0 Þ ¼ a0V V log2 ðs0 Þ to the 24 measurements com- each image and scaling factor as defined in Eq. (4). In
0
puted over natural images (and set aV ¼ 2aV ). Resulting Section 4.2 we confirmed that E½E ffi 0, our goal now is to
pffiffiffiffiffiffiffiffi
estimates of V are given in plot legends in Fig. 3. understand how s s ¼ stdev½E E½E 2 behaves.
There is strong agreement between the resulting best-fit In Fig. 4 we plot s s as a function of s for the same chan-
lines and the observations. In legend brackets in Fig. 3 we nels as in Fig. 3. In legend brackets we report s s for s ¼ 12 for
report expected error jE½Ej ¼ jms aV s V j for both natu- both natural and pedestrian images; for all channels studied
ral and pedestrian images averaged over s (using aV and s 1=2 < :2. In all cases s s increases gradually with increasing
V estimated using natural images). For basic gradient his- s and the deviation is low for small s. The expected magni-
tograms jE½Ej ¼ 0:018 for natural images and jE½Ej ¼ tude of E varies across channels, for example histograms of
0:037 for pedestrian images. Indeed, for every channel type normalized gradients (Fig. 4b) have lower s s than their
Eq. (4) is an excellent fit to the observations ms for both unnormalized counterparts (Fig. 4a). The trivial grayscale
image ensembles. channel (Fig. 4d) has s s ¼ 0 as the approximation is exact.
The derivation of Eq. (4) depends on the distribution Observe that often s s is greater for natural images than
of image statistics being stationary with respect to scale; for pedestrian images. Many of the natural images contain
that this holds for all channel types tested, and with relatively little structure (e.g., a patch of sky), for such
nearly an identical constant for both pedestrian and images fV ðIÞ is small for certain V (e.g., simple gradient his-
natural images, shows the estimate of V is robust and tograms) resulting in more variance in the ratio in Eq. (4).
generally applicable. For HOG channels (Fig. 4f), which have additional normali-
zation, this effect is minimized.

4.3 Deviation for Individual Images 4.4 Miscellanea

In Section 4.2 we verified that Eq. (4) holds for an ensem- We conclude this section with additional observations.
ble of images; we now examine the magnitude of devia- Interpolation method. Varying the interpolation algorithm
tion from the power law for individual images. We study for image resampling does not have a major effect. In
the effect this has in the context of object detection in Fig. 5a, we plot m1=2 and s 1=2 for normalized gradient histo-
Section 6. grams computed using nearest neighbor, bilinear, and bicu-
Plots of fV ðIs1 Þ=fV ðIs2 Þ for randomly selected images are bic interpolation. In all three cases both m1=2 and s 1=2 remain
shown as faint gray lines in Fig. 3. The individual curves are essentially unchanged.
relatively smooth and diverge only somewhat from the Window size. All preceding experiments were performed
best-fit line. We quantify their deviation by defining s s anal- on 128 64 windows. In Fig. 5b we plot the effect of varying
ogously to ms in Eq. (5): the window size. While m1=2 remains relatively constant,
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1538 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

qffiffiffiffiffiffiffiffiffiffiffiffi
Fig. 4. Power law deviation for individual images. For each of the six channel types described in Fig. 3 we plot s s versus s where s s ¼ E½E 2 and E is
the deviation from the power law for a single image as defined in Eq. (4). In brackets we report s 1=2 for both natural and pedestrian images. s s
increases gradually as a function of s, meaning that not only does Eq. (4) hold for an ensemble of images but also the deviation from the power law
for individual images is low for small s.

s 1=2 increases with decreasing window size (see also the 5 FAST FEATURE PYRAMIDS
derivation of Eq. (4)).
We introduce a novel, efficient scheme for computing fea-
Upsampling. The power law can predict features in higher
ture pyramids. First, in Section 5.1 we outline an approach
resolution images but not upsampled images. In practice,
for scaling feature channels. Next, in Section 5.2 we show its
though, we want to predict features in higher resolution as
application to constructing feature pyramids efficiently and
opposed to (smooth) upsampled images.
we analyze computational complexity in Section 5.3.
Robust estimation. In preceding derivations, when com-
puting fV ðIs1 Þ=fV ðIs2 Þ we assumed that fV ðIs2 Þ 6¼ 0. For the
V’s considered this was the case after windows of near uni- 5.1 Feature Channel Scaling
form intensity were excluded (see Section 3.1). Alterna- We propose an extension of the power law governing fea-
tively, we have found that excluding I with fV ðIÞ 0 when ture scaling introduced in Section 4 that applies directly to
estimating results in more robust estimates. channel images. As before, let Is denote I captured at scale
Sparse channels. For sparse channels where frequently s and RðI; sÞ denote I resampled by s. Suppose we have
fV ðIÞ 0, e.g., the output of a sliding-window object detec- computed C ¼ VðIÞ; can we predict the channel image
tor, s will be large. Such channels may not be good candi- Cs ¼ VðIs Þ at a new scale s using only C?
dates for the power law approximation. The standard approach is to compute Cs ¼ VðRðI; sÞÞ,
One-shot estimates. We can estimate as described in Sec- ignoring the information contained in C ¼ VðIÞ. Instead,
tion 4.2 using a single image in place of an ensemble (N ¼ 1). we propose the following approximation:
Such estimates are noisy but not entirely unreasonable; e.g.,
on normalized gradient histograms (with 0:101) the V
Cs RðC; sÞ s : (7)
mean of 4,280 single image estimates of if 0:096 and the
standard deviation of the estimates is 0:073. A visual demonstration of Eq. (7) is shown in Fig. 6.
Scale range. We expect the power law to break down at
extreme scales not typically encountered under natural
viewing conditions (e.g., under high magnification).

Fig. 6. Feature channel scaling. Suppose we have computed C ¼ VðIÞ;

Fig. 5. Effect of the interpolation algorithm and window size on channel can we predict Cs ¼ VðIs Þ at a new scale s? Top. the standard approach
scaling. We plot m1=2 (bar height) and s 1=2 (error bars) for normalized is to compute Cs ¼ VðRðI; sÞÞ, ignoring the information contained in
gradient histograms (see Section 3.3) . (a) Varying the interpolation algo- C ¼ VðIÞ. Bottom. instead, based on the power law introduced in
rithm for resampling does not have a major effect on either m1=2 or s 1=2 . Section 4, we propose to approximate Cs by RðC; sÞ s V . This
(b) Decreasing window size leaves m1=2 relatively unchanged but results approach is simple, general, and accurate, and allows for fast feature
in increasing s 1=2 . pyramid construction.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1539

Eq. (7) follows from Eq. (4). Setting s1 ¼ s, s2 ¼ 1, and

rearranging Eq. (4) gives fV ðIs Þ fV ðIÞs V . This relation
must hold not only for the original images but also for any
pair of corresponding windows ws and w in Is and I, respec-
tively. Expanding yields:

fV Isws fV ðI w Þs V
1 X 1 X V
Cs ði; jÞ Cði; jÞs
jws j i;j2ws jwj i;j2w
V
Cs RðC; sÞs :

The final P line follows Pbecause if for all corresponding

Fig. 7. Fast feature pyramids. Color and grayscale icons represent
windows ws C 0 =jws j w C=jwj, then C 0 RðC; sÞ. images and channels; horizontal and vertical arrows denote computation
On a per-pixel basis, the approximation of Cs in Eq. (7) of R and V. Top. The standard pipeline for constructing a feature pyra-
may be quite noisy. The standard deviation s s of the ratio mid requires computing Is ¼ RðI; sÞ followed by Cs ¼ VðIs Þ for every s.
fV ðIsws Þ=fV ðI w Þ depends on the size of the window w: s s This is costly. Bottom. We propose computing Is ¼ RðI; sÞ and
Cs ¼ VðIs Þ for only a sparse set of s (once per octave). Then, at interme-
increases as w decreases (see Fig. 5b). Therefore, the accu- diate scales Cs is computed using the approximation in Eq. (7):
racy of the approximation for Cs will improve if information Cs RðCs0 ; s=s0 Þðs=s0 Þ V where s0 is the nearest scale for which we
is aggregated over multiple pixels of Cs . have Cs0 ¼ VðIs0 Þ. In the proposed scheme, the number of computations
of R is constant while (more expensive) computations of V are reduced
A simple strategy for aggregating over multiple pixels considerably.
and thus improving robustness is to downsample and/or
smooth Cs relative to Is (each pixel in the resulting channel 5.3 Complexity Analysis
will be a weighted sum of pixels in the original full resolu- The computational savings of computing approximate fea-
tion channel). Downsampling Cs also allows for faster pyra- ture pyramids is significant. Assume the cost of computing
mid construction (we return to this in Section 5.2). For V is linear in the number of pixels in an n n image (as is
object detection, we typically downsample channels by 4 often the case). The cost of constructing a feature pyramid
to 8 (e.g., HOG [21] uses 8 8 bins). with m scales per octave is
X
1 X
1
n2 mn2
5.2 Fast Feature Pyramids 1=m k
n2 2 2k=m
¼ n2 ð4 Þ ¼ 1=m
: (8)
A feature pyramid is a multi-scale representation of an k¼0 k¼0
1 4 ln4
image I where channels Cs ¼ VðIs Þ are computed at every
The second equality follows from the formula for a sum of a
scale s. Scales are sampled evenly in log-space, starting at
geometric series; the last approximation is valid for large m
s ¼ 1, with typically four to 12 scales per octave (an octave
(and follows from l’H^ opital’s rule). In the proposed
is the interval between one scale and another with half or
approach we compute V once per octave (m ¼ 1). The total
double its value). The standard approach to constructing a
cost is 43 n2 , which is only 33 percent more than the cost of
feature pyramid is to compute Cs ¼ VðRðI; sÞÞ for every s,
computing single scale features. Typical detectors are evalu-
see Fig. 7 (top).
ated on eight to 12 scales per octave [31], thus according to
The approximation in Eq. (7) suggests a straightfor-
(8) we achieve an order of magnitude savings over comput-
ward method for efficient feature pyramid construction.
ing V densely (and intermediate Cs are computed efficiently
We begin by computing Cs ¼ VðRðI; sÞÞ at just one scale
through resampling afterward).
per octave (s 2 f1; 12 ; 14 ; . . .g). At intermediate scales, Cs is
computed using Cs RðCs0 ; s=s0 Þðs=s0 Þ V where s0 2
f1; 12 ; 14 ; . . .g is the nearest scale for which we have 6 APPLICATIONS TO OBJECT DETECTION
Cs0 ¼ VðIs0 Þ, see Fig. 7 (bottom).
We demonstrate the effectiveness of fast feature pyramids
Computing Cs ¼ VðRðI; sÞÞ at one scale per octave
in the context of object detection with three distinct detec-
provides a good tradeoff between speed and accuracy.
tion frameworks. First, in Section 6.1 we show the efficacy
The cost of evaluating V is within 33 percent of comput-
of our approach with a simple yet state-of-the-art pedestrian
ing VðIÞ at the original scale (see Section 5.3) and chan-
detector we introduce in this work called Aggregated Channel
nels do not need to be approximated beyond half an
Features (ACF). In Section 6.2 we describe an alternate
octave (keeping error low, see Section 4.3). While the
approach for exploiting approximate multiscale features
number of evaluations of R is constant (evaluations of
using integral images computed over the same channels
RðI; sÞ are replaced by RðC; sÞ), if each Cs is down-
(integral channel features or ICF), much as in our previous
sampled relative to Is as described in Section 5.1, evalu-
work [29], [39]. Finally, in Section 6.3 we approximate HOG
ating RðC; sÞ is faster than RðI; sÞ.
feature pyramids for use with deformable part models [35].
Alternate schemes, such as interpolating between two
nearby scales s0 for each intermediate scale s or evaluating
V more densely, could result in even higher pyramid accu- 6.1 Aggregated Channel Features
racy (at increased cost). However, the proposed approach The ACF detection framework is conceptually straightfor-
proves sufficient for object detection (see Section 6). ward (Fig. 8). Given an input image I, we compute several
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1540 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

Fig. 8. Overview of the ACF detector. Given an input image I, we compute several channels C ¼ VðIÞ, sum every block of pixels in C, and smooth
the resulting lower resolution channels. Features are single pixel lookups in the aggregated channels. Boosting is used to learn decision trees over
these features (pixels) to distinguish object from background. With the appropriate choice of channels and careful attention to design, ACF achieves
state-of-the-art performance in pedestrian detection.

channels C ¼ VðIÞ, sum every block of pixels in C, and MRs for 16 competing methods are shown. ACF outper-
smooth the resulting lower resolution channels. Features forms competing approaches on nearly all datasets. When
are single pixel lookups in the aggregated channels. Boost- averaged over the four data sets, the MR of ACF is 40 per-
ing is used to train and combine decision trees over these cent with exact feature pyramids and 41 percent with fast
features (pixels) to distinguish object from background and feature pyramids, a negligible difference, demonstrating the
a multiscale sliding-window approach is employed. With effectiveness of our approach.
the appropriate choice of channels and careful attention to Speed. MR versus speed for numerous detectors is shown
design, ACF achieves state-of-the-art performance in pedes- in Fig. 10. ACF with fast feature pyramids runs at 32 fps.
trian detection. The only two faster approaches are Crosstalk cascades [40]
Channels. ACF uses the same channels as [39]: normal- and the VeryFast detector from Benenson et al. [30]. Their
ized gradient magnitude, histogram of oriented gradients additional speedups are based on improved cascade strate-
(six channels), and LUV color channels. Prior to computing gies and combining multi-resolution models with a GPU
the 10 channels, I is smoothed with a [1 2 1]/4 filter. The implementation, respectively, and are orthogonal to the
channels are divided into 4 4 blocks and pixels in each gains achieved by using approximate multiscale features.
block are summed. Finally the channels are smoothed, again Indeed, all the detectors that run at 5 fps and higher exploit
with a [1 2 1]/4 filter. For 640 480 images, computing the the power law governing feature scaling.
channels runs at over 100 fps on a modern PC. The code is Pyramid parameters. Detection performance on INRIA [21]
optimized but runs on a single CPU; further gains could be with fast feature pyramids under varying settings is shown
obtained using multiple cores or a GPU as in [30]. in Fig. 11. The key result is given in Fig. 11a: when approxi-
Pyramid. Computation of feature pyramids at octave- mating seven of eight scales per octave, the MR for ACF is
spaced scale intervals runs at 75 fps on 640 480 images. 0:169 which is virtually identical to the MR of 0:166 obtained
Meanwhile, computing exact feature pyramids with eight
scales per octave slows to 15 fps, precluding real-time
TABLE 1
detection. In contrast, our fast pyramid construction (see
MRs of Leading Approaches for Pedestrian Detection
Section 5) with seven of eight scales per octave approxi- on Four Data Sets
mated runs at nearly 50 fps.
Detector. For pedestrian detection, AdaBoost [69] is used
to train and combine 2,048 depth-two trees over the
128 64 10=16 ¼ 5;120 candidate features (channel pixel
lookups) in each 128 64 window. Training with multiple
rounds of bootstrapping takes 10 minutes (a parallel
implementation reduces this to 3 minutes). The detector
has a step size of four pixels and eight scales per octave. For
640 480 images, the complete system, including fast pyra-
mid construction and sliding-window detection, runs at
over 30 fps allowing for real-time uses (with exact feature
pyramids the detector slows to 12 fps).
Code. Code for the ACF framework is available online.4
For more details on the channels and detector used in ACF,
including exact parameter settings and training framework,
we refer users to the source code.
Accuracy. We report accuracy of ACF with exact and fast
feature pyramids in Table 1. Following the methodology of
[31], we summarize performance using the log-average
miss rate (MR) between 10 2 and 100 false positives per
image. Results are reported on four pedestrian data sets:
INRIA [21], Caltech [31], TUD-Brussels [36] and ETH [37].
For ICF and ACF exact and approximate detection results are shown
with only small differences between them. For the latest pedestrian
4. Code: http://vision.ucsd.edu/ pdollar/toolbox/doc/index.html. detection results please see [32].
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1541

than ACF due to construction of integral images and more

expensive features (rectangular sums computed via integral
images versus single pixel lookups). For more details on
ICF, see [29], [39]. The variant tested here uses identical
channels to ACF.
Detection performance with fast feature pyramids under
varying settings is shown in Fig. 12. The plots mirror the
results shown in Fig. 11 for ACF. The key result is given in
Fig. 9. (a) A standard pipeline for performing multiscale detection is to
Fig. 12a: when approximating seven of eight scales per
create a densely sampled feature pyramid. (b) Viola and Jones [27]
used simple shift and scale invariant features, allowing a detector to be octave, the MR for ICF is 2 percent worse than the MR
placed at any location and scale without relying on a feature pyramid. (c) obtained with exact feature pyramids.
ICF can use a hybrid approach of constructing an octave-spaced feature The ICF framework allows for an alternate application
pyramid followed by approximating detector responses within half an
octave of each pyramid level. of the power law governing feature scaling: instead of
rescaling channels as discussed in Section 5, one can
using the exact feature pyramid. Even approximating 15 of instead rescale the detector. Using the notation from
every 16 scales increases MR only somewhat. Constructing Section 4, rectangular channel sums (features used in ICF)
the channels without correcting for power law scaling, or can be written as AfV ðIÞ, where A denotes rectangle area.
using an incorrect value of , results in markedly decreased As such, Eq. (4) can be applied to approximate features at
performance, see Fig. 11b. Finally, we observe that at least nearby scales and given integral channel images computed
eight scales per octave must be used for good performance at one scale, detector responses can be approximated at
(Fig. 11c), making the proposed scheme crucial for achiev- nearby scales. This operation can be implemented by
ing detection results that are both fast and accurate. rescaling the detector itself, see [39]. As the approximation
degrades with increasing scale offsets, a hybrid approach
6.2 Integral Channel Features is to construct an octave-spaced feature pyramid followed
Integral channel features [29] are a precursor to the ACF by approximating detector responses at nearby scales, see
framework described in Section 6.1. Both ACF and ICF Fig. 9. This approach was extended in [30].
use the same channel features and boosted classifiers;
the key difference between the two frameworks is that 6.3 Deformable Part Models
ACF uses pixel lookups in aggregated channels as fea- Deformable part models from Felzenszwalb et al. [35] are an
tures while ICF uses sums over rectangular channel elegant approach for general object detection that have con-
regions (computed efficiently with integral images). sistently achieved top results on the PASCAL VOC chal-
Accuracy of ICF with exact and fast feature pyramids is lenge [38]. DPMs use a variant of HOG features [21] as their
shown in Table 1. ICF achieves state-of-the-art results: infe- image representation, followed by classification with linear
rior to ACF but otherwise outperforming most competing SVMs. An object model is composed of multiple parts, a
approaches. The MR of ICF averaged over the four data sets root model, and optionally multiple mixture components.
is 42 percent with exact feature pyramids and 45 percent For details see [35].
with fast feature pyramids. The gap of 3 percent is larger Recent approaches for increasing the speed of DPMs
than the 1 percent gap for ACF but still small. With fast fea- include work by Felzenszwalb et al. [44] and Pedersoli
ture pyramids ICF runs at 16 fps, see Fig. 10. ICF is slower et al. [45] on cascaded and coarse-to-fine deformable part

Fig. 10. Log-average miss rate on the INRIA pedestrian data set [21] versus frame rate on 640 480 images for multiple detectors. Method runtimes
were obtained from [31], see also [31] for citations for detectors A-L. Numbers in brackets indicate MR/fps for select approaches, sorted by speed.
All detectors that run at 5 fps and higher are based on our fast feature pyramids; these methods are also the most accurate. They include: (M)
FPDW [39] which is our original implementation of ICF, (N) ICF [Section 6.2], (O) ACF [Section 6.1], (P) crosstalk cascades [40], and (Q) the Very-
Fast detector from Benenson et al. [30]. Both (P) and (Q) use the power law governing feature scaling described in this work; the additional speedups
in (P) and (Q) are based on improved cascade strategies, multi-resolution models and a GPU implementation, and are orthogonal to the gains
achieved by using approximate multiscale features.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1542 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

Fig. 11. Effect of parameter setting of fast feature pyramids on the ACF detector [Section 6.1]. We report log-average miss rate averaged over 25 tri-
als on the INRIA pedestrian data set [21]. Orange diamonds denote default parameter settings: 7/8 scales approximated per octave, 0:17 for the
normalized gradient channels, and eight scales per octave in the pyramid. (a) The MR stays relatively constant as the fraction of approximated scales
increases up to 7/8 demonstrating the efficacy of the proposed approach. (b) Sub-optimal values of when approximating the normalized gradient
channels cause a marked decrease in performance. (c) At least eight scales per octave are necessary for good performance, making the proposed
scheme crucial for achieving detection results that are both fast and accurate.

models, respectively. Our work is complementary as we widely believed that the price to be paid for improved
focus on improving the speed of pyramid construction. performance is sharply increased computational costs.
The current bottleneck of DPMs is in the classification We have shown that this is not necessarily so. Finely
stage, therefore pyramid construction accounts for only a sampled pyramids may be obtained inexpensively by
fraction of total runtime. However, if fast feature pyra- extrapolation from coarsely sampled ones. This insight
mids are coupled with optimized classification schemes decreases computational costs substantially.
[44], [45], DPMs have the potential to have more competi- Our insight ultimately relies on the fractal structure of
tive runtimes. We focus on demonstrating DPMs can much of the visual world. By investigating the statistics of
achieve good accuracy with fast feature pyramids and natural images we have demonstrated that the behavior of
leave the coupling of fast feature pyramids and optimized image features can be predicted reliably across scales. Our
classification schemes to practitioners. calculations and experiments show that this makes it possi-
DPM code is available online [35]. We tested pre-trained ble to estimate features at a given scale inexpensively by
DPM models on the 20 PASCAL 2007 categories using exact extrapolating computations carried out at a coarsely sam-
HOG pyramids and HOG pyramids with nine of 10 scales pled set of scales. While our results do not hold under all
per octave approximated using our proposed approach. circumstances, for instance, on images of textures or white
Average precision (AP) scores for the two approaches, noise, they do hold for images typically encountered in the
denoted DPM and DPM, respectively, are shown in natural world.
Table 2. The mean AP across the 20 categories is 26.6 percent In order to validate our findings we studied the perfor-
for DPMs and 24.5 percent for DPMs. Using fast HOG fea- mance of three end-to-end object detection systems. We
ture pyramids only decreased mean AP 2 percent, demon- found that detection rates are relatively unaffected while
strating the validity of the proposed approach. computational costs decrease considerably. This has led to
the first detectors that operate at frame rate while using rich
feature representations.
7 CONCLUSION Our results are not restricted to object detection nor to
Improvements in the performance of visual recognition visual recognition. The foundations we have developed
systems in the past decade have in part come from the should readily apply to other computer vision tasks where
realization that finely sampled pyramids of image fea- a fine-grained scale sampling of features is necessary as the
tures provide a good front-end for image analysis. It is image processing front end.

Fig. 12. Effect of parameter setting of fast feature pyramids on the ICF detector [Section 6.2]. The plots mirror the results shown in Fig. 11 for the ACF
detector, although overall performance for ICF is slightly lower. (a) When approximating seven of every eight scales in the pyramid, the MR for ICF is
0:195 which is only slightly worse than the MR of 0:176 obtained using exact feature pyramids. (b) Computing approximate channels with an incorrect
value of results in decreased performance (although using a slightly larger than predicted appears to improve results marginally). (c) Similarly to
the ACF framework, at least eight scales per octave are necessary to achieve good results.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1543

TABLE 2 [16] Y. Weiss and E. Adelson, “A Unified Mixture Framework for

Average Precision Scores for Deformable Part Models Motion Segmentation: Incorporating Spatial Coherence and Esti-
with Exact (DPM) and Approximate (DPM) mating the Number of Models,” Proc. IEEE Conf. Computer Vision
and Pattern Recognition (CVPR), 1996.
Feature Pyramids on PASCAL
[17] L. Itti, C. Koch, and E. Niebur, “A Model of Saliency-Based Visual
Attention for Rapid Scene Analysis,” IEEE Trans. Pattern Analysis
and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, Nov. 1998.
[18] P. Perona and J. Malik, “Detecting and Localizing Edges Com-
posed of Steps, Peaks and Roofs,” Proc. Third IEEE Int’l Conf. Com-
puter Vision (ICCV), 1990.
[19] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. von der
Malsburg, R. Wurtz, and W. Konen, “Distortion Invariant Object
Recognition in the Dynamic Link Architecture,” IEEE Trans. Com-
puters, vol. 42, no. 3, pp. 300-311, Mar. 1993.
[20] D.G. Lowe, “Object Recognition from Local Scale-Invariant
Features,” Proc. Seventh IEEE Int’l Conf. Computer Vision (ICCV),
ACKNOWLEDGMENTS 1999.
The authors thank Peter Welinder and Rodrigo Benenson [21] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for
Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern
for helpful comments and suggestions. Piotr Doll
ar, Ron Recognition (CVPR), 2005.
Appel, and Pietro Perona were supported by MURI- [22] R. De Valois, D. Albrecht, and L. Thorell, “Spatial Frequency
ONR N00014-10-1-0933 and ARO/JPL-NASA Stennis Selectivity of Cells in Macaque Visual Cortex,” Vision Research,
vol. 22, no. 5, pp. 545-559, 1982.
NAS7.03001. Ron Appel was also supported by NSERC [23] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Gradient-Based
420456-2012 and The Moore Foundation. Serge Belongie Learning Applied to Document Recognition,” Proc. IEEE, vol. 86,
was supported by the US National Science Foundation no. 11, pp. 2278-2324, Nov. 1998.
(NSF) CAREER Grant 0448615, MURI-ONR N00014-08-1- [24] M. Riesenhuber and T. Poggio, “Hierarchical Models of Object
Recognition in Cortex,” Nature Neuroscience, vol. 2, pp. 1019-1025,
0638 and a Google Research Award. 1999.
[25] D.G. Lowe, “Distinctive Image Features from Scale-Invariant Key-
REFERENCES points,” Int’l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[26] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet Classifica-
[1] D. Hubel and T. Wiesel, “Receptive Fields and Functional Archi- tion with Deep Convolutional Neural Networks,” Proc. Advances
tecture of Monkey Striate Cortex,” J. Physiology, vol. 195, pp. 215- in Neural Information Processing Systems (NIPS), 2012.
243, 1968. [27] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted
[2] C. Malsburg, “Self-Organization of Orientation Sensitive Cells in Cascade of Simple Features,” Proc. IEEE Conf. Computer Vision and
the Striate Cortex,” Biological Cybernetics, vol. 14, no. 2, pp. 85-100, Pattern Recognition (CVPR), 2001.
1973. [28] P. Viola, M. Jones, and D. Snow, “Detecting Pedestrians Using
[3] L. Maffei and A. Fiorentini, “The Visual Cortex as a Spatial Fre- Patterns of Motion and Appearance,” Int’l J. Computer Vision,
quency Analyser,” Vision Research, vol. 13, no. 7, pp. 1255-1267, vol. 63, no. 2, pp. 153-161, 2005.
1973. [29] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral Channel
[4] P. Burt and E. Adelson, “The Laplacian Pyramid as a Compact Features,” Proc. British Machine Vision Conf. (BMVC), 2009.
Image Code,” IEEE Trans. Comm., vol. 31, no. 4, pp. 532-540, Apr. [30] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool,
1983. “Pedestrian Detection at 100 Frames per Second,” Proc. IEEE Conf.
[5] J. Daugman, “Uncertainty Relation for Resolution in Space, Spa- Computer Vision and Pattern Recognition (CVPR), 2012.
tial Frequency, and Orientation Optimized by Two-Dimensional [31] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detec-
Visual Cortical Filters,” J. Optical Soc. Am. A: Optics, Image Science, tion: An Evaluation of the State of the Art,”IEEE Trans. Pattern
and Vision, vol. 2, pp. 1160-1169, 1985. Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2012.
[6] J. Koenderink and A. Van Doorn, “Representation of Local Geom- [32] www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/,
etry in the Visual System,” Biological Cybernetics, vol. 55, no. 6, 2014.
pp. 367-375, 1987. [33] D.L. Ruderman and W. Bialek, “Statistics of Natural Images: Scal-
[7] D.J. Field, “Relations between the Statistics of Natural Images ing in the Woods,” Physical Rev. Letters, vol. 73, no. 6, pp. 814-817,
and the Response Properties of Cortical Cells,” J. Optical Soc. Am. Aug. 1994.
A: Optics, Image Science, and Vision, vol. 4, pp. 2379-2394, 1987. [34] E. Switkes, M. Mayer, and J. Sloan, “Spatial Frequency Analysis of
[8] S. Mallat, “A Theory for Multiresolution Signal Decomposition: the Visual Environment: Anisotropy and the Carpentered Envi-
The Wavelet Representation,” IEEE Trans. Pattern Analysis and ronment Hypothesis,” Vision Research, vol. 18, no. 10, pp. 1393-
Machine Intelligence, vol. 11, no. 7, pp. 647-693, July 1989. 1399, 1978.
[9] P. Vaidyanathan, “Multirate Digital Filters, Filter Banks, Poly- [35] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan,
phase Networks, and Applications: A Tutorial,” Proc. IEEE, “Object Detection with Discriminatively Trained Part Based Mod-
vol. 78, no. 1, pp. 56-93, Jan. 1990. els,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32,
[10] M. Vetterli, “A Theory of Multirate Filter Banks,” IEEE Trans. no. 9, pp. 1627-1645, Sept. 2010.
Acoustics, Speech and Signal Processing, vol. 35, no. 3, pp. 356-372, [36] C. Wojek, S. Walk, and B. Schiele, “Multi-Cue Onboard Pedestrian
Mar. 1987. Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recogni-
[11] E. Simoncelli and E. Adelson, “Noise Removal via Bayesian tion (CVPR), 2009.
Wavelet Coring,” Proc. Int’l Conf. Image Processing (ICIP), vol. 1, [37] A. Ess, B. Leibe, and L. Van Gool, “Depth and Appearance for
1996. Mobile Scene Analysis,” Proc. IEEE 11th Int’l Conf. Computer Vision
[12] W.T. Freeman and E.H. Adelson, “The Design and Use of Steer- (ICCV), 2007.
able Filters,” IEEE Trans. Pattern Analysis and Machine Intelligence, [38] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, and A.
vol. 13, no. 9, pp. 891-906, Sept. 1991. Zisserman, “The PASCAL Visual Object Classes (VOC)
[13] J. Malik and P. Perona, “Preattentive Texture Discrimination with Challenge,” Int’l J. Computer Vision, vol. 88, no. 2, pp. 303-338,
Early Vision Mechanisms,” J. Optical Soc. Am. A, vol. 7, pp. 923- June 2010.
932, May 1990. [39] P. Dollar, S. Belongie, and P. Perona, “The Fastest Pedestrian
[14] D. Jones and J. Malik, “Computational Framework for Determin- Detector in the West,” Proc. British Machine Vision Conf. (BMVC),
ing Stereo Correspondence from a Set of Linear Spatial Filters,” 2010.
Image and Vision Computing, vol. 10, no. 10, pp. 699-708, 1992. [40] P. Dollar, R. Appel, and W. Kienzle, “Crosstalk Cascades for
[15] E. Adelson and J. Bergen, “Spatiotemporal Energy Models for the Frame-Rate Pedestrian Detection,” Proc. 12th European Conf. Com-
Perception of Motion,” J. Optical Soc. Am. A, vol. 2, no. 2, pp. 284- puter Vision (ECCV), 2012.
299, 1985.
Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
1544 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 8, AUGUST 2014

[41] T. Lindeberg, “Scale-Space for Discrete Signals,” IEEE Trans. Pat- [67] D.L. Ruderman, “The Statistics of Natural Images,” Network: Com-
tern Analysis and Machine Intelligence, vol. 12, no. 3, pp. 234-254, putation in Neural Systems, vol. 5, no. 4, pp. 517-548, 1994.
Mar. 1990. [68] S.G. Ghurye, “A Characterization of the Exponential Function,”
[42] J.L. Crowley, O. Riff, and J.H. Piater, “Fast Computation of Char- The Am. Math. Monthly, vol. 64, no. 4, pp. 255-257, 1957.
acteristic Scale Using a Half-Octave Pyramid,” Proc. Fourth Int’l [69] J. Friedman, T. Hastie, and R. Tibshirani, “Additive Logistic
Conf. Scale-Space Theories in Computer Vision, 2002. Regression: A Statistical View of Boosting,” Annals of Statistics,
[43] R.S. Eaton, M.R. Stevens, J.C. McBride, G.T. Foil, and M.S. Snorra- vol. 38, no. 2, pp. 337-374, 2000.
son, “A Systems View of Scale Space,” Proc. IEEE Int’l Conf. Com- [70] P. Sabzmeydani and G. Mori, “Detecting Pedestrians by Learning
puter Vision Systems (ICVS), 2006. Shapelet Features,” Proc. IEEE Conf. Computer Vision and Pattern
[44] P. Felzenszwalb, R. Girshick, and D. McAllester, “Cascade Object Recognition (CVPR), 2007.
Detection with Deformable Part Models,” Proc. IEEE Conf. Com- [71] Z. Lin and L.S. Davis, “A Pose-Invariant Descriptor for Human
puter Vision and Pattern Recognition (CVPR), 2010. Detection and Segmentation,” Proc. 10th European Conf. Computer
[45] M. Pedersoli, A. Vedaldi, and J. Gonzalez, “A Coarse-to-Fine Vision (ECCV), 2008.
Approach for Fast Deformable Object Detection,” Proc. IEEE Conf. [72] S. Maji, A. Berg, and J. Malik, “Classification Using Intersection
Computer Vision and Pattern Recognition (CVPR), 2011. Kernel SVMs Is Efficient,” Proc. IEEE Conf. Computer Vision and
[46] C.H. Lampert, M.B. Blaschko, and T. Hofmann, “Efficient Sub- Pattern Recognition (CVPR), 2008.
window Search: A Branch and Bound Framework for Object [73] X. Wang, T.X. Han, and S. Yan, “An HOG-LBP Human Detector
Localization,” IEEE Trans. Pattern Analysis and Machine Intelligence, with Partial Occlusion Handling,” Proc. IEEE Int’l Conf. Computer
vol. 31, no. 12, pp. 2129-2142, Dec. 2009. Vision (ICCV), 2009.
[47] L. Bourdev and J. Brandt, “Robust Object Detection via Soft [74] C. Wojek and B. Schiele, “A Performance Evaluation of Single and
Cascade,” Proc. IEEE Conf. Computer Vision and Pattern Recognition Multi-Feature People Detection,” Proc. 30th DAGM Symp. Pattern
(CVPR), 2005. Recognition (DAGM), 2008.
[48] C. Zhang and P. Viola, “Multiple-Instance Pruning for Learning [75] W. Schwartz, A. Kembhavi, D. Harwood, and L. Davis, “Human
Efficient Cascade Detectors,” Proc. Advances in Neural Information Detection Using Partial Least Squares Analysis,” Proc. IEEE 12th
Processing Systems (NIPS), 2007. Int’l Conf. Computer Vision (ICCV), 2009.

[49] J. Sochman and J. Matas, “Waldboost—Learning for Time Con- [76] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and
strained Sequential Detection,” Proc. IEEE Conf. Computer Vision Insights for Pedestrian Detection,” Proc. IEEE Conf. Computer
and Pattern Recognition (CVPR), 2005. Vision and Pattern Recognition (CVPR), 2010.
[50] H. Masnadi-Shirazi and N. Vasconcelos, “High Detection-Rate [77] D. Park, D. Ramanan, and C. Fowlkes, “Multiresolution Models
Cascades for Real-Time Object Detection,” Proc. IEEE 11th Int’l for Object Detection,” Proc. 11th European Conf. Computer Vision
Conf. Computer Vision (ICCV), 2007. (ECCV), 2010.
[51] F. Fleuret and D. Geman, “Coarse-To-Fine Face Detection,” Int’l J.
Computer Vision, vol. 41, no. 1/2, pp. 85-107, 2001. Piotr Dolla r received the master’s degree in
[52] P. Felzenszwalb and D. Huttenlocher, “Efficient Matching of Pic- computer science from Harvard University in
torial Structures,” Proc. IEEE Conf. Computer Vision and Pattern Rec- 2002 and the PhD degree from the University of
ognition (CVPR), 2000. California, San Diego in 2007. He joined the
[53] C. Papageorgiou and T. Poggio, “A Trainable System for Object Computational Vision Lab at the California Insti-
Detection,” Int’l J. Computer Vision, vol. 38, no. 1, pp. 15-33, 2000. tute of Technology Caltech as a postdoctoral fel-
[54] M. Weber, M. Welling, and P. Perona, “Unsupervised Learning of low in 2007. Upon being promoted to a senior
Models for Recognition,” Proc. European Conf. Computer Vision postdoctoral fellow he realized it was time to
(ECCV), 2000. move on, and in 2011, he joined the Interactive
[55] S. Agarwal and D. Roth, “Learning a Sparse Representation for Visual Media Group at Microsoft Research,
Object Detection,” Proc. European Conf. Computer Vision (ECCV), Redmond, Washington, where he currently
2002. resides. He has worked on object detection, pose estimation, boundary
[56] R. Fergus, P. Perona, and A. Zisserman, “Object Class Recognition learning, and behavior recognition. His general interests include
by Unsupervised Scale-Invariant Learning,” Proc. IEEE Conf. Com- machine learning and pattern recognition, and their application to com-
puter Vision and Pattern Recognition (CVPR), 2003. puter vision.
[57] B. Leibe, A. Leonardis, and B. Schiele, “Robust Object Detection
with Interleaved Categorization and Segmentation,” Int’l J. Com-
puter Vision, vol. 77, no. 1-3, pp. 259-289, May 2008. Ron Appel received the bachelor’s and master’s
[58] C. Gu, J.J. Lim, P. Arbelaez, and J. Malik, “Recognition Using degrees in electrical and computer engineering
Regions,” Proc. IEEE Conf. Computer Vision and Pattern Recognition from the University of Toronto in 2006 and
(CVPR), 2009. 2008, respectively. He is working toward the
[59] B. Alexe, T. Deselaers, and V. Ferrari, “What Is an Object?” Proc. PhD degree in the Computational VisionLab at
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2010. the California Institute of Technology, Caltech,
[60] C. Wojek, G. Dork o, A. Schulz, and B. Schiele, “Sliding-Windows where he currently holds an NSERC Graduate
for Rapid Object Class Localization: A Parallel Technique,” Proc. Award. He cofounded ViewGenie Inc., a com-
30th DAGM Symp. Pattern Recognition, 2008. pany specializing in intelligent image processing
[61] L. Zhang and R. Nevatia, “Efficient Scan-Window Based Object and search. His research interests include
Detection Using GPGPU,” Proc. Workshop Visual Computer Vision machine learning, visual object detection, and
on GPU’s (CVGPU), 2008. algorithmic optimization.
[62] B. Bilgic, “Fast Human Detection with Cascaded Ensembles,”
master’s thesis, MIT, Feb. 2010.
[63] Q. Zhu, S. Avidan, M. Yeh, and K. Cheng, “Fast Human Detection
Using a Cascade of Histograms of Oriented Gradients,” Proc.
IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2006.
[64] F.M. Porikli, “Integral Histogram: A Fast Way to Extract Histo-
grams in Cartesian Spaces,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition (CVPR), 2005.
[65] A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “Robust Multi-Per-
son Tracking from a Mobile Platform,” IEEE Trans. Pattern Analy-
sis and Machine Intelligence, vol. 31, no. 10, pp. 1831-1846, Oct. 2009.
[66] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan, and L.H.
Matthies, “A Fast Stereo-Based System for Detecting and Tracking
Pedestrians from a Moving Vehicle,” Int’l J. Robotics Research,
vol. 28, pp. 1466-1485, 2009.

Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.
ET AL.: FAST FEATURE PYRAMIDS FOR OBJECT DETECTION
DOLLAR 1545

Serge Belongie received the BS (with honor) in Pietro Perona received the graduate degree
electrical engineering from the California Institute di
in electrical engineering from the Universita
of Technology, Caltech in 1995 and the PhD Padova in 1985 and the PhD degree in electri-
degree in electrical engineering and computer cal engineering and computer science from
science from the University of California at Ber- the University of California at Berkeley in
keley in 2000. While at the University of California 1990. After a postdoctoral fellowship at MIT in
at Berkeley, his research was supported by the 1990-1991 he joined the faculty of the Califor-
US National Science Foundation (NSF) Graduate nia Institute of Technology, Caltech in 1991,
Research Fellowship. From 2001 to 2013, he where he is now an Allen E. Puckett professor
was a professor in the Department of Computer of electrical engineering and computation and
Science and Engineering at the University of Cal- neural systems. His current interests include
ifornia, San Diego (UCSD). He is currently a professor at Cornell NYC visual recognition, modeling vision in biological systems, modeling
Tech and the Cornell Computer Science Department, Ithaca, New York. and measuring behavior, and Visipedia. He has worked on aniso-
His research interests include computer vision, machine learning, crowd- tropic diffusion, multiresolution-multiorientation filtering, human tex-
sourcing, and human-in-the-loop computing. He is also a cofounder of ture perception and segmentation, dynamic vision, grouping,
several companies including Digital Persona, Anchovi Labs (acquired by analysis of human motion, recognition of object categories, and
Dropbox) and Orpix. He is a recipient of the US National Science Foun- modeling visual search.
dation (NSF) CAREER Award, the Alfred P. Sloan Research Fellowship,
and the MIT Technology Review “Innovators Under 35” Award.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: Fuzhou University. Downloaded on March 04,2024 at 03:27:10 UTC from IEEE Xplore. Restrictions apply.

Using Grayscale Images For Object Recognition With Convolutional-Recursive Neural Network
No ratings yet
Using Grayscale Images For Object Recognition With Convolutional-Recursive Neural Network
5 pages
E-Housing Helping System: Ms. Muskaan Shaikh
No ratings yet
E-Housing Helping System: Ms. Muskaan Shaikh
26 pages
Voilajones Paper PDF
No ratings yet
Voilajones Paper PDF
8 pages
Computer vision
No ratings yet
Computer vision
13 pages
Pedestrian Detection at 100 Frames Per Second
No ratings yet
Pedestrian Detection at 100 Frames Per Second
8 pages
Feature Pyramid Networks For Object Detection
No ratings yet
Feature Pyramid Networks For Object Detection
10 pages
_PhD Visual Object Category Recognition
No ratings yet
_PhD Visual Object Category Recognition
193 pages
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 download
100% (4)
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 download
45 pages
Feature Pyramid Networks For Object Detection
No ratings yet
Feature Pyramid Networks For Object Detection
9 pages
8 Rapid Object Detection Using A Boosted Cascade of Simple Features
No ratings yet
8 Rapid Object Detection Using A Boosted Cascade of Simple Features
8 pages
Empirical Analysis of Detection Cascades of Boosted Classifiers For Rapid Object Detection
No ratings yet
Empirical Analysis of Detection Cascades of Boosted Classifiers For Rapid Object Detection
8 pages
Rapid Object Detection Using A Boosted Cascade of Simple Features
No ratings yet
Rapid Object Detection Using A Boosted Cascade of Simple Features
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
2017 05 12 Image Segmentation
No ratings yet
2017 05 12 Image Segmentation
2 pages
2016 - Alex Pappachen James - Edge Detection For Pattern Recognition A Survey
No ratings yet
2016 - Alex Pappachen James - Edge Detection For Pattern Recognition A Survey
33 pages
CH 8
No ratings yet
CH 8
21 pages
Keypoint Recognition Using Randomized Trees
No ratings yet
Keypoint Recognition Using Randomized Trees
29 pages
Project Detecto!: A Real-Time Object Detection Model
No ratings yet
Project Detecto!: A Real-Time Object Detection Model
3 pages
06 Features
No ratings yet
06 Features
94 pages
Beyond Bags of Features: Spatial Pyramid Matching For Recognizing Natural Scene Categories
No ratings yet
Beyond Bags of Features: Spatial Pyramid Matching For Recognizing Natural Scene Categories
8 pages
Ballard D. and Brown C. M. 1982 Computer Vision
100% (2)
Ballard D. and Brown C. M. 1982 Computer Vision
539 pages
18 TallapallyHarini 162-170
No ratings yet
18 TallapallyHarini 162-170
9 pages
Thesis PDF
No ratings yet
Thesis PDF
208 pages
Unit 3-Non CNN approaches to object recognition
No ratings yet
Unit 3-Non CNN approaches to object recognition
26 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Computer Vision Application
No ratings yet
Computer Vision Application
2 pages
Image Features and Categorization: Computer Vision Jia-Bin Huang, Virginia Tech
No ratings yet
Image Features and Categorization: Computer Vision Jia-Bin Huang, Virginia Tech
70 pages
Tripartite Feature Enhanced Pyramid Network For Dense Prediction
No ratings yet
Tripartite Feature Enhanced Pyramid Network For Dense Prediction
15 pages
Po-Jui Huang and Duan-Yu Chen Department of Electrical Engineering, Yuan Ze University, Chung-Li, Taiwan Dychen@saturn - Yzu.edu - TW, S970561@mail - Yzu.edu - TW
No ratings yet
Po-Jui Huang and Duan-Yu Chen Department of Electrical Engineering, Yuan Ze University, Chung-Li, Taiwan Dychen@saturn - Yzu.edu - TW, S970561@mail - Yzu.edu - TW
5 pages
Unit 4
No ratings yet
Unit 4
21 pages
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - The full ebook with all chapters is available for download
100% (6)
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - The full ebook with all chapters is available for download
73 pages
DeepPrimitive Image Decomposition by Layered Primi
No ratings yet
DeepPrimitive Image Decomposition by Layered Primi
13 pages
Sift
No ratings yet
Sift
28 pages
Two Types of Image Segmentation Exist:: Semantic Segmentation. Objects Shown in An Image Are Grouped Based On
No ratings yet
Two Types of Image Segmentation Exist:: Semantic Segmentation. Objects Shown in An Image Are Grouped Based On
25 pages
The Techniques For Face Recognition With Support Vector Machines
No ratings yet
The Techniques For Face Recognition With Support Vector Machines
6 pages
Overview and Implementation of Fast Corner Detection Method
No ratings yet
Overview and Implementation of Fast Corner Detection Method
11 pages
Tensor Flow
No ratings yet
Tensor Flow
5 pages
Understanding Regions and Region Segmentation: by Nayan Khinvasara
No ratings yet
Understanding Regions and Region Segmentation: by Nayan Khinvasara
59 pages
Machine Vision: Chapter Index
No ratings yet
Machine Vision: Chapter Index
1 page
Features 1 B
No ratings yet
Features 1 B
94 pages
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - Download the ebook today and experience the full content
No ratings yet
Natural Image Stattistics A Probabilistic Approach to Early Computational vision 1st Edition by Aapo Hyvarinen, Jarmo Hurri, Patrick O Hoyer ISBN 1848824904 9781848824904 - Download the ebook today and experience the full content
44 pages
9-2e. SIFT-21-08-2024
No ratings yet
9-2e. SIFT-21-08-2024
66 pages
SR22804211151
No ratings yet
SR22804211151
8 pages
Chapter 1
No ratings yet
Chapter 1
8 pages
A Review On Multiscale-Deep-Learning Applications
No ratings yet
A Review On Multiscale-Deep-Learning Applications
28 pages
Computer Vision Methods For Fast Image Classification and Retrieval 2020
100% (4)
Computer Vision Methods For Fast Image Classification and Retrieval 2020
144 pages
Improved Edge Detection Based Fast Face Detection Method Using Enhanced Fourier Transform (IED-FFD) Towards Facial Expression Recongnition
No ratings yet
Improved Edge Detection Based Fast Face Detection Method Using Enhanced Fourier Transform (IED-FFD) Towards Facial Expression Recongnition
7 pages
Havi Batch 10
No ratings yet
Havi Batch 10
15 pages
Lec 27
No ratings yet
Lec 27
25 pages
CV Unit 3
No ratings yet
CV Unit 3
41 pages
Obstacle Detection For Visually Impaired
No ratings yet
Obstacle Detection For Visually Impaired
4 pages
Felisberto Et-Al 2003
No ratings yet
Felisberto Et-Al 2003
10 pages
CIL13-503 Rev. 2
No ratings yet
CIL13-503 Rev. 2
11 pages
Object Recognition System Design in Computer Vision: A Universal Approach
No ratings yet
Object Recognition System Design in Computer Vision: A Universal Approach
18 pages
JCTN Avinash Rohini 417 425
No ratings yet
JCTN Avinash Rohini 417 425
10 pages
Digital Image Processing
No ratings yet
Digital Image Processing
10 pages
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
DSI-NRF Master's and Doctoral Application and Funding Guide For 2025
No ratings yet
DSI-NRF Master's and Doctoral Application and Funding Guide For 2025
31 pages
Stage: Business Analysis Business Requirements Document: University of Edinburgh
No ratings yet
Stage: Business Analysis Business Requirements Document: University of Edinburgh
13 pages
New Pass4itsure HP HPE0-J74 PDF - Foundations of HPE Storage Solutions Exam
No ratings yet
New Pass4itsure HP HPE0-J74 PDF - Foundations of HPE Storage Solutions Exam
10 pages
High-Precision Advanced Tuning Fork Balance: Series Manual
No ratings yet
High-Precision Advanced Tuning Fork Balance: Series Manual
121 pages
Different Types of Camera Lenses
No ratings yet
Different Types of Camera Lenses
9 pages
HV350 Catalog
No ratings yet
HV350 Catalog
5 pages
Hotel Dashboard
No ratings yet
Hotel Dashboard
41 pages
AMMA 2025 Conference Brochure
No ratings yet
AMMA 2025 Conference Brochure
2 pages
Models and Guidelines For The Design and Development of A Joint Micro-Credential Programme in Higher Education
No ratings yet
Models and Guidelines For The Design and Development of A Joint Micro-Credential Programme in Higher Education
91 pages
Metalux LHB Led High Bay Specsheet
No ratings yet
Metalux LHB Led High Bay Specsheet
3 pages
Auction Manager For Bidders
No ratings yet
Auction Manager For Bidders
4 pages
Prelim Exam 22-23 DSBDA
No ratings yet
Prelim Exam 22-23 DSBDA
2 pages
ICT - JAVA1-Grade11 - IntroToProgrammingLanguage
No ratings yet
ICT - JAVA1-Grade11 - IntroToProgrammingLanguage
17 pages
Practice Problems: Concurrency: Lectures On Operating Systems (Mythili Vutukuru, IIT Bombay)
No ratings yet
Practice Problems: Concurrency: Lectures On Operating Systems (Mythili Vutukuru, IIT Bombay)
38 pages
Sensor Mic
No ratings yet
Sensor Mic
4 pages
Attendance System With Bluetoothand Android App
No ratings yet
Attendance System With Bluetoothand Android App
5 pages
Overlay For Waterproofing Membrane On Roof Deck Floor
No ratings yet
Overlay For Waterproofing Membrane On Roof Deck Floor
5 pages
Dump State
No ratings yet
Dump State
6 pages
DFo 4 1 Project
No ratings yet
DFo 4 1 Project
3 pages
Presentation On Youtube Streamers Analysis
No ratings yet
Presentation On Youtube Streamers Analysis
9 pages
Devoir a Comicile Partie 2 Security Comprehensive
No ratings yet
Devoir a Comicile Partie 2 Security Comprehensive
18 pages
Diesel Key Start Board With Low Coolant SMD - DKSTLS: Features
No ratings yet
Diesel Key Start Board With Low Coolant SMD - DKSTLS: Features
1 page
ZF Check Xopvw
No ratings yet
ZF Check Xopvw
7 pages
MSF 390
No ratings yet
MSF 390
23 pages
AWS Certified Solutions Architect - Professional (SAP-C01) Exam Guide
No ratings yet
AWS Certified Solutions Architect - Professional (SAP-C01) Exam Guide
9 pages
Dahua Bullet Network Camera - Quick Start Guide - V1.0.0
No ratings yet
Dahua Bullet Network Camera - Quick Start Guide - V1.0.0
19 pages
Var. I - 3.AXC-EX - 400-6 - 36°-2-P - (160) - (2.2 - KW) - S - IE3
No ratings yet
Var. I - 3.AXC-EX - 400-6 - 36°-2-P - (160) - (2.2 - KW) - S - IE3
7 pages
Test Case Writing
No ratings yet
Test Case Writing
10 pages
Apuntes Inglés SMR2 23-24
No ratings yet
Apuntes Inglés SMR2 23-24
67 pages

Fast Feature Pyramids For Object Detection

Uploaded by

Fast Feature Pyramids For Object Detection

Uploaded by

1532 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO.

Fast Feature Pyramids for Object Detection

M ULTI-RESOLUTION multi-orientation decompositions

1 X fV ðIs1 Þ=fV ðIs2 Þ ¼ ðs1 =s2 Þ V

4.3 Deviation for Individual Images 4.4 Miscellanea

Fig. 6. Feature channel scaling. Suppose we have computed C ¼ VðIÞ;

Eq. (7) follows from Eq. (4). Setting s1 ¼ s, s2 ¼ 1, and

The final P line follows Pbecause if for all corresponding

than ACF due to construction of integral images and more

TABLE 2 [16] Y. Weiss and E. Adelson, “A Unified Mixture Framework for

You might also like