0% found this document useful (0 votes)
8 views

Real-time GPU-based Face Detection in HD Video Sequences

Uploaded by

beoverall
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Real-time GPU-based Face Detection in HD Video Sequences

Uploaded by

beoverall
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Real-time GPU-based Face Detection in HD Video Sequences

David Oro † Carles Fernández † Javier Rodrı́guez Saeta † Xavier Martorell ‡ Javier Hernando ‡

Herta Security, Barcelona, Spain

Universitat Politècnica de Catalunya, Barcelona, Spain
[email protected] [email protected] [email protected]

Abstract ing a high amount of computing power, especially when it is


required to meet specific latency and accuracy constraints.
Modern GPUs have evolved into fully programmable From the perspective of latency, a video surveillance system
parallel stream multiprocessors. Due to the nature of the has to analyze real-time image sequences without missing
graphic workloads, computer vision algorithms are in good any video frame. In order to guarantee SLAs and trigger
position to leverage the computing power of these devices. the appropriate alarms, computer vision algorithms should
An interesting problem that greatly benefits from paral- meet a tight deadline ensuring that all required computa-
lelism is face detection. This paper presents a highly op- tions can sustain a frame rate of at least 25 fps.
timized Haar-based face detector that works in real time On the other hand, recent advances in CCD and active-
over high definition videos. The proposed kernel opera- pixel CMOS sensors have substantially reduced the cost of
tions exploit both coarse and fine grain parallelism for per- deploying 1080p HDTV cameras. By using high resolution
forming integral image computations and filter evaluations, images, it is now possible to detect features of distant faces
thus being beneficial not only for face detection but also for in highly crowded environments (e.g. stadiums, airports or
other computer vision techniques. Compared to previous train stations) and use such increased resolutions to effec-
implementations, the experiments show that our proposal tively improve face identification algorithms.
achieves a sustained throughput of 35 fps under 1080p res- Unfortunately, it is not expected anymore to dramatically
olutions using a sliding window with step of one pixel. increase the throughput of single-threaded object detectors
just by executing them in the latest available CPU. The so-
called power wall and the impossibility of clocking out-
1. Introduction of-order microarchitectures at ever-increasing frequencies
In recent years, face detection and identification tech- have been the main reasons for the major multicore shift
nology has experienced a huge leap in terms of improved experienced by the CPU industry.
accuracy and throughput. Part of this success is due to the Unlike multicore CPUs, GPU microarchitectures do not
fact that the semiconductor industry has been able to deliver rely on big L2 and L3 caches for hiding latencies when ac-
faster microprocessors clocked at higher frequencies. This cessing DRAM memories. They spend instead a large ex-
increase in raw performance for single-threaded applica- tension of the die area in ALUs rather than in caches, and
tions also has its roots in the growing complexity of the mi- hide memory access latencies simply by overlapping them
croarchitectures of general purpose processors. Techniques with arithmetic computations from multiple threads. In or-
such as deep pipelining, out-of-order and superscalar exe- der to exploit these throughput-oriented architectures, the
cution, and simultaneous or speculative multithreading have programmer must explicitly expose data-level parallelism
enabled sustained increases in both the instruction level par- by mapping kernel functions to a collection of data records
allelism and the issue width of these highly complex cores. or streams.
At the same time, computer vision algorithms have suc- In this work we present a massively parallel stream-
cessfully exploited the increase of serial performance, by based implementation of a Haar-based face detector that
carrying out hundreds of billions of operations to match targets NVIDIA GPUs. The proposed parallel processing
specific patterns within an image. Applications ranging pipeline is designed from scratch for detecting faces in real-
from classic object recognition to advanced video analyt- time in H.264 high definition 1080p video sequences, and
ics were made possible. Particularly, smart video surveil- returning their precise location within the image for further
lance algorithms aim at analyzing 24/7 continuous video identification. Our implementation processes HDTV video
broadcasted from multiple IP CCTV cameras, thus requir- sequences at a sustained rate of 35 fps using hard runtime

2011 IEEE International Conference on Computer Vision Workshops 530


978-1-4673-0063-6/11/$26.00 c 2011 IEEE
constraints (sliding windows with step of one pixel and 32 3. Proposed face detection pipeline
different scales).
In order to implement face detection, we follow a par-
This paper is structured as follows: Section 2 discusses
allelization strategy that tries to simultaneously maximize
recent advances and state-of-the-art implementations of
GPU occupancy and minimize the amount of memory
low-latency face detection algorithms based on the frame-
transfers between CPU and GPU. In addition, since the
work originally described by Viola and Jones [1]. Section 3
targeted GPU microarchitecture features a heterogeneous
describes the proposed GPU-based parallel pipeline for per-
cache memory hierarchy [8], the Haar feature set and the
forming face detection. The optimized implementation of
image to be computed should be manually transferred to the
this pipeline is described in two parts: Section 4 proposes
appropriate cache type to achieve the highest throughput.
an implementation of the integral image computation based
As shown in Figure 1, the pipeline starts from a video
on multiscan operations, and Section 5 discusses the paral-
input. If the input is encoded using standard codecs, e.g.
lel implementation strategy followed for the Haar cascade
MPEG2 or MPEG4-AVC/H.264, it is then possible to use
evaluation process. An intensive evaluation of these opti-
the NVCUVID API for decoding the video in the GPU [9].
mizations is carried out in Section 6. To conclude, Section 7
This API exposes the on-die video decoder processor (VP)
draws some final conclusions and describes the future work.
to the programmer, and allows interoperability between the
hardware-decoded video frames and CUDA kernel compu-
2. Related work
tations.
There has been little work in the literature during the
s Input video
last years about real-time face detection at HDTV resolu- ale
sc
tions. Cho et al. [2] presented a FPGA-based face detection N

GPU hardware functions


system based on Haar classifiers capable of detecting faces H.264 decoding
at 640×480 resolutions at 7.5 fps. Hefenbrock et al. [3]
proposed a stream-based multi-GPU implementation on 4
cards that achieved 15.2 fps. However, the integral image Image scaling .. Is N
.
computation was not parallelized, and Haar features were Is 1Is 2
accessed from the shared memory of each streaming multi-
processor (SM) and not from the constant memory, which Bilinear filtering ... BFN
is specifically designed for broadcasting values to a thread BF1BF2
warp. Kong et al. [4] described another GPU-based imple-
Integral image generation .. IintN
I .
Time

mentation that offered a latency of 197 ms (0.5 fps) when


Iint1int2

CUDA kernels
detecting 48 faces at 1280×1024 resolution. Herout et
al. [5] proposed a GPU-based face detector based on local Haar cascade evaluation ... HN
rank patterns as an alternative to the commonly used Haar H1H2
wavelets [6].
Finally, Sharma et al. [7] presented a working CUDA im- Face bounding process ... FBN
plementation that achieved a peak throughput of 19 fps un- FB1FB2
der a resolution of 1280×960 pixels. They proposed a naive
parallel integral image kernel to perform both row-wise and OpenGL texture

column-wise prefix sums, by fetching input data from the


off-chip texture memory cached in each SM. Unlike other Display rendering
implementations, the Haar cascade evaluation was paral-
lelized using a fixed-size sliding window, and the input im- Figure 1. Proposed pipeline for parallel face detection
ages were sequentially resized to deal with multiple scales.
In the present work we overcome the bottlenecks and The following step after decoding the video frame is
limitations of the parallel integral image computation used computing the integral image Iint in parallel. At this point,
by Sharma et al. [7]. We also propose a parallel algorithm there are two possible implementations for the process H
for evaluating Haar filters that fully exploits the underly- that evaluates the cascade of Haar filters. The first alter-
ing microarchitecture of the NVIDIA GF100 core. The native scales the sliding window and requires resizing each
experimental results show that our parallel face detection filter accordingly, whereas the second one scales the image.
framework beats in performance all implementations found As it will be discussed in Section 5, the best option is rescal-
to date in the literature, achieving a sustained frame rate of ing images instead of filters. Each newly scaled image Is is
35 fps at a resolution of 1920×1080 while decoding H.264 filtered to avoid aliasing, e.g. by using the bilinear filtering
video in real-time. fixed-functions BF available in the texturing units of the

531
GPU occupancy GPU occupancy
grid scan
Kernel (s1) Kernel (s1) Kernel (s4)
block scan block scan
Kernel (s4) Kernel (s3)
warp warp ... warp ... warp warp ... warp
Time

Kernel (s2) scan scan scan scan scan scan


Kernel (s3)
Kernel (s2)
SM SM SM
Figure 2. Serial (left) vs. concurrent (right) kernel execution core core core core core core

shared memory

shared memory

shared memory
core core core core core core
...
GPU, and its corresponding integral image is recomputed core core core core core core
for each scale. core core core core core core
The last step of the pipeline returns the coordinates of de-
tected faces to the CPU for further processing. The detected L2 Cache
faces are bounded using a CUDA kernel FB , and these re-
sults are bypassed to the conventional OpenGL pipeline us- GPU DRAM
ing a pixel buffer object to display the results.
All kernel computations in the GPU exploit both coarse- Figure 3. Divide-and-conquer approach for the parallel prefix sum
grain and fine-grain parallelism. Coarse-grain parallelism (scan) operation and the underlying GPU microarchitecture
is achieved by simultaneously launching and executing the
same kernel operation for each scaled image from the same
CUDA context. Additionally, fine-grain parallelism is also ture, we take an additional effort and follow a divide-and-
exploited at thread-level within each CUDA kernel. Since conquer hierarchical approach based on the Hillis-Steele al-
the Fermi microarchitecture supports concurrent kernel ex- gorithm [14, 15], which implements the scan at different
ecution [8], the occupancy of GPU resources is maximized granularities. In this implementation, input data is fetched
even for those kernels dealing with small scales, by execut- in parallel from the GPU DRAM and then stored in the on-
ing them in parallel as seen in Figure 2. die shared memory of each SM before starting any compu-
tation.
4. Parallel integral image computation As Figure 3 shows, the high-level block scan operation
Computing integral images is an expensive task within is composed of several warp scan primitives. The purpose
the face detection process. There exists literature on how of this block-wise operation is to compute the scan across
to optimize this task via parallelization under both stream- a block of threads. A grid of intra-block primitives (grid
based and multi-core processors [10, 11, 12]. However, scan) is then used to finish the prefix sum computation of a
most existing work relies on the observation made by Mes- stream of arbitrary length. In order to exploit coarse-grain
son et al. [11] that integral images can be computed us- parallelism at row level, the integral image computation re-
ing standard exclusive prefix sum operations followed by lies on the multiscan kernel operation. This operation car-
matrix transpositions. Since this algorithm is meant to be ries out the scan in parallel for each row of the input image.
executed on GPUs, it is possible to preserve data local- At the lowest level, a warp scan primitive computes the
ity in on-die caches by performing row-wise prefix-sums prefix sum only for the threads referenced within the warp.
and two matrix transpositions, as opposed to the row-wise As depicted in Figure 4, the warp scan primitive performs
and column-wise prefix sum operations initially proposed the prefix sum operation in parallel for an input vector of 32
by Sharma et al. [7]. elements in log2 32 steps. For each step, a subset of threads
The prefix sum or scan is a data-parallel primitive ap- in the warp performs a partial sum. At the end of the 5th
plied to a given stream, in which each element is generated step, the output of the algorithm will be the prefix sum for
by summing the elements up to its index. A naive parallel the 32-element vector.
implementation computes the prefix-sum in log2 n steps, by In the Fermi core, SIMT instructions perform the same
first summing in parallel all pairs of contiguous elements to computation across a warp or group of 32 threads and si-
produce n/2 additions, and recursively summing the pro- multaneously store the results in registers without need-
duced pairs until a final single sum remains [13]. This ing additional thread barrier instructions [8]. For this rea-
naive implementation performs O(n log2 n) total additions, son, each step of the warp scan primitive can be imple-
whereas a single-threaded CPU implementation would only mented using a sequence of ld.shared.u32, add.u32
require O(n) operations. and st.shared.u32 SIMT instructions. The parallel ac-
For better exploiting the underlying GPU microarchitec- cess pattern depicted in Figure 4 ensures memory coalesc-

532
Warp (32 threads)
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31

∑00 ∑01 ∑12 ∑23 ∑34 ∑45 ∑56 ∑67 ∑78 ∑89 ∑910 ∑10
11 ∑1112 ∑12
13 ∑13
14 ∑14
15 ∑15
16 ∑16
17 ∑17
18 ∑18
19 ∑19
20 ∑20
21 ∑21
22 ∑22
23 ∑23
24 ∑24
25 ∑25
26 ∑26
27 ∑27
28 ∑28
29 ∑29
30 ∑30
31

∑00 ∑01 ∑02 ∑03 ∑14 ∑25 ∑36 ∑47 ∑58 ∑69 ∑710 ∑811 ∑912 ∑10
13 ∑11 15 ∑13
14 ∑12 16 ∑14
17 ∑15
18 ∑16 20 ∑18
19 ∑17 21 ∑19
22 ∑20 24 ∑22
23 ∑21 25 ∑23
26 ∑24
27 ∑25 29 ∑27
28 ∑26 30 ∑28
31

∑00 ∑01 ∑02 ∑03 ∑04 ∑0


∑05 ∑06 ∑07 ∑18 ∑29 ∑310 ∑411 ∑512 ∑613 ∑714 ∑815 ∑916 ∑10
17 ∑11
18 ∑12 20 ∑14
19 ∑13 21 ∑15
22 ∑16 24 ∑18
23 ∑17 25 ∑19
26 ∑20 28 ∑22
27 ∑21 29 ∑23
30 ∑24
31

∑00 ∑01 ∑02 ∑03 ∑04 ∑05 ∑06 ∑07 ∑08 ∑09 ∑010 ∑011 ∑012 ∑013 ∑014 ∑015 ∑116 ∑217 ∑318 ∑419 ∑520 ∑621 ∑722 ∑823 ∑924 ∑10
25 ∑11
26 ∑12 28 ∑14
27 ∑13 29 ∑15
30 ∑16
31

∑00 ∑01 ∑02 ∑03 ∑04 ∑05 ∑06 ∑07 ∑08 ∑09 ∑010 ∑011 ∑012 ∑013 ∑014 ∑015 ∑016 ∑017 ∑018 ∑019 ∑020 ∑021 ∑022 ∑023 ∑024 ∑025 ∑026 ∑027 ∑028 ∑029 ∑030 ∑031
Figure 4. Parallel warp scan primitive for a 32-element input vector x

ing, thus maximizing both bandwidth and throughput. to parallelize the serial algorithm proposed by Viola and
The integral image Iint is obtained by subsequently per- Jones [1], which consists of resizing the filters to the desired
forming a multiscan operation, a parallel matrix transposi- scanning resolution. Unfortunately, this technique is ineffi-
tion, a second multiscan and a second transposition: cient when implemented in CUDA. Since all filters must
 T be scaled up to the size of the sliding window, the occu-
Iint = multiscan multiscan (I)
T pancy of the CUDA cores will be extremely low if each
thread evaluates the Haar filter cascade for a given sliding
Transposing the matrices is necessary to avoid column- window position and resolution. As shown in the follow-
wise scanning, so that data locality is better exploited when ing equation, the number of potential simultaneous threads
accessing the shared memory. Memory coalescing and bulk Nthreads quadratically decreases as the size of the sliding
transfers are required to achieve the maximum bandwidth window W ×W increases by a scale factor:
for the GPU, so a natural way of transposing matrices is to  
divide them into equally-sized blocks and to copy data from Iwidth · Iheight
Nthreads =
the off-chip DRAM to the on-die shared memories. This in- scale2 · W 2
volves using a barrier instruction after data is transferred to
Therefore, when detecting faces of 96×96 pixels in a
the shared memory of each SM. Synchronization is required
1920×1080 image, there are only 225 threads competing
to guarantee that all threads constituting the CUDA block
for the hardware resources. This amount of threads is
start with the transposition immediately after all memory
clearly insufficient for keeping the 512 cores of the GF100
transfers have concluded.
microarchitecture busy, especially when threads stall due to
Generally, the input matrix is not a multiple of the
data dependencies or pending load instructions. The situa-
warp width. To solve this issue, before transferring mem-
tion is even worse for higher resolutions of faces.
ory from the CPU address space to the GPU, the matrix
is zero-padded until becoming square. Even though this Hence, the main objective is to create as many threads
padding process causes additional memory transfers, the evaluating the cascade as possible. In order to meet this
final throughput of the transposition will be much better, goal, we propose using small fixed-sized sliding windows
since all accesses will be properly coalesced and aligned. and scaling instead the input image. As a consequence, in-
tegral images need to be computed for each newly scaled
image, whereas the traditional algorithm computes it once.
5. Parallel Haar cascade evaluation
In order to obtain the highest performance during the
Once the integral image of a given frame is computed, evaluation of the cascade, the integral image data should be
the cascade of Haar filters has to be evaluated. This step is moved to the on-die shared memory before any computing-
the most resource-intensive part of the face detection algo- intensive operation starts. The best way to do it is using
rithm, since it involves hundreds of billions of operations the divide-and-conquer approach, where each thread from
per image frame in HD video sequences. the cooperative array transfers a portion of image from the
There are two alternative parallelization strategies to global memory to the shared memory of the corresponding
tackle scaling during filter evaluation. The first one is SM. The size of memory chunks has to be at least four times

533
block (i, j) block (i, j+1) NVIDIA GTX 470 graphics card.
All GPU applications were compiled for the CUDA
Toolkit 4.0 with the -O3 flag and targeted the sm 20 ar-
chitecture. The underlying OS was powered by the Linux
threadx,y Kernel 2.6.35 and compiled for the x86-64 architecture. On
the other hand, GCC v.4.4.5 was used for linking the fi-
nal application and for building the CPU tests. It should
be noted that all GPU benchmarks were performed without
considering the time spent on memory transfers between
the CPU and the GPU. This assumption is valid since the
final GPU-based face detector will start performing compu-
tations only once the image frame is available in the off-chip
block (i+1, j) block (i+1, j+1) GPU DRAM memory.
Figure 5. Evaluation area considered by a thread in a CUDA block The cascade used for the Haar evaluation process was the
frontal feature set developed by Lienhart et al. [16] and dis-
tributed in the OpenCV framework. It has 2913 filters and
larger than that of the sliding window. This is because the is organized into 25 stages. Finally, the size of the training
threads of a block (i, j) in the grid have to bring pixels from images was 24×24 pixels.
contiguous blocks to the shared memory in order to evaluate
the sliding window, see Figure 5. 6.1. Integral image
After these memory transfers have been completed, the
filters of the Haar cascade need to be evaluated. This evalua- In order to evaluate our parallel integral image imple-
tion is both a memory and an arithmetic intensive operation. mentation, we executed the multiscan algorithm for differ-
Therefore, the right approach should involve moving the ent image sizes in both CPU and GPU. The CPU imple-
Haar features to the on-die shared memory. Unfortunately, mentation was a single-threaded application that carried out
this is unfeasible for two reasons. First, the selected fea- a recursive scan in O(n) steps for an n×n matrix whereas
ture set [16] has a size of 106 KB and the available shared the GPU implementation was the same as described in Sec-
memory in a SM is 48 KB. Second, all warp threads will tion 4. Since each row involves O(log2 n) steps, the GPU
stall, since each thread will try to access the same memory implementation is asymptotically fitted by O(n log2 n). On
address (e.g. the first filter of the first stage), thus produc- the other hand, input images were matrices randomly popu-
ing bank conflicts. To deal with these situations, the GF100 lated with unsigned 32-bit integers that ranged from 30×30
microarchitecture provides the constant memory, which is to 104 × 104 pixels.
specifically designed for broadcasting values for all warp As depicted in Figure 6, the GPU multiscan scales well
threads that simultaneously access the same address. with the image size. For a 100 megapixel image it is exe-
Unlike the shared memory, the constant memory is read- cuted in only 43 ms, 66 times faster than the CPU version.
only and must be allocated and initialized before launching Although the GPU algorithm scales well with the size
a kernel. Although the aggregated size of the constant mem- of the input image, it is slower than the CPU version for
ory is 128 KB (8 KB per SM), the CUDA programming small images. This effect can be clearly seen in Figure 7
model restricts the available size to 64 KB, so the feature where the execution time for the GPU is constant for images
set must be stored compressed. Since thresholds are en- smaller or equal than 100×100 pixels. The reason for this
coded using double precision floating-point numbers, they behavior is related to the overhead produced by the CUDA
can be re-encoded in 32 bits at the cost of slightly losing kernel launching process. If the image to be processed has
precision. Similarly, not all the bits required for the coordi- few elements, it is not possible to hide the latencies of ker-
nates, dimensions and weight values are significant. Given nel launches with arithmetic computations. As a result, the
that the training images used for the cascade had a size of CPU will beat the GPU for images smaller than 60×60 pix-
24×24 pixels, two 16-bit unsigned integers suffice for en- els where the time spent by the CUDA runtime for allo-
coding a feature. cating all required resources and internal data structures is
higher than the time spent by the CUDA cores for comput-
6. Experimental results ing the data stream.
The performance of the multiscan for non-square n×m
In order to evaluate the performance of the previously matrices was also analyzed. Since kernel launches are ex-
described parallel algorithms, multiple execution tests were pensive operations, in theory a multiscan of a 10000×100
performed in the same computer. The selected platform fea- matrix would yield a higher execution time than that of a
tured an Intel Core i5-760 2.8 GHz, quad-core CPU and an 100×10000. Also, the first approach requires more ker-

534
10000
3000
O(nlog 2 n) Fit
GPU 1000
2500 O(n 2 ) Fit
CPU
100 GPU
CPU
2000
10

Time (ms)
Time (ms)

1500 1

0.1
1000
0.01
500
0.001

0 0.0001
0 2e+07 4e+07 6e+07 8e+07 1e+08 1000 10000 100000 1e+06 1e+07 1e+08
Image Size (pixels) Image Size (pixels)

Figure 6. Execution time of the multiscan operation Figure 7. Execution time of multiscan (log scale for x and y axis)

100
nel launches than the second and computes less work in 1280x720
each kernel. As shown in Table 1, this does not behave 1280x720 Coalesced
1920x1080
as expected. The execution time grows slightly faster if
1920x1080 Coalesced
the column size (m) is kept constant and the number of 10
2048x2048
rows (n) increased, rather than the opposite, but only while 2048x2048 Coalesced
Time (ms)

m ≤ 1000 and n ≤ 10000. This effect is amplified as n


(kernel launches) grows and may be related to the imple- 1
mentation of the CUDA runtime.
Rows Columns (m)
(n) 1 10 100 1000 10000
0.1
1 0.006860 0.006820 0.006780 0.006740 0.021770
10 0.006970 0.006880 0.006860 0.007010 0.024120
100 0.006900 0.007400 0.007290 0.014300 0.192690
1000 0.045060 0.049580 0.051780 0.248030 3.874300
10000 0.421470 0.47058 0.507210 2.604200 40.690800 0.01
1x1 2x2 4x4 8x8 16x16 32x32
Table 1. Multiscan execution time (ms) for the GPU Block Size
Figure 8. Matrix transposition for different CUDA block sizes
In addition to the multiscan, the matrix transposition was
also evaluated. Due to the nature of our GPU parallel im-
plementation, the optimal width of the CUDA cooperative compared against a sequential O(n · m) CPU implementa-
thread array must be experimentally determined for achiev- tion, which does not perform any scan or transposition for
ing the maximum performance. Since the purposed face computing the integral image. It is worth noting that se-
detection application should work at HDTV resolutions, we quentially adding each element of the matrix while keeping
determined the block size yielding the best results for res- the accumulated sum in a variable avoids unnecessary loop
olutions ranging from 1280×720 up to 2048×2048 pixels. iterations. Both implementations were tested with padded
Furthermore, the transposition algorithm was deliberately images ranging from 256×256 up to 8192×8192 pixels.
modified with the purpose of analyzing the effect of both co- The obtained speed up for the GPU is not as high as
alesced and uncoalesced memory accesses under the GF100 expected if we take into account the results of each inter-
microarchitecture. mediate step of the parallel implementation (see Figure 9).
As Figure 8 shows, the lowest execution time is obtained On average, the GPU implementation is 2.5 times faster
for blocks of 16×16 threads when performing coalesced than the CPU. However, this does not happen with images
memory accesses. The reason for this is related to the num- smaller than 256×256 pixels; at these low resolutions, the
ber of bank ports available in the on-die shared memory. CPU is on average 15% faster than the GPU. Again, the
Finally, the parallel algorithms for multiscan and trans- reason behind this behavior is that when the random values
position are sequentially combined to obtain the n×m inte- of the small input matrices are created from CPU registers,
gral image. The performance of the whole process has been they may never leave the on-die L2 and L3 caches, and they

535
even fit in the L1 data cache. Nevertheless, the approach of
caching the complete working set is not sustainable for high M: 16
N: 32
resolutions and even though the CPU still benefits from ag- Occ: 100
M: 24
gressive memory prefetching, the GPU speed up grows with N: 32
Occ: 100
the size of the input image. M: 23 M: 28
N: 23 N: 28
M: 16 Occ: 100 Occ: 100
100 N: 16
Occ: 83.33

Occupancy (%)
140 80

60
120 GPU M: 32
CPU 40 N: 32
Occ: 66.67
20
100 M: 32
N: 16
0 Occ: 100
Time (ms)

80 8
10 30
12 28
14 26
60 16 24
18 22
20 20
22 18
40 M 24 16
26 14 N
28 12
20 30 10
8

0 Figure 10. GPU occupancy for blocks of threads of different sizes


256x256 512x512 1024x1024 2048x2048 4096x4096 8192x8192
Image Dimensions

Figure 9. Integral image execution time We were not able to find any publicly available databases
of high definition videos for evaluating the performance of
To conclude, the average latency obtained for comput- the final face detection system. The considered benchmarks
ing the integral image at 1920×1080 pixels in the GPU was are instead a collection of high definition H.264 movie trail-
2.3 ms where 0.61 ms were spent for each multiscan opera- ers and music clips carefully selected for stressing the sys-
tion and 0.54 ms for each matrix transposition. tem, i.e. containing a large number of faces. All these
videos were downloaded from the YouTube website and
6.2. Optimal block size for filter evaluation feature a resolution of 1920×1080 pixels.
Prior to the performance evaluation of the parallelized Figure 11 shows the face detection execution time at
Haar cascade filtering, the optimal block size has to be esti- each frame of two selected videos presenting different char-
mated. This size is also used for computing the dimensions acteristics within the chosen collection. The first one
of the memory chunks that each thread has to bring to the (Shakira – Waka Waka) features panoramic views in highly
shared memory. crowded soccer stadiums, reaching hundreds of faces in
In this way, the occupancy of the GPU is computed by an some frames. On the other hand, the second video (Andreea
analytical model that takes into account register and shared Balan – Trippin’) peaks at 6 simultaneous faces at most.
memory usage and the number of threads that constitute a Both videos present very frequent transitions and parts with
block for a given kernel. By following this approach, we sudden motion, and the scales of faces roughly range from
experimentally determined the GPU occupancy for the Haar 30 up to 1000 pixels vertically. Even though the face detec-
evaluation kernel. As Figure 10 depicts, there are multiple tion slows down the video playback and violates the 40 ms
combinations that achieve the maximum occupancy for this deadline (25 fps) on a few occasions, most of the time the
code. However, since the sliding window should be greater obtained throughput is close to 40 fps.
or equal than 24×24 elements and square, the only alterna- The performance of the algorithm regarding face de-
tive is to use 28 · 28 = 784 threads per block. tection is exactly equivalent to that shown by the detector
in [16] when specifying the same tuning parameters. That
6.3. Face detection includes a step size of one pixel for the sliding window, and
a total of 32 scale reductions using a scale factor of 1.1.
Once the optimal block size was selected, several bench-
marks were conducted in order to characterize the latency of 7. Conclusions and future work
the face detection algorithm. These tests were executed in
the same computer as that used in Section 6.1 and combine We have presented a highly optimized parallel imple-
both the integral image computation and the Haar cascade mentation of a Haar-based face detector, which analyzes
process. In addition, the final algorithm carries out face de- high-definition 1080p video at a sustained rate of 35 fps for
tection by progressively downscaling the input image frame generic sequences containing multiple faces. Our face de-
32 times, and simultaneously launching the CUDA kernels tection implementation is, to the best of our knowledge, the
for each one of the 32 considered scales in parallel. first one processing real-time HDTV video with a sliding

536
[Shakira - Waka Waka] 1920x1080@24 fps MPEG4 AVC/H.264 3714 Kbps
60
eralitat de Catalunya (2009-SGR-980), and Herta Security.

50
References
Face Detection Elapsed Time (ms)

[1] P. Viola and M. Jones. Robust Real-time Object Detection.


40 International Journal of Computer Vision, 57(2):137–154,
2002. 2, 4
30 [2] J. Cho, S. Mirzaei, J. Oberg, and R. Kastner. FPGA-based
Face Detection System Using Haar Classifiers. In Proceed-
20 ing of the ACM/SIGDA international symposium on Field
Programmable Gate Arrays, pages 103–112. ACM, 2009. 2
10 [3] D. Hefenbrock, J. Oberg, N. Thanh, R. Kastner, and S.B.
Baden. Accelerating Viola-Jones Face Detection to FPGA-
0 level Using GPUs. In 18th IEEE Annual International
0 1000 2000 3000 4000 5000
Video Frame Symposium on Field-Programmable Custom Computing Ma-
[Andreea Balan - Trippin’] 1920x1080@25 fps MPEG4 AVC/H.264 3520 Kbps chines (FCCM), pages 11–18. IEEE, 2010. 2
50 [4] J. Kong and Y. Deng. GPU Accelerated Face Detection. In
International Conference on Intelligent Control and Infor-
mation Processing, pages 584–588. IEEE, 2010. 2
Face Detection Elapsed Time (ms)

40
[5] A. Herout, R. Josth, R. Juranek, J. Havel, M. Hradis, and
P. Zemcik. Real-time Object Detection on CUDA. Journal
30 of Real-Time Image Processing, pages 1–12, 2010. 2
[6] M. Hradis, A. Herout, and P. Zemcik. Local Rank Patterns:
20 Novel Features for Rapid Object Detection. Computer Vision
and Graphics, pages 239–248, 2009. 2
10
[7] B. Sharma, R. Thota, N. Vydyanathan, and A. Kale. To-
wards a Robust, Real-time Face Processing System Using
CUDA-enabled GPUs. In International Conference on High
0 Performance Computing, pages 368–377. IEEE, 2009. 2, 3
0 1000 2000 3000 4000 5000
Video Frame [8] C.M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100
Figure 11. Execution time for two different HD music videos GPU Architecture. IEEE Micro, 31(2):50–59, 2011. 2, 3
[9] F. Jargstorff and E. Young. CUDA Video Decoder API.
NVIDIA Corporation, 2008. 2
window of dense step size (1 pixel). [10] B. Bilgic, B.K.P. Horn, and I. Masaki. Efficient Integral Im-
The efficient implementation of the integral image age Computation on the GPU. In Intelligent Vehicles Sym-
paradigm via row multiscan does not restrict to Viola-Jones posium, pages 528–533. IEEE, 2010. 3
face detection. It directly benefits a wide variety of com- [11] C.H. Messom and Barczack A.L. High Precision GPU based
puter vision techniques, including Randomized Decision Integral Images for Moment Invariant Image Processing Sys-
Trees and Forests and methods based on more complex de- tems. Electronics and New Zealand Conference, 2008. 3
scriptors like HOG and SURF. Moreover, we proved that the [12] N. Zhang. Working Towards Efficient Parallel Computing of
combination of a smart usage of the on-die caches, fixed- Integral Images on Multi-core Processors. In International
size sliding windows and image scaling based on texture Conference on Computer Engineering and Technology, vol-
fetches maximize the GPU occupancy, thus increasing the ume 2, pages V2–30. IEEE, 2010. 3
evaluation throughput of Haar filter cascades. [13] M. Harris, S. Sengupta, and J.D. Owens. Parallel Prefix Sum
Further steps include an efficient implementation of the (scan) with CUDA. In GPU Gems 3, pages 851–876. Addi-
son Wesley, 2007. 3
non-maxima suppression process. In addition, a previous
stage for image enhancement is required in order to over- [14] W.D. Hillis and G.L. Steele. Data Parallel Algorithms. Com-
come challenging lighting conditions. munications of the ACM, 29(12):1170–1183, 1986. 3
[15] S. Sengupta, M. Harris, and M. Garland. Efficient Parallel
Scan Algorithms for GPUs. NVIDIA Technical Report NVR-
Acknowledgements 2008-003, 2008. 3
This work has been partially supported by the European [16] R. Lienhart and J. Maydt. An Extended Set of Haar-like
Commission in the context of the HiPEAC-2 Network of Features For Rapid Object Detection. In Proceedings of the
International Conference on Image Processing, volume 1,
Excellence (FP7/ICT 217068), the Spanish Ministry of Sci-
pages 900–903. IEEE, 2002. 5, 7
ence (TIN2007-60625, TEC2010-21040-C02-01), the Gen-

537

You might also like