Real-time GPU-based Face Detection in HD Video Sequences
Real-time GPU-based Face Detection in HD Video Sequences
David Oro † Carles Fernández † Javier Rodrı́guez Saeta † Xavier Martorell ‡ Javier Hernando ‡
†
Herta Security, Barcelona, Spain
‡
Universitat Politècnica de Catalunya, Barcelona, Spain
[email protected] [email protected] [email protected]
CUDA kernels
detecting 48 faces at 1280×1024 resolution. Herout et
al. [5] proposed a GPU-based face detector based on local Haar cascade evaluation ... HN
rank patterns as an alternative to the commonly used Haar H1H2
wavelets [6].
Finally, Sharma et al. [7] presented a working CUDA im- Face bounding process ... FBN
plementation that achieved a peak throughput of 19 fps un- FB1FB2
der a resolution of 1280×960 pixels. They proposed a naive
parallel integral image kernel to perform both row-wise and OpenGL texture
531
GPU occupancy GPU occupancy
grid scan
Kernel (s1) Kernel (s1) Kernel (s4)
block scan block scan
Kernel (s4) Kernel (s3)
warp warp ... warp ... warp warp ... warp
Time
shared memory
shared memory
shared memory
core core core core core core
...
GPU, and its corresponding integral image is recomputed core core core core core core
for each scale. core core core core core core
The last step of the pipeline returns the coordinates of de-
tected faces to the CPU for further processing. The detected L2 Cache
faces are bounded using a CUDA kernel FB , and these re-
sults are bypassed to the conventional OpenGL pipeline us- GPU DRAM
ing a pixel buffer object to display the results.
All kernel computations in the GPU exploit both coarse- Figure 3. Divide-and-conquer approach for the parallel prefix sum
grain and fine-grain parallelism. Coarse-grain parallelism (scan) operation and the underlying GPU microarchitecture
is achieved by simultaneously launching and executing the
same kernel operation for each scaled image from the same
CUDA context. Additionally, fine-grain parallelism is also ture, we take an additional effort and follow a divide-and-
exploited at thread-level within each CUDA kernel. Since conquer hierarchical approach based on the Hillis-Steele al-
the Fermi microarchitecture supports concurrent kernel ex- gorithm [14, 15], which implements the scan at different
ecution [8], the occupancy of GPU resources is maximized granularities. In this implementation, input data is fetched
even for those kernels dealing with small scales, by execut- in parallel from the GPU DRAM and then stored in the on-
ing them in parallel as seen in Figure 2. die shared memory of each SM before starting any compu-
tation.
4. Parallel integral image computation As Figure 3 shows, the high-level block scan operation
Computing integral images is an expensive task within is composed of several warp scan primitives. The purpose
the face detection process. There exists literature on how of this block-wise operation is to compute the scan across
to optimize this task via parallelization under both stream- a block of threads. A grid of intra-block primitives (grid
based and multi-core processors [10, 11, 12]. However, scan) is then used to finish the prefix sum computation of a
most existing work relies on the observation made by Mes- stream of arbitrary length. In order to exploit coarse-grain
son et al. [11] that integral images can be computed us- parallelism at row level, the integral image computation re-
ing standard exclusive prefix sum operations followed by lies on the multiscan kernel operation. This operation car-
matrix transpositions. Since this algorithm is meant to be ries out the scan in parallel for each row of the input image.
executed on GPUs, it is possible to preserve data local- At the lowest level, a warp scan primitive computes the
ity in on-die caches by performing row-wise prefix-sums prefix sum only for the threads referenced within the warp.
and two matrix transpositions, as opposed to the row-wise As depicted in Figure 4, the warp scan primitive performs
and column-wise prefix sum operations initially proposed the prefix sum operation in parallel for an input vector of 32
by Sharma et al. [7]. elements in log2 32 steps. For each step, a subset of threads
The prefix sum or scan is a data-parallel primitive ap- in the warp performs a partial sum. At the end of the 5th
plied to a given stream, in which each element is generated step, the output of the algorithm will be the prefix sum for
by summing the elements up to its index. A naive parallel the 32-element vector.
implementation computes the prefix-sum in log2 n steps, by In the Fermi core, SIMT instructions perform the same
first summing in parallel all pairs of contiguous elements to computation across a warp or group of 32 threads and si-
produce n/2 additions, and recursively summing the pro- multaneously store the results in registers without need-
duced pairs until a final single sum remains [13]. This ing additional thread barrier instructions [8]. For this rea-
naive implementation performs O(n log2 n) total additions, son, each step of the warp scan primitive can be imple-
whereas a single-threaded CPU implementation would only mented using a sequence of ld.shared.u32, add.u32
require O(n) operations. and st.shared.u32 SIMT instructions. The parallel ac-
For better exploiting the underlying GPU microarchitec- cess pattern depicted in Figure 4 ensures memory coalesc-
532
Warp (32 threads)
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 x28 x29 x30 x31
∑00 ∑01 ∑12 ∑23 ∑34 ∑45 ∑56 ∑67 ∑78 ∑89 ∑910 ∑10
11 ∑1112 ∑12
13 ∑13
14 ∑14
15 ∑15
16 ∑16
17 ∑17
18 ∑18
19 ∑19
20 ∑20
21 ∑21
22 ∑22
23 ∑23
24 ∑24
25 ∑25
26 ∑26
27 ∑27
28 ∑28
29 ∑29
30 ∑30
31
∑00 ∑01 ∑02 ∑03 ∑14 ∑25 ∑36 ∑47 ∑58 ∑69 ∑710 ∑811 ∑912 ∑10
13 ∑11 15 ∑13
14 ∑12 16 ∑14
17 ∑15
18 ∑16 20 ∑18
19 ∑17 21 ∑19
22 ∑20 24 ∑22
23 ∑21 25 ∑23
26 ∑24
27 ∑25 29 ∑27
28 ∑26 30 ∑28
31
∑00 ∑01 ∑02 ∑03 ∑04 ∑05 ∑06 ∑07 ∑08 ∑09 ∑010 ∑011 ∑012 ∑013 ∑014 ∑015 ∑116 ∑217 ∑318 ∑419 ∑520 ∑621 ∑722 ∑823 ∑924 ∑10
25 ∑11
26 ∑12 28 ∑14
27 ∑13 29 ∑15
30 ∑16
31
∑00 ∑01 ∑02 ∑03 ∑04 ∑05 ∑06 ∑07 ∑08 ∑09 ∑010 ∑011 ∑012 ∑013 ∑014 ∑015 ∑016 ∑017 ∑018 ∑019 ∑020 ∑021 ∑022 ∑023 ∑024 ∑025 ∑026 ∑027 ∑028 ∑029 ∑030 ∑031
Figure 4. Parallel warp scan primitive for a 32-element input vector x
ing, thus maximizing both bandwidth and throughput. to parallelize the serial algorithm proposed by Viola and
The integral image Iint is obtained by subsequently per- Jones [1], which consists of resizing the filters to the desired
forming a multiscan operation, a parallel matrix transposi- scanning resolution. Unfortunately, this technique is ineffi-
tion, a second multiscan and a second transposition: cient when implemented in CUDA. Since all filters must
T be scaled up to the size of the sliding window, the occu-
Iint = multiscan multiscan (I)
T pancy of the CUDA cores will be extremely low if each
thread evaluates the Haar filter cascade for a given sliding
Transposing the matrices is necessary to avoid column- window position and resolution. As shown in the follow-
wise scanning, so that data locality is better exploited when ing equation, the number of potential simultaneous threads
accessing the shared memory. Memory coalescing and bulk Nthreads quadratically decreases as the size of the sliding
transfers are required to achieve the maximum bandwidth window W ×W increases by a scale factor:
for the GPU, so a natural way of transposing matrices is to
divide them into equally-sized blocks and to copy data from Iwidth · Iheight
Nthreads =
the off-chip DRAM to the on-die shared memories. This in- scale2 · W 2
volves using a barrier instruction after data is transferred to
Therefore, when detecting faces of 96×96 pixels in a
the shared memory of each SM. Synchronization is required
1920×1080 image, there are only 225 threads competing
to guarantee that all threads constituting the CUDA block
for the hardware resources. This amount of threads is
start with the transposition immediately after all memory
clearly insufficient for keeping the 512 cores of the GF100
transfers have concluded.
microarchitecture busy, especially when threads stall due to
Generally, the input matrix is not a multiple of the
data dependencies or pending load instructions. The situa-
warp width. To solve this issue, before transferring mem-
tion is even worse for higher resolutions of faces.
ory from the CPU address space to the GPU, the matrix
is zero-padded until becoming square. Even though this Hence, the main objective is to create as many threads
padding process causes additional memory transfers, the evaluating the cascade as possible. In order to meet this
final throughput of the transposition will be much better, goal, we propose using small fixed-sized sliding windows
since all accesses will be properly coalesced and aligned. and scaling instead the input image. As a consequence, in-
tegral images need to be computed for each newly scaled
image, whereas the traditional algorithm computes it once.
5. Parallel Haar cascade evaluation
In order to obtain the highest performance during the
Once the integral image of a given frame is computed, evaluation of the cascade, the integral image data should be
the cascade of Haar filters has to be evaluated. This step is moved to the on-die shared memory before any computing-
the most resource-intensive part of the face detection algo- intensive operation starts. The best way to do it is using
rithm, since it involves hundreds of billions of operations the divide-and-conquer approach, where each thread from
per image frame in HD video sequences. the cooperative array transfers a portion of image from the
There are two alternative parallelization strategies to global memory to the shared memory of the corresponding
tackle scaling during filter evaluation. The first one is SM. The size of memory chunks has to be at least four times
533
block (i, j) block (i, j+1) NVIDIA GTX 470 graphics card.
All GPU applications were compiled for the CUDA
Toolkit 4.0 with the -O3 flag and targeted the sm 20 ar-
chitecture. The underlying OS was powered by the Linux
threadx,y Kernel 2.6.35 and compiled for the x86-64 architecture. On
the other hand, GCC v.4.4.5 was used for linking the fi-
nal application and for building the CPU tests. It should
be noted that all GPU benchmarks were performed without
considering the time spent on memory transfers between
the CPU and the GPU. This assumption is valid since the
final GPU-based face detector will start performing compu-
tations only once the image frame is available in the off-chip
block (i+1, j) block (i+1, j+1) GPU DRAM memory.
Figure 5. Evaluation area considered by a thread in a CUDA block The cascade used for the Haar evaluation process was the
frontal feature set developed by Lienhart et al. [16] and dis-
tributed in the OpenCV framework. It has 2913 filters and
larger than that of the sliding window. This is because the is organized into 25 stages. Finally, the size of the training
threads of a block (i, j) in the grid have to bring pixels from images was 24×24 pixels.
contiguous blocks to the shared memory in order to evaluate
the sliding window, see Figure 5. 6.1. Integral image
After these memory transfers have been completed, the
filters of the Haar cascade need to be evaluated. This evalua- In order to evaluate our parallel integral image imple-
tion is both a memory and an arithmetic intensive operation. mentation, we executed the multiscan algorithm for differ-
Therefore, the right approach should involve moving the ent image sizes in both CPU and GPU. The CPU imple-
Haar features to the on-die shared memory. Unfortunately, mentation was a single-threaded application that carried out
this is unfeasible for two reasons. First, the selected fea- a recursive scan in O(n) steps for an n×n matrix whereas
ture set [16] has a size of 106 KB and the available shared the GPU implementation was the same as described in Sec-
memory in a SM is 48 KB. Second, all warp threads will tion 4. Since each row involves O(log2 n) steps, the GPU
stall, since each thread will try to access the same memory implementation is asymptotically fitted by O(n log2 n). On
address (e.g. the first filter of the first stage), thus produc- the other hand, input images were matrices randomly popu-
ing bank conflicts. To deal with these situations, the GF100 lated with unsigned 32-bit integers that ranged from 30×30
microarchitecture provides the constant memory, which is to 104 × 104 pixels.
specifically designed for broadcasting values for all warp As depicted in Figure 6, the GPU multiscan scales well
threads that simultaneously access the same address. with the image size. For a 100 megapixel image it is exe-
Unlike the shared memory, the constant memory is read- cuted in only 43 ms, 66 times faster than the CPU version.
only and must be allocated and initialized before launching Although the GPU algorithm scales well with the size
a kernel. Although the aggregated size of the constant mem- of the input image, it is slower than the CPU version for
ory is 128 KB (8 KB per SM), the CUDA programming small images. This effect can be clearly seen in Figure 7
model restricts the available size to 64 KB, so the feature where the execution time for the GPU is constant for images
set must be stored compressed. Since thresholds are en- smaller or equal than 100×100 pixels. The reason for this
coded using double precision floating-point numbers, they behavior is related to the overhead produced by the CUDA
can be re-encoded in 32 bits at the cost of slightly losing kernel launching process. If the image to be processed has
precision. Similarly, not all the bits required for the coordi- few elements, it is not possible to hide the latencies of ker-
nates, dimensions and weight values are significant. Given nel launches with arithmetic computations. As a result, the
that the training images used for the cascade had a size of CPU will beat the GPU for images smaller than 60×60 pix-
24×24 pixels, two 16-bit unsigned integers suffice for en- els where the time spent by the CUDA runtime for allo-
coding a feature. cating all required resources and internal data structures is
higher than the time spent by the CUDA cores for comput-
6. Experimental results ing the data stream.
The performance of the multiscan for non-square n×m
In order to evaluate the performance of the previously matrices was also analyzed. Since kernel launches are ex-
described parallel algorithms, multiple execution tests were pensive operations, in theory a multiscan of a 10000×100
performed in the same computer. The selected platform fea- matrix would yield a higher execution time than that of a
tured an Intel Core i5-760 2.8 GHz, quad-core CPU and an 100×10000. Also, the first approach requires more ker-
534
10000
3000
O(nlog 2 n) Fit
GPU 1000
2500 O(n 2 ) Fit
CPU
100 GPU
CPU
2000
10
Time (ms)
Time (ms)
1500 1
0.1
1000
0.01
500
0.001
0 0.0001
0 2e+07 4e+07 6e+07 8e+07 1e+08 1000 10000 100000 1e+06 1e+07 1e+08
Image Size (pixels) Image Size (pixels)
Figure 6. Execution time of the multiscan operation Figure 7. Execution time of multiscan (log scale for x and y axis)
100
nel launches than the second and computes less work in 1280x720
each kernel. As shown in Table 1, this does not behave 1280x720 Coalesced
1920x1080
as expected. The execution time grows slightly faster if
1920x1080 Coalesced
the column size (m) is kept constant and the number of 10
2048x2048
rows (n) increased, rather than the opposite, but only while 2048x2048 Coalesced
Time (ms)
535
even fit in the L1 data cache. Nevertheless, the approach of
caching the complete working set is not sustainable for high M: 16
N: 32
resolutions and even though the CPU still benefits from ag- Occ: 100
M: 24
gressive memory prefetching, the GPU speed up grows with N: 32
Occ: 100
the size of the input image. M: 23 M: 28
N: 23 N: 28
M: 16 Occ: 100 Occ: 100
100 N: 16
Occ: 83.33
Occupancy (%)
140 80
60
120 GPU M: 32
CPU 40 N: 32
Occ: 66.67
20
100 M: 32
N: 16
0 Occ: 100
Time (ms)
80 8
10 30
12 28
14 26
60 16 24
18 22
20 20
22 18
40 M 24 16
26 14 N
28 12
20 30 10
8
Figure 9. Integral image execution time We were not able to find any publicly available databases
of high definition videos for evaluating the performance of
To conclude, the average latency obtained for comput- the final face detection system. The considered benchmarks
ing the integral image at 1920×1080 pixels in the GPU was are instead a collection of high definition H.264 movie trail-
2.3 ms where 0.61 ms were spent for each multiscan opera- ers and music clips carefully selected for stressing the sys-
tion and 0.54 ms for each matrix transposition. tem, i.e. containing a large number of faces. All these
videos were downloaded from the YouTube website and
6.2. Optimal block size for filter evaluation feature a resolution of 1920×1080 pixels.
Prior to the performance evaluation of the parallelized Figure 11 shows the face detection execution time at
Haar cascade filtering, the optimal block size has to be esti- each frame of two selected videos presenting different char-
mated. This size is also used for computing the dimensions acteristics within the chosen collection. The first one
of the memory chunks that each thread has to bring to the (Shakira – Waka Waka) features panoramic views in highly
shared memory. crowded soccer stadiums, reaching hundreds of faces in
In this way, the occupancy of the GPU is computed by an some frames. On the other hand, the second video (Andreea
analytical model that takes into account register and shared Balan – Trippin’) peaks at 6 simultaneous faces at most.
memory usage and the number of threads that constitute a Both videos present very frequent transitions and parts with
block for a given kernel. By following this approach, we sudden motion, and the scales of faces roughly range from
experimentally determined the GPU occupancy for the Haar 30 up to 1000 pixels vertically. Even though the face detec-
evaluation kernel. As Figure 10 depicts, there are multiple tion slows down the video playback and violates the 40 ms
combinations that achieve the maximum occupancy for this deadline (25 fps) on a few occasions, most of the time the
code. However, since the sliding window should be greater obtained throughput is close to 40 fps.
or equal than 24×24 elements and square, the only alterna- The performance of the algorithm regarding face de-
tive is to use 28 · 28 = 784 threads per block. tection is exactly equivalent to that shown by the detector
in [16] when specifying the same tuning parameters. That
6.3. Face detection includes a step size of one pixel for the sliding window, and
a total of 32 scale reductions using a scale factor of 1.1.
Once the optimal block size was selected, several bench-
marks were conducted in order to characterize the latency of 7. Conclusions and future work
the face detection algorithm. These tests were executed in
the same computer as that used in Section 6.1 and combine We have presented a highly optimized parallel imple-
both the integral image computation and the Haar cascade mentation of a Haar-based face detector, which analyzes
process. In addition, the final algorithm carries out face de- high-definition 1080p video at a sustained rate of 35 fps for
tection by progressively downscaling the input image frame generic sequences containing multiple faces. Our face de-
32 times, and simultaneously launching the CUDA kernels tection implementation is, to the best of our knowledge, the
for each one of the 32 considered scales in parallel. first one processing real-time HDTV video with a sliding
536
[Shakira - Waka Waka] 1920x1080@24 fps MPEG4 AVC/H.264 3714 Kbps
60
eralitat de Catalunya (2009-SGR-980), and Herta Security.
50
References
Face Detection Elapsed Time (ms)
40
[5] A. Herout, R. Josth, R. Juranek, J. Havel, M. Hradis, and
P. Zemcik. Real-time Object Detection on CUDA. Journal
30 of Real-Time Image Processing, pages 1–12, 2010. 2
[6] M. Hradis, A. Herout, and P. Zemcik. Local Rank Patterns:
20 Novel Features for Rapid Object Detection. Computer Vision
and Graphics, pages 239–248, 2009. 2
10
[7] B. Sharma, R. Thota, N. Vydyanathan, and A. Kale. To-
wards a Robust, Real-time Face Processing System Using
CUDA-enabled GPUs. In International Conference on High
0 Performance Computing, pages 368–377. IEEE, 2009. 2, 3
0 1000 2000 3000 4000 5000
Video Frame [8] C.M. Wittenbrink, E. Kilgariff, and A. Prabhu. Fermi GF100
Figure 11. Execution time for two different HD music videos GPU Architecture. IEEE Micro, 31(2):50–59, 2011. 2, 3
[9] F. Jargstorff and E. Young. CUDA Video Decoder API.
NVIDIA Corporation, 2008. 2
window of dense step size (1 pixel). [10] B. Bilgic, B.K.P. Horn, and I. Masaki. Efficient Integral Im-
The efficient implementation of the integral image age Computation on the GPU. In Intelligent Vehicles Sym-
paradigm via row multiscan does not restrict to Viola-Jones posium, pages 528–533. IEEE, 2010. 3
face detection. It directly benefits a wide variety of com- [11] C.H. Messom and Barczack A.L. High Precision GPU based
puter vision techniques, including Randomized Decision Integral Images for Moment Invariant Image Processing Sys-
Trees and Forests and methods based on more complex de- tems. Electronics and New Zealand Conference, 2008. 3
scriptors like HOG and SURF. Moreover, we proved that the [12] N. Zhang. Working Towards Efficient Parallel Computing of
combination of a smart usage of the on-die caches, fixed- Integral Images on Multi-core Processors. In International
size sliding windows and image scaling based on texture Conference on Computer Engineering and Technology, vol-
fetches maximize the GPU occupancy, thus increasing the ume 2, pages V2–30. IEEE, 2010. 3
evaluation throughput of Haar filter cascades. [13] M. Harris, S. Sengupta, and J.D. Owens. Parallel Prefix Sum
Further steps include an efficient implementation of the (scan) with CUDA. In GPU Gems 3, pages 851–876. Addi-
son Wesley, 2007. 3
non-maxima suppression process. In addition, a previous
stage for image enhancement is required in order to over- [14] W.D. Hillis and G.L. Steele. Data Parallel Algorithms. Com-
come challenging lighting conditions. munications of the ACM, 29(12):1170–1183, 1986. 3
[15] S. Sengupta, M. Harris, and M. Garland. Efficient Parallel
Scan Algorithms for GPUs. NVIDIA Technical Report NVR-
Acknowledgements 2008-003, 2008. 3
This work has been partially supported by the European [16] R. Lienhart and J. Maydt. An Extended Set of Haar-like
Commission in the context of the HiPEAC-2 Network of Features For Rapid Object Detection. In Proceedings of the
International Conference on Image Processing, volume 1,
Excellence (FP7/ICT 217068), the Spanish Ministry of Sci-
pages 900–903. IEEE, 2002. 5, 7
ence (TIN2007-60625, TEC2010-21040-C02-01), the Gen-
537