Real-Time Obstacle Detection Using Range Images: Processing Dynamically-Sized Sliding Windows On A GPU
Real-Time Obstacle Detection Using Range Images: Processing Dynamically-Sized Sliding Windows On A GPU
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/276696122
CITATIONS READS
0 66
3 authors:
Denis Wolf
University of São Paulo
127 PUBLICATIONS 857 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
A Pattern Recognition System for Detecting Distractions While Driving View project
Guaranteed Cost Model Predictive Control: A real-time robust MPC for systems subject to
multiplicative uncertainties View project
All content following this page was uploaded by Caio César Teodoro Mendes on 16 September 2015.
The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Real-time obstacle detection using range images: processing dynamically-sized
sliding windows on a GPU
Caio César Teodoro Mendesa,∗, Fernando Santos Osórioa , Denis Fernando Wolfa
a Mobile Robotics Laboratory, University of São Paulo (USP), Av. Trabalhador São-Carlense, 400, P.O. Box 668, 13.560-970 São Carlos,
Brazil
Abstract
An efficient obstacle detection technique is required so that navigating robots can avoid obstacles and potential hazards.
This task is usually simplified by relying on structural patterns. However, obstacle detection constitutes a challenging
problem in unstructured unknown environments, where such patterns may not exist. Talukder et al. (2002) success-
fully derived a method to deal with such environments. Nevertheless, the method has a high computational cost and
researchers that employ it usually rely on approximations to achieve real-time. We hypothesize that by using a Graphics
Processing Unit (GPU) the computing time of the method can be significantly reduced. Throughout the implementation
process, we developed a general framework for processing dynamically-sized sliding windows on a GPU. The framework
can be applied to other problems that require similar computation. Experiments were performed with a stereo camera
and an RGB-D sensor, where the GPU implementations were compared to multi-core and single-core CPU implemen-
tations. The results show a significant gain in the computational performance, i.e. in a particular instance, a GPU
implementation is almost 90 times faster than a single-core one.
Keywords: Obstacle detection, Autonomous navigation, Stereo vision, Graphics processing unit (GPU)
2. Obstacle Detection
ence in the height and slope between them. The method
has several advantages over the ones previously mentioned: Obstacle detection methods that apply for range im-
it provides and applies a clear point-wise definition of ob- ages may work in two different domains: the range image
stacle, enables a number of post-processing steps, such as itself or the corresponding point cloud. The latter can
clustering and temporal integration and, in practice, can be generated by transforming the range image based on
handle rough terrain. It was successfully employed in a the intrinsic parameters of the sensor. A point cloud that
long autonomous navigation experiment, in which few ge- can be indexed by its projection (i.e. range image) index
ometric assumptions could be made [8]. In practice, its space is called an organized point cloud. A range image
only drawback is its high computational cost; it may take indexed by u and v returns a single value d corresponding
the method seconds to detect obstacles in a practical-sized to the depth, i.e. Iu,v = (du,v ), where I is the range image.
range image. Researchers usually rely on approximations An organized point cloud can be indexed by the same u
[8, 9, 10], so it can perform within a reasonable time set- and v and returns three values x, y and z usually in real
ting. world units, i.e. Pu,v = (xu,v , yu,v , zu,v ), where P is the
Our main goal is to significantly reduce the processing point cloud. Here we will assume that both domains are
time of the method, more specifically, achieve real-time available and the point cloud is indexed by the same index
obstacle detection in practical-sized range images without space of the range image.
using approximations. We hypothesize that the method By extending the intuition shown in Fig. 1 for a three-
of Talukder et al. could benefit from a parallel implemen- dimensional space, [7] proposed a point-wise obstacle def-
tation using a Graphics Processing Unit (GPU). Unlike inition for organized point clouds, where p1 = (x1 , y1 , z1 )
a CPU, which is mainly a serial processor, a GPU has and p2 = (x2 , y2 , z2 ) are called compatible and considered
a parallel structure that makes it especially suitable for obstacles if they satisfy the following conditions:
graphics-related processing. Despite starting as a fixed-
function unit, recent changes in its architectural design 1. HT < |y2 − y1 | < Hmax ;
have increased the flexibility of its previous rigid pipeline.
2. |y2 − y1 |/||p2 − p1 || > sin(θT );
Such changes have enabled the exploration of its compu-
tational power for massive data-parallel (i.e. same task on where HT , Hmax and θT are empirically determined
different pieces of data) applications. Scientists have rec- constants.
ognized this potential and have been applying it to speed
up scientific computations. The first condition refers to the difference in the height
Performance and accessibility were some of the factors between two points, i.e. it checks the significance of the
that contributed towards the choice of a GPU. Modern height difference. The second regards the angle between
GPUs are widely available in the market and their pro- the line segment containing the two points and the ground
cessing capacity can overcome the one of high-end CPUs plane, i.e. it checks the significance of the slope. Param-
by an order of magnitude . However, to benefit from this eters HT and θT correspond to the thresholds used for
potential, the programmer must take into account the ar- determining if the height difference and slope are respec-
tively significant. Parameter Hmax plays a more subtle
role, as it checks if the two points are connected. Imagine
2
Image Plane
θT
z
Hmax
x θT
HT
y θT Hmax
HT
θT
4
Range Image Core 1
2 0 1 7 2 3 0 2
5 6 3 1 6 8 5 0 Window Core 2
4 4 5 7 4 6 1 2 6 3 1 Core 3
81 0 3 7 8 2 7 8 4 5 7
9 4 2 6 8 0 2 1 0 3 7
6 9 2 4 8 5 3 0 Core N
5
Work-items 1 2 3 Work-group 1 Work-group 2 Work-group 3 Work-group N
T1
T2
Memory
T3
8
5162
160 Implementation Implementation
Single-core 4122 3764 Single-core
Processing time (ms)
Figure 12: Mean processing times for each device and implementa- Figure 13: Squared scaled mean processing times for each device and
tion obtained using the stereo log. Only the projected CPU imple- implementation obtained using the kinect log. Only the projected
mentations are presented CPU implementations are presented
Table IV and Fig. 13 show that the processing times Table II: Mean number of points per ms computed by each device
were higher for the Kinect sensor. There are two main Device Points per ms Speedup
reasons for the result: the point cloud is four times larger
CPU Single-core 2.72 × 105 1
and the indoor environment tends to have closer points
CPU Multi-core 1.42 × 106 5.2
(smaller z) and generate larger search areas (window sizes).
Radeon 5850 2.70 × 106 9.9
The CPU single-core implementation was almost 27 times
Radeon 7950 1.95 × 107 71.9
slower than the equivalent one from stereo data while the
7950 Kernel 3 GPU implementation scaled much better, The GPU devices refer to Kernel 3 implementa-
i.e. only 3.6 times slower, and provided a speedup of 87 tion
relative to single-core. A larger gap was observed between
the two tested GPUs and, again, no significant difference
we started attending some of the GPU performance rec-
was found between kernels.
ommendations in Kernel 1, this was not the case in sub-
We finally provide a setup independent measure of com-
sequent kernels. Coalesced memory operations were com-
putational cost, showing, for each device, the number of
promised as Kernel 2 and Kernel 3 do not necessarily ac-
points and processing times in Fig. 14, where each dot of
cess adjacent memory address within a wavefront, which
the graph represents a frame and the number of points of
may result in idle Arithmetic Logic Units (ALUs) while
a frame was calculated according to:
work-items wait fetch operations.
X AMP APP Profiler enabled the gathering of informa-
np = w(p)h(p) (7)
tion on the performance of each kernel. It organizes this
p∈V
information in ‘performance counters’, which reveal a dif-
where V is the set containing all valid pivot points. ferent aspect of the kernel. Our focus is on two of them:
Table II shows the number of points processed per mil-
lisecond (ms), calculated using the mean points per ms VALUBusy Percentage of GPU time spent in vector ALU
of each frame. Both table and chart show only the im- operations;
plementations that yielded the same results of the “full”
one, i.e. single-core, multi-core and Kernel 3. Such data MemUnitBusy Percentage of GPU time the memory
provide setup independent results, which can serve as a unit is active.
reference for other researchers in the field to check if our
These counters can reveal whether a kernel is CPU or
implementation and devices can attend their demand.
memory bounded. We tested only the AMD Radeon 7950
in the Kinect log since they generally apply to our kernels
5.3. Kernel Analysis
and this GPU is the only that provides such counters.
While processing times alone enable the assessment of Table III shows the performance counter for each kernel
relative speedups, they provide few clues on how efficient in the Kinect log with AMD Radeon 7950. There is little
the implemented kernels are in absolute terms. Although room for memory access improvements as the ALUs are
9
Table III: Performance counters for each kernel in the Kinect log Table V: Comparison of computation times found in the literature
with the AMD Radeon 7950
Approach Resolution Proc. time
Kernel Counter Value in %
Sub-sampling [10] 640×480 200 ms
Kernel 1 VALUBusy 82.06 Sub-sampling [8] 500×320 25 ms
MemUnitBusy 95.10 Discrete parameters [9] 640×480 62 ms
GPU (Ours) 640×480 43 ms
Kernel 2 VALUBusy 82.98
MemUnitBusy 94.18 The processing times are not directly comparable due
to variations in hardware, environment and parameters;
Kernel 3 VALUBusy 80.14 therefore it should serve only as a reference
MemUnitBusy 91.20
in conjunction with our approach and yield even higher
speedups.
kept significantly busy, i.e. over 80% in all cases, and no
significant variation in the counters across kernels, proba-
bly because although memory accesses are less efficient in 7. Conclusions and Future Work
Kernel 2 and Kernel 3, their ALU/fetch instruction ratio
is higher i.e. they have more ALU instructions. These This paper has presented a GPU-based parallel ver-
instructions hide the latency of inefficient fetches. sion of an obstacle detection method. Proposed in [7], the
method has been widely used, however researchers have to
deal with its main limitation, its computational cost.
6. Discussion One of the main challenges to adapt code for parallel
execution on a GPU is the need of a proper distribution of
The GPU implementation has successfully provided a
the workload among multiple processors/compute units.
speedup for the method and enabled real-time obstacle
The memory access is also a critical point and must be
detection for both sensors and scenarios. We were espe-
performed carefully to avoid potential bottlenecks. Our
cially interested in cases where a single-core implementa-
target algorithm requires the use of dynamic-sized win-
tion provides a prohibitive computational time, as in our
dows, which makes parallelization even more complex.
Kinect setup. Fortunately our implementation could ac-
We formalized the GPU constraints as an optimization
celerate the processing frequency from ∼0.26 Hz (single-
problem and derived an approximation for it. Satisfactory
core) to ∼23 Hz (Kernel 3 and Radeon 7950). Concerning
results were achieved, i.e. the Kernel 3 implementation
the stereo setup, the small granularity (window sizes) and
was almost 90 times faster than the single-core one with
amount of data penalized the GPU implementations.
the use of Radeon 7950 in the Kinect log. An impor-
A crucial aspect of our approach is that no approxi-
tant aspect of the GPU option is that the CPU becomes
mations were employed, differently from other works. In
available to perform other tasks and responsible only for
[10], the authors used a conditional reduction in the point
reading and writing buffers.
cloud resolution, in which sub-sampling is performed un-
The proposed solution for processing dynamically-sized
til an obstacle is found. A similar approach was used in
sliding windows on GPU can be seen as a general frame-
[8], where the image is divided into horizontal segments
work and applied to other tasks that require similar com-
and sub-sampled according to the distance represented by
putation, benefiting a wide range of applications.
such segments. In [9], a set of discrete parameters was
As future work, we aim at using our implementation
employed. Although these approaches are justifiable for
along with a clustering method and a filter based on ma-
noisy and low resolution range images (e.g. the ones from
chine learning techniques to minimize noise and improve
our stereo setup), they assume that obstacles are formed
the classification results.
by a reasonable set of points, which is not necessarily true,
can conceal obstacles and reduce the detection accuracy.
This is especially the case of relative high-resolution range References
images, where the sub-sampling rate should be high, and
[1] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier,
on precise sensors, that provide pixel-wise reliable mea- K. Kavukcuoglu, U. Muller, Y. LeCun, Learning long-range vi-
surements. sion for autonomous off-road driving, Journal Field Robotics
Although computation times are not directly compa- 26 (2) (2009) 120–144.
[2] K. Konolige, M. Agrawal, M. R. Blas, R. C. Bolles, B. Gerkey,
rable due to different hardware, parameters and test sce-
J. Solà, A. Sundaresan, Mapping, navigation, and learning for
narios, we could achieve the second lowest computational off-road traversal, Journal of Field Robotics 26 (1).
time stated in the literature without using approximations [3] A. Broggi, C. Caraffi, R. Fedriga, P. Grisleri, Obstacle detec-
(Table V). It is also worth mentioning that the approxi- tion with stereo vision for off-road vehicle navigation, in: IEEE
Computer Society Conference on Computer Vision and Pattern
mation techniques employed by other authors can be used Recognition - Workshops, 2005, p. 65.
10
10000
Processing time (ms)
3981
1585 Implementation
CPU Single-core
631
CPU Multi-core
251 Radeon 5850
Radeon 7950
100
40
Table IV: Processing times for each different combination of log, device and implementation
Log Device Implementation Mean time Max. time Min. time Speedup
Stereo i7 3770 Single-core 139 ms 240 ms 105 ms 1
Multi-core 31 ms 54 ms 21 ms 4.5
Radeon 5850 Kernel 1 32 ms 41 ms 23 ms 4.5
Kernel 2 36 ms 46 ms 27 ms 3.9
Kernel 3 29 ms 42 ms 24 ms 4.8
Radeon 7950 Kernel 1 13 ms 19 ms 9 ms 10.7
Kernel 2 12 ms 18 ms 8 ms 11.6
Kernel 3 12 ms 17 ms 8 ms 11.6
Kinect i7 3770 Single-core 3764 ms 9487 ms 1479 ms 1
Multi-core 776 ms 2230 ms 287 ms 4.8
Radeon 5850 Kernel 1 160 ms 179 ms 98 ms 23.5
Kernel 2 176 ms 199 ms 82 ms 21.4
Kernel 3 371 ms 798 ms 176 ms 10.1
Radeon 7950 Kernel 1 30 ms 53 ms 20 ms 125.5
Kernel 2 27 ms 42 ms 16 ms 139.4
Kernel 3 43 ms 93 ms 22 ms 87.5
Speedup relative to the single-core implementation of each log
[4] C. Caraffi, S. Cattani, P. Grisleri, Off-road path and obsta- [5] J. Kolter, M. Rodgers, A. Ng, A control architecture for
cle detection using decision networks and stereo vision, IEEE quadruped locomotion over rough terrain, in: IEEE Interna-
Transactions on Intelligent Transportation Systems 8 (4) (2007) tional Conference on Robotics and Automation, 2008, pp. 811–
607 –618. 818.
11
[6] S. Lacroix, A. Mallet, D. Bonnafous, G. Bauzil, S. Fleury,
M. Herrb, R. Chatila, Autonomous rover navigation on un-
known terrains functions and integration, in: Experimental
Robotics VII, Vol. 271 of Lecture Notes in Control and Infor-
mation Sciences, Springer Berlin Heidelberg, 2001, pp. 501–510.
[7] A. Talukder, R. Manduchi, A. Rankin, L. Matthies, Fast and
reliable obstacle detection and segmentation for cross-country
navigation, in: IEEE Intelligent Vehicles Symposium, 2002, pp.
610–618.
[8] A. Broggi, M. Buzzoni, M. Felisa, P. Zani, Stereo obstacle de-
tection in challenging environments: The viac experience, in:
IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2011, pp. 1599 –1604.
[9] W. van der Mark, J. van den Heuvel, F. Groen, Stereo based
obstacle detection with uncertainty in rough terrain, in: IEEE
Intelligent Vehicles Symposium, 2007, pp. 1005 –1012.
[10] P. Santana, P. Santos, L. Correia, J. Barata, Cross-country ob-
stacle detection: Space-variant resolution and outliers removal,
in: IEEE/RSJ International Conference on Intelligent Robots
and Systems, 2008, pp. 1836 –1841.
[11] C. C. T. Mendes, F. S. Osorio, D. F. Wolf, An efficient obstacle
detection approach for organized point clouds, in: Intelligent
Vehicles Symposium (IV), 2013 IEEE, 2013, pp. 1203–1208.
doi:10.1109/IVS.2013.6629630.
[12] R. Manduchi, A. Castano, A. Talukder, L. Matthies, Obsta-
cle detection and terrain classification for autonomous off-road
navigation, Autonomous Robots 18 (1) (2005) 81–102.
[13] NVIDIA, OpenCL Programming Guide for the CUDA Archi-
tecture (May 2010).
URL http://developer.download.nvidia.com/compute/cuda/
3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf
[14] S. Xiao, W. chun Feng, Inter-block gpu communication via fast
barrier synchronization, in: IEEE International Symposium on
Parallel Distributed Processing, 2010, pp. 1–12.
[15] AMD, AMD Accelerated Parallel Processing OpenCL Pro-
gramming Guide (July 2013).
URL http://developer.amd.com/tools/hc/AMDAPPSDK/
assets/AMD_Accelerated_Parallel_Processing_OpenCL_
Programming_Guide.pdf
[16] H. Hirschmuller, Accurate and efficient stereo processing by
semi-global matching and mutual information, in: IEEE Com-
puter Society Conference on Computer Vision and Pattern
Recognition, Vol. 2, 2005, pp. 807 – 814 vol. 2.
12