0% found this document useful (0 votes)
54 views13 pages

Real-Time Obstacle Detection Using Range Images: Processing Dynamically-Sized Sliding Windows On A GPU

This document summarizes a research article that proposes using a GPU to accelerate real-time obstacle detection from range images. The authors implement an existing obstacle detection method on CPU cores and GPU to compare performance. They achieve a significant speedup of nearly 90 times faster using a GPU compared to a single-core CPU implementation. The GPU implementation allows for real-time obstacle detection in dynamic, unknown environments.

Uploaded by

Roger Sacchelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views13 pages

Real-Time Obstacle Detection Using Range Images: Processing Dynamically-Sized Sliding Windows On A GPU

This document summarizes a research article that proposes using a GPU to accelerate real-time obstacle detection from range images. The authors implement an existing obstacle detection method on CPU cores and GPU to compare performance. They achieve a significant speedup of nearly 90 times faster using a GPU compared to a single-core CPU implementation. The GPU implementation allows for real-time obstacle detection in dynamic, unknown environments.

Uploaded by

Roger Sacchelli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/276696122

Real-time obstacle detection using range


images: processing dynamically-sized sliding
windows on a GPU

Article in Robotica · March 2015


DOI: 10.1017/S0263574714002914

CITATIONS READS

0 66

3 authors:

Caio César Teodoro Mendes Fernando Osorio


University of São Paulo University of São Paulo
16 PUBLICATIONS 38 CITATIONS 175 PUBLICATIONS 644 CITATIONS

SEE PROFILE SEE PROFILE

Denis Wolf
University of São Paulo
127 PUBLICATIONS 857 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A Pattern Recognition System for Detecting Distractions While Driving View project

Guaranteed Cost Model Predictive Control: A real-time robust MPC for systems subject to
multiplicative uncertainties View project

All content following this page was uploaded by Caio César Teodoro Mendes on 16 September 2015.

The user has requested enhancement of the downloaded file. All in-text references underlined in blue are added to the original document
and are linked to publications on ResearchGate, letting you access and read them immediately.
Real-time obstacle detection using range images: processing dynamically-sized
sliding windows on a GPU

Caio César Teodoro Mendesa,∗, Fernando Santos Osórioa , Denis Fernando Wolfa
a Mobile Robotics Laboratory, University of São Paulo (USP), Av. Trabalhador São-Carlense, 400, P.O. Box 668, 13.560-970 São Carlos,
Brazil

Abstract
An efficient obstacle detection technique is required so that navigating robots can avoid obstacles and potential hazards.
This task is usually simplified by relying on structural patterns. However, obstacle detection constitutes a challenging
problem in unstructured unknown environments, where such patterns may not exist. Talukder et al. (2002) success-
fully derived a method to deal with such environments. Nevertheless, the method has a high computational cost and
researchers that employ it usually rely on approximations to achieve real-time. We hypothesize that by using a Graphics
Processing Unit (GPU) the computing time of the method can be significantly reduced. Throughout the implementation
process, we developed a general framework for processing dynamically-sized sliding windows on a GPU. The framework
can be applied to other problems that require similar computation. Experiments were performed with a stereo camera
and an RGB-D sensor, where the GPU implementations were compared to multi-core and single-core CPU implemen-
tations. The results show a significant gain in the computational performance, i.e. in a particular instance, a GPU
implementation is almost 90 times faster than a single-core one.
Keywords: Obstacle detection, Autonomous navigation, Stereo vision, Graphics processing unit (GPU)

1. Introduction which is essentially a lateral projection of the range image


where the longitudinal road profile is expected to appear
A mobile robot with autonomous navigation capabili- as a line segment. With the road longitudinal profile ex-
ties can serve many practical purposes. It can be a vehicle tracted, obstacles are pixels that do not belong to the road
that drives itself carrying passengers and goods or an agri- profile. This approach can handle curved road longitudi-
cultural machine that frees individuals from tedious and nal profiles, but its transversal profile is expected to be
dangerous tasks. To accomplish such useful feat, robots flat otherwise the road profile will appear blurred in the
must sense the environment through sensors, detect trans- V-disparity map. This significantly compromises the pro-
versable regions and move through them precisely and effi- file extraction, hence the obstacle detection. A camera roll
ciently. However, autonomous navigation involves several angle relative to the ground surface will exert the same ef-
technical challenges, which range from dealing with intrin- fect even on roads with a flat transversal profile.
sically noisy sensors to correctly identifying road signs. To circumvent such problems, some authors [5, 6] cre-
A crucial component of an autonomous navigation sys- ate a digital elevation map (DEM), which is a grid where
tem is the obstacle detection module. Regarding range each cell corresponds to a part of the terrain. These ap-
images, a popular approach is to approximate the ground proaches tend to be computationally costly and do not
surface using a planar model [1, 2]. Once the model pa- clearly answer the question: “what is an obstacle?”. Rather,
rameters have been estimated, the distance from any point they tend to apply a number of heuristics (i.e. standard
to the model can be calculated and, based on a threshold, deviation of heights, average slope in x) to create a cost
points can be distinguished as either parts of the ground or map and usually end up overfitting the robot setup (robot
obstacles. Although reasonable for indoor environments, size and shape, sensor characteristics, etc.).
this approach is not suitable for outdoor ones, where one What is an obstacle? It is something that prevents the
may expect to find curved and hilly regions. progress or passage of some movable entity. Regarding
Another popular approach [3, 4] relies on a transforma- wheeled robots and range images, we have this intuitive
tion of the disparity/range image called V-disparity map, notion that an obstacle is a rapid and significant change
in the surface height, as illustrated in Fig. 1.
∗ Corresponding
Talukder et al. [7] proposed an obstacle definition and
author. Tel.: +55 1936514080
Email addresses: [email protected] (Caio César Teodoro detection method for range images using precisely this no-
Mendes), [email protected] (Fernando Santos Osório), tion. According to the authors, for two 3D points to be
[email protected] (Denis Fernando Wolf) considered an obstacle, there should be a significant differ-
Preprint submitted to Robotica November 27, 2014
chitecture and workflow of the GPU to implement the tar-
a) get application.
This article describes and evaluates single-core, multi-
core and GPU implementations of Talukder et al. method,
sharing general and specific optimization techniques. It
b) also addresses some issues related to the implementation
suggested by the original authors of the method. The
study is not concerned with the method detection quality,
but its proper CPU implementation and possible speedup
c) due to the use of a GPU. The article is an extension of a
previous conference paper [11].
The paper is organized as follows: Section 2 presents
Talukder et al. obstacle detection method; Section 3 pro-
vides an overview of general-purpose computing using GPUs;
the implementations are detailed in Section 4; processing
Figure 1: 2D illustration of the intuitive notion of an obstacle. Sce- times are presented in 5; Section 6 discusses the results
nario a): a rapid, but not significant change in the terrain height; and related work; finally, Section 7 draws the conclusions
Scenario b): a significant, but not rapid change in height; Scenario
c): a rapid and significant change, thus an obstacle
and suggests future work.

2. Obstacle Detection
ence in the height and slope between them. The method
has several advantages over the ones previously mentioned: Obstacle detection methods that apply for range im-
it provides and applies a clear point-wise definition of ob- ages may work in two different domains: the range image
stacle, enables a number of post-processing steps, such as itself or the corresponding point cloud. The latter can
clustering and temporal integration and, in practice, can be generated by transforming the range image based on
handle rough terrain. It was successfully employed in a the intrinsic parameters of the sensor. A point cloud that
long autonomous navigation experiment, in which few ge- can be indexed by its projection (i.e. range image) index
ometric assumptions could be made [8]. In practice, its space is called an organized point cloud. A range image
only drawback is its high computational cost; it may take indexed by u and v returns a single value d corresponding
the method seconds to detect obstacles in a practical-sized to the depth, i.e. Iu,v = (du,v ), where I is the range image.
range image. Researchers usually rely on approximations An organized point cloud can be indexed by the same u
[8, 9, 10], so it can perform within a reasonable time set- and v and returns three values x, y and z usually in real
ting. world units, i.e. Pu,v = (xu,v , yu,v , zu,v ), where P is the
Our main goal is to significantly reduce the processing point cloud. Here we will assume that both domains are
time of the method, more specifically, achieve real-time available and the point cloud is indexed by the same index
obstacle detection in practical-sized range images without space of the range image.
using approximations. We hypothesize that the method By extending the intuition shown in Fig. 1 for a three-
of Talukder et al. could benefit from a parallel implemen- dimensional space, [7] proposed a point-wise obstacle def-
tation using a Graphics Processing Unit (GPU). Unlike inition for organized point clouds, where p1 = (x1 , y1 , z1 )
a CPU, which is mainly a serial processor, a GPU has and p2 = (x2 , y2 , z2 ) are called compatible and considered
a parallel structure that makes it especially suitable for obstacles if they satisfy the following conditions:
graphics-related processing. Despite starting as a fixed-
function unit, recent changes in its architectural design 1. HT < |y2 − y1 | < Hmax ;
have increased the flexibility of its previous rigid pipeline.
2. |y2 − y1 |/||p2 − p1 || > sin(θT );
Such changes have enabled the exploration of its compu-
tational power for massive data-parallel (i.e. same task on where HT , Hmax and θT are empirically determined
different pieces of data) applications. Scientists have rec- constants.
ognized this potential and have been applying it to speed
up scientific computations. The first condition refers to the difference in the height
Performance and accessibility were some of the factors between two points, i.e. it checks the significance of the
that contributed towards the choice of a GPU. Modern height difference. The second regards the angle between
GPUs are widely available in the market and their pro- the line segment containing the two points and the ground
cessing capacity can overcome the one of high-end CPUs plane, i.e. it checks the significance of the slope. Param-
by an order of magnitude . However, to benefit from this eters HT and θT correspond to the thresholds used for
potential, the programmer must take into account the ar- determining if the height difference and slope are respec-
tively significant. Parameter Hmax plays a more subtle
role, as it checks if the two points are connected. Imagine
2
Image Plane
θT

z
Hmax
x θT
HT
y θT Hmax

HT
θT

Figure 3: Projection of the search area for a triangle in the image


Figure 2: Illustration of a search for compatible points. Orange plane, adapted from [7]
points are compatible and blue ones are not, regarding the pivot
point (red point), adapted from [7]
In practice, there are four manners to handle the method
and projection assumptions:
a point cloud referent to a city intersection, where there
are some points referent to a traffic light and right below 1. Ignore the method assumption while maintaining the
it some points referent to the road. Between the points of projection one;
the traffic light and the road there is a large height differ- 2. Use a physical pan-and-tilt system to compensate for
ence and a high slope, but they do not represent obstacles the vehicle motions maintaining both assumptions;
because nothing connects them and parameter Hmax pre- 3. Virtually align the point cloud and ignore the pro-
vents this mistake. jection assumption;
Given a pivot point p, its effective search area for com- 4. Virtually align the point cloud and derive a projec-
patible points resembles an upside-down cone, as shown in tion that accounts for the misalignment between the
Fig. 2. A straightforward way to implement the method camera and the point cloud coordinate system.
would be to compare every point with every other point, The first option is the most popular and has been em-
which would result in a high computational cost and an ployed by the original authors and [8, 10]. It is also worth
O(N 2 ) complexity, where N is the number of points. As- mentioning that by reading the original authors’ papers
suming that the coordinate system of the camera and the [7, 12], one may infer that by increasing the projection
point cloud are aligned (projection assumption), the com- size the method assumption would be somehow alleviated.
putational cost can be reduced by projecting the search Enlarging the projected size only makes sense if the point
area to a region of the reference image plane taking into ac- cloud is virtually aligned with the ground and the projec-
count parameters Hmax , θT , and the focal length f (in pix- tion accounts for it (option 4), which apparently was not
els) of the camera. The search area for compatible points the case in [7, 12]. However, we acknowledge it is a natural
can be projected to a triangle on the image plane, as shown assumption since we made it ourselves in [11].
in Fig. 3. The height of the triangle is equal to f Hmax /pz We have also chosen option 1. In practice, the method
and the opening angle of the triangle is equal to 180 − 2θT is robust and yields great results even when the “method
degrees. Now the complexity is reduced to O(N Kmean ), assumption” is violated, as the results from [8, 10, 7, 12]
where N is the number of points and Kmean is the mean suggest. We also have the benefit of the projection as-
area of triangle projections onto the image plane. sumption but, as the original authors, rather than pro-
This method assumes that the coordinate system of the jecting a triangle or trapezoid we employ a rectangular
point cloud is aligned with the ground plane (method as- region (see Fig. 3) according to:
sumption), i.e. the y-axis of the point cloud is aligned with  
f.(Hmax − HT )
the ground plane height. For the most part, this assump- h(p) = , (1)
tion may be false. While navigating, the vehicle undergoes z
pitch and roll motions. Such motions may affect the cam- 
2f tan(90 − θT )Hmax

era coordinate system, which may affect the point cloud w(p) = , (2)
z
coordinate system. A way to tackle this issue is to connect
the camera to a pan-and-tilt system that compensates for where h is the function that returns the height of the pro-
the vehicle motions based on the feedback of some iner- jected rectangle to a point p and w is the function that
tial sensor. Another approach is to use the same feedback returns its width. We believe that a rectangular region
to virtually tilt the point cloud. While avoiding the need provides the best trade-off between computational com-
for a physical pan-and-tilt system, this approach violates plexity and number of unnecessarily processed points.
the projection assumption (the coordinate system of the As previously discussed, there is no need to create
camera and the point cloud are aligned). larger projections than the ones calculated by these func-
tions. We provide evidence that this is the case in Section
5.
3
Device The processing is finished when all elements of the NDRange
Compute Unit N have been processed. Work-items can be grouped into
work-groups; all work-items belonging to the same group
will be performed by a single compute unit, therefore they
Compute Unit 2 can share local memory and synchronize.
Compute Unit 1
The concept of wavefront (warp) must also be intro-
duced. Instead of scheduling single work-items for execu-
Local Memory tion, the GPU schedules wavefronts. A wavefront refers
to a part of a work-group that will share the same con-
Registers Registers Registers trol unit, therefore its instructions are executed strictly in
Instruction
Unit
parallel (lock-step).
Processor 1 Processor 2 Processor M

3.1. Considerations on Performance


Constant
Cache Some performance issues must be addressed when de-
Texture
veloping for a GPU. This section provides an overview of
Cache the most significant and general performance considera-
tions.
Hiding memory access latency is one of the major chal-
Global Memory lenges in GPGPU. When work-items within a wavefront
access global memory, some patterns produce a coalesced
memory operation. Such an operation uses the full band-
Figure 4: Simplified GPU architecture consisting of N compute units width of the memory subsystem. These patterns and the
with M processors each, adapted from [13] way they use the bandwidth are device-dependent. How-
ever, the general idea is that multiple fetch operations can
be translated to a single ‘wide’ fetch by the GPU, i.e. at
3. General-purpose Computing on Graphics Pro- most 16 fetch operations (half wavefront) can be replaced
cessing Units with a single fetch on NVidia GPUs. A simple and widely
used pattern that produces a coalesced operation in most
This section presents a brief overview of the architec-
cases is the use of adjacent work-items to access adjacent
ture of modern GPUs and concepts related to General-
memory addresses within a wavefront.
Purpose computing on GPUs (GPGPU). The OpenCL
Another technique to hide memory access latency is
terminology and abstraction layer were used to provide a
the use of the local memory (shared memory). Since local
device-independent overview and the CUDA counterparts
memory has much lower latency than the global one and
of the main concepts are provided in parentheses.
does not depend on coalesced operations to be efficient,
While CPUs rely on few and sophisticated processors,
its use can yield significant performance boost. This is
GPUs employ numerous (hundreds) simplified processors
especially the case when there is data reuse within a work-
that operate at relatively low frequencies. This architec-
group.
ture is motivated by its main application, i.e. graphics-
Each device has its wavefront size, which is usually 32
related processing, which consists in applying the same
for NVidia GPUs and 64 for AMD ones, i.e. if we do not
operation over and over to different chunks of data.
want part of our wavefront idle, we need a work-group size
As shown in Fig. 4, a GPU is divided into compute
multiple of the wavefront size. Since wavefronts are pro-
units (streaming multiprocessors). Each compute unit
cessed in lock-step, no flow deviations should occur within
consists of several processing elements (scalar cores) that
one. If a flow deviation occurs within a wavefront, the
can perform a work-item (thread) each. Work-items within
processing time will be equivalent to the whole wavefront
a compute unit can communicate via local memory (shared
processing all possible paths.
memory) and synchronize. Work-items from different com-
Here is a summary of what we believe to be the most
pute units can communicate only through the global mem-
significant optimization recommendations:
ory (or device memory) and there is no explicit way of
synchronizing them on the GPU. Global synchronization a) Select a work-group size multiple of the wavefront size
may be achieved either via CPU or implicitly on the GPU to avoid idle work-items;
by handcrafted strategies [14]. b) Avoid flow deviations within the wavefront;
Work-item is a fundamental concept in GPU program- c) When accessing global memory, use a pattern that yields
ming. It represents an instance of a kernel, which is a func- a coalesced memory operation, e.g. adjacent work-
tion (as in programming) specified by the developer and items accessing adjacent memory addresses;
running on a processing element. Work-items are mapped
d) Use local memory to hide global memory access latency.
onto an n-dimensional index space called NDRange (grid),
used by the GPU to schedule the execution of work-items.

4
Range Image Core 1

2 0 1 7 2 3 0 2
5 6 3 1 6 8 5 0 Window Core 2

4 4 5 7 4 6 1 2 6 3 1 Core 3
81 0 3 7 8 2 7 8 4 5 7
9 4 2 6 8 0 2 1 0 3 7
6 9 2 4 8 5 3 0 Core N

Figure 6: Multi-core implementation. The pivot lines are processed


For every point i in window check: in parallel by N CPU cores and the search areas of the first and last
pivot are colored in light blue

N is the number of points. This implementation will serve


Output as a reference to validate the others.
A natural scheme for implementing the “projected”
0 1 0 version of the target method is the use of four aligned
1 0 0 loops, which can also be employed for image convolution.
The two outer loops correspond to the pivot line and col-
1 0 1 1
umn and the inner ones are related to the window. The
ranges of the outer loops correspond to the data (point
cloud) size and the ranges of the inner ones refer to the
Figure 5: Computation of Talukder et al. method: the pivot point
window size. We implemented the CPU single-core ver-
(p1 ) is checked against every element of its respective search area sion of the method based on this idea and the window
(window) and the result is written as boolean on an output matrix, sizes were calculated by Eqs. (1) and (2).
i.e. 1 for obstacle and 0 for non-obstacle The multi-core parallelization was performed with the
Open Multi-Processing (OpenMP )1 library, which provides
4. Implementations and Optimization a directive that automatically parallelizes and distributes a
loop between multiple cores using threads. The multi-core
The main limitation of the obstacle detection method implementation was developed by inserting such directive
proposed by Talukder et al. is its computational cost. before the most outer loop. This loop refers to the pivot
Even with the search area projection, the processing time lines. As a result, the lines will be processed in parallel
of a point cloud can be too long for several applications. using threads (Fig. 6). By default, OpenMP uses static
Therefore, the original method has been parallelized to scheduling to divide the workload (in this case, the pivot
efficiently utilize GPUs and CPUs. lines) across threads, i.e. the workload of each thread is
The computation of the target method employing pro- determined before the loop execution. However, as each
jection is similar to the computation of an image convo- pivot line takes a different processing time to finish due
lution. Both cases involve a sliding window guided by a to the dynamic nature of window sizes, some threads will
pivot point. In an image convolution the window is dis- finish their workload sooner than others, which results in
placed around the pivot, i.e. the pivot is the center of the idle threads and the under-use of the available processing.
window, whereas in Talukder et al. method, the window That being the case, we changed the schedule type for dy-
is displaced above the pivot, as shown in Fig. 5. The namic. Where the workload of each thread is determined
range image is used only for indexing purposes and the on-the-fly: each new chuck (pivot lines) is assigned to a
computation itself is performed in its referent point cloud thread as its finishes its workload. The size of each chunk,
domain (i.e. 3D Euclidean space). A major difference i.e. the number of pivot lines assigned each time, can also
is that image convolutions are usually performed with a be defined, but we chose to leave that option automatically
fixed window size, whereas the target method uses a vari- determined by the library. Unless specified, the number of
able window size that depends on the z coordinate of the threads created by the OpenMP library is determined au-
pivot. This difference is what makes a GPU implementa- tomatically based on the number of threads the CPU can
tion of Talukder et al. method challenging. execute in parallel (for the Intel Core i7-3770 this number
is 8).
4.1. CPU Implementations The CPU implementations are based on c++ language
We started by implementing the “full” version of the and compiled by g++ 2 with general and CPU-specific (in-
method, that is, without considering the projection. This struction sets and cache size) optimization flags. Unlike a
implementation consists in checking the two conditions of
the method for every pair of points, i.e. N 2 times, where 1 http://www.openmp.org/
2 http://gcc.gnu.org/

5
Work-items 1 2 3 Work-group 1 Work-group 2 Work-group 3 Work-group N

T1

T2
Memory
T3

Figure 8: Each search area is computed by a work-group


Memory

points, each work-item must check the compatibility of 20


Figure 7: Redistribution of coordinates in the memory so that
a wavefront accesses contiguous memory addresses (from array of
points and the size of a work-group is reduced from 400 to
structures to structure of arrays); the memory accesses are repre-
20.
sented by arrows and colored according to the time of the event
A limitation of this approach is that the size of a work-
group is fixed, but the window size is not. Moreover, the
convolution, the projected version does not require padded size of a wavefront in GPUs HD79xx series is 64, which
data; instead, the inner loops ranges can be calculated and means that a work-group smaller than 64 would result in
limited (if they are larger than the image) as soon as the a critical under-utilization of the GPU. A naive solution
pivot has been fetched. would be to use only windows of size 64×64 and ignore
the proper window size, as given by Eqs. (1) and (2). The
4.2. GPU Implementations advantage of this approach is its simple kernel with few
The OpenCL specification was employed for the GPU- things to compute and only one fixed size for loop which
based implementation due to its interoperability across can be unrolled. The whole wavefront also accesses mem-
GPU manufacturers. The OpenCL specification comprises ory addresses sequentially. However since the proper win-
an Application Programming Interface (API) and a pro- dow sizes are ignored, in some cases one might expect that
gramming language similar to the C language. The API is we have a larger window size than unnecessary, processing
used to communicate with the GPU whereas the language unnecessary points, and in other cases, smaller ones, miss-
is used to develop the kernel that effectively runs on the ing compatible points. We will call this implementation
GPU. Kernel 1 for further references.
In a GPU implementation, constraints and trade-offs A partial solution to avoid this unnecessary workload
must be dealt with, as GPUs rely on systematic branch/memoryand maintain a simple kernel has been the use of three
schemes to be efficient. We started with a simple kernel fixed window sizes: 16, 32 and 64. The lateral size of each
meeting most of this constraints, and increased its com- window would be computed according to Eq. (3), where s
plexity as necessary. Three different kernels were imple- is the function that returns the side size of a window, h is
mented in the process. from Eq. (1) and w from Eq. (2).
Initially we reorganized the data, as shown in Fig. 7.
This change enabled a wavefront to access memory ad-

16 if max{w(p), h(p)} ≤ 16

dresses sequentially. Unlike the CPU implementation, we
s(p) = 32 if 16 < max{w(p), h(p)} ≤ 32 . (3)
padded the data with large negative values. To avoid 
64 if 32 < max{w(p), h(p)}

copying the padding to the GPU every time, we write
the padded buffer once and in later write calls we employ This function is computed once for each work-group.
function clEnqueueWriteBufferRect, writing only useful Although the size of the window is variable, the work-
data. group size of 64 was kept to match the wavefront size. In
A simple approach to distribute the workload is to a 64×64 window, a work-item would compute the com-
process a window (search area) per work-group (Fig. 8). patibility of 642 points. In this case, the initial scheme is
Within a work-group each work-item could process a point, maintained and each processor computes the compatibility
however, in practice, the maximum size of a work-group is of a column whose number of points is 64.
easily reached: a search area of 20×20 results in a work- The points of windows sized 32×32 (322 points) and
group with 400 work-items, above the 256 limit of some 16×16 (162 points) must be adequately distributed among
GPUs (e.g. AMD Radeon 5850). a group of 64 work-items (Fig. 9). Each work-item pro-
This limit can be suppressed by increasing the work- cesses 16 points within a 32×32 window and 4 points
load of the work-items: instead of checking the compati- within one of 16×16. A fast way to determine the points
bility of a point, each work-item checks the compatibility that should be computed by a work-item is to use the “?”
of a column of points. For example, in a window of 20×20 operator, since it does not cause flow divergence [15]. A
6
16x16 32x32 ing up w(p) to the nearest multiple of 4 and y to the near-
1 1 est number from set M4 = {16, 32, 64, 128, 192, 256...}.
2 2 We can calculate x and y using:
3 3
4 4
 
w(p) + 3
5 5 x= 4 (5)
6 4
7
8 16 arg min r(y) = (y − h(p))2
9 17 y (6)
10 18
11 19
subject to y ≥ h(p), y ∈ M4.
12 20
13 21
Although this is not the optimal solution, it is a good
14 approximation and has a simple computational implemen-
15 tation. Since there is a natural limit to y (the range image
16 32
height), we implemented Eq. (6) by using a look-up table,
1 2 3 16 1 2 3 32
which can be stored in the constant memory of the GPU.
Figure 9: Distribution of 16×16 and 32×32 search windows within We will call this implementation Kernel 3 for further ref-
a work-group of size 64. Each color represents the workload of a erences.
work-item An optimization step applied to all implementations
is related to the pivot access; every work-item within a
noteworthy fact is that 64 points are processed in paral- work-group must load the same pivot. According to [15],
lel, which is consistent with a wavefront of size 64. How- the best way to carry this access pattern is to use only one
ever, the whole wavefront is no longer accessing continuous work-item to load the pivot and share it with the others
memory addresses. We will call this implementation Ker- through the local memory. In common, all kernels must
nel 2 for further references. perform the following steps:
Although this solution minimizes part of the unneces-
a) Obtain their work-group number and position within
sary processing relative to the Kernel 1 solution, it does
the work-group;
not address cases in which max{w(p), h(p)} > 64 and
misses compatible points. For smaller window sizes, there b) Calculate the pivot index;
are only three possible windows sizes; hence it is highly c) If it is the first work-item: load the pivot into the local
probable that it will process unnecessary points. memory;
To address such cases, we consider Eqs. (1) and (2), d) Synchronize work-items so that they can access the
but directly using them to calculate window sizes is not pivot;
feasible and we must obey at least the following constraints: e) Calculate the window size and the ranges for process-
it should be possible to distribute the workload of a work- ing;
group equally among its work-items, i.e. xy mod 64 = 0, f) Use a “for” loop to process the designated points and
where x is the width of the window and y is its height. An- save the results in the global memory.
other desirable constraint is that the workload of a work-
A number of windows, work-items and work-groups
item (xy/64) should be divisible by its height, or its height
combinations were not explored in this study and there
should be divisible by its workload, i.e. the workload can
is probably one that would yield lower processing times
be processed by at most two for loops, which minimizes
than the ones proposed. Nevertheless, Section 5.3 pro-
the kernel complexity. We formalized such constraints as
vides evidence that this is a reasonable implementation in
the following optimization problem:
absolute terms.

arg min g(x, y) = (x − w(p))2 + (y − h(p))2


x,y 5. Results
subject to x ≥ w(p), y ≥ h(p), (x, y) ∈ N,
(4) Experiments were conducted using range images from
xy xy
mod y = 0 OR y mod = 0, a stereo camera and an RGB-D sensor (Microsoft Kinect).
64 64
xy mod 64 = 0.
where g is the function that calculates the distance be- Table I: Talukder et al. method parameter selection for both sensors
tween the calculated window sizes and the ones we are Parameter Stereo Kinect
trying to approximate. The constraints are GPU-related
HT 27 cm 7 cm
and we believe they represent the minimal set so as to
Hmax 40 cm 20 cm
enable a reasonable kernel implementation.
θT 45 degrees 45 degrees
To the best of our knowledge, this problem has no sim-
f 630 pixels 420 pixels
ple analytic solution but we can approximate it by round-
7
Figure 10: Comparison between the results of the “full” CPU imple- Figure 11: Comparison between the results of the “full” CPU im-
mentation and a projected one with a frame of the stereo log. Blue plementation and GPU Kernel 1 one with a frame of the kinect
pixels represent obstacles detected by both implementations and red log. Blue pixels represent obstacles detected by both implementa-
pixels are the ones detected only by the “full” CPU implementation tions and red pixels are the ones detected only by the “full” CPU
implementation. This is not the case of the proper projected imple-
mentations (i.e. single-core, multi-core and Kernel 3 )
Two logs were captured for this purpose: one from an agri-
cultural environment by a stereo camera and an indoor one
by the Kinect sensor. We chose a resolution of 320×240 for window size and the first difference causes them to miss
the stereo camera and a semi-global stereo method [16] for near obstacles (Fig. 11). The proper projected implemen-
stereo matching. The Kinect sensor tests were performed tations (i.e. single-core, multi-core and Kernel 3 ) pro-
using its full resolution, i.e. 640×480. Table I shows the duced the same results as the “full” CPU implementa-
parameter selection for the method in each case. We em- tion during the log, i.e. the same obstacles were found
ployed the following hardware for experiments: Intel Core in both cases. Therefore, we can conclude that the proper
i7-3770 CPU, AMD Radeon 5850 GPU and AMD Radeon projected implementations are correctly implemented, and
7950 GPU. All tests were conducted in a Linux 32 bits that using projection sizes larger than the ones from Eqs.
environment. (1) and (2) do not alleviate the “method assumption”. The
later can be inferred because the “full” implementation
5.1. Validation uses the maximum possible window size, i.e. the image
size. These results were expected from the discussion in
We validated our implementations by comparing each Section 2.
one with the “full” CPU implementation. The objective
was to assure that single-core, multi-core and Kernel 3 5.2. Processing Times
implementations were correct, i.e. if they yield the same
results as the “full” one, and also to highlight the possible This section compares the processing times of each im-
effects of having a limited window size, as in the Kernel 1 plementation. The results were calculated by running each
and Kernel 2 implementations. log with each implementation. Since the processing times
Figure 10 shows the comparison between the results are valid only for our setup, we also provided processing
of the “full” CPU implementation and a projected one in times relative to the number of points processed. This is
the stereo log. No obstacle was missed. The same holds important because even with the same sensor and reso-
for all (projected) GPU implementations, which revealed lution, the density of the point cloud may change. This
that, for this log, max{w(p), h(p)} ≤ 64 and, in cases change is especially significant in stereo cameras, where
in which max{w(p), h(p)} > 64, no obstacles were found. both the stereo method and the environment affect the
The results in Fig. 10 were the same throughout the stereo number of valid depth measurements.
log. A bar chart displaying the mean processing times for
The same validation was performed using the Kinect each implementation on the stereo log in shown in Fig.
log. Two relevant differences between this log and the 12 and Table IV provides further details. The multi-core
stereo one: I) the indoor environment tends to yield larger implementation is 4.5 times faster than the single-core
window sizes since the points are closer to the sensor; and one, which has revealed the benefit of Simultaneous Multi-
II) the sensor is clearly tilted towards the ground plane, Threading (SMT) in the tested CPU. Concerning GPU im-
therefore the “method assumption” is violated. The GPU plementations, there was a significant difference between
implementations Kernel 1 and Kernel 2 have a limited the two tested GPUs, but not between kernels.

8
5162
160 Implementation Implementation
Single-core 4122 3764 Single-core
Processing time (ms)

Processing time (ms)


140 139
Multi-core 3198 Multi-core
120 Kernel 1 2391 Kernel 1
Kernel 2 Kernel 2
100 1702
Kernel 3 Kernel 3
80 1129
776
60 673 371
31 32 36 29 335
40 160 176
20 13 12 12 113 30 27 43
0
i7 3770 Radeon 5850 Radeon 7950 i7 3770 Radeon 5850 Radeon 7950
Device Device

Figure 12: Mean processing times for each device and implementa- Figure 13: Squared scaled mean processing times for each device and
tion obtained using the stereo log. Only the projected CPU imple- implementation obtained using the kinect log. Only the projected
mentations are presented CPU implementations are presented

Table IV and Fig. 13 show that the processing times Table II: Mean number of points per ms computed by each device
were higher for the Kinect sensor. There are two main Device Points per ms Speedup
reasons for the result: the point cloud is four times larger
CPU Single-core 2.72 × 105 1
and the indoor environment tends to have closer points
CPU Multi-core 1.42 × 106 5.2
(smaller z) and generate larger search areas (window sizes).
Radeon 5850 2.70 × 106 9.9
The CPU single-core implementation was almost 27 times
Radeon 7950 1.95 × 107 71.9
slower than the equivalent one from stereo data while the
7950 Kernel 3 GPU implementation scaled much better, The GPU devices refer to Kernel 3 implementa-
i.e. only 3.6 times slower, and provided a speedup of 87 tion
relative to single-core. A larger gap was observed between
the two tested GPUs and, again, no significant difference
we started attending some of the GPU performance rec-
was found between kernels.
ommendations in Kernel 1, this was not the case in sub-
We finally provide a setup independent measure of com-
sequent kernels. Coalesced memory operations were com-
putational cost, showing, for each device, the number of
promised as Kernel 2 and Kernel 3 do not necessarily ac-
points and processing times in Fig. 14, where each dot of
cess adjacent memory address within a wavefront, which
the graph represents a frame and the number of points of
may result in idle Arithmetic Logic Units (ALUs) while
a frame was calculated according to:
work-items wait fetch operations.
X AMP APP Profiler enabled the gathering of informa-
np = w(p)h(p) (7)
tion on the performance of each kernel. It organizes this
p∈V
information in ‘performance counters’, which reveal a dif-
where V is the set containing all valid pivot points. ferent aspect of the kernel. Our focus is on two of them:
Table II shows the number of points processed per mil-
lisecond (ms), calculated using the mean points per ms VALUBusy Percentage of GPU time spent in vector ALU
of each frame. Both table and chart show only the im- operations;
plementations that yielded the same results of the “full”
one, i.e. single-core, multi-core and Kernel 3. Such data MemUnitBusy Percentage of GPU time the memory
provide setup independent results, which can serve as a unit is active.
reference for other researchers in the field to check if our
These counters can reveal whether a kernel is CPU or
implementation and devices can attend their demand.
memory bounded. We tested only the AMD Radeon 7950
in the Kinect log since they generally apply to our kernels
5.3. Kernel Analysis
and this GPU is the only that provides such counters.
While processing times alone enable the assessment of Table III shows the performance counter for each kernel
relative speedups, they provide few clues on how efficient in the Kinect log with AMD Radeon 7950. There is little
the implemented kernels are in absolute terms. Although room for memory access improvements as the ALUs are

9
Table III: Performance counters for each kernel in the Kinect log Table V: Comparison of computation times found in the literature
with the AMD Radeon 7950
Approach Resolution Proc. time
Kernel Counter Value in %
Sub-sampling [10] 640×480 200 ms
Kernel 1 VALUBusy 82.06 Sub-sampling [8] 500×320 25 ms
MemUnitBusy 95.10 Discrete parameters [9] 640×480 62 ms
GPU (Ours) 640×480 43 ms
Kernel 2 VALUBusy 82.98
MemUnitBusy 94.18 The processing times are not directly comparable due
to variations in hardware, environment and parameters;
Kernel 3 VALUBusy 80.14 therefore it should serve only as a reference
MemUnitBusy 91.20
in conjunction with our approach and yield even higher
speedups.
kept significantly busy, i.e. over 80% in all cases, and no
significant variation in the counters across kernels, proba-
bly because although memory accesses are less efficient in 7. Conclusions and Future Work
Kernel 2 and Kernel 3, their ALU/fetch instruction ratio
is higher i.e. they have more ALU instructions. These This paper has presented a GPU-based parallel ver-
instructions hide the latency of inefficient fetches. sion of an obstacle detection method. Proposed in [7], the
method has been widely used, however researchers have to
deal with its main limitation, its computational cost.
6. Discussion One of the main challenges to adapt code for parallel
execution on a GPU is the need of a proper distribution of
The GPU implementation has successfully provided a
the workload among multiple processors/compute units.
speedup for the method and enabled real-time obstacle
The memory access is also a critical point and must be
detection for both sensors and scenarios. We were espe-
performed carefully to avoid potential bottlenecks. Our
cially interested in cases where a single-core implementa-
target algorithm requires the use of dynamic-sized win-
tion provides a prohibitive computational time, as in our
dows, which makes parallelization even more complex.
Kinect setup. Fortunately our implementation could ac-
We formalized the GPU constraints as an optimization
celerate the processing frequency from ∼0.26 Hz (single-
problem and derived an approximation for it. Satisfactory
core) to ∼23 Hz (Kernel 3 and Radeon 7950). Concerning
results were achieved, i.e. the Kernel 3 implementation
the stereo setup, the small granularity (window sizes) and
was almost 90 times faster than the single-core one with
amount of data penalized the GPU implementations.
the use of Radeon 7950 in the Kinect log. An impor-
A crucial aspect of our approach is that no approxi-
tant aspect of the GPU option is that the CPU becomes
mations were employed, differently from other works. In
available to perform other tasks and responsible only for
[10], the authors used a conditional reduction in the point
reading and writing buffers.
cloud resolution, in which sub-sampling is performed un-
The proposed solution for processing dynamically-sized
til an obstacle is found. A similar approach was used in
sliding windows on GPU can be seen as a general frame-
[8], where the image is divided into horizontal segments
work and applied to other tasks that require similar com-
and sub-sampled according to the distance represented by
putation, benefiting a wide range of applications.
such segments. In [9], a set of discrete parameters was
As future work, we aim at using our implementation
employed. Although these approaches are justifiable for
along with a clustering method and a filter based on ma-
noisy and low resolution range images (e.g. the ones from
chine learning techniques to minimize noise and improve
our stereo setup), they assume that obstacles are formed
the classification results.
by a reasonable set of points, which is not necessarily true,
can conceal obstacles and reduce the detection accuracy.
This is especially the case of relative high-resolution range References
images, where the sub-sampling rate should be high, and
[1] R. Hadsell, P. Sermanet, J. Ben, A. Erkan, M. Scoffier,
on precise sensors, that provide pixel-wise reliable mea- K. Kavukcuoglu, U. Muller, Y. LeCun, Learning long-range vi-
surements. sion for autonomous off-road driving, Journal Field Robotics
Although computation times are not directly compa- 26 (2) (2009) 120–144.
[2] K. Konolige, M. Agrawal, M. R. Blas, R. C. Bolles, B. Gerkey,
rable due to different hardware, parameters and test sce-
J. Solà, A. Sundaresan, Mapping, navigation, and learning for
narios, we could achieve the second lowest computational off-road traversal, Journal of Field Robotics 26 (1).
time stated in the literature without using approximations [3] A. Broggi, C. Caraffi, R. Fedriga, P. Grisleri, Obstacle detec-
(Table V). It is also worth mentioning that the approxi- tion with stereo vision for off-road vehicle navigation, in: IEEE
Computer Society Conference on Computer Vision and Pattern
mation techniques employed by other authors can be used Recognition - Workshops, 2005, p. 65.

10
10000
Processing time (ms)

3981

1585 Implementation
CPU Single-core
631
CPU Multi-core
251 Radeon 5850
Radeon 7950
100

40

1e+09 2e+09 3e+09


Number of points
Figure 14: Logarithmic scaled processing times relative to the number of points for each device. The GPU devices refer to Kernel 3
implementation

Table IV: Processing times for each different combination of log, device and implementation

Log Device Implementation Mean time Max. time Min. time Speedup
Stereo i7 3770 Single-core 139 ms 240 ms 105 ms 1
Multi-core 31 ms 54 ms 21 ms 4.5
Radeon 5850 Kernel 1 32 ms 41 ms 23 ms 4.5
Kernel 2 36 ms 46 ms 27 ms 3.9
Kernel 3 29 ms 42 ms 24 ms 4.8
Radeon 7950 Kernel 1 13 ms 19 ms 9 ms 10.7
Kernel 2 12 ms 18 ms 8 ms 11.6
Kernel 3 12 ms 17 ms 8 ms 11.6
Kinect i7 3770 Single-core 3764 ms 9487 ms 1479 ms 1
Multi-core 776 ms 2230 ms 287 ms 4.8
Radeon 5850 Kernel 1 160 ms 179 ms 98 ms 23.5
Kernel 2 176 ms 199 ms 82 ms 21.4
Kernel 3 371 ms 798 ms 176 ms 10.1
Radeon 7950 Kernel 1 30 ms 53 ms 20 ms 125.5
Kernel 2 27 ms 42 ms 16 ms 139.4
Kernel 3 43 ms 93 ms 22 ms 87.5
Speedup relative to the single-core implementation of each log

[4] C. Caraffi, S. Cattani, P. Grisleri, Off-road path and obsta- [5] J. Kolter, M. Rodgers, A. Ng, A control architecture for
cle detection using decision networks and stereo vision, IEEE quadruped locomotion over rough terrain, in: IEEE Interna-
Transactions on Intelligent Transportation Systems 8 (4) (2007) tional Conference on Robotics and Automation, 2008, pp. 811–
607 –618. 818.

11
[6] S. Lacroix, A. Mallet, D. Bonnafous, G. Bauzil, S. Fleury,
M. Herrb, R. Chatila, Autonomous rover navigation on un-
known terrains functions and integration, in: Experimental
Robotics VII, Vol. 271 of Lecture Notes in Control and Infor-
mation Sciences, Springer Berlin Heidelberg, 2001, pp. 501–510.
[7] A. Talukder, R. Manduchi, A. Rankin, L. Matthies, Fast and
reliable obstacle detection and segmentation for cross-country
navigation, in: IEEE Intelligent Vehicles Symposium, 2002, pp.
610–618.
[8] A. Broggi, M. Buzzoni, M. Felisa, P. Zani, Stereo obstacle de-
tection in challenging environments: The viac experience, in:
IEEE/RSJ International Conference on Intelligent Robots and
Systems, 2011, pp. 1599 –1604.
[9] W. van der Mark, J. van den Heuvel, F. Groen, Stereo based
obstacle detection with uncertainty in rough terrain, in: IEEE
Intelligent Vehicles Symposium, 2007, pp. 1005 –1012.
[10] P. Santana, P. Santos, L. Correia, J. Barata, Cross-country ob-
stacle detection: Space-variant resolution and outliers removal,
in: IEEE/RSJ International Conference on Intelligent Robots
and Systems, 2008, pp. 1836 –1841.
[11] C. C. T. Mendes, F. S. Osorio, D. F. Wolf, An efficient obstacle
detection approach for organized point clouds, in: Intelligent
Vehicles Symposium (IV), 2013 IEEE, 2013, pp. 1203–1208.
doi:10.1109/IVS.2013.6629630.
[12] R. Manduchi, A. Castano, A. Talukder, L. Matthies, Obsta-
cle detection and terrain classification for autonomous off-road
navigation, Autonomous Robots 18 (1) (2005) 81–102.
[13] NVIDIA, OpenCL Programming Guide for the CUDA Archi-
tecture (May 2010).
URL http://developer.download.nvidia.com/compute/cuda/
3_1/toolkit/docs/NVIDIA_OpenCL_ProgrammingGuide.pdf
[14] S. Xiao, W. chun Feng, Inter-block gpu communication via fast
barrier synchronization, in: IEEE International Symposium on
Parallel Distributed Processing, 2010, pp. 1–12.
[15] AMD, AMD Accelerated Parallel Processing OpenCL Pro-
gramming Guide (July 2013).
URL http://developer.amd.com/tools/hc/AMDAPPSDK/
assets/AMD_Accelerated_Parallel_Processing_OpenCL_
Programming_Guide.pdf
[16] H. Hirschmuller, Accurate and efficient stereo processing by
semi-global matching and mutual information, in: IEEE Com-
puter Society Conference on Computer Vision and Pattern
Recognition, Vol. 2, 2005, pp. 807 – 814 vol. 2.

12

View publication stats

You might also like