Real Time SLAM
Real Time SLAM
Abstract
Simultaneous localization and mapping (SLAM) is widely used by autonomous robots operating in unknown
environments. Research community has developed numerous SLAM algorithms in the last 10 years. Several works
have presented many algorithms’ optimizations. However, they have not explored a system optimization from the
system hardware architecture to the algorithmic development level. New computing technologies (SIMD
coprocessors, DSP, multi-cores) can greatly accelerate the system processing but require rethinking the algorithm
implementation. This article presents an efficient implementation of the EKF-SLAM algorithm on a multi-processor
architecture. The algorithm-architecture adequacy aims to optimize the implementation of the SLAM algorithm on a
low-cost and heterogeneous architecture (implementing an ARM processor with SIMD coprocessor and a DSP core).
Experiments were conducted with an instrumented platform. Results aim to demonstrate that an optimized
implementation of the algorithm, resulting from an optimization methodology, can help to design embedded
systems implementing low-cost multiprocessor architecture operating under real-time constraints.
© 2012 Vincke et al.; licensee Springer. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction
in any medium, provided the original work is properly cited.
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 2 of 14
http://jes.eurasipjournals.com/content/2012/1/5
using 25 particles, it takes more than 10 s per update. Section “Evaluation methodology and algorithm imple-
Authors have underlined the difficulty to find the right mentation” details the evaluation methodology, provides
SLAM parameters to fit within the available computing a first algorithm implementation and analyzes this imple-
power and the real-time processing. Magnenat et al. [12] mentation in terms of processing time. A Hardware–
present a system based on the co-design of a low-cost software optimization is proposed and analyzed in Section
sensor (a slim rotating scanner), a SLAM algorithm, a “Hardware–software optimization and improvements”. It
computing unit, and an optimization methodology. The presents SIMD optimizations and DSP parallelization. A
computing unit is based on an ARM processor (533 Mhz) performance comparison is then performed between the
running a FASTSLAM 2.0 algorithm [13]. Magnenat et al. optimized and non-optimized instances. Finally, Section
[12] use an evolution strategy to find the best configura- “Conclusion” concludes this article.
tion of the algorithm and setting of the parameters.
As pointed out by [11,12], the first improvement of EKF-SLAM algorithm
a SLAM algorithm is an efficient setting of the various Overview
parameters of the algorithm. Other modifications were Extended Kalman filter for SLAM estimates a state vector
investigated to reach real-time constraints. These modifi- containing both the robot pose and the landmark loca-
cations are necessary due to the low computing power and tions. We consider that the robot is moving on a plane.
limited memory resources available on embedded sys- The algorithm uses 3D points as landmarks. It uses pro-
tems. Features restriction for EKF-SLAM algorithm has prioceptive sensors to compute a predicted vector and
been implemented to decrease the processing time [14]. then corrects this state using exteroceptive sensors. In
Schroter et al. [15] focused on reducing the memory foot- this article, we consider a wheeled robot embedding two
print of particle-based gridmap SLAM by sharing the map odometers (attached to each rear wheel) and a camera.
between several particles.
Robust laser-based SLAM navigation has long existed in State vector and covariance matrix
robot applications, but systems implement sensors that, With N landmarks, the state vector is defined as:
in some cases, are more expensive than the final prod-
uct. Neato Robotics has developed a vacuum cleaner that x = (x, z, θ, xa1 , ya1 , za1 , . . . , xaN , yaN , zaN )T (1)
implements a navigation system using a SLAM algorithm.
The approach is based on a low-cost system implementing where:
a designed laser rangefinder [16]. • x,z are the ground coordinates (x -axis, z -axis) of the
This article presents an efficient implementation of the robot rear axle center. We suppose that the robot is
EKF-SLAM algorithm on a multi-processor architecture. always moving on the ground, so y = 0 (no elevation)
The approach is based on an algorithm implementation and y does not appear in Equation (1).
adequate to a defined architecture. The aim is to optimize • θ is the orientation of a local frame attached to the
the implementation of the SLAM algorithm on a low-cost robot with respect to the global frame.
and heterogeneous architecture implementing an SIMD • xa1 , ya1 , za1 , . . . , xaN , yaN , zaN are the 3D coordinates
coprocessor (NEON) and a DSP core. The hardware of the N landmarks in the global frame.
includes several low-cost sensors. As [17], we chose to
use a low-cost camera (exteroceptive sensor) and odome- The state covariance matrix is defined as:
ters (proprioceptive sensors). Following [12], we efficiently ⎡ ⎤
tune the parameters of the SLAM algorithm. We improve Pxx Pxz Pxθ Pxxa1 .. Pxza
N
⎢ Pzx Pzz Pzθ Pzx .. Pzza ⎥
on previous works by proposing an adequate implemen- ⎢ a1 ⎥
⎢ P N
⎥
tation of the EKF-SLAM algorithm on a multiprocessing ⎢ θx Pθz Pθθ Pθxa1 .. PθzaN ⎥
P=⎢ ⎥ (2)
architecture (ARM processor, SIMD NEON coprocessor, ⎢ Pxa x Pxa z Pxa θ Pxa xa .. Pxa za ⎥
⎢ 1 1 1 1 1 1 N ⎥
DSP core). The specifications related to the NEON copro- ⎣ .. .. .. .. .. .. ⎦
cessor and the DSP core improve the processing time and PzaN x PzaN z PzaN θ PzaN xa1 .. PzaN zaN
the system performance. Results aim to demonstrate that
an optimized implementation of the algorithm, result- Prediction
ing from an evaluation methodology, can help to design The prediction step relies on the measurements of the
embedded systems implementing low-cost multiproces- proprioceptive sensors, the odometers, embedded on our
sor architecture operating under real-time constraints. experimental platform. A non linear discrete-time state-
Section “EKF-SLAM algorithm” introduces the EKF- space model is considered to describe the evolution of the
SLAM algorithm. Section “Multiprocessor architecture robot configuration x:
and system configuration” presents the embedded mul-
tiprocessor architecture and the system configuration. xk|k−1 = f(xk−1|k−1 , uk ) + vk (3)
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 3 of 14
http://jes.eurasipjournals.com/content/2012/1/5
where uk is a known two-dimensional control vector, The innovation and its covariance matrix: The pinhole
assumed constant between the times indexed by k − 1 and model is used to project a known landmark position into
k, and vk is an unknown state perturbation vector that the image:
accounts for the model uncertainties. xk−1|k−1 represents ⎛ ⎞
the state vector at time k-1, xk|k−1 represented the state ui
⎝ vi ⎠ = pinhole(xcam
ai , yai , zai )
cam cam
vector after the prediction step, xk|k represents the state
1
vector after the estimation step. The classical evolution ⎛ cam ⎞
model, described in [18], is considered: ⎡ ⎤ xai (7)
fku suv cu ⎜ zacam i ⎟
= ⎣ 0 fkv cv ⎦ ⎜ yai ⎟
cam
⎛ ⎞ ⎝ zcam ⎠
xk−1 + δs cos θk−1 + δθ2 0 0 1 ai
⎜ zk−1 + δs sin θk−1 + δθ ⎟ 1
⎜ 2 ⎟
⎜ θk−1 + δθ ⎟
⎜ ⎟ where:
⎜ xa1 ,k−1 ⎟
⎜ ⎟
⎜ ya1 ,k−1 ⎟ • (ui , vi ) is the position of the i -th landmark in the
⎜ ⎟
f(xk−1|k−1 , δs, δθ) = ⎜
⎜ za1 ,k−1 ⎟
⎟ image.
⎜ ⎟ • (xcam
ai , yai , zai ) is the position of the i -th landmark
⎜ .. ⎟ cam cam
⎜ .. ⎟
⎜ ⎟ in the camera frame.
⎜ xaN ,k−1 ⎟ • f is the focal length.
⎜ ⎟
⎝ yaN ,k−1 ⎠ • (ku , kv ) is the number of pixels per unit length.
zaN ,k−1 • suv is a factor accounting for the skew due to
(4) non-rectangular pixels. In our case, we take suv =0.
where uk = (δs, δθ); δs is the longitudinal motion and δθ Equation (7) can be written as the predicted observation
is the rotational motion [19]: equation for a single landmark:
⎛ xcam ⎞
wr δϕr +wl δϕl cu + fku zcam
ai
δs u
δθ
= g(ϕl , ϕr ) = wr δϕr −w
2
l δϕl
(5) hi (xk|k−1 ) = i
=⎝ ai
ycam
⎠ (8)
e vi c + fk i
a
v v zcam
ai
The state covariance matrix is defined as: Where D is the length between the camera and the robot
rear axle center.
∂f ∂f T During the observation step, the algorithm matches M
Pk|k−1 = Pk−1|k−1 + Qk (6)
∂x ∂x
⎛ 0(M⎞<= N) whose observations are added in
landmarks
where h
⎡ ⎤ hk = ⎝ .. ⎠.
1 0 −δs sin θk−1|k−1 + δθ2 0 .. 0 hM−1
⎢ 0 1 δs cos θk−1|k−1 + δθ 0 .. 0 ⎥ Thus, the innovation is:
⎢ 2 ⎥
⎢0 0 1 0 .. 0 ⎥
∂f
• ∂x =⎢ ⎢0 0
⎥
Yk = ẑk − hk (xk|k−1 )
⎢ 0 1 .. 0 ⎥ ⎥
(10)
⎣ .. .. .. .. .. .. ⎦ where ẑk is the measurement for all the M predicted
0 0 0 0 .. 1 observations.
• Q k is the covariance matrix of the process noise.
The innovation covariance Sk is:
Estimation Sk = Hk Pk|k−1 HTk + Rk (11)
The estimation of the state is made using the camera
which returns the position in the image (ui , vi ) of the i-th where Hk is the Jacobian of hk and Rk is the observation
landmark. noise covariance.
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 4 of 14
http://jes.eurasipjournals.com/content/2012/1/5
State estimation: The state is updated using the classical point inside τ . For each candidate point p : (px , py ), the Np
EKF equations: value of the weighted ZMSSD is:
Kk = Pk|k−1 Hk S−1
k Np = w(px , py ) × ZMSSD (13)
xk|k = xk|k−1 + Kk Yk (12)
and
Pk|k = (I − Kk Hk )Pk|k−1
Visual landmarks ZMSSD = (d(i, j) − md )
The landmarks used in the observation equation are i,j
extracted from images. Landmark initialization defines 2
des des
the initial coordinates and the initial covariance of land- − im px + i − , py + j − − mi
2 2
marks localization (also called interest points or features). (14)
In [20], we have evaluated the processing time of cor-
ner detectors like Harris, Shi-Tomasi or FAST. Harris and where:
Shi-Tomasi detectors were more time consuming than
the FAST detector and do not provide significantly better • w(px , py ) is the Gaussian weights defined by the
localization results than FAST. Consequently, there is no landmark covariance.
need to implement more sophisticated algorithms such as • i ∈[ 0; des − 1] and j ∈[ 0; des − 1] and des is the
Harris or Shi and Tomasi. FAST [21] (Features from Accel- descriptor size.
erated Segment Test) corner detector relies on a simple • d is the feature descriptor.
test performed for a pixel p by examining a circle of 16 • md and mi are respectively the means of the pixel
pixels (a Bresenham circle of radius 3) centered on p. A values in the descriptor and in the image window.
feature is detected at p if the intensities of at least 12 con- • im is the image.
tiguous pixels are all above or all below the intensity of p
with a threshold t. Even if this detector is not highly robust The observation pobs will be selected using p = (p ∈
to noises and depends on a threshold it produces stable τ |Np = min(Npj ), ∀pj ∈ τ ).
landmarks and is computationally very efficient [21]. The descriptor, used to identify the landmark dur-
The FAST detector [21] is related to the wedge-model ing the matching, is classically a small image window
style of detector evaluated using a circle surrounding a of 9×9 pixels to 16×16 pixels around the interest point.
candidate pixel. To optimize the detector processing-time, Davison [22] claims that this sort of descriptor is able to
this model is used to made a decision classifier which is serve as long-term landmark feature.
applied to the image (Figure 1).
Landmark initialization based on davison method
Matching based on zero-mean sum of squared differences Landmark initialization consists of defining the initial
The EKF-SLAM matches the previously detected feature coordinates and the initial covariance of landmarks (inter-
with a new one using zero-mean sum of squared differ- est points). Various methods exist and can be classified
ences (ZMSSD). as an undelayed or delayed method. Undelayed method
The covariance of the projected feature localization adds landmarks with only one measurement whereas the
defines a searching area τ . This area includes the robot delayed method needs two or more frames. We chose
localization uncertainty and the landmarks localization to use the widely spread delayed method proposed by
uncertainty. We use the ZMSSD to find the best candidate [2] which is both efficient and adequate to implement.
Figure 1 Image (320 × 240 Pixels) of the embedded camera and result of the FAST detector.
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 5 of 14
http://jes.eurasipjournals.com/content/2012/1/5
and thus the WLAN connection will be used to achieve 17: hk ← (ui , vi )
only the system monitoring. A coprocessor (ATMega168) 18: Yk ← ẑk − hk
k
takes care of data acquisition. It controls the robot 19: Hk ← ∂h∂x
speed and its direction using two pulse-width modula- xk|k−1
tion (PWM) signals. It decodes signals coming from the 20: end if
odometers embedded in the rear wheels. It communi- 21: end for
cates with the main board using an I2C interface. This 22: ESTIMATION:
interface allows the main processor to retrieve odome- 23: Sk ← Hk Pk|k−1 HTk + Rk (see Eq (11))
ters data and to send instructions corresponding to speed 24: Kk ← Pk|k−1 Hk S−1k (see Eq (11))
and direction. 25: xk|k ← xk|k−1 + Kk Yk (see Eq (12))
To evaluate the designed system, an experiment was 26: Pk|k ← (I − Kk Hk )Pk|k−1 (see Eq (12))
achieved in a corridor of our lab. Frames have been 27: INITIALIZATION:
grabbed at 30 fps with 320×240 resolution. Odometer 28: for Each L ∈ χ do L: Aspiring new Landmark
data were sampled at 30 Hz. During the experiment, 29: Lobs ← ZMSSD(L) (see Eq (13))
references are periodically drawn on the ground by an 30: Update the particles weight according Lobs
embedded marker. (see [2])
31: Compute σdepth , depth
σdepth
32: if depth < then
Evaluation methodology and algorithm 33: Compute L, PL
implementation 34: append(xk|k−1 , L); append(Pk|k−1 , PL )
Our evaluation methodology is based on the identification 35: remove(χ, L)
of the processing tasks requiring a significant computing 36: end if
time. It is based on several steps: we analyze first the exe- 37: end for
cution time of tasks and their dependencies on the algo- 38: if Lack of Landmark then see [8]
rithm’s parameters. A threshold is fixed for each param- 39: append(χ, New Landmarks)
eter. The algorithm is then partitioned in order to have 40: end if
functional blocks (FBs) performing defined calculations. 41: end if
Each block is then evaluated to determine its process- 42: end while
ing time. Function blocks that require the most important
execution time are then optimized to reduce the global Prediction process
processing time. This phase updates the mobile robot position (xk|k−1 )
Algorithm 1 summarizes the main tasks of EKF-SLAM. according to its proprioceptive data acquired from
The algorithm is composed of two process: Prediction odometers (ϕl , ϕr ). The processing time of the predic-
and Correction. The correction process implements three tion process is constant. It just updates the 3D vector
tasks: matching, estimation and initialization. containing the robot pose and its 3×3 covariance matrix.
During the prediction step, the landmarks localization
and uncertainties do not change: landmarks are defined in
Algorithm 1 EKF-SLAM the global frame.
1: χ ← Ø List of Landmarks for initialization
2: Robot pose initialization
3: while localization is required do Correction process
4: DATA ← Sensors Data acquisition The processing time of the correction process is not con-
5: if DATA = (ϕl , ϕr ) then Odometer’s data stant. The following of this section studies the processing
6: PREDICTION time of each task of the process and their dependencies.
7: (δs, δθ) ← g(ϕl , ϕr ) (see Eq (5))
8: xk|k−1 ← f(xk−1|k−1 , δs, δθ) (see Eq (4))
∂f ∂f T
9: Pk|k−1 ← ∂x Pk−1|k−1 ∂x + Qk (see Eq (6)) Matching task Each landmark in the state vector must
10: else if DATA = Camera then be projected in the camera frame using the pinhole
11: FAST detector applied on the image model (see L. 2). The computing time of these projections
12: MATCHING: depends only on the number of landmarks in the state
13: for Each Landmark Ni ∈ xk|k−1 do vector (L. 2). For each projected landmark on the focal
14: ui , vi , τi ← pinhole(xk|k−1 , Ni ) (see Eq (8)) plane, ZMSSD matches an observation. Both the size of
15: if (ui , vi ) ∈ Camera Frame then the descriptor and the size of the searching area τ will
16: ẑk ← ZMSSD(τi , Ni ) (see Eq (13)) affect the computing time (see Equation (13)).
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 7 of 14
http://jes.eurasipjournals.com/content/2012/1/5
The processing time of the matching task depends on variables. For real-time implementation, it is important to
several parameters: get a constant, or at least a bounded computation time. To
solve this constraint we have to:
• The number of landmarks in the state vector.
• The number of visible landmarks on the focal plane. • set the maximum number of landmarks in the state
• The size of the descriptor. vector. The size of the state vector will be fixed.
• Both the localization uncertainty of the mobile robot Therefore, no dynamic memory allocation will be
and the landmarks. needed.
• set the maximum number of landmarks observed.
In practice, all the previously defined parameters should This keeps the computation time of the estimation
be set in order to bound the computing time. The first task constant using a fixed size matrix multiplication.
three parameters can be set by the users. The uncertainty • set the maximum number of landmarks being
depends on the followed path and cannot be bounded. initialized in order to bound the computation time of
the initialization task. Unfortunately, it will not be
Estimation task The estimation task uses the classical sufficient to keep the computation time of the
Kalman equations to update both the robot and land- initialization task constant due to its internal
marks uncertainties. The processing time of the estima- matching step.
tion task is time-consuming and depends on: • bound the computing time induced by the
uncertainties. The only solution to get a bounded
• The number of landmarks in the state vector.
global-processing-time is to set a maximum
• The number of landmarks observed.
execution time for the matching task. Due to the
constant processing time of the prediction and the
The size of the matrix and thus the computing cost
estimation task, the execution time of both the
of the matrix multiplication in the Equations (11) and
matching and initialization task can be bounded
(12) depend on the number of landmarks in the state
(33 ms - (tprediction + testimation )). We chose to use all
vector. Moreover, Equation (11) depends on the num-
possible images (30 fps).We set a maximum
ber of landmarks observed. As for the matching process,
execution time for the matching task. The algorithm
these parameters (size of the state vector and number of
proceeds in a way to match a maximum of landmarks
observations) should be bounded in order to achieve this
in a bounded time. The initialization task has a
estimation task in a constant computing time.
dynamic execution time depending on the real
processing time of the matching task and the number
Initialization task For each landmark under initializa- of landmarks being initialized. The lower bound of
tion, each particle (a feature depth hypothesis) is projected this dynamic execution allows at least a minimum
into the image, matched and its probability is re-weighted. number of landmarks to be initialized.
If there is a lack of landmarks under initialization, we
add aspiring new landmarks. The processing time of the
initialization task depends on: Map management
To keep the size of the state vector constant, we need
• The number of landmarks being initialized. to delete some landmarks when inserting new ones. The
• The size of the descriptor. new state vector includes new landmarks (whose initial-
• Both the localization uncertainty of the mobile robot ization has just been performed) and previously used
and the landmark. landmarks. Auat Cheein and Carelli [26] proposes an effi-
cient method to select landmarks for the estimation task.
The number of landmarks being initialized and the size It is based on the evaluation of the influence of a given
of the descriptors can be bounded. For each landmark feature on the convergence of the state covariance matrix.
being initialized, we have to update the probability of The method matches all possible landmarks and com-
each localization hypothesis using a matching process. As putes (I − Kk Hk ) from Equation (12). Unfortunately, we
for the matching task, the computing time depends on cannot implement it exactly as proposed by [26] due to
the localization uncertainty of the mobile robot and the the high computing time. We chose to add the landmarks,
landmarks. based on the previous estimation step, by selecting the
previous landmarks which have the best previous influ-
Thresholds definition ence on the convergence of the state covariance matrix.
Previous section shows that the computation time of At time k, we select the landmark which had the smallest
each task of the EKF-SLAM algorithm depends on many (I − Kk−1 Hk−1 ).
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 8 of 14
http://jes.eurasipjournals.com/content/2012/1/5
Functional block partitioning vector and the matched observations. The accuracy of
All the previously defined tasks do not have a fixed com- the localization depends monotonically on the number of
puting time, their computing time depends on the experi- processed landmarks.
ment. We have defined FBs which have a fixed computing The given EKF-SLAM (Algorithm 1) is processed
times to optimize the implementation. The computing sequentially on the embedded ARM processor operating
time of the FB do not depend on the experiment. Exper- at 500 MHz (no coprocessor is implemented). In the fol-
iments will only affect the number of iterations of some lowing, all times given correspond to times evaluated on
FBs(3,4,5,6,7,8 and 9). From the previous algorithm, we the embedded system using the ARM processor. The data
have defined 9 FBs and their runtimes are studied in below acquisition time is constant:
Table 1.
• The odometer data acquisition is achieved in 0.7 ms
Each FB has a fixed computing time and some FB can
occur more than one time (Landmark projection, ZMSSD, (this processing time is due to the I2C
Hi , Weight updating, Addition of a new landmark). communication with the Atmega168 processor).
• Each image acquisition takes 1.8 ms (due to USB data
transfer).
Processing time evaluation
As an application scenario, the robot moves over a square The prediction step does not require significant process-
of 6 m side. At the end of the trajectory, it joined ing time, it takes only 0.093 ms per iteration. As for the
the initial starting position. Using only odometers, the matching task, the estimation task cannot be achieved in
final localization has an error of 1.6 m. With the EKF- a constant processing time. Estimation task processing
SLAM algorithm, the localization has been significantly time depends on the total number of landmarks and
improved. The final error is approximately 0.4 m. EKF- the number of matched landmarks. Figure 4 shows the
SLAM includes all viewed landmarks in the state vector. processing time of the estimation task according to the
Indeed, the localization result depends on the number of number of landmarks in the state vector. The estimation
landmarks but the size of the state vector and the number task is entirely processed on the ARM processor (no
of observations must be bounded to achieve a bounded use of coprocessor). Obviously it will be impossible to
computing time. The overall accuracy of the EKF-SLAM take into account all the landmarks detected when the
depends on the number of the landmarks in the state algorithm is processed: the computation time will be
Table 2 Processing time of the correction process FBs on the main processor (ARM)
Functional block (FB) Processing time per Mean of the number of Mean of the processing
iteration (μs) iterations per correction time per correction
process process (μs)
2. FAST 3400 1 3400
3. Landmark projection 9 19 180
4. ZMSSD-M 11.29 233 2630
5. Hi 14.5 4.5 66
6. Estimation 88845 0.8 70568
7. ZMSSD-I 11.29 123 1388
8. Weight updating 638 4.0 2586
9. Addition of a new landmark 103 0.18 18
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 10 of 14
http://jes.eurasipjournals.com/content/2012/1/5
8 bits data. The computation of the ZMSSD can be opti- to avoid the use of two loops. Formally the ZMSSD is
mized using the SIMD NEON architecture. The estima- written as:
tion task is based on floating point matrix multiplication,
it could efficiently be optimized using the SIMD NEON
coprocessor (the ARM Cortex A8 does not include any ZMSSD = ((d − md ) − (im − mi ))2 (15)
i,j
floating point unit (FPU)). The initialization FBs will be
studied at Section “Parallel implementation on a DSP
where d = d(i, j) and im = im(px + i − des des
2 , py + j − 2 )
processor”.
By expanding the ZMSSD, we obtain:
ZMSSD (FB4) ZMSSD = ((d−md )2 −2(d−md )(im−mi )+(im−mi )2 )
The EKF-SLAM matches features using ZMSSD. ZMSSD i,j
is computed for each landmark using Equation 2. We (16)
chose to use a descriptor with 16×16 8-bits pixels size due
to the efficiency of SIMD NEON architecture to deal with = (d2 −2d.md +m2d −2d.im+2d.mi + 2md .im
128/64 bits vectors. i,j
Figure 6 Processing time of the estimation task on the ARM and NEON coprocessor.
EKF-SLAM algorithm on the multiprocessor architecture ARM processor writes the image (320×240 pixels), the
(ARM, NEON and DSP processors). This allows enhanc- robot position and its uncertainty on the shared mem-
ing the global processing time especially when we consider ory. Data transfer between the ARM processor and DSP
to operate in real-time constraints. The landmarks match- processor for a 320×240 gray image is done in one mil-
ing (FB3 to FB5) and the robot position estimation (FB6) lisecond. When the initialization of a landmark is com-
tasks must be processed sequentially. Fortunately, the pleted, the DSP processor returns the position and the
initialization tasks (FB7, FB8 and FB9) can run simultane- uncertainty of possible new landmarks.
ously with the matching and estimation tasks.
Rethinking the implementation to obtain a parallel Global results
implementation, the instance of Algorithm 1 with block We have improved the EKF SLAM implementation using
partitioning leads to the Algorithm 5. the SIMD NEON coprocessor and the DSP processor.
We have implemented the matching and estimation tasks
on a NEON coprocessor and the initialization tasks on a
Algorithm 5 Multiprocessed EKF-SLAM DSP processor. FAST corner detector is already an opti-
1: Robot pose initialization mized algorithm using machine learning [21]. For the
2: while localization is required do latest experiment, we set the same thresholds as Section
3: if DATA = Odometers then “Experimental results”.
4: PREDICTION Table 4 summarizes the processing time per iteration
5: else if DATA = Camera then and the mean processing time per Frame of each FB. The
6: FAST detector computing time of the initialization task (blocks 7, 8 and
7: ARM Processor MATCHING and ESTIMATION 9) implemented on the DSP processor is approximately 4.0
(FB 3, 4, 5 and 6) ms. The DSP processor computes the initialization task
8: DSP Processor INITIALIZATION (FB 7, 8 and 9) while the ARM-NEON processors compute the predic-
9: end if tion, FAST detection, matching and estimation tasks.
10: end while With this implementation and since the processing-time
of the initialization task (4.0 ms) is smaller compared
The architecture of the OMAP3530 can interface the to the sum of the processing times of the matching and
ARM and DSP processors using a shared memory. estimation tasks (13.0 ms for blocks 3, 4, 5 and 6), the
Figure 7 shows the data transfer mechanism using a overall computing time is reduced to the sum of the
shared DDR memory area. For each acquired image, the processing-times of the prediction process (0.093 ms),
the FAST detector (3.4 ms), the matching and estima- Other future developments will be centered around a
tion tasks (13.0 ms). The mean processing time per frame Hardware–software co-design to improve the system per-
with the optimized implementation is 17.6 ms (we add 1 formances implementing a system-on-chip with a field
ms for the ARM/DSP data transfer) whereas the nonop- programmable gate array (FPGA). The use of a con-
timized implementation has a processing time of 80.85 figurable architecture accelerates greatly the design and
ms. The optimized processing time represents 22% of the validation of a proof of real-time and system-on-chip
nonoptimized one. The processing time has been reduced concept.
by 78%.
Competing interests
The authors declare that they have no competing interests.
Conclusion
Author details
This article proposed an efficient implementation of the 1 Univ Paris-Sud, CNRS, Institut d’Electronique Fondamentale, F-91405 Orsay,
EKF-SLAM algorithm on a multiprocessor architecture. France. 2 IFSTTAR, IM, LIVIC, F-78000 Versailles, France.
The overall accuracy of the EKF-SLAM depends on
Received: 24 November 2011 Accepted: 16 June 2012
the number of the landmarks in the state vector and Published: 18 July 2012
the matched observations. Both are linked to the time
allowed to the embedded architecture to compute the References
robot pose. Based on the application constraints (real- 1. M Dissanayake, P Newman, S Clark, H Durrant-Whyte, M Csorba, A
solution to the simultaneous localization and map building (SLAM)
time localization) and an evaluation methodology, we problem. IEEE Trans. Robot. Autom. 17, pp. 229–241 (2001)
have implemented the algorithm in consideration of the 2. A Davison, I Reid, N Molton, O Stasse, MonoSLAM: real-time single camera
underlying hardware architecture. A runtime analyses SLAM. IEEE Trans. Pattern Anal. Mach. Intell. 29, pp. 1052–1067 (2007)
3. M Montemerlo, S Thrun, D Koller, B Wegbreit, in National Conference on
shows that the FBs and the initialization task represents Artificial Intelligence, FastSLAM: a factored solution to the simultaneous
99.6% of the global processing time. We have used an localization and mapping problem. Orlando, Florida, USA, 2002, pp.
optimized instance of the FAST detector. Two FBs (in 593–598
4. J Folkesson, HI Christensen, in IEEE International Conference on Robotics
matching and estimation tasks) have been optimized on and Automation, Graphical SLAM-a self-correcting map. LA, New Orleans,
an SIMD NEON architecture. The initialization task has USA, 2004, pp. 383–390
been parallelized on a DSP processor. This optimization 5. A Eliazar, R Parr, in International Joint Conference on Artificial Intelligence.
DP-SLAM: fast, robust simultaneous localization and mapping without
required a modification of the algorithm implementation. predetermined landmarks. vol. 18. Acapulco, Mexico, 2003, pp. 1135–1142
Using the optimized implementation, the global process- 6. S Thrun, Probabilistic robotics. Assoc. Comput. Mach. 45(3), pp. 52–57
ing time was reduced by a factor equal to 4.7. The results (2002)
7. C Brenneke, O Wulf, B Wagner, in IEEE/RSJ International Conference on
demonstrate that an embedded systems (with a low-cost Intelligent Robots and Systems, Using 3d laser range data for slam in
multiprocessor architecture) can operate under real-time outdoor environments. Las Vegas, Nevada, USA, 2003, pp. 188–193
constraints, if the software implementation is designed 8. A Prusak, O Melnychuk, H Roth, I Schiller, Pose estimation and map
carefully. To scale with larger environment, we are going building with a time-of-flight-camera for robot navigation. Int. J. Intell.
Syst. Technol. Appl. 5(3), pp. 355–364 (2008)
to include an approach of local/global mapping as pro- 9. F Abrate, B Bona, M Indri, in European Conference on Mobile Robots,
posed by [30]. Using this approach, we will be able to Experimental EKF-based SLAM for mini-rovers with IR sensors only.
map larger environment. The map joining system will be Freiburg, Germany, 2007
10. T Yap, C Shelton, in IEEE International Conference on Robotics and
implemented on the GPU coprocessor integrated on the Automation, SLAM in large indoor environments with low-cost, noisy, and
OMAP3530. sparse sonars. Kobe, Japan, 2009, pp. 1395–1401
Vincke et al. EURASIP Journal on Embedded Systems 2012, 2012:5 Page 14 of 14
http://jes.eurasipjournals.com/content/2012/1/5