0% found this document useful (0 votes)

29 views

Unsupervised Obstacle Detection in Drivi

The document describes a new approach for unsupervised obstacle detection in driving environments using deep learning-based stereovision. The approach develops a hybrid deep autoencoder (HAE) that combines deep Boltzmann machines and autoencoders. This HAE model is combined with a one-class support vector machine for visual monitoring of urban scenes. The system addresses obstacle detection as an anomaly detection problem using experimental data from two public datasets. Results show the capacity of the proposed approach to reliably detect obstacles.

Uploaded by

Siddhant Shete

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Unsupervised Obstacle Detection in Drivi

Uploaded by

Siddhant Shete

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Robotics and Autonomous Systems 100 (2018) 287–301

Contents lists available at ScienceDirect

Robotics and Autonomous Systems

journal homepage: www.elsevier.com/locate/robot

Unsupervised obstacle detection in driving environments using

deep-learning-based stereovision
Abdelkader Dairi a , Fouzi Harrou b, *, Mohamed Senouci a , Ying Sun b
a
Computer Science Department, University of Oran 1 Ahmed Ben Bella, Algeria Street El senia el mnouer bp 31000 Oran, Algeria
b
King Abdullah University of Science and Technology (KAUST) Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal
23955-6900, Saudi Arabia

highlights

• A stereovision-based hybrid deep autoencoder (HAE) approach to urban scene monitoring is developed.
• This system combines the advantages of deep Boltzmann Machines (DBM) and autoencoders.
• An unsupervised HAE-based one-class SVM is developed for obstacle detection in driving environments.
• A fast obstacle tracking approach based on density maps is developed.
• Two publically available datasets, Malaga and Daimler, are used for validation.
• The detection results show the superior performance of the new combined HAE-OCSVM strategy.

article info a b s t r a c t
Article history: A vision-based obstacle detection system is a key enabler for the development of autonomous robots
Received 11 July 2017 and vehicles and intelligent transportation systems. This paper addresses the problem of urban scene
Received in revised form 13 October 2017 monitoring and tracking of obstacles based on unsupervised, deep-learning approaches. Here, we design
Accepted 26 November 2017
an innovative hybrid encoder that integrates deep Boltzmann machines (DBM) and auto-encoders
Available online 6 December 2017
(AE). This hybrid auto-encode (HAE) model combines the greedy learning features of DBM with the
Keywords: dimensionality reduction capacity of AE to accurately and reliably detect the presence of obstacles. We
Deep learning combine the proposed hybrid model with the one-class support vector machines (OCSVM) to visually
DBM monitor an urban scene. We also propose an efficient approach to estimating obstacles location and track
Autoencoder their positions via scene densities. Specifically, we address obstacle detection as an anomaly detection
OCSVM problem. If an obstacle is detected by the OCSVM algorithm, then localization and tracking algorithm is
Monitoring executed. We validated the effectiveness of our approach by using experimental data from two publicly
Stereovision available dataset, the Malaga stereovision urban dataset (MSVUD) and the Daimler urban segmentation
dataset (DUSD). Results show the capacity of the proposed approach to reliably detect obstacles.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction objective of a detection and localization of obstacles system is to

improve safety and comfort, while reducing the risk of collisions
1.1. Background by alerting the driver or providing useful information for rapid
decision making. Moreover, obstacle detection is useful in other
Over the past two decades, intelligent transport systems, driver applications, such as smart wheelchairs, unmanned aerial vehicles
assistance systems and autonomous vehicles have received in- and agricultural applications [7–9].
creasing research attention [1–6]. Localization and obstacle de- To guarantee reliable obstacle detection, researchers and en-
tection systems are key enablers in the development of practical gineers have developed autonomous vehicles and robots that are
autonomous robots and vehicles and for intelligent transporta- fully equipped with sophisticated sensors, such as ultrasound sen-
tion systems so that accidents can be avoid. Indeed, the main sors, RADAR and LIDAR systems, 3D and 360◦ cameras [4,10]. How-
ever, these sensors are costly, and require continuous maintenance
and complex synchronization in the fusion of different sources of
* Corresponding author. data. To remedy these limitations, low-cost, vision-based obstacle
E-mail addresses: [email protected] (A. Dairi), [email protected]
(F. Harrou). detection and localization systems have been developed [1,2,11].

https://doi.org/10.1016/j.robot.2017.11.014
0921-8890/© 2017 Elsevier B.V. All rights reserved.
288 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Such systems are mainly based on multiple collection of views dimensionality, generating new data with a given joint distribu-
using visual sensors that can estimate depth and perceive three- tion, and unsupervised learning are not possible [24]. Restricted
dimensional (3D) components in a scene. For example, binocular Boltzmann machines (RBM) and autoencoders are powerful deep
stereovision is based on two rectified images (left and right) that architectures that overcome most of these limitations [23]. These
are used to compute a disparity map (i.e., displacement of an object deep-learning based approaches are usually implemented in three
between two rectified images) such that the epipolar geometry main steps. First, a heavy scanning of images. The next step is to
constraints are fulfilled [1,2,12,5]. locate the surrounding ROI. The last step is to start a recognition
In the literature, there has been much discussion on obstacle process. This complex process is automatically executed in both
detection techniques. For instance, some approaches are based on the presence and absence of obstacles, which is the main drawback
images descriptors such as scale invariant feature transform (SIFT), of such an approach.
local binary pattern (LBP), regions of interest (ROI) based on sliding
windows, and histograms of oriented gradient (HOG) [13]. Indeed, 1.2. Motivation and contribution
these techniques usually utilize manually designated features,
such as vehicle motion, color and texture. Nadav and Katz [14], To improve obstacle detection and classification, we start by
Broggi et al. [15], Yamaguchi et al. [16] proposed obstacle detection checking the presence of obstacles before starting any heavy scan-
using a monocular camera in the off-road environment. Häne et al. ning of input images. In other words, our objective is to optimize
[17] proposed an obstacle detection approach in the on-road envi- the obstacle detection process by answering the question, are there
ronment using monocular cameras. Labayrade et al. [1], Fakhfakh any obstacles? Then, the localization, estimation and recognition
et al. [2], Hu and Uchimura [12] proposed a binocular stereo vision processes can be executed only if a potential obstacle exists.
system based on depth estimation via disparity maps for high- Here, we treat the problem of obstacle detection as an anomaly
ways. Sun et al. [3] proposed a system for detection and tracking detection problem based on the V-disparity data distribution. In
of moving obstacles in urban driving scenarios. Appiah and Ban- urban settings or on highways a V-disparity data distribution,
daru [4] proposed an approach using stacked stereo 360 vertical which is the vertical coordinate in the (u, v ) disparity map coor-
cameras to perceive obstacles around an autonomous vehicle. Nal- dinate system [27,12], is mostly stable with small variations due to
pantidis et al. [5] introduced a new representation of 3D scene measurement noise. The V-disparity can significantly change in the
structure named theta-disparity. The key idea of theta-disparity is presence of obstacles. Our proposed system has four main stages as
to get a radial representation of the significant objects in a set with shown in Fig. 1.
respect to a point of interest based on a disparity map [4]. Woo
and Kim [7] proposed vision-based obstacle detection and collision • First, the system employs an innovative hybrid framework
risk estimation of an unmanned surface vehicle. Based on the work for feature extraction and encoding. This is based on a hy-
of Labayrade et al. [1], Fakhfakh et al. [2], Nalpantidis et al. [5], brid encoder model that combines multiple layers of deep
Burlacu et al. [18] presented an obstacle detection approach in Boltzmann machine (DBM) as the feature extractor and
stereo sequences using multiple representations of the disparity autoencoder (AE) for dimensionality reduction (V-disparity
map. However, this approach is based on heavy scanning of images ⇒ Code). In fact, we start with unsupervised greedy layer-
to look for obstacles without any certainty about the existence wise training of the hybrid encoder using the V-disparity
and kind of obstacles. This method requires intensive computation dataset. Two tasks are accomplished at the end of each layer:
and is difficult to adapt in real-time applications. In addition, this (1) discover and extract new features; (2) generate a new
method cannot distinguish obstacles from other objects. encoded output that will be used as input for the next layer.
In obstacle detection and localization, machine learning turn This proposed hybrid encoder architecture is built on four
out to play an important role [19–21]. Many methods have been layers of DBM and AE.
developed for improving obstacle detection and for handling new • Second,we address obstacle detection as an anomaly detec-
applications [22,13,23,24,21]. In learning-based obstacle detection tion problem based on the one-class support vector machine
methods, two classes can be distinguished: approaches based on (OCSVM) classifier, which requires only obstacle-free data in
shallow learning approaches and those based on deep learning training. The training of OCSVM is unsupervised from data
approaches. Various shallow learning-based approaches have been encoded by the hybrid encoder model. The central role of the
investigated, such as training different classifiers by support vector OCSVM classifier is to separate inliers from outliers in the
machines (SVM), AdaBoost, and neural networks in supervised testing data by building a hyper-plan [28]. Third, the pres-
learning with one or two layers [22]. Robust approaches have ence of obstacles can predicted. Towards this end, for a given
been proposed by merging HOG with SVM for human detection V-disparity, a code is generated using the hybrid encoder
based on single views [13]. However, shallow learning approaches model and the OCSVM classifier predict if it is an inlier or an
are not suitable for representing dependencies between multiple outlier. Here, two models are built, the first model identifies
variables, and they are inefficient in dealing with problems with free scenes and the second model to identifies busy scenes.
high-dimensionality data, leading to unsuitable generalized mod- The main reason to use two models is to improve decision
els [23,21]. making and reduce false alarms.
On the other hand, deep learning-based approaches have been • Finally, the location of obstacles can be estimated based on
developed to overcome these limitations. Indeed, deep convolu- density maps computed for both V-disparity and U-disparity
tional neural networks are powerful tools in image classification. by checking changes in residuals, which represent the differ-
They have proved to be efficient for Google’s ImageNet, which con- ence between the current values of the density maps and the
tains more then 1.3 million high-resolution images. Deep convolu- previous values. Here, the three-sigma rule is used to detect
tional neural networks (CNNs) were first proposed by Nguyen et changes in residuals.
al. [25] for obstacle detection and recognition, but their efficiency
was limited to 2D images. Ramos et al. [26] proposed an approach The effectiveness of the developed hybrid approach is validated
based on deep CNNs to detect unexpected obstacles. Despite the using experimental data from two publicly available datasets, the
promising results obtained using the deep CNN approach for obsta- Malaga stereovision urban dataset and the Daimler urban segmen-
cle detection and recognition based on 2D images, some tasks, such tation dataset. Results show that the proposed approach is able to
as learning more about data distribution, encoding data, reducing reliably detect obstacles.
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 289

Fig. 1. Flowchart of the proposed vision-based obstacle detection and localization system.

The remainder of this paper is organized as follows. Section 2

gives a brief overview of machine learning generative models
and the OCSVM algorithm. In Section 3, stereovision is briefly
presented. In Section 4, we present the proposed hybrid deep-
learning-based obstacle detection approach. In Section 5, we as-
sess the performance of the developed approach using publically
available experimental data. Finally, Section 6 concludes with a
discussion and suggestions for future research directions.

2. Preliminary materials
Fig. 2. Autoencoders.
In this section, we briefly present an overview of machine learn-
ing generative models used to build deep learning architectures,
such as deep autoencoders, Boltzmann machine and restricted
Boltzmann machine. More details about these generative models models have been widely applied in image denoising [30,31] and
can be found in [24,29]. content-based image retrieval [32].

2.1. Autoencoders
2.2. Restricted Boltzmann machine
An autoencoder is an artificial neural network [23] used for
unsupervised learning that is trained to reconstruct its own inputs Restricted Boltzmann Machines (RBMs) can be viewed as
(i.e., predicting the value of output x̂ given input x via hidden stochastic neural networks [33] (see Fig. 3). RBMs consist of m
layer h, see Fig. 2). Autoencoders are widely used in dimensionality visible units, v ∈ {0, 1}m and n hidden units, h ∈ {0, 1}n . There
reduction and feature learning. Autoencoders comprise two parts: are no visible-to-visible and hidden-to-hidden connections, al-
the encoder and the decoder. The encoder can be defined with
though v and h are fully connected (see Fig. 3). The learning proce-
encoder function h = Encoder(x), which can be defined by a linear
dure comprises many steps of Gibbs sampling (propagate: sample
or nonlinear function. If the encoder function is nonlinear, the
autoencoder will have capacity to learn more features than linear hidden given visibles; reconstruct: sample visible given hidden;
principal component analysis [23]. The purpose of the decoder part repeat) and selecting the weights with minimum reconstruction
is to reconstruct its own inputs via the decoder function, x̂ = error. Different learning algorithms for RBMs have been proposed
Decoder(h). The learning process of an autoencoder is achieved by mostly based on Markov chain Monte Carlo (MCMC) sampling
minimization of the negative log-likelihood (loss function) of the using Gibbs sampling to obtain an estimator of the log-likelihood
reconstruction, given the encoding Encoder(x) [23]: gradient [23,34]. Moreover, RBMs are used to construct deeper
models, such as Deep Belief Networks (DBN) and the hierarchical
Reconstructionerror = −log(P(x|Encoder(x))), (1) probabilistic model deep Boltzmann machine (DBM) [35].
where P is the probability assigned to the input vector x by the RBMs are particularly energy-based models and have been used
model. Indeed, incorporating latent variable models has caused au- as generative models for several types of data [23] such as text,
toencoders to behave like generative models. Stacked autoencoder speech and images. The energy function of the RBM configuration
290 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 3. Schematic presentation of a restricted Boltzmann machine.

Fig. 4. The structure of deep belief networks.

is defined as [36]:
m n m n
∑ ∑ ∑ ∑
Energy(v, h) = − Wij vi hj − bi vi − cj hj , (2)
other words, it is done by maximizing the log-likelihood of the
i=1 j=1 i=1 j=1
parameters given the training data, where the derivative of the log-
where Wij is the weight matrix between visible variable vi and likelihood with respect to W takes the following form [34]:
hidden variable hj and b and c are model parameters. The joint
distribution of the configuration is given as: ∆wij = α (E(vi , hj ) − Ê(vi , hj )), (11)
1 where α is the learning rate and Ê(vi , hj ) is the energy ex-
P(v, h) = exp(−Energy(v, h))
Z pected from the distribution learned by the model which is in-
1 ∏ Wij vi hj ∏ bi vi ∏ aj hj tractable [34] Gibbs Sampling is used instead. RBMs have been
= e e e , (3)
Z successfully applied in various applications such as in blocks of
ij i j
deep learning architectures, classification, feature extraction and
where dimensionality reduction.
∑∑
Z = exp(−Energy(v, h)) (4) 2.3. Deep belief networks
v h

is the partition function. Since only v is observed, the hidden Deep belief networks (DBNs) are probabilistic generative mod-
variables h are marginalized. els that are based on stacked RBMs (see Fig. 4). DBNs have been
∑ e−Energy(v,h) used in many challenging learning problems, such as in real-
P(v ) = , (5) time classification [37], audio classification [38], speech synthe-
Z sis [39], and facial expression recognition [40]. They exhibited
h
high efficiency in discovering layer-by-layer complex nonlinearity.
where P(v ) is the probability assigned by the mode to a given
Furthermore, DBNs have been used successfully in dimensionality
visible vector v . In terms of probability since the hidden nodes are
reduction [34,41]. Hinton et al. [34] introduced a fast unsupervised
conditionally independent from the visible units, we can derive
learning algorithm for DBN in which the joint distribution between
from Eq. (3):
observed vector x and ℓ hidden layers hk is expressed as follows:
∏
P(v|h) = p(vi |h), (6) ℓ−2
∏
i P(x, h1 , . . . , hl ) = ( P(hk |hk+1 ))P(hℓ−1 , hℓ ), (12)
∏
P(h|v ) = p(hj |v ). (7) k=0

j where x = h0 and P(hk |hk+1 ) is a visible given hidden conditional

m
For binary visible unit v ∈ {0, 1} and hidden units h ∈ {0, 1} , n distribution in an RBM associated with level k of the DBN, and
the marginal probability of the RBM is expressed by: P(hℓ−1 , hℓ ) is the joint distribution in the top-level RBM.
∑ Indeed, adding more layers of the DBN allows an increase in the
P(vi = 1|h) = σ ( Wij hj + ci ), (8) probability of the training data. Specifically, accuracy of the energy
j
expression is improved by adding more layers in the network. The
∑ training time will be reduced because only one step is required to
P(hj = 1|v ) = σ ( Wij vi + bj ), (9) learn the maximum likelihood.
i

where σ (·) is the logistic function and σ (x) = (1 + exp(−x))−1 . Hin- 2.4. Deep Boltzmann machines
ton et al. [34] developed an extension of RBMs, Gaussian Bernoulli
RBMs, to deal with different data types like real-valued vectors Salakhutdinov and Hinton [35] proposed a new learning algo-
(e.g., pixel intensities of an image), in which v ∈ Rm and hidden rithm for a hierarchical probabilistic model called deep Boltzmann
units h ∈ {0, 1}n . For the Gaussian Bernoulli RBMs, the joint energy machine (DBM). DBM is a generative model with many layers of
is: hidden variables in which connections between layers are undi-
I I J J
rected (see Fig. 5). Whereas RBMs are a kind of Markov random
∑ (vi , ci )2 ∑ ∑ vi ∑ field, DBMs learn increasingly from complex representations of
Energy(v, h) = − Wij hj − bj hj . (10)
2σi2 σi given data and incorporate uncertainty about ambiguous and miss-
i=1 i=1 j=1 j=1
ing or noisy inputs. DBMs are able to extract complex statistical
The aim of training RBMs is to adjust the model’s parameters structures and are applicable to various applications, such as object
(weights matrix w ) (see Eq. (11)). This task is achieved by max- recognition [42], and computer vision [43]. Salakhutdinov and
imizing the probability of the training data under the model. In Larochelle [44] optimized all layers of DBM parameters jointly
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 291

The decision function f (x) can be estimated by Eq. (19) since

nonzero slack variables ξi are penalized in the objective function:

f (x) = sgn((w · Ψ (x)) − ρ ). (19)

An hyperplane is constructed based on two parameters w and ρ ,
the distance of all the data points in F from the hyperplane to the
origin.
Tax and Duin [46] introduced an other one-class classifier called
support vector data description (SVDD). In the SVDD algorithm,
boundaries used to detect novel data as inliers or outliers are
spherically shaped to contain the training samples. However, the
spherical boundary of SVDD suffers from the nonspherical shapes
Fig. 5. A deep Boltzmann machine.
of some training datasets, resulting in empty space in the hyper
sphere. In this paper, we use the SVDD algorithm as a benchmark
for obstacle detection using our hybrid deep learning approach.

by following the approximate gradient of a variational lower- 3. Stereovision

bound on the likelihood function. Salakhutdinov and Hinton [35]
proposed greedy, layer-by-layer pre-training by learning a stack of Stereovision is the process of extracting 3-D information from
RBMs with a small change to initialize the model parameters of a multiple 2-D views of a scene. Stereovision techniques usually
DBM. The DBM energy function of the state {v, h} is defined as: depend on epipolar geometry to perform spatial perception and
depth estimation based on a disparity map of two rectified im-
E(v, h1 , h2 ; θ ) = −v T W 1 h1 − h1 W 2 h2 , (13) ages (left and right) [1,2]. Disparity maps indicate the difference
(’’disparity’’) in position of an object in two corresponding rectified
where θ = {W 1 , W 2 } are the model parameters, the vector of
images. The disparity becomes smaller as the distance between the
visible units v ∈ {0, 1}D and the vectors of hidden units h1 , h2 ∈ object and camera decreases, and vice versa. Several algorithms
{0, 1}P . The probability that the model assigns to a visible vector v have been proposed to compute disparity maps [1,2,47], D, using
is: different matching correlation measures, such as the sum of abso-
1 ∑ lute differences (SAD) (see Eq. (20)).
p(v; θ ) = exp(−E(v, h1 , h2 ; θ )). (14)
Z (θ )
h1 ,h2 DSAD (i, j, d)
ω
∑ ω
∑
= |Ileft (i + u, j + v ) − Iright (i + u, j − d + v )|. (20)
2.5. The one-class support vector machine (OCSVM) u=−ω u=−ω

where Ileft and Iright respectively denote the left and right image
The one-class support vector machine (OCSVM) [45] is an effi- pixel intensities, d is the disparity range [dmin , dmax ], dmin and dmin
cient, unsupervised learning algorithm that learns decision func- are respectively the minimum and maximum disparity values, ω
tions for anomaly detection. OCSVM returns a function f (x) with is the window size and i, j are the coordinates (rows, columns
+1 or −1 to indicate whether the data is an ‘‘inlier’’ or ‘‘outlier’’ respectively) of the center pixel of the SAD or any correlation
respectively. Its decision function f (x) is defined as: measures.
{ V-disparity map, which gives a good estimation of a road’s pro-
+1, if region capturing most of the data points
f (x) = (15) file based on the Hough transform and depth estimation, provides
−1, otherwise.
information about the height of obstacles and their positions with
OCSVM, which is based on kernels (see Eq. (16)) such as the respect to the ground [1,2]. The main steps used to compute the
radial basis function (RBF) (see Eq. (17)), maps input data into a V-disparity are given in Algorithm 1.
high-dimensional feature space F , the hyperplane that maximizes Algorithm 1: V-disparity computation steps.
the margin that best separates the training data from the origin.
Input: Disparity map DispMap(rows, cols)
Input: Dmax : Max disparity value.
K(x, y) = (Ψ (x) · Ψ (y)), (16)
Output: V-disparity DispMapv (rows, Dmax )
where x and y are the input vectors, Ψ is a feature map X → F and 1 for Each row r th in DispMap do
X is set of observed x. The RBF kernel is also known as a Gaussian 2 for Each column c th in DispMap do
kernel: 3 currentDisparity ← DispMap(r , c)
2 4 if currentDisparity > 0 then
∥ x − y∥
KRBF (x, y) = exp(− ). (17) 5 DispMapv (r , c) ← (currentDisparity + 1)
2σ 2
The selection of the hyperplane separating the training dataset
from the origin is achieved by solving the following quadratic On the other hand, a U-disparity map provides information
about the width of obstacles and depth estimation [1,2,12]. Algo-
optimization problem:
rithm 2 describe the main steps to compute U-disparity.
l
1 1 ∑ A density map is a compact representation of V-disparity with-
min ∥ w∥2 ξi − ρ, (18) out losing of essential information. To compute the density map,
w∈F, ξ ∈Rl ,ρ∈R 2 νl
i the V-disparity is segmented into many small cells (see Fig. 6), and
the density map for each cell is derived as follows:
subject to (w · Ψ (x)) ≥ ρ − ξi , ξi ≥ 0
Cell
∑
where ν ∈ [0, 1] is a parameter that characterizes the solution, w DensityCell = ( I(i, j))/(w ∗ h),
is a weight vector and ρ is an offset.
292 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 6. Example of density map.

where I(i, j) is the intensity of the pixel in row i and column j, w

and h are respectively the width and height of the cell.
Algorithm 2: U-disparity computation steps.
Input: Disparity map DispMap(rows, cols) Fig. 8. Example of V-disparity in the situation of free-scene (a). Example of V-
Input: Dmax : Max Disparity value. disparity and U-disparity in the presence of obstacles (b).
Output: UDisparity DispMapu (Dmax , cols)
1 for Each row r th in DispMap do
2 for Each column c th in DispMap do Indeed, V-disparity and U-disparity are respectively computed
3 currentDisparity ← DispMap(r , c) based on the numbers of pixels with the same disparity level at
4 if currentDisparity > 0 then rows and columns wise in the disparity map. After the V-disparity
5 DispMapu (r , c) ← (currentDisparity + 1) and U-disparity maps are computed, the road profile is extracted
from V-disparity using the Hough transform. Fig. 8(a)–(b) shows
an example of V-disparity with free-scene (i.e., the absence of
The region of interest (ROI) of obstacles can be determined obstacles) and in the presence of an obstacle, respectively. The
using both U-disparity and V-disparity maps. Fig. 7 illustrates how road surface determines an inclined straight line in V-disparity
V-disparity and U-disparity maps are used to find a region of inter- space (see Fig. 8(a)). Fig. 8(a) shows that the V-disparity concept
simplifies the process of separating obstacles in an image. The
est containing potential obstacles. Indeed, each detected obstacle
vertical cloud of points on the lower disparity represents a static
belongs to an interval of disparity, which allows the estimation of environment (see Fig. 8(a)), its thickness depends on its texture
its distance from the vehicle. richness (e.g., buildings, and trees). Obstacles on a road will are

Fig. 7. Example of using V-disparity and U-disparity to locate obstacles.

A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 293

Fig. 9. Block diagram of the deep encoders architecture with two OCSVM classifiers.

presented by vertical lines with high intensities (see Fig. 8(a). If two models. Fig. 9 schematically summarizes the proposed system
the obstacle is closer to the right side of the V-disparity map, that is based on a deep learning architecture trained entirely in an
the distance between the obstacle and the vehicle is smaller. The unsupervised way. The main steps of the proposed approach are
thickness of the detected obstacle decrease when the obstacle summarized in Algorithm 4.
moves away further from the mobile robot. The vertical length Algorithm 4: Hybrid deep encoder approach.
of the vertical line represents the height, h, of the actual obstacle
in the image. The greater the thickness of the obstacle in the V- Input: images DataSet of (Left, right): TrainingDataset
disparity map, the bigger is the obstacle in the image (e.g., bus, Output: DataSet of Encoded V-disparity: EncodedDataset
cars, and pedestrians). Fig. 8(b) shows pedestrians walking on the 1 for Each tuple (Left, Right) in training dataset do
road. From the V-disparity, it can be seen that vertical lines to the 2 DisparityMap ← ComputeDisparityMap(Left , Right)
road profile indicate the presence of these obstacles (i.e., pedestri- 3 V -Disparity ← ComputeVDisparity(DisparityMap)
ans). In U-disparity, obstacles appear as a fragment of horizontal 4 X ← V -Disparity
lines (see Fig. 8(b)). The length of a fragment is the width of the 5 for Each layer λ in HAE layers do
detected obstacle, and the starting x-coordinate of each fragment 6 outputDBM ← LearnFeaturesDBM (X )
represents the x-coordinate of the obstacle. By using V-disparity 7 outputλ ← EncodeAE (outputDBM )
and U-disparity, the width, high, x and y coordinates of the detected 8 X ← outputλ
obstacle can be extracted. The Algorithm 3 describes the steps to 9 EncodedDataset ⇐H add(X )
surround obstacle on ROI. 10 /*Add X to EncodedDataset*/
Algorithm 3: Obstacle localization steps. 11 OCSVMModel ← train(EncodedDataset)
Input: Disparity map: DMap
Output: Vector of Region of Interest: RROI
Definition 1 (Operating Area). Let us define an operating area
1 V ← BuildVDisparity (DMap);
as the region in front of a vehicle (see Fig. 10). The dimensions
2 U ← BuildUDisparity (DMap);
of this region are expressed as a range of disparities, where δ is
3 D : is the disparity range of the obstacle;
the disparity range, δmin and δmax are he minimum and maximum
4 (x, y): coordinate of the obstacle in the original image;
5 (h, w ): height and width of the obstacle; disparity values, respectively.
6 Extract Road Profile RP from V ; The proposed procedure is implemented in several steps as
7 OBS ← FindStandingObstacle(RP); summarized in Table 1.
8 for Each obstacle O in OBS do
9 ➥ Determine D and y from V ;
4.1. Hybrid deep architecture training
10 ➥ Determine h obstacle height located in V ;
11 ➥ Determine w and x using D from U ;
12 ➥ Append (x, y, h, w ) to RROI ; In this section, we describe the approach used to train the
proposed deep architecture, starting with building the hybrid deep
13 return RROI
encoder based on unsupervised training. Then, the one-class clas-
sifier is trained to learn how to classify the encoded data obtained
4. Proposed hybrid deep autoencoder-based obstacle detection from the hybrid deep encoder.
approach Deep hybrid encoder training. The proposed system is based on two
models, which are implemented in parallel (see Fig. 9). Each model
The proposed hybrid deep autoencoder (HAE) consists of four merges a deep DBM with an autoencoder to enhance the quality
layers. Each layer is the combination of a DBM and an autoencoder. of the generated encoded datasets (see Fig. 11). These models are
In each layer, useful features are extracted and encoded in an trained with an input dataset that contain rectified left and right
output code. Then, the generated code is used for the next layer. images. Specifically, we train the first model with image sequences
The output of the last layer will be used as the input to the one-class that contain mostly free scenes with a few obstacles. At the same
classifier. Specifically, the one-class classifier builds boundaries to time, we train the second model with data containing mostly
separate normal (without obstacles) and abnormal (presence of scenes with obstacles. This hybrid deep encoder allows the system
obstacles) cases. In this approach, two models are constructed to to learn a complex data distribution and encode the input images.
enhance accuracy and reduce false alarms. The first is built with It is also able to reconstruct the input with reduced errors.
unsupervised learning of images with obstacles and the second
is built with the unsupervised learning based on images without Training the one-class classifier. In the proposed approach, the
obstacles. False alarms can be reduced by comparing the outputs of OCSVM classifier, which is an unsupervised classifier, is trained
294 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 10. Vehicle operating area.

Table 1
Main steps of the proposed system.
Step Action
❶ Stereoimage acquisition from the stereovision device.
➥ Input: Left and right images.
➥ Output: Rectified left and right images.
❷ Compute disparity map:
➥ Input: Rectified left and right images.
➥ Output: Disparity map.
❸ Compute V-disparity map (see Algorithm 1)
➥ Input: Disparity map.
➥ Output: V-disparity map.
❹ Check existence of obstacles (Detection): Apply the hybrid deep encoder-based
OCSVM for obstacle detection.
➥ Input: Encoded V-disparity map.
➥ Output: Prediction, P ∈ ⟨Yes, No⟩.
❺ Compute scene density: Compute Density map using V-disparity density
➥ Input: Encoded V-disparity map.
➥ Output: Density estimation.
❻ Track obstacles localization (tracking): Based on the previous density map,
predict the new obstacles localization by tracking density changes.
➥ Input: Density map.
➥ Output: Estimation of the obstacles localization.
❼ Compute U-disparity map: Compute U-disparity map based the boundaries of
the vehicle operating area (see Algorithm 2).
➥ Input: Disparity map.
➥ Output: Obstacles region of interest (ROI) (see Fig. 12).

with the encoded V-disparity map generated from the two con- schematically presented in Fig. 12. This approach is based on V-
structed models of the hybrid deep encoder. As described above, disparity and U-disparity maps, which are useful for obstacle local-
we implement two OCSVMs, the first aims to detect outliers from ization. In fact, each row in the density map of the V-disparity map
represents an area that potentially contains obstacles (following
the encoded V-disparity map of the model trained with obstacles;
the Y -coordinate axis). The density map of the V-disparity is useful
the second is used to detect outliers from the encoded V-disparity for detecting and tracking obstacles moving vertically. On other
map of the model trained without obstacles (see Fig. 9). hand, columns of the density map obtained from the U-disparity
represent an area that potentially contains obstacles (following
Obstacle localization and tracking. After detecting an obstacle using the X -coordinate axis). Thus, the density map of the V-disparity
our HAE-OCSVM approach, it is important to locate its position. can be used to detect and track obstacles moving horizontally. In
The proposed approach for obstacle localization and tracking is this approach, the V-disparity and U-disparity maps are computed
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 295

Table 2
Parameter settings of the studied approaches.
Models Parameter Value
learning rate 0.01
DBM Gibbs sampling (k) 15
Training epochs 100
Autoencoder Learning rate 0.01
Training epochs 100
OCSVM Kernel RBF
RBF γ 0.1
RBF ν 0.1
Operating area δmin 32 (pixels)
δmax 64 (pixels)

Fig. 11. Deep Boltzmann machines with autoencoders. mainly of free scenes. In the testing phase, we used two sub-
datasets of MSVUD, extract number 10 (multiple loop closures)
which consists of 9000 pairs of images, and extract number 12
(a long avenue of 3.7 km with traffic), which consists of 11,000
pairs of images. In addition, the DUSD dataset is used for obstacle
detection with 500 pairs of images.
To do so, we used two MSVUD datasets for testing pur-
poses [49]. The first dataset termed FREE-DST contains 20% of fuzzy
situations and 80% of free roads. The second dataset called BUSY-
DST contains a 90% of fuzzy situations and 10% of true obstacles
Fig. 12. Block diagram of the obstacle localization process. (vehicles, motorbikes and pedestrians). This distribution is moti-
vated by the fact that in normal urban driving scenarios, the car is
moving most of the time unless. The vehicle can be stuck in traffic.
Both datasets, FREE-DST (3563 pairs of images) and BUSY-DST
based on the disparity map of the two input images. Of course, the (1437 pairs of images), were generated randomly from extracts 10
density map can be used as an indicator to determine the position and 12 of MSVUD.
of the detected obstacle. Towards this end, we analyze the trend In this study, the effectiveness of three obstacle detection ap-
of previous density map to track changes. Specifically, we applied proaches consisting of two layers: deep encoders and one-class
the three-sigma rule (i.e., Shewhart monitoring chart) [48] on the
encoders, is assessed and compared. Indeed, we used three dif-
density map column-wise to detect change.
ferent deep encoders: (i) the proposed Hybrid Autoencode (HAE),
The proposed approach is implemented in several steps: firstly,
(ii) Deep Belief Network (DBN), and (iii) Stacked Autoencoders
the V-disparity and U-disparity maps are computed based on the
(SDA). Also, we used two one-class classifiers OCSVM and SVDD.
disparity map of the two input images. Then, the road profile is
The experimental parameters of the machine learning approaches
extracted using the Hough transform, which helps to determine a
studied in this paper are presented in Table 2.
line representing the road. Obstacles on the road are represented
by vertical lines on the V-disparity map. Their height and depth
5.2. Model trained with free scenes (FSM)
can be estimated via distances in the V-disparity. Their width can
be determined based on processing the U-disparity map. Thus, we
can surround the obstacles in the ROI. Specifically, by crossing the To build an efficient and accurate model able to predict free
U-disparity and V-disparity maps, we can surround the obstacles scenes and reject the scenes with obstacles, we trained the one-
and estimate their positions and distances. class classifier with V-disparities of free scenes. Sometimes there
were confusing (fuzzy) situations in which obstacles were in the
5. Experimental results and discussion field of view of the vehicle but they were not in the operating area.
This classifier is constructed to fit with the free scenes and fuzzy
5.1. Data description situations and to reject busy scenes. Examples of free scenes and
their corresponding V-disparity map are shown in Fig. 13. From
This section reports on the effectiveness of the proposed hy- the V-disparity shown in Fig. 13 (a)–(d), it can be seen that the road
brid encoder approach. Towards this end, we performed experi- profile is clearly apparent with visible inclined line of cloud points
ments on two practical datasets: the Malaga stereovision urban without accumulation of high intensities pixels. So, from Fig. 13
dataset (MSVUD) [49] and the Daimler urban segmentation dataset (a)–(d), it seems that there is no obstacle in the road. It can also
(DUSD) [50,51]. The MSVUD comprises 15 sub-datasets (extracts) be seen that the static environment (vertical line) is in the low V-
of rich urban scenarios of more than 20 km in length with a resolu- disparity area, which means it is away from the vehicle.
tion of 800 × 600 pixels recorded under different situations (with We evaluated the effect of the number of samples in the training
and without traffic), such as a straight path, turns, roundabouts, dataset on the accuracy of the proposed hybrid model. To do so,
avenue traffic and highway. The DUSD contains images sequences we varied the number of samples in the training dataset from 500,
recorded in urban traffic. It consists of rectified stereo image pairs 1000, 2000 to 5000 and evaluated the accuracy of the proposed
with a resolution of 1024 × 440 pixels [51]. HAE-OCSVM algorithm compared to both SDA and DBN-based
Two sub-datasets of MSVUD are used in the training phase. OCSVM algorithms (see Table 3). In each experiment, we measure
The first dataset, which is extract number 5 (avenue loop closure the inliers called true positives (TP); accepted by OCSVM and the
1.7 km), consists of 5000 pairs of images and the second dataset outliers, called false positives (FP), rejected by OCSVM. Table 3
is extract number 8 (long loop closure, 4.5 km), which consists shows that when 500 samples were used for training, the accuracy
of 10,000 pairs of images. These two extracts (5,8) are composed in percentage % (TP, FP) of the proposed HAE-OCSVM method was
296 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 13. Examples of free scenes (Right) Original input image and (Left) its corresponding V-disparity map.

Table 3 Table 4
Performance comparison between HAE-OCSVM, DBN-OCSVM, and SDA-OCSVM Performance comparison between HAE-OCSVM, DBN-OCSVM, and SDA-OCSVM
based on FREE-DST. methods applied to BUSY-DST dataset.
Dataset (Samples) Approach Inliers (TP) Outliers (FP) Dataset (Samples) Approach Inliers (TP) Outliers (FN)
DBN-OCSVM 89.78 10.22 DBN-OCSVM 63.89 36.11
500 HAE-OCSVM 99.51 0.49 500 HAE-OCSVM 81.98 18.02
SDA-OCSVM 89.39 10.61 SDA-OCSVM 52.96 47.04
DBN-OCSVM 89.45 10.55 DBN-OCSVM 63.96 36.04
1000 HAE-OCSVM 99.95 0.05 1000 HAE-OCSVM 94.79 5.21
SDA-OCSVM 90.3 9.70 SDA-OCSVM 53.38 46.62
DBN-OCSVM 90.25 9.75 DBN-OCSVM 64.51 35.49
2000 HAE-OCSVM 99.92 0.08 2000 HAE-OCSVM 91.24 8.76
SDA-OCSVM 90.08 9.92 SDA-OCSVM 52.96 47.04
DBN-OCSVM 89.89 10.11 DBN-OCSVM 41.13 58.87
5000 HAE-OCSVM 99.73 0.27 5000 HAE-OCSVM 86.44 13.56
SDA-OCSVM 90.57 9.43 SDA-OCSVM 52.55 47.45

99.51 and 0.49 respectively, and that of DBN-OCSVM and SDA- of the proposed method compared to the DBN-OCSVM and SDA-
OCSVM was 89.78 and 10.22 and 89.39 and 10.61, respectively. It OCSVM methods. This fact is due to integrating the DBM, which
can be seen that the accuracy of the proposed method increases is able to learn and extract complex data, with encoder-based
with the number of samples in the training data. dimensionality reduction, thus improving the feature extraction.
With 5000 training samples, the HAE-OCSVM method, the DBN- These results indicate that the proposed method learns complex
OCSVM, and the SDA-OCSVM method respectively yielded 99.73 structures of input data.
and 0.27, 89.89 and 10.11 and 90.57 and 9.43 percent accu- Fig. 15 presents area-under-curve (AUC) values corresponding
racy. Results show that the proposed method outperformed DBN- to the proposed HAE-OCSVM method, and the DBN-OCSVM and
OCSVM and SDA-OCSVM, and exhibited the highest accuracy. This SDA-OCSVM methods for different training data sizes. We note that
the HAE-OCSVM method performed better than the other models
is mainly due to its strong ability to learn complex structures from
due to the combination of two powerful deep learning architecture
training data.
DBMs as feature extractors, the autoencoder for dimensionality
We also assessed, the performance of previously constructed
reduction and the extended capacity of OCSVM algorithm to detect
models trained with free scenes using BUSY-DST. Fig. 14 shows
outliers.
examples of busy situations. From Fig. 14 (a,c and d), it can be seen
that an area with visible pixel intensities is present in the road 5.3. Model trained with busy scenes (BSM)
profile, and the static environment (vertical line) is located in the
middle of the V-disparity. Thus, the scene contains an obstacle and To build a model that rejects free scenes and describes busy
its static environment is close to the vehicle. The static environ- scenes, we trained three deep encoders (HAE,SDA,DBN) with a
ment in Fig. 14(b) has an unusually thick due to the sky fragment dateset containing sequences of busy roads (with traffic) as de-
with low in texture. Table 4 shows the high prediction accuracy scribed above. To validate the proposed model we generated a new
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 297

Fig. 14. Examples of busy scenes (Right) Original input image and (Left) its corresponding V-disparity map.

Table 5
Performance of HAE-OCSVM, DBN-OCSVM and SDA-OCSVM methods trained with
busy scenes and tested on OBS-DST.
Encoders Inliers (TN) Outliers(FN)
DBN-OCSVM 91.12 8.88
HAE-OCSVM 99.79 0.21
SDA-OCSVM 95.20 4.80

can increase the number of false alarms. For this reason, we have
to deal with fuzzy situations. Fig. 16 shows few examples of fuzzy
situations in which it is not easy to determining whether or not
the scene is free. From Fig. 16(a)–(d), it can be seen that the
environment static is close to the vehicle, which is not the case of
a free scene. Also, here the vehicle is coming close to a bend. These
are confusing situations.
Fig. 15. AUC of the proposed HAE-OCSVM method compared to DBN-OCSVM and Here, we propose an approach to identify fuzzy situations as
SDA-OCSVM methods for different training-sample size. different from busy and free scenes. Towards this end, we compare
responses of the FSM and BSM models to identify fuzzy situations.
If both models are flagged, the tested case is considered as a fuzzy
dataset from BUSY-DST composed of 400 true obstacles named situation. By this, we can identify and filter fuzzy situations from
OBS-DST. Table 5 presents the testing results of the HAE, SDA and busy and free situations.
DBN-based OCSVM methods applied to the OBS-DST dataset. The After identifying fuzzy scenes, two cases can be distinguished:
proposed method achieved a high prediction accuracy of 99.79% true alarm and warning alarm. A true alarm occurs if there is an
compared to 91.12% and 95.20% accuracy using DBN-OCSVM and obstacle in the operating area of the vehicle (see Fig. 10). Fig. 17
SDA-OCSVM, respectively (see Table 5). Again, the overall per- shows two examples of true alarms (i.e., the presence of obstacles
formance of the proposed HAE-OCSVM is better that of DBN- in the operating area). On other hand, a warning alarm is declared
OCSVM and SDA-OCSVM due to fact that DBMs are robust feature if there is an obstacle in the field of view but outside the operating
detectors that capture data correlations. In addition, complex data- area of the vehicle (see Fig. 18). To distinguish between true alarms
dependent statistics can be discovered for learning through multi- and warning alarms, we used U–V disparity on the operating area
ple layers. of the vehicle to estimate the obstacle locations. If the obstacle
is inside the operating area, then it considered as true alarm;
5.4. Identification of confusing (fuzzy) situations otherwise, it considered as a warning alarm.
Here, we investigate the capability of this approach to distin-
Now, we focus on the identification of confusing situations. guish between warning and true alarms. To do so, we test both
In such situations, the output response could be free scene, busy BSM and FSM models with the BUSY-DST dataset, which comprises
scene or fuzzy or confusing situation. These confusing situations 1437 examples of confirmed obstacles (417 scenes) and 1020 fuzzy
298 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 16. Examples of fuzzy situations.

Fig. 17. Obstacle detection for true alarms examples. In each image, the colored boxes represent the predicted ROI area of the detected obstacles.

Fig. 18. False alarm examples. In each image, the colored boxes represent the predicted ROI area of the detected obstacles.

situations. After applying the identification approach, we find 59% to appear to be fairly linear. Here, we aim to exploit the ad-
warning alarms and 41% true alarms. This distribution (see Fig. 19) vantages of the HAE model and those of the OCSVM with RBF
is obtained according to the chosen dimensions of operating area. kernel functions to improve the detection of obstacles. Table 6
We can make this area stricter or more flexible by extending or presents a comparison between the HAE-OCSVM method with
reducing the disparity range. other studied classifiers. Results show that the combined HAE-
OCSVM detection scheme outperforms the other algorithms used
5.5. Obstacle detection-based one class classifiers in this study. OCSVM-based detection also surpassed SVDD-based
detection algorithms. This is related to the phenomenon of empty
After constructing the two hybrid models trained with BUSY- spaces inside the hyper sphere suffered by the SVDD.
DST and FREE-DST, respectively, we assess the performance of Implementation of these methods consist of two phases. Off-
the proposed HAE-based OCSVM obstacle detection approach and line training or learning, in which models are constructed. The
we compare our results with results from five algorithms: DBN- models are then used to detect obstacles in future data (i.e., test-
OCSVM, SDA-OCSVM, HAE-SVDD, DBN-SVDD and SDA-SVDD. A ing). On-line detection, in which the online measurement data are
benefit of SVMs is their ability to map problems into higher spa- processed and the constructed models are used to detect obstacles.
tial dimensions using kernels, allowing a non-linear relationship For each obstacle detection method, a processing time is computed
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 299

∑R+4,C +4
r =R,c =C V − disparity(r , c)
dmn =
M .N
where R = (m − 1) ∗ M and C = (n − 1) ∗ N.
We check if there is any change in the columns of the density
map by using density map information from previous scenes. In
other words, we use residuals, E = [e1 , e2 , . . . , en ], which repre-
sent the difference between columns of the current density map
and the previous density map, as change indicators. Without ob-
stacles, residuals are close to zero due to measurement noise, and
they deviate significantly from zero in the presence of obstacles.
First, we remove the density map data mean and then we apply
the three-sigma rule on the residuals to detect potential changes.
Upper and lower control limits, which are denoted respectively by
UCL and LCL, for the residuals are defined as

Fig. 19. Filtering fuzzy situations. UCL, LCL = µ0 ± 3σ0 ,

where µ0 and σ0 are respectively the mean and standard deviation
Table 6 of the obstacle-free residuals. The width of the control limits is
Accuracy of the OCSVM vs SVDD.
usually chosen to be 3 in practice, by using this control width,
TN FN TP FP TPR FPR AUC the Shewhart chart would have a 0.27% probability to give a false
HAE-OCSVM 86.43 13.57 99.73 0.27 0.95 0.007 0.97 alarm the absence of obstacles. This implies that 99.73% of the
DBN-OCSVM 41.12 58.88 89.88 10.12 0.78 0.38 0.70 observations should be contained within the control limits in the
SDA-OCSVM 56.29 43.71 90.56 9.44 0.84 0.3 0.77
absence of obstacles. Such a choice is motivated by the detection
HAE-SVDD 81.21 19.79 98.76 1.24 0.93 0.03 0.94 ability of Shewhart chart and its low-computational cost making
DBN-SVDD 34.65 64.35 86.82 13.18 0.77 0.49 0.64
SDA-SVDD 51.56 48.44 87.37 12.63 0.82 0.38 0.72
it easy to implement in real time. Fig. 20 shows an example of
tracking obstacle locations by applying the three-sigma rule to
the columns of the density map. Whenever the most recently
Table 7
Testing processing times (ms) for each detection approach. measured point or a consecutive sequence of points is outside the
control limits, a change is encountered. The detection of a change
Model Encoding time (ms) Total time (ms) FPS
means the presence of the obstacle in column i of the density
DBN 70 98 10.20
HAE 72 100 10
map. This procedure to track changes in density map columns is
SDA 64 92 10.86 summarized in Algorithm 5.
Algorithm 5: Obstacle tracking based on density map change
detection. respectively.
(see Table 7). In Table 7, ’Encoding time’ is the time needed to Input: Density Map : {θPrev iews , θCurrent }
encode V-disparity map, ’Total time’ is the total time required Output: vector of response r ∈ {0, 1}4
to perform all steps of obstacle detection, and ’FPS number’ is 1 ai ← columns of θPrev iews
the number of images processed per second. Processing time in 2 bi ← columns of θCurrent
testing phase is a significant indicator to measure the complexity of 3 for i in 1..4 do
models. Meanwhile, the testing time for all the methods are within 4 Residuei ← computeResidue(ai , bi )
100 ms, that means the testing size is very small. We implemented 5 if Residuei < thresholdUCL and Residuei > thresholdLCL then
these approaches using a fast algorithm based Intel SSE (CPU i7 6 ri ← 0
come with version 4) technology to accelerate computation (10
7 else
FPS) to meet real time application requirements. The processing
8 ri ← 1
time could be decreased further by using this approach implement-
ing on GPU (up to 15 FSP). 9 return r

5.6. Estimation of obstacle locations

6. Conclusion
In this subsection, we describe a statistical approach to estimate
obstacle locations in a current scene. This approach is based on Accurate detection, localization and tracking of obstacles in
tracking changes in density map columns. Indeed, the motion of urban scenes is a key enabler to improving traffic efficiency and
obstacles will generate changes in the V-disparity map, which are safety by avoiding accidents during driving. In this paper, a novel
reflected in the density map as well. Let the Density map matrix, obstacle detection method based on stereovision is proposed.
θV Disparity , which provides a compact representation of V-disparity, Specifically, we presented an obstacle detection system by com-
be defined as follows: bining the flexibility and accuracy of a new hybrid encoder and
⎡ ⎤ the extended capacity of OCSVM in anomaly detection. Indeed,
d11 d12 d13 ... d1n the developed hybrid model merges the greedy features of deep
⎢ d21 d22 d23 ... d2n ⎥ Boltzmann Machines (DBM) with the dimensionality reduction
θV Disparity = ⎢
⎣ .. .. .. .. .. ⎥
⎦ = [d1 , d2 , . . . , dn ],
. . . . . capacity of an auto-encoders (AE). We evaluated the proposed
dm1 dm2 dm3 ... dmn approach using practical data from two databases, the Malaga
stereovision urban dataset (MSVUD) and the Daimler urban seg-
where m ∈ [1, M ], n ∈ [1, N ] , with : mentation dataset (DUSD). We provided comparisons of the pro-
Row sVDisparity ColumnsVDisparity posed model with state-of-the-art models based on the deep belief
M= and N = . network (DBN) and stacked auto-encoders (SDA) and showed that
4 4
300 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301

Fig. 20. Tracking obstacle locations based on a density map. In each plot, the solid colored lines represent the residuals of density map. The dashed horizontal lines labeled
UCL and LCL denote the upper control limit and the lower control limit of Shewhart chart.

we achieve better results. Also, we compared the detection quality References

of OCSVM to that of support vector data descriptor (SVDD) and
found better performance. To reduce the number of false alarms [1] R. Labayrade, D. Aubert, J.P. Tarel, Real time obstacle detection in stereovision
and fuzzy situations, we constructed two models and used them on non flat road geometry through ‘‘v-disparity’’ representation, in: IEEE In-
for detection, one trained with free scenes and the other with telligent Vehicle Symposium, 2002, Vol. 2, IEEE, 2002, pp. 646–651.
busy scenes. We showed that by using both models together, a [2] N. Fakhfakh, D. Gruyer, D. Aubert, Weighted V-disparity approach for obsta-
cles localization in highway environments, in: 2013 IEEE Intelligent Vehicles
higher accuracy is achieved compared to DBN, SDA, and the use Symposium (IV), IEEE, 2013, pp. 1271–1278.
of one model alone. Furthermore, we developed a fast approach to [3] H. Sun, H. Zou, S. Zhou, C. Wang, N. El-Sheimy, Surrounding moving obstacle
estimating obstacle locations by tracking changes on a density map detection for autonomous driving using stereo vision, Int. J. Adv. Robot. Syst.
10 (6) (2013) 261.
using the three-sigma rule.
[4] N. Appiah, N. Bandaru, Obstacle detection using stereo vision for self-driving
The presence of highly noisy images makes obstacle detection cars, 2015.
more difficult as the presence of noise degrades the quality of [5] L. Nalpantidis, D. Kragic, I. Kostavelis, A. Gasteratos, Theta-disparity: An ef-
anomaly detection. In fact, wavelet-based multiscale representa- ficient representation of the 3d scene structure, in: Intelligent Autonomous
Systems 13, Springer, 2016, pp. 795–806.
tion of data has been shown to provide effective noise-feature
[6] X. Zhang, Y. Song, Y. Yang, H. Pan, Stereo vision based autonomous robot
separation in the data and to approximately decorrelate auto- calibration, Robot. Auton. Syst. 93 (2017) 43–51.
correlated data. As future work, we plan to exploit the advan- [7] J. Woo, N. Kim, Vision based obstacle detection and collision risk estimation
tages of multiscale denoising and those of the proposed obstacle of an unmanned surface vehicle, in: 2016 13th International Conference on
Ubiquitous Robots and Ambient Intelligence, IEEE, 2016, pp. 461–465.
detection approach to further enhance the performance of this
[8] C. Del, S. Skaar, A. Cardenas, L. Fehr, A sonar approach to obstacle detection
technique, especially when the observed data are very noisy. To for a vision-based autonomous wheelchair, Robot. Auton. Syst. 54 (12) (2006)
further improve the performance of this technique, the proposed 967–981.
method can be parallelized by implementing it on GPU to enable [9] P. Fleischmann, K. Berns, A stereo vision based obstacle detection system
for agricultural applications, in: Field and Service Robotics, Springer, 2016,
its real-time use in vision-based driver assistance systems. Also, pp. 217–231.
other methods can be integrated with HE model for online obstacle [10] A. Asvadi, C. Premebida, P. Peixoto, U. Nunes, 3D Lidar-based static and moving
detection should be explored. One potential approach could be to obstacle detection in driving environments: An approach based on voxels and
use statistical hypothesis testing approaches, such as generalized multi-region ground planes, Robot. Auton. Syst. 83 (2016) 299–311.
[11] H. Yoo, J. Son, B. Ham, K. Sohn, Real-time rear obstacle detection using reliable
likelihood ratio test, which does not require any learning step disparity for driver assistance, Expert Syst. Appl. 56 (2016) 186–196.
to take place. Finally, other available databases can be used for [12] Z. Hu, K. Uchimura, UV-disparity: an efficient algorithm for stereovision based
obstacles tracking, such as the Karlsruhe Institute of Technology scene analysis, in: IEEE Intelligent Vehicles Symposium, 2005. Proceedings,
and Toyota Technological Institute (KITTI) car dataset. IEEE, 2005, pp. 48–54.
[13] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection,
in: IEEE Computer Society Conference on Computer Vision and Pattern Recog-
Acknowledgments nition, 2005. Vol. 1, CVPR 2005, IEEE, 2005, pp. 886–893.
[14] I. Nadav, E. Katz, Off-road path and obstacle detection using monocular cam-
era, in: IEEE International Conference on the Science of Electrical Engineering,
The authors (Abdelkader Dairi and Mohamed Senouci) would IEEE, 2016, pp. 1–5.
like to thank the Computer Science Department, University of [15] A. Broggi, C. Caraffi, R.I. Fedriga, P. Grisleri, Obstacle detection with stereo
vision for off-road vehicle navigation, in: IEEE Computer Society Conference on
Oran 1 Ahmed Ben Bella for the continued support during the
Computer Vision and Pattern Recognition-Workshops, 2005, in: CVPR Work-
research. This publication is based upon work supported by the shops, IEEE, 2005, p. 65.
King Abdullah University of Science and Technology (KAUST) Office [16] K. Yamaguchi, T. Kato, Y. Ninomiya, Moving obstacle detection using monoc-
of Sponsored Research (OSR) under Award No: OSR-2015-CRG4- ular vision, in: Intelligent Vehicles Symposium, IEEE, 2006, pp. 288–293.
[17] C. Häne, T. Sattler, M. Pollefeys, Obstacle detection for self-driving cars using
2582. The authors would like to thank two anonymous referees
only monocular cameras and wheel odometry, in: 2015 IEEE/RSJ International
whose comments and suggestions have improved the content and Conference on Intelligent Robots and Systems (IROS), IEEE, 2015, pp. 5101–
presentation of this work. 5108.
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 301

[18] A. Burlacu, S. Bostaca, I. Hector, P. Herghelegiu, G. Ivanica, A. Moldoveanul, S. [47] C. Georgoulas, L. Kotoulas, G.C. Sirakoulis, I. Andreadis, A. Gasteratos, Real-time
Caraiman, Obstacle detection in stereo sequences using multiple representa- disparity map computation module, Microprocess. Microsyst. 32 (3) (2008)
tions of the disparity map, in: 2016 20th International Conference on System 159–170.
Theory, Control and Computing (ICSTCC), IEEE, 2016, pp. 854–859. [48] D.C. Montgomery, Introduction to Statistical Quality Control, John Wiley &
[19] D. Petković, A.S. Danesh, M. Dadkhah, N. Misaghian, S. Shamshirband, E. Sons, New York, 2009.
Zalnezhad, N.D. Pavlović, Adaptive control algorithm of flexible robotic gripper [49] J.-L. Blanco, F.-A. Moreno, J. Gonzlez-Jimnez, The Mlaga Urban Dataset: High-
by extreme learning machine, Robot. Comput.-Integr. Manuf. 37 (2016) 170– rate stereo and lidars in a realistic urban scenario, Int. J. Robot. Res. 33 (2)
178. (2014) 207–214. http://www.mrpt.org/MalagaUrbanDataset.
[20] M. Duguleana, F.G. Barbuceanu, A. Teirelbar, G. Mogan, Obstacle avoidance of [50] T. Scharwächter, M. Enzweiler, U. Franke, S. Roth, Efficient multi-cue scene
redundant manipulators using neural networks based reinforcement learning, segmentation, in: German Conference on Pattern Recognition, Springer, 2013,
Robot. Comput.-Integr. Manuf. 28 (2) (2012) 132–146. pp. 435–445.
[21] Y. Bengio, Y. LeCun, et al., Scaling learning algorithms towards AI, Large Scale [51] T. Scharwächter, M. Enzweiler, U. Franke, S. Roth, Stixmantics: A medium-level
Kernel Mach. 34 (5) (2007) 1–41. model for real-time semantic scene understanding, in: European Conference
[22] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evaluation on Computer Vision, Springer, 2014, pp. 533–548.
of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 743–
761.
[23] Y. Bengio, et al., Learning deep architectures for AI, Found. Trends Mach. Learn. Abdelkader Dairi holds in 2003 an Engineer degree in
2 (1) (2009) 1–127. computer science from the University of Oran 1 Ahmed
[24] G.E. Hinton, Learning multiple layers of representation, Trends Cogn. Sci. Ben Bella, Algeria. He got a Magister degree in 2006 from
11 (10) (2007) 428–434. the National Polytechnic School of Oran, Algeria. He is
[25] V.D. Nguyen, H. Van Nguyen, D.T. Tran, S.J. Lee, J.W. Jeon, Learning framework currently preparing his Ph.D. degree in computer sciences
for robust obstacle detection, recognition, and tracking, IEEE Trans. Intell. at Ben Bella Oran1 University under the supervision of Pr.
Transp. Syst. (2016). Senouci Mohamed. His current research interests include
[26] S. Ramos, S. Gehrig, P. Pinggera, U. Franke, C. Rother, Detecting unexpected machine learning, computer vision, image processing and
obstacles for self-driving cars, Fusing Deep Learn. Geom. Model. (2016). arXiv mobile robotics.
preprint arXiv:1612.06573.
[27] R. Labayrade, D. Aubert, In-vehicle obstacles detection and characterization by
stereovision, in: Proceedings of the 1st International Workshop on In-Vehicle
Cognitive Computer Vision Systems, Graz, Austria, 2003. Fouzi Harrou received the Dipl.-Ing in Telecommunica-
[28] S.M. Erfani, S. Rajasegarar, S. Karunasekera, C. Leckie, High-dimensional and tions from Abou Bekr Belkaid University, Algeria, in 2004
large-scale anomaly detection using a linear one-class SVM with deep learn- and the M.Sc. degree in Telecommunications and Net-
ing, Pattern Recognit. 58 (2016) 121–134. working in 2006 from the University of Paris VI, France.
[29] J. Xu, H. Li, S. Zhou, An overview of deep generative models, IETE Tech. Rev. In 2010, he received the Ph.D. degree in Systems Opti-
32 (2) (2015) 131–139. mization and Security from the University of Technology
[30] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing of Troyes (UTT), France, and was an Assistant Professor at
robust features with denoising autoencoders, in: Proceedings of the 25th the UTT, from 2009 to 2010. In 2010, he was an Assistant
International Conference on Machine Learning, ACM, 2008, pp. 1096–1103. Professor at the Institute of Automotive and Transport
[31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising Engineering at Nevers, France. From 2011 to 2012, he was
autoencoders: Learning useful representations in a deep network with a local Postdoctoral Research Associate at the Systems Modelling
denoising criterion, J. Mach. Learn. Res. 11 (Dec) (2010) 3371–3408. and Dependability Laboratory, UTT. From 2012 to 2014, he was an Assistant Re-
[32] A. Krizhevsky, G.E. Hinton, Using very deep autoencoders for content-based search Scientist, in Chemical Engineering Department at the Texas A&M Univer-
image retrieval, in: ESANN, 2011. sity at Qatar, Doha, Qatar. Since 2015, he is Postdoctoral Fellow in the Division
[33] P. Smolensky, Information processing in dynamical systems: Foundations of of Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) at
harmony theory; cu-cs-321-86, 1986. King Abdullah University of Science and Technology (KAUST). His current research
[34] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief interests include statistical decision theory and its applications, fault detection and
nets, Neural Comput. 18 (7) (2006) 1527–1554. signal processing, and Spatio-temporal statistics with environmental applications.
[35] R. Salakhutdinov, G. Hinton, Deep Boltzmann machines, in: Artificial Intelli- He is a Member of the IEEE Computational Intelligence Society.
gence and Statistics, 2009, pp. 448–455.
Mohamed Senouci received the Engineer degree and
[36] A.-r. Mohamed, G.E. Dahl, G. Hinton, Acoustic modeling using deep belief
Magister degrees in computer science from the University
networks, IEEE Trans. Audio Speech Lang. Process. 20 (1) (2012) 14–22.
of Oran 1 Ben Bella, Algeria in 1979 and 1994, respectively,
[37] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, M. Pfeiffer, Real-time classification
and the Ph.D. degree in computer sciences, from the Ben
and sensor fusion with a spiking deep belief network, Front. Neurosci. 7 (2013).
Bella Oran1 University Algeria in 2007, where he is cur-
[38] H. Lee, P. Pham, Y. Largman, A.Y. Ng, Unsupervised feature learning for
rently a Professor. His research interests include embed-
audio classification using convolutional deep belief networks, in: Y. Bengio, D.
ded systems, machine learning, and artificial intelligence.
Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (Eds.), Advances in Neural
Information Processing Systems 22, Curran Associates, 2009, pp. 1096–1104.
http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-
classification-using-convolutional-deep-belief-networks.pdf.
[39] S. Kang, X. Qian, H. Meng, Multi-distribution deep belief network for speech
synthesis, in: 2013 IEEE International Conference on Acoustics, Speech and Ying Sun is an Assistant Professor of Statistics in the di-
Signal Processing (ICASSP), IEEE, 2013, pp. 8012–8016. vision of Computer, Electrical and Mathematical Sciences
[40] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a boosted deep and Engineering (CEMSE) at King Abdullah University of
belief network, in: Proceedings of the IEEE Conference on Computer Vision and Science and Technology (KAUST) in Saudi Arabia. She
Pattern Recognition, 2014, pp. 1805–1812. joined KAUST in June 2014 after one-year service as an
[41] R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving assistant professor in the Department of Statistics at the
class neighbourhood structure, in: AISTATs, Vol.11, 2007. Ohio State University, USA. At KAUST, she leads a mul-
[42] B. Leng, X. Zhang, M. Yao, Z. Xiong, A 3D model recognition mechanism based tidisciplinary research group on environmental statistics,
on deep Boltzmann machines, Neurocomputing 151 (2015) 593–602. dedicated to developing statistical models and methods
[43] Q. Gan, C. Wu, S. Wang, Q. Ji, Posed and spontaneous facial expression differ- for space–time data to solve important environmental
entiation using deep Boltzmann machines, in: 2015 International Conference problems. Prof. Sun received her Ph.D. degree in Statistics
on Affective Computing and Intelligent Interaction (ACII), IEEE, 2015, pp. 643– from Texas A&M University in 2011, and was a postdoctorate researcher in the
648. research network of Statistics in the Atmospheric and Oceanic Sciences (STATMOS),
[44] R. Salakhutdinov, H. Larochelle, Efficient learning of deep Boltzmann ma- affiliated with the University of Chicago and the Statistical and Applied Mathemat-
chines, in: AISTATs, Vol. 9, 2010, pp. 693–700. ical Sciences Institute (SAMSI). She demonstrated excellence in research and teach-
[45] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating ing, published research papers in top statistical journals as well as subject matter
the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) journals, won multiple best paper awards from the American Statistical Association
1443–1471. and the Transportation Research Board National Academies. Her research interests
[46] D.M. Tax, R.P. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) include spatio-temporal statistics with environmental applications, computational
45–66. methods for large datasets, uncertainty quantification and visualization, functional
data analysis, robust statistics, statistics of extremes.