Unsupervised Obstacle Detection in Drivi
Unsupervised Obstacle Detection in Drivi
highlights
• A stereovision-based hybrid deep autoencoder (HAE) approach to urban scene monitoring is developed.
• This system combines the advantages of deep Boltzmann Machines (DBM) and autoencoders.
• An unsupervised HAE-based one-class SVM is developed for obstacle detection in driving environments.
• A fast obstacle tracking approach based on density maps is developed.
• Two publically available datasets, Malaga and Daimler, are used for validation.
• The detection results show the superior performance of the new combined HAE-OCSVM strategy.
article info a b s t r a c t
Article history: A vision-based obstacle detection system is a key enabler for the development of autonomous robots
Received 11 July 2017 and vehicles and intelligent transportation systems. This paper addresses the problem of urban scene
Received in revised form 13 October 2017 monitoring and tracking of obstacles based on unsupervised, deep-learning approaches. Here, we design
Accepted 26 November 2017
an innovative hybrid encoder that integrates deep Boltzmann machines (DBM) and auto-encoders
Available online 6 December 2017
(AE). This hybrid auto-encode (HAE) model combines the greedy learning features of DBM with the
Keywords: dimensionality reduction capacity of AE to accurately and reliably detect the presence of obstacles. We
Deep learning combine the proposed hybrid model with the one-class support vector machines (OCSVM) to visually
DBM monitor an urban scene. We also propose an efficient approach to estimating obstacles location and track
Autoencoder their positions via scene densities. Specifically, we address obstacle detection as an anomaly detection
OCSVM problem. If an obstacle is detected by the OCSVM algorithm, then localization and tracking algorithm is
Monitoring executed. We validated the effectiveness of our approach by using experimental data from two publicly
Stereovision available dataset, the Malaga stereovision urban dataset (MSVUD) and the Daimler urban segmentation
dataset (DUSD). Results show the capacity of the proposed approach to reliably detect obstacles.
© 2017 Elsevier B.V. All rights reserved.
https://doi.org/10.1016/j.robot.2017.11.014
0921-8890/© 2017 Elsevier B.V. All rights reserved.
288 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
Such systems are mainly based on multiple collection of views dimensionality, generating new data with a given joint distribu-
using visual sensors that can estimate depth and perceive three- tion, and unsupervised learning are not possible [24]. Restricted
dimensional (3D) components in a scene. For example, binocular Boltzmann machines (RBM) and autoencoders are powerful deep
stereovision is based on two rectified images (left and right) that architectures that overcome most of these limitations [23]. These
are used to compute a disparity map (i.e., displacement of an object deep-learning based approaches are usually implemented in three
between two rectified images) such that the epipolar geometry main steps. First, a heavy scanning of images. The next step is to
constraints are fulfilled [1,2,12,5]. locate the surrounding ROI. The last step is to start a recognition
In the literature, there has been much discussion on obstacle process. This complex process is automatically executed in both
detection techniques. For instance, some approaches are based on the presence and absence of obstacles, which is the main drawback
images descriptors such as scale invariant feature transform (SIFT), of such an approach.
local binary pattern (LBP), regions of interest (ROI) based on sliding
windows, and histograms of oriented gradient (HOG) [13]. Indeed, 1.2. Motivation and contribution
these techniques usually utilize manually designated features,
such as vehicle motion, color and texture. Nadav and Katz [14], To improve obstacle detection and classification, we start by
Broggi et al. [15], Yamaguchi et al. [16] proposed obstacle detection checking the presence of obstacles before starting any heavy scan-
using a monocular camera in the off-road environment. Häne et al. ning of input images. In other words, our objective is to optimize
[17] proposed an obstacle detection approach in the on-road envi- the obstacle detection process by answering the question, are there
ronment using monocular cameras. Labayrade et al. [1], Fakhfakh any obstacles? Then, the localization, estimation and recognition
et al. [2], Hu and Uchimura [12] proposed a binocular stereo vision processes can be executed only if a potential obstacle exists.
system based on depth estimation via disparity maps for high- Here, we treat the problem of obstacle detection as an anomaly
ways. Sun et al. [3] proposed a system for detection and tracking detection problem based on the V-disparity data distribution. In
of moving obstacles in urban driving scenarios. Appiah and Ban- urban settings or on highways a V-disparity data distribution,
daru [4] proposed an approach using stacked stereo 360 vertical which is the vertical coordinate in the (u, v ) disparity map coor-
cameras to perceive obstacles around an autonomous vehicle. Nal- dinate system [27,12], is mostly stable with small variations due to
pantidis et al. [5] introduced a new representation of 3D scene measurement noise. The V-disparity can significantly change in the
structure named theta-disparity. The key idea of theta-disparity is presence of obstacles. Our proposed system has four main stages as
to get a radial representation of the significant objects in a set with shown in Fig. 1.
respect to a point of interest based on a disparity map [4]. Woo
and Kim [7] proposed vision-based obstacle detection and collision • First, the system employs an innovative hybrid framework
risk estimation of an unmanned surface vehicle. Based on the work for feature extraction and encoding. This is based on a hy-
of Labayrade et al. [1], Fakhfakh et al. [2], Nalpantidis et al. [5], brid encoder model that combines multiple layers of deep
Burlacu et al. [18] presented an obstacle detection approach in Boltzmann machine (DBM) as the feature extractor and
stereo sequences using multiple representations of the disparity autoencoder (AE) for dimensionality reduction (V-disparity
map. However, this approach is based on heavy scanning of images ⇒ Code). In fact, we start with unsupervised greedy layer-
to look for obstacles without any certainty about the existence wise training of the hybrid encoder using the V-disparity
and kind of obstacles. This method requires intensive computation dataset. Two tasks are accomplished at the end of each layer:
and is difficult to adapt in real-time applications. In addition, this (1) discover and extract new features; (2) generate a new
method cannot distinguish obstacles from other objects. encoded output that will be used as input for the next layer.
In obstacle detection and localization, machine learning turn This proposed hybrid encoder architecture is built on four
out to play an important role [19–21]. Many methods have been layers of DBM and AE.
developed for improving obstacle detection and for handling new • Second,we address obstacle detection as an anomaly detec-
applications [22,13,23,24,21]. In learning-based obstacle detection tion problem based on the one-class support vector machine
methods, two classes can be distinguished: approaches based on (OCSVM) classifier, which requires only obstacle-free data in
shallow learning approaches and those based on deep learning training. The training of OCSVM is unsupervised from data
approaches. Various shallow learning-based approaches have been encoded by the hybrid encoder model. The central role of the
investigated, such as training different classifiers by support vector OCSVM classifier is to separate inliers from outliers in the
machines (SVM), AdaBoost, and neural networks in supervised testing data by building a hyper-plan [28]. Third, the pres-
learning with one or two layers [22]. Robust approaches have ence of obstacles can predicted. Towards this end, for a given
been proposed by merging HOG with SVM for human detection V-disparity, a code is generated using the hybrid encoder
based on single views [13]. However, shallow learning approaches model and the OCSVM classifier predict if it is an inlier or an
are not suitable for representing dependencies between multiple outlier. Here, two models are built, the first model identifies
variables, and they are inefficient in dealing with problems with free scenes and the second model to identifies busy scenes.
high-dimensionality data, leading to unsuitable generalized mod- The main reason to use two models is to improve decision
els [23,21]. making and reduce false alarms.
On the other hand, deep learning-based approaches have been • Finally, the location of obstacles can be estimated based on
developed to overcome these limitations. Indeed, deep convolu- density maps computed for both V-disparity and U-disparity
tional neural networks are powerful tools in image classification. by checking changes in residuals, which represent the differ-
They have proved to be efficient for Google’s ImageNet, which con- ence between the current values of the density maps and the
tains more then 1.3 million high-resolution images. Deep convolu- previous values. Here, the three-sigma rule is used to detect
tional neural networks (CNNs) were first proposed by Nguyen et changes in residuals.
al. [25] for obstacle detection and recognition, but their efficiency
was limited to 2D images. Ramos et al. [26] proposed an approach The effectiveness of the developed hybrid approach is validated
based on deep CNNs to detect unexpected obstacles. Despite the using experimental data from two publicly available datasets, the
promising results obtained using the deep CNN approach for obsta- Malaga stereovision urban dataset and the Daimler urban segmen-
cle detection and recognition based on 2D images, some tasks, such tation dataset. Results show that the proposed approach is able to
as learning more about data distribution, encoding data, reducing reliably detect obstacles.
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 289
Fig. 1. Flowchart of the proposed vision-based obstacle detection and localization system.
2. Preliminary materials
Fig. 2. Autoencoders.
In this section, we briefly present an overview of machine learn-
ing generative models used to build deep learning architectures,
such as deep autoencoders, Boltzmann machine and restricted
Boltzmann machine. More details about these generative models models have been widely applied in image denoising [30,31] and
can be found in [24,29]. content-based image retrieval [32].
2.1. Autoencoders
2.2. Restricted Boltzmann machine
An autoencoder is an artificial neural network [23] used for
unsupervised learning that is trained to reconstruct its own inputs Restricted Boltzmann Machines (RBMs) can be viewed as
(i.e., predicting the value of output x̂ given input x via hidden stochastic neural networks [33] (see Fig. 3). RBMs consist of m
layer h, see Fig. 2). Autoencoders are widely used in dimensionality visible units, v ∈ {0, 1}m and n hidden units, h ∈ {0, 1}n . There
reduction and feature learning. Autoencoders comprise two parts: are no visible-to-visible and hidden-to-hidden connections, al-
the encoder and the decoder. The encoder can be defined with
though v and h are fully connected (see Fig. 3). The learning proce-
encoder function h = Encoder(x), which can be defined by a linear
dure comprises many steps of Gibbs sampling (propagate: sample
or nonlinear function. If the encoder function is nonlinear, the
autoencoder will have capacity to learn more features than linear hidden given visibles; reconstruct: sample visible given hidden;
principal component analysis [23]. The purpose of the decoder part repeat) and selecting the weights with minimum reconstruction
is to reconstruct its own inputs via the decoder function, x̂ = error. Different learning algorithms for RBMs have been proposed
Decoder(h). The learning process of an autoencoder is achieved by mostly based on Markov chain Monte Carlo (MCMC) sampling
minimization of the negative log-likelihood (loss function) of the using Gibbs sampling to obtain an estimator of the log-likelihood
reconstruction, given the encoding Encoder(x) [23]: gradient [23,34]. Moreover, RBMs are used to construct deeper
models, such as Deep Belief Networks (DBN) and the hierarchical
Reconstructionerror = −log(P(x|Encoder(x))), (1) probabilistic model deep Boltzmann machine (DBM) [35].
where P is the probability assigned to the input vector x by the RBMs are particularly energy-based models and have been used
model. Indeed, incorporating latent variable models has caused au- as generative models for several types of data [23] such as text,
toencoders to behave like generative models. Stacked autoencoder speech and images. The energy function of the RBM configuration
290 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
is the partition function. Since only v is observed, the hidden Deep belief networks (DBNs) are probabilistic generative mod-
variables h are marginalized. els that are based on stacked RBMs (see Fig. 4). DBNs have been
∑ e−Energy(v,h) used in many challenging learning problems, such as in real-
P(v ) = , (5) time classification [37], audio classification [38], speech synthe-
Z sis [39], and facial expression recognition [40]. They exhibited
h
high efficiency in discovering layer-by-layer complex nonlinearity.
where P(v ) is the probability assigned by the mode to a given
Furthermore, DBNs have been used successfully in dimensionality
visible vector v . In terms of probability since the hidden nodes are
reduction [34,41]. Hinton et al. [34] introduced a fast unsupervised
conditionally independent from the visible units, we can derive
learning algorithm for DBN in which the joint distribution between
from Eq. (3):
observed vector x and ℓ hidden layers hk is expressed as follows:
∏
P(v|h) = p(vi |h), (6) ℓ−2
∏
i P(x, h1 , . . . , hl ) = ( P(hk |hk+1 ))P(hℓ−1 , hℓ ), (12)
∏
P(h|v ) = p(hj |v ). (7) k=0
where σ (·) is the logistic function and σ (x) = (1 + exp(−x))−1 . Hin- 2.4. Deep Boltzmann machines
ton et al. [34] developed an extension of RBMs, Gaussian Bernoulli
RBMs, to deal with different data types like real-valued vectors Salakhutdinov and Hinton [35] proposed a new learning algo-
(e.g., pixel intensities of an image), in which v ∈ Rm and hidden rithm for a hierarchical probabilistic model called deep Boltzmann
units h ∈ {0, 1}n . For the Gaussian Bernoulli RBMs, the joint energy machine (DBM). DBM is a generative model with many layers of
is: hidden variables in which connections between layers are undi-
I I J J
rected (see Fig. 5). Whereas RBMs are a kind of Markov random
∑ (vi , ci )2 ∑ ∑ vi ∑ field, DBMs learn increasingly from complex representations of
Energy(v, h) = − Wij hj − bj hj . (10)
2σi2 σi given data and incorporate uncertainty about ambiguous and miss-
i=1 i=1 j=1 j=1
ing or noisy inputs. DBMs are able to extract complex statistical
The aim of training RBMs is to adjust the model’s parameters structures and are applicable to various applications, such as object
(weights matrix w ) (see Eq. (11)). This task is achieved by max- recognition [42], and computer vision [43]. Salakhutdinov and
imizing the probability of the training data under the model. In Larochelle [44] optimized all layers of DBM parameters jointly
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 291
where Ileft and Iright respectively denote the left and right image
The one-class support vector machine (OCSVM) [45] is an effi- pixel intensities, d is the disparity range [dmin , dmax ], dmin and dmin
cient, unsupervised learning algorithm that learns decision func- are respectively the minimum and maximum disparity values, ω
tions for anomaly detection. OCSVM returns a function f (x) with is the window size and i, j are the coordinates (rows, columns
+1 or −1 to indicate whether the data is an ‘‘inlier’’ or ‘‘outlier’’ respectively) of the center pixel of the SAD or any correlation
respectively. Its decision function f (x) is defined as: measures.
{ V-disparity map, which gives a good estimation of a road’s pro-
+1, if region capturing most of the data points
f (x) = (15) file based on the Hough transform and depth estimation, provides
−1, otherwise.
information about the height of obstacles and their positions with
OCSVM, which is based on kernels (see Eq. (16)) such as the respect to the ground [1,2]. The main steps used to compute the
radial basis function (RBF) (see Eq. (17)), maps input data into a V-disparity are given in Algorithm 1.
high-dimensional feature space F , the hyperplane that maximizes Algorithm 1: V-disparity computation steps.
the margin that best separates the training data from the origin.
Input: Disparity map DispMap(rows, cols)
Input: Dmax : Max disparity value.
K(x, y) = (Ψ (x) · Ψ (y)), (16)
Output: V-disparity DispMapv (rows, Dmax )
where x and y are the input vectors, Ψ is a feature map X → F and 1 for Each row r th in DispMap do
X is set of observed x. The RBF kernel is also known as a Gaussian 2 for Each column c th in DispMap do
kernel: 3 currentDisparity ← DispMap(r , c)
2 4 if currentDisparity > 0 then
∥ x − y∥
KRBF (x, y) = exp(− ). (17) 5 DispMapv (r , c) ← (currentDisparity + 1)
2σ 2
The selection of the hyperplane separating the training dataset
from the origin is achieved by solving the following quadratic On the other hand, a U-disparity map provides information
about the width of obstacles and depth estimation [1,2,12]. Algo-
optimization problem:
rithm 2 describe the main steps to compute U-disparity.
l
1 1 ∑ A density map is a compact representation of V-disparity with-
min ∥ w∥2 ξi − ρ, (18) out losing of essential information. To compute the density map,
w∈F, ξ ∈Rl ,ρ∈R 2 νl
i the V-disparity is segmented into many small cells (see Fig. 6), and
the density map for each cell is derived as follows:
subject to (w · Ψ (x)) ≥ ρ − ξi , ξi ≥ 0
Cell
∑
where ν ∈ [0, 1] is a parameter that characterizes the solution, w DensityCell = ( I(i, j))/(w ∗ h),
is a weight vector and ρ is an offset.
292 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
Fig. 9. Block diagram of the deep encoders architecture with two OCSVM classifiers.
presented by vertical lines with high intensities (see Fig. 8(a). If two models. Fig. 9 schematically summarizes the proposed system
the obstacle is closer to the right side of the V-disparity map, that is based on a deep learning architecture trained entirely in an
the distance between the obstacle and the vehicle is smaller. The unsupervised way. The main steps of the proposed approach are
thickness of the detected obstacle decrease when the obstacle summarized in Algorithm 4.
moves away further from the mobile robot. The vertical length Algorithm 4: Hybrid deep encoder approach.
of the vertical line represents the height, h, of the actual obstacle
in the image. The greater the thickness of the obstacle in the V- Input: images DataSet of (Left, right): TrainingDataset
disparity map, the bigger is the obstacle in the image (e.g., bus, Output: DataSet of Encoded V-disparity: EncodedDataset
cars, and pedestrians). Fig. 8(b) shows pedestrians walking on the 1 for Each tuple (Left, Right) in training dataset do
road. From the V-disparity, it can be seen that vertical lines to the 2 DisparityMap ← ComputeDisparityMap(Left , Right)
road profile indicate the presence of these obstacles (i.e., pedestri- 3 V -Disparity ← ComputeVDisparity(DisparityMap)
ans). In U-disparity, obstacles appear as a fragment of horizontal 4 X ← V -Disparity
lines (see Fig. 8(b)). The length of a fragment is the width of the 5 for Each layer λ in HAE layers do
detected obstacle, and the starting x-coordinate of each fragment 6 outputDBM ← LearnFeaturesDBM (X )
represents the x-coordinate of the obstacle. By using V-disparity 7 outputλ ← EncodeAE (outputDBM )
and U-disparity, the width, high, x and y coordinates of the detected 8 X ← outputλ
obstacle can be extracted. The Algorithm 3 describes the steps to 9 EncodedDataset ⇐H add(X )
surround obstacle on ROI. 10 /*Add X to EncodedDataset*/
Algorithm 3: Obstacle localization steps. 11 OCSVMModel ← train(EncodedDataset)
Input: Disparity map: DMap
Output: Vector of Region of Interest: RROI
Definition 1 (Operating Area). Let us define an operating area
1 V ← BuildVDisparity (DMap);
as the region in front of a vehicle (see Fig. 10). The dimensions
2 U ← BuildUDisparity (DMap);
of this region are expressed as a range of disparities, where δ is
3 D : is the disparity range of the obstacle;
the disparity range, δmin and δmax are he minimum and maximum
4 (x, y): coordinate of the obstacle in the original image;
5 (h, w ): height and width of the obstacle; disparity values, respectively.
6 Extract Road Profile RP from V ; The proposed procedure is implemented in several steps as
7 OBS ← FindStandingObstacle(RP); summarized in Table 1.
8 for Each obstacle O in OBS do
9 ➥ Determine D and y from V ;
4.1. Hybrid deep architecture training
10 ➥ Determine h obstacle height located in V ;
11 ➥ Determine w and x using D from U ;
12 ➥ Append (x, y, h, w ) to RROI ; In this section, we describe the approach used to train the
proposed deep architecture, starting with building the hybrid deep
13 return RROI
encoder based on unsupervised training. Then, the one-class clas-
sifier is trained to learn how to classify the encoded data obtained
4. Proposed hybrid deep autoencoder-based obstacle detection from the hybrid deep encoder.
approach Deep hybrid encoder training. The proposed system is based on two
models, which are implemented in parallel (see Fig. 9). Each model
The proposed hybrid deep autoencoder (HAE) consists of four merges a deep DBM with an autoencoder to enhance the quality
layers. Each layer is the combination of a DBM and an autoencoder. of the generated encoded datasets (see Fig. 11). These models are
In each layer, useful features are extracted and encoded in an trained with an input dataset that contain rectified left and right
output code. Then, the generated code is used for the next layer. images. Specifically, we train the first model with image sequences
The output of the last layer will be used as the input to the one-class that contain mostly free scenes with a few obstacles. At the same
classifier. Specifically, the one-class classifier builds boundaries to time, we train the second model with data containing mostly
separate normal (without obstacles) and abnormal (presence of scenes with obstacles. This hybrid deep encoder allows the system
obstacles) cases. In this approach, two models are constructed to to learn a complex data distribution and encode the input images.
enhance accuracy and reduce false alarms. The first is built with It is also able to reconstruct the input with reduced errors.
unsupervised learning of images with obstacles and the second
is built with the unsupervised learning based on images without Training the one-class classifier. In the proposed approach, the
obstacles. False alarms can be reduced by comparing the outputs of OCSVM classifier, which is an unsupervised classifier, is trained
294 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
Table 1
Main steps of the proposed system.
Step Action
❶ Stereoimage acquisition from the stereovision device.
➥ Input: Left and right images.
➥ Output: Rectified left and right images.
❷ Compute disparity map:
➥ Input: Rectified left and right images.
➥ Output: Disparity map.
❸ Compute V-disparity map (see Algorithm 1)
➥ Input: Disparity map.
➥ Output: V-disparity map.
❹ Check existence of obstacles (Detection): Apply the hybrid deep encoder-based
OCSVM for obstacle detection.
➥ Input: Encoded V-disparity map.
➥ Output: Prediction, P ∈ ⟨Yes, No⟩.
❺ Compute scene density: Compute Density map using V-disparity density
➥ Input: Encoded V-disparity map.
➥ Output: Density estimation.
❻ Track obstacles localization (tracking): Based on the previous density map,
predict the new obstacles localization by tracking density changes.
➥ Input: Density map.
➥ Output: Estimation of the obstacles localization.
❼ Compute U-disparity map: Compute U-disparity map based the boundaries of
the vehicle operating area (see Algorithm 2).
➥ Input: Disparity map.
➥ Output: Obstacles region of interest (ROI) (see Fig. 12).
with the encoded V-disparity map generated from the two con- schematically presented in Fig. 12. This approach is based on V-
structed models of the hybrid deep encoder. As described above, disparity and U-disparity maps, which are useful for obstacle local-
we implement two OCSVMs, the first aims to detect outliers from ization. In fact, each row in the density map of the V-disparity map
represents an area that potentially contains obstacles (following
the encoded V-disparity map of the model trained with obstacles;
the Y -coordinate axis). The density map of the V-disparity is useful
the second is used to detect outliers from the encoded V-disparity for detecting and tracking obstacles moving vertically. On other
map of the model trained without obstacles (see Fig. 9). hand, columns of the density map obtained from the U-disparity
represent an area that potentially contains obstacles (following
Obstacle localization and tracking. After detecting an obstacle using the X -coordinate axis). Thus, the density map of the V-disparity
our HAE-OCSVM approach, it is important to locate its position. can be used to detect and track obstacles moving horizontally. In
The proposed approach for obstacle localization and tracking is this approach, the V-disparity and U-disparity maps are computed
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 295
Table 2
Parameter settings of the studied approaches.
Models Parameter Value
learning rate 0.01
DBM Gibbs sampling (k) 15
Training epochs 100
Autoencoder Learning rate 0.01
Training epochs 100
OCSVM Kernel RBF
RBF γ 0.1
RBF ν 0.1
Operating area δmin 32 (pixels)
δmax 64 (pixels)
Fig. 11. Deep Boltzmann machines with autoencoders. mainly of free scenes. In the testing phase, we used two sub-
datasets of MSVUD, extract number 10 (multiple loop closures)
which consists of 9000 pairs of images, and extract number 12
(a long avenue of 3.7 km with traffic), which consists of 11,000
pairs of images. In addition, the DUSD dataset is used for obstacle
detection with 500 pairs of images.
To do so, we used two MSVUD datasets for testing pur-
poses [49]. The first dataset termed FREE-DST contains 20% of fuzzy
situations and 80% of free roads. The second dataset called BUSY-
DST contains a 90% of fuzzy situations and 10% of true obstacles
Fig. 12. Block diagram of the obstacle localization process. (vehicles, motorbikes and pedestrians). This distribution is moti-
vated by the fact that in normal urban driving scenarios, the car is
moving most of the time unless. The vehicle can be stuck in traffic.
Both datasets, FREE-DST (3563 pairs of images) and BUSY-DST
based on the disparity map of the two input images. Of course, the (1437 pairs of images), were generated randomly from extracts 10
density map can be used as an indicator to determine the position and 12 of MSVUD.
of the detected obstacle. Towards this end, we analyze the trend In this study, the effectiveness of three obstacle detection ap-
of previous density map to track changes. Specifically, we applied proaches consisting of two layers: deep encoders and one-class
the three-sigma rule (i.e., Shewhart monitoring chart) [48] on the
encoders, is assessed and compared. Indeed, we used three dif-
density map column-wise to detect change.
ferent deep encoders: (i) the proposed Hybrid Autoencode (HAE),
The proposed approach is implemented in several steps: firstly,
(ii) Deep Belief Network (DBN), and (iii) Stacked Autoencoders
the V-disparity and U-disparity maps are computed based on the
(SDA). Also, we used two one-class classifiers OCSVM and SVDD.
disparity map of the two input images. Then, the road profile is
The experimental parameters of the machine learning approaches
extracted using the Hough transform, which helps to determine a
studied in this paper are presented in Table 2.
line representing the road. Obstacles on the road are represented
by vertical lines on the V-disparity map. Their height and depth
5.2. Model trained with free scenes (FSM)
can be estimated via distances in the V-disparity. Their width can
be determined based on processing the U-disparity map. Thus, we
can surround the obstacles in the ROI. Specifically, by crossing the To build an efficient and accurate model able to predict free
U-disparity and V-disparity maps, we can surround the obstacles scenes and reject the scenes with obstacles, we trained the one-
and estimate their positions and distances. class classifier with V-disparities of free scenes. Sometimes there
were confusing (fuzzy) situations in which obstacles were in the
5. Experimental results and discussion field of view of the vehicle but they were not in the operating area.
This classifier is constructed to fit with the free scenes and fuzzy
5.1. Data description situations and to reject busy scenes. Examples of free scenes and
their corresponding V-disparity map are shown in Fig. 13. From
This section reports on the effectiveness of the proposed hy- the V-disparity shown in Fig. 13 (a)–(d), it can be seen that the road
brid encoder approach. Towards this end, we performed experi- profile is clearly apparent with visible inclined line of cloud points
ments on two practical datasets: the Malaga stereovision urban without accumulation of high intensities pixels. So, from Fig. 13
dataset (MSVUD) [49] and the Daimler urban segmentation dataset (a)–(d), it seems that there is no obstacle in the road. It can also
(DUSD) [50,51]. The MSVUD comprises 15 sub-datasets (extracts) be seen that the static environment (vertical line) is in the low V-
of rich urban scenarios of more than 20 km in length with a resolu- disparity area, which means it is away from the vehicle.
tion of 800 × 600 pixels recorded under different situations (with We evaluated the effect of the number of samples in the training
and without traffic), such as a straight path, turns, roundabouts, dataset on the accuracy of the proposed hybrid model. To do so,
avenue traffic and highway. The DUSD contains images sequences we varied the number of samples in the training dataset from 500,
recorded in urban traffic. It consists of rectified stereo image pairs 1000, 2000 to 5000 and evaluated the accuracy of the proposed
with a resolution of 1024 × 440 pixels [51]. HAE-OCSVM algorithm compared to both SDA and DBN-based
Two sub-datasets of MSVUD are used in the training phase. OCSVM algorithms (see Table 3). In each experiment, we measure
The first dataset, which is extract number 5 (avenue loop closure the inliers called true positives (TP); accepted by OCSVM and the
1.7 km), consists of 5000 pairs of images and the second dataset outliers, called false positives (FP), rejected by OCSVM. Table 3
is extract number 8 (long loop closure, 4.5 km), which consists shows that when 500 samples were used for training, the accuracy
of 10,000 pairs of images. These two extracts (5,8) are composed in percentage % (TP, FP) of the proposed HAE-OCSVM method was
296 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
Fig. 13. Examples of free scenes (Right) Original input image and (Left) its corresponding V-disparity map.
Table 3 Table 4
Performance comparison between HAE-OCSVM, DBN-OCSVM, and SDA-OCSVM Performance comparison between HAE-OCSVM, DBN-OCSVM, and SDA-OCSVM
based on FREE-DST. methods applied to BUSY-DST dataset.
Dataset (Samples) Approach Inliers (TP) Outliers (FP) Dataset (Samples) Approach Inliers (TP) Outliers (FN)
DBN-OCSVM 89.78 10.22 DBN-OCSVM 63.89 36.11
500 HAE-OCSVM 99.51 0.49 500 HAE-OCSVM 81.98 18.02
SDA-OCSVM 89.39 10.61 SDA-OCSVM 52.96 47.04
DBN-OCSVM 89.45 10.55 DBN-OCSVM 63.96 36.04
1000 HAE-OCSVM 99.95 0.05 1000 HAE-OCSVM 94.79 5.21
SDA-OCSVM 90.3 9.70 SDA-OCSVM 53.38 46.62
DBN-OCSVM 90.25 9.75 DBN-OCSVM 64.51 35.49
2000 HAE-OCSVM 99.92 0.08 2000 HAE-OCSVM 91.24 8.76
SDA-OCSVM 90.08 9.92 SDA-OCSVM 52.96 47.04
DBN-OCSVM 89.89 10.11 DBN-OCSVM 41.13 58.87
5000 HAE-OCSVM 99.73 0.27 5000 HAE-OCSVM 86.44 13.56
SDA-OCSVM 90.57 9.43 SDA-OCSVM 52.55 47.45
99.51 and 0.49 respectively, and that of DBN-OCSVM and SDA- of the proposed method compared to the DBN-OCSVM and SDA-
OCSVM was 89.78 and 10.22 and 89.39 and 10.61, respectively. It OCSVM methods. This fact is due to integrating the DBM, which
can be seen that the accuracy of the proposed method increases is able to learn and extract complex data, with encoder-based
with the number of samples in the training data. dimensionality reduction, thus improving the feature extraction.
With 5000 training samples, the HAE-OCSVM method, the DBN- These results indicate that the proposed method learns complex
OCSVM, and the SDA-OCSVM method respectively yielded 99.73 structures of input data.
and 0.27, 89.89 and 10.11 and 90.57 and 9.43 percent accu- Fig. 15 presents area-under-curve (AUC) values corresponding
racy. Results show that the proposed method outperformed DBN- to the proposed HAE-OCSVM method, and the DBN-OCSVM and
OCSVM and SDA-OCSVM, and exhibited the highest accuracy. This SDA-OCSVM methods for different training data sizes. We note that
the HAE-OCSVM method performed better than the other models
is mainly due to its strong ability to learn complex structures from
due to the combination of two powerful deep learning architecture
training data.
DBMs as feature extractors, the autoencoder for dimensionality
We also assessed, the performance of previously constructed
reduction and the extended capacity of OCSVM algorithm to detect
models trained with free scenes using BUSY-DST. Fig. 14 shows
outliers.
examples of busy situations. From Fig. 14 (a,c and d), it can be seen
that an area with visible pixel intensities is present in the road 5.3. Model trained with busy scenes (BSM)
profile, and the static environment (vertical line) is located in the
middle of the V-disparity. Thus, the scene contains an obstacle and To build a model that rejects free scenes and describes busy
its static environment is close to the vehicle. The static environ- scenes, we trained three deep encoders (HAE,SDA,DBN) with a
ment in Fig. 14(b) has an unusually thick due to the sky fragment dateset containing sequences of busy roads (with traffic) as de-
with low in texture. Table 4 shows the high prediction accuracy scribed above. To validate the proposed model we generated a new
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 297
Fig. 14. Examples of busy scenes (Right) Original input image and (Left) its corresponding V-disparity map.
Table 5
Performance of HAE-OCSVM, DBN-OCSVM and SDA-OCSVM methods trained with
busy scenes and tested on OBS-DST.
Encoders Inliers (TN) Outliers(FN)
DBN-OCSVM 91.12 8.88
HAE-OCSVM 99.79 0.21
SDA-OCSVM 95.20 4.80
can increase the number of false alarms. For this reason, we have
to deal with fuzzy situations. Fig. 16 shows few examples of fuzzy
situations in which it is not easy to determining whether or not
the scene is free. From Fig. 16(a)–(d), it can be seen that the
environment static is close to the vehicle, which is not the case of
a free scene. Also, here the vehicle is coming close to a bend. These
are confusing situations.
Fig. 15. AUC of the proposed HAE-OCSVM method compared to DBN-OCSVM and Here, we propose an approach to identify fuzzy situations as
SDA-OCSVM methods for different training-sample size. different from busy and free scenes. Towards this end, we compare
responses of the FSM and BSM models to identify fuzzy situations.
If both models are flagged, the tested case is considered as a fuzzy
dataset from BUSY-DST composed of 400 true obstacles named situation. By this, we can identify and filter fuzzy situations from
OBS-DST. Table 5 presents the testing results of the HAE, SDA and busy and free situations.
DBN-based OCSVM methods applied to the OBS-DST dataset. The After identifying fuzzy scenes, two cases can be distinguished:
proposed method achieved a high prediction accuracy of 99.79% true alarm and warning alarm. A true alarm occurs if there is an
compared to 91.12% and 95.20% accuracy using DBN-OCSVM and obstacle in the operating area of the vehicle (see Fig. 10). Fig. 17
SDA-OCSVM, respectively (see Table 5). Again, the overall per- shows two examples of true alarms (i.e., the presence of obstacles
formance of the proposed HAE-OCSVM is better that of DBN- in the operating area). On other hand, a warning alarm is declared
OCSVM and SDA-OCSVM due to fact that DBMs are robust feature if there is an obstacle in the field of view but outside the operating
detectors that capture data correlations. In addition, complex data- area of the vehicle (see Fig. 18). To distinguish between true alarms
dependent statistics can be discovered for learning through multi- and warning alarms, we used U–V disparity on the operating area
ple layers. of the vehicle to estimate the obstacle locations. If the obstacle
is inside the operating area, then it considered as true alarm;
5.4. Identification of confusing (fuzzy) situations otherwise, it considered as a warning alarm.
Here, we investigate the capability of this approach to distin-
Now, we focus on the identification of confusing situations. guish between warning and true alarms. To do so, we test both
In such situations, the output response could be free scene, busy BSM and FSM models with the BUSY-DST dataset, which comprises
scene or fuzzy or confusing situation. These confusing situations 1437 examples of confirmed obstacles (417 scenes) and 1020 fuzzy
298 A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301
Fig. 17. Obstacle detection for true alarms examples. In each image, the colored boxes represent the predicted ROI area of the detected obstacles.
Fig. 18. False alarm examples. In each image, the colored boxes represent the predicted ROI area of the detected obstacles.
situations. After applying the identification approach, we find 59% to appear to be fairly linear. Here, we aim to exploit the ad-
warning alarms and 41% true alarms. This distribution (see Fig. 19) vantages of the HAE model and those of the OCSVM with RBF
is obtained according to the chosen dimensions of operating area. kernel functions to improve the detection of obstacles. Table 6
We can make this area stricter or more flexible by extending or presents a comparison between the HAE-OCSVM method with
reducing the disparity range. other studied classifiers. Results show that the combined HAE-
OCSVM detection scheme outperforms the other algorithms used
5.5. Obstacle detection-based one class classifiers in this study. OCSVM-based detection also surpassed SVDD-based
detection algorithms. This is related to the phenomenon of empty
After constructing the two hybrid models trained with BUSY- spaces inside the hyper sphere suffered by the SVDD.
DST and FREE-DST, respectively, we assess the performance of Implementation of these methods consist of two phases. Off-
the proposed HAE-based OCSVM obstacle detection approach and line training or learning, in which models are constructed. The
we compare our results with results from five algorithms: DBN- models are then used to detect obstacles in future data (i.e., test-
OCSVM, SDA-OCSVM, HAE-SVDD, DBN-SVDD and SDA-SVDD. A ing). On-line detection, in which the online measurement data are
benefit of SVMs is their ability to map problems into higher spa- processed and the constructed models are used to detect obstacles.
tial dimensions using kernels, allowing a non-linear relationship For each obstacle detection method, a processing time is computed
A. Dairi et al. / Robotics and Autonomous Systems 100 (2018) 287–301 299
∑R+4,C +4
r =R,c =C V − disparity(r , c)
dmn =
M .N
where R = (m − 1) ∗ M and C = (n − 1) ∗ N.
We check if there is any change in the columns of the density
map by using density map information from previous scenes. In
other words, we use residuals, E = [e1 , e2 , . . . , en ], which repre-
sent the difference between columns of the current density map
and the previous density map, as change indicators. Without ob-
stacles, residuals are close to zero due to measurement noise, and
they deviate significantly from zero in the presence of obstacles.
First, we remove the density map data mean and then we apply
the three-sigma rule on the residuals to detect potential changes.
Upper and lower control limits, which are denoted respectively by
UCL and LCL, for the residuals are defined as
Fig. 20. Tracking obstacle locations based on a density map. In each plot, the solid colored lines represent the residuals of density map. The dashed horizontal lines labeled
UCL and LCL denote the upper control limit and the lower control limit of Shewhart chart.
[18] A. Burlacu, S. Bostaca, I. Hector, P. Herghelegiu, G. Ivanica, A. Moldoveanul, S. [47] C. Georgoulas, L. Kotoulas, G.C. Sirakoulis, I. Andreadis, A. Gasteratos, Real-time
Caraiman, Obstacle detection in stereo sequences using multiple representa- disparity map computation module, Microprocess. Microsyst. 32 (3) (2008)
tions of the disparity map, in: 2016 20th International Conference on System 159–170.
Theory, Control and Computing (ICSTCC), IEEE, 2016, pp. 854–859. [48] D.C. Montgomery, Introduction to Statistical Quality Control, John Wiley &
[19] D. Petković, A.S. Danesh, M. Dadkhah, N. Misaghian, S. Shamshirband, E. Sons, New York, 2009.
Zalnezhad, N.D. Pavlović, Adaptive control algorithm of flexible robotic gripper [49] J.-L. Blanco, F.-A. Moreno, J. Gonzlez-Jimnez, The Mlaga Urban Dataset: High-
by extreme learning machine, Robot. Comput.-Integr. Manuf. 37 (2016) 170– rate stereo and lidars in a realistic urban scenario, Int. J. Robot. Res. 33 (2)
178. (2014) 207–214. http://www.mrpt.org/MalagaUrbanDataset.
[20] M. Duguleana, F.G. Barbuceanu, A. Teirelbar, G. Mogan, Obstacle avoidance of [50] T. Scharwächter, M. Enzweiler, U. Franke, S. Roth, Efficient multi-cue scene
redundant manipulators using neural networks based reinforcement learning, segmentation, in: German Conference on Pattern Recognition, Springer, 2013,
Robot. Comput.-Integr. Manuf. 28 (2) (2012) 132–146. pp. 435–445.
[21] Y. Bengio, Y. LeCun, et al., Scaling learning algorithms towards AI, Large Scale [51] T. Scharwächter, M. Enzweiler, U. Franke, S. Roth, Stixmantics: A medium-level
Kernel Mach. 34 (5) (2007) 1–41. model for real-time semantic scene understanding, in: European Conference
[22] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evaluation on Computer Vision, Springer, 2014, pp. 533–548.
of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell. 34 (4) (2012) 743–
761.
[23] Y. Bengio, et al., Learning deep architectures for AI, Found. Trends Mach. Learn. Abdelkader Dairi holds in 2003 an Engineer degree in
2 (1) (2009) 1–127. computer science from the University of Oran 1 Ahmed
[24] G.E. Hinton, Learning multiple layers of representation, Trends Cogn. Sci. Ben Bella, Algeria. He got a Magister degree in 2006 from
11 (10) (2007) 428–434. the National Polytechnic School of Oran, Algeria. He is
[25] V.D. Nguyen, H. Van Nguyen, D.T. Tran, S.J. Lee, J.W. Jeon, Learning framework currently preparing his Ph.D. degree in computer sciences
for robust obstacle detection, recognition, and tracking, IEEE Trans. Intell. at Ben Bella Oran1 University under the supervision of Pr.
Transp. Syst. (2016). Senouci Mohamed. His current research interests include
[26] S. Ramos, S. Gehrig, P. Pinggera, U. Franke, C. Rother, Detecting unexpected machine learning, computer vision, image processing and
obstacles for self-driving cars, Fusing Deep Learn. Geom. Model. (2016). arXiv mobile robotics.
preprint arXiv:1612.06573.
[27] R. Labayrade, D. Aubert, In-vehicle obstacles detection and characterization by
stereovision, in: Proceedings of the 1st International Workshop on In-Vehicle
Cognitive Computer Vision Systems, Graz, Austria, 2003. Fouzi Harrou received the Dipl.-Ing in Telecommunica-
[28] S.M. Erfani, S. Rajasegarar, S. Karunasekera, C. Leckie, High-dimensional and tions from Abou Bekr Belkaid University, Algeria, in 2004
large-scale anomaly detection using a linear one-class SVM with deep learn- and the M.Sc. degree in Telecommunications and Net-
ing, Pattern Recognit. 58 (2016) 121–134. working in 2006 from the University of Paris VI, France.
[29] J. Xu, H. Li, S. Zhou, An overview of deep generative models, IETE Tech. Rev. In 2010, he received the Ph.D. degree in Systems Opti-
32 (2) (2015) 131–139. mization and Security from the University of Technology
[30] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing of Troyes (UTT), France, and was an Assistant Professor at
robust features with denoising autoencoders, in: Proceedings of the 25th the UTT, from 2009 to 2010. In 2010, he was an Assistant
International Conference on Machine Learning, ACM, 2008, pp. 1096–1103. Professor at the Institute of Automotive and Transport
[31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, Stacked denoising Engineering at Nevers, France. From 2011 to 2012, he was
autoencoders: Learning useful representations in a deep network with a local Postdoctoral Research Associate at the Systems Modelling
denoising criterion, J. Mach. Learn. Res. 11 (Dec) (2010) 3371–3408. and Dependability Laboratory, UTT. From 2012 to 2014, he was an Assistant Re-
[32] A. Krizhevsky, G.E. Hinton, Using very deep autoencoders for content-based search Scientist, in Chemical Engineering Department at the Texas A&M Univer-
image retrieval, in: ESANN, 2011. sity at Qatar, Doha, Qatar. Since 2015, he is Postdoctoral Fellow in the Division
[33] P. Smolensky, Information processing in dynamical systems: Foundations of of Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) at
harmony theory; cu-cs-321-86, 1986. King Abdullah University of Science and Technology (KAUST). His current research
[34] G.E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep belief interests include statistical decision theory and its applications, fault detection and
nets, Neural Comput. 18 (7) (2006) 1527–1554. signal processing, and Spatio-temporal statistics with environmental applications.
[35] R. Salakhutdinov, G. Hinton, Deep Boltzmann machines, in: Artificial Intelli- He is a Member of the IEEE Computational Intelligence Society.
gence and Statistics, 2009, pp. 448–455.
Mohamed Senouci received the Engineer degree and
[36] A.-r. Mohamed, G.E. Dahl, G. Hinton, Acoustic modeling using deep belief
Magister degrees in computer science from the University
networks, IEEE Trans. Audio Speech Lang. Process. 20 (1) (2012) 14–22.
of Oran 1 Ben Bella, Algeria in 1979 and 1994, respectively,
[37] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, M. Pfeiffer, Real-time classification
and the Ph.D. degree in computer sciences, from the Ben
and sensor fusion with a spiking deep belief network, Front. Neurosci. 7 (2013).
Bella Oran1 University Algeria in 2007, where he is cur-
[38] H. Lee, P. Pham, Y. Largman, A.Y. Ng, Unsupervised feature learning for
rently a Professor. His research interests include embed-
audio classification using convolutional deep belief networks, in: Y. Bengio, D.
ded systems, machine learning, and artificial intelligence.
Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (Eds.), Advances in Neural
Information Processing Systems 22, Curran Associates, 2009, pp. 1096–1104.
http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-
classification-using-convolutional-deep-belief-networks.pdf.
[39] S. Kang, X. Qian, H. Meng, Multi-distribution deep belief network for speech
synthesis, in: 2013 IEEE International Conference on Acoustics, Speech and Ying Sun is an Assistant Professor of Statistics in the di-
Signal Processing (ICASSP), IEEE, 2013, pp. 8012–8016. vision of Computer, Electrical and Mathematical Sciences
[40] P. Liu, S. Han, Z. Meng, Y. Tong, Facial expression recognition via a boosted deep and Engineering (CEMSE) at King Abdullah University of
belief network, in: Proceedings of the IEEE Conference on Computer Vision and Science and Technology (KAUST) in Saudi Arabia. She
Pattern Recognition, 2014, pp. 1805–1812. joined KAUST in June 2014 after one-year service as an
[41] R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving assistant professor in the Department of Statistics at the
class neighbourhood structure, in: AISTATs, Vol.11, 2007. Ohio State University, USA. At KAUST, she leads a mul-
[42] B. Leng, X. Zhang, M. Yao, Z. Xiong, A 3D model recognition mechanism based tidisciplinary research group on environmental statistics,
on deep Boltzmann machines, Neurocomputing 151 (2015) 593–602. dedicated to developing statistical models and methods
[43] Q. Gan, C. Wu, S. Wang, Q. Ji, Posed and spontaneous facial expression differ- for space–time data to solve important environmental
entiation using deep Boltzmann machines, in: 2015 International Conference problems. Prof. Sun received her Ph.D. degree in Statistics
on Affective Computing and Intelligent Interaction (ACII), IEEE, 2015, pp. 643– from Texas A&M University in 2011, and was a postdoctorate researcher in the
648. research network of Statistics in the Atmospheric and Oceanic Sciences (STATMOS),
[44] R. Salakhutdinov, H. Larochelle, Efficient learning of deep Boltzmann ma- affiliated with the University of Chicago and the Statistical and Applied Mathemat-
chines, in: AISTATs, Vol. 9, 2010, pp. 693–700. ical Sciences Institute (SAMSI). She demonstrated excellence in research and teach-
[45] B. Schölkopf, J.C. Platt, J. Shawe-Taylor, A.J. Smola, R.C. Williamson, Estimating ing, published research papers in top statistical journals as well as subject matter
the support of a high-dimensional distribution, Neural Comput. 13 (7) (2001) journals, won multiple best paper awards from the American Statistical Association
1443–1471. and the Transportation Research Board National Academies. Her research interests
[46] D.M. Tax, R.P. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) include spatio-temporal statistics with environmental applications, computational
45–66. methods for large datasets, uncertainty quantification and visualization, functional
data analysis, robust statistics, statistics of extremes.