A Vision Based Traffic Accident Detection Method Using Extreme Learning Machine
A Vision Based Traffic Accident Detection Method Using Extreme Learning Machine
Mechatronics (ICARM)
Abstract— Over the past years, automatic traffic accident which is a bridge between low-level features and high-level
detection (ATAD) based on video has become one of the representations [2]. This type of methods mainly includes
most promising applications in intelligent transportation and four subsequent modules: Object detection, tracking, tra-
is playing a more and more important role in ensuring travel
safety. This paper proposes a classifier-based supervised method jectory learning and path modeling. Object detection and
by viewing the last seconds before motor vehicle collisions as tracking aim to extract trajectory elements of each candidate
the detection target. In our method, we devise a novel algorithm object of interest. Clustering techniques for time-series data
called OF-SIFT as the low-level feature. Deriving from the [3], e.g., spectral clustering [4] and fuzzy self-organizing
optical flow and Scale Invariant Feature Transform (SIFT), neural network [5], [6] are then used to train clusters. Based
it is designed to extract local motion information from the
temporal domain rather than gradient-based local appearance on these clusters, probabilistic models e.g., hidden Markov
from the spatial domain. The purpose of OF-SIFT is to generate model (HMM) [7]–[11], or tree structure based models [8],
a feature that can capture sufficient and distinctive dynamic [12] are widely used to constitute a path model. However,
motion information for motion detection without using the static as a prerequisite stage, there are still a lot of challenges for
state information of moving objects. Further, in order to develop object detection and tracking in the real cluttered environ-
a more compact image representation without considering
the explicit vehicle geometry shape, we use the idea of Bag ment, which imposes negative effects on modeling behavior
of Feature (BOF) model with spatial information to encode representations. In contrast, spatio-temporal local features
features. Finally, an extreme learning machine (ELM) classifier are also proposed for behavior representation. One of such
is introduced as the basic classifier owing to its excellent and fast features is called MoSIFT [13], which is constructed by first
generalization. Experiments using real-world data have shown extracting key points and then estimating local descriptors
that the proposed method has achieved good performance in
handling ordinary video scenes. based on static gradients and optical flow around key points.
However, there is no consensus about whether static gradi-
Index Terms— Traffic accident detection, OF-SIFT, bag of ents can improve motion behavior representation.
features with spatial information, extreme learning machine. The second issue is about how to localize the traffic
accident region based on traffic behavior representations. So
I. INTRODUCTION far there have been two techniques used for such localization.
RAFFIC accidents, e.g., vehicle collisions, is one se- One is abnormality detection [4]–[6], [14], based on which a
T rious problems for transportation systems. Delays in
traffic accident detection might cause traffic jams as well as
new observation (e.g., trajectory) is identified as abnormality
if it does not fit a path well. Unfortunately, such technique is
much more serious injuries and damages. Thus it is required dependent on threshold and how to define the threshold for
to develop a reliable and efficient automatic traffic accident various environment is very challenging. The other technique
detection (ATAD) system. is classification, in which a classifier is trained based on
normal traffic samples and accident samples. support vector
Since vision can provide large amount of traffic informa-
machine (SVM) [15], [16] and decision tree [12] have
tion, it is an economic and promising way for developing an
been used for traffic accident detection. However, SVM and
ATAD system. Although some methods have been proposed
decision tree is easy to fall into a local optimum and suffers
for ATAD, there are still several challenging issues.
from a slow training speed.
The first issue is about how to construct the behavior rep-
This paper proposes a novel classifier-based method for
resentations of traffic accidents in the cluttered environment.
traffic accident detection by solving the aforementioned
Earlier work relied on geometric features, e.g., active contour
issues.
[1], to build the behavior model. This method can be used
To solve the first issue mentioned above and make it
for head-on collision detection, however, it is obvious that
easier to detect crashed vehicles, we adopt the idea of
this type of representation is not robust to viewing changes
the collision prediction to regard the last seconds before
and occlusions. Recently, many studies focused on using
collision as the detection target. Consequently, the diversity
trajectory dynamics to build traffic behavior representations,
of collision scenes to detect is greatly reduced because
*Corresponding author: Yuanlong Yu. This work is supported by National untouched vehicles are still running in their normal state.
Natural Science Foundation of China (NSFC) under grant 61473089. Besides, we also don’t require the help of the trackers to
Y. Chen and Y. Yu are with the college of Mathematics and extract and analyze the vehicle features by measuring vehicle
Computer Science, Fuzhou University, Fuzhou, 350116 China. hay un nuevo
yu [email protected], [email protected] paths for a long time. Then, inspired by MoSIFT, we propose descriptor de
Ting Li, Fujian Provincial Power Co. Ltd., State GRIP, China. a novel algorithm called OF-SIFT, which can be perceived caracteristicas
que no necesita
de los
seguidores ni
978-1-5090-3364-5/16/$31.00 ©2016 IEEE 567 detectores
Fig. 1. Framework of our traffic accident system.
as a temporal feature descriptor containing sufficient and words and add their spatial information. Finally, those encod-
distinctive motion shape information derived from optical ed feature vectors are fed into the ELM classifier for training
flow, regardless of useless still spatial constraints of oriented and then for detection. Each sample is defined as a fixed
gradients. Optical flow has parallel attributes as oriented window. Training samples are picked manually from several
gradients in appearance because it also grasps the magnitude video frames and labeled as positive or negative samples
and orientation of moving objects. However, many researches depending on whether their inner information represents the
have suggested that dense feature extraction ultimately gives target region or not. Testing samples are obtained by using a
a better performance than interest points [17]. So inspired sliding window strategy on all video frames without labels.
by Dense Scale Invariant Feature Transform (DSIFT) that In the low-level feature extraction module, some prepara-
can acquire more features in less time, we extract OF- tory steps should be performed for OF-SIFT descriptors,
SIFT descriptors from dense sampling and set the sampled including denoising image, calculating optical flow, trun-
keypoints with the same scale and orientation. Furthermore, cating tiny oriented flow magnitude, and normalizing flow
in order to construct a more compact image representation magnitude within the window.
to detect vehicle collision without requiring explicit shape
structure and have a tolerance for partial occlusion, we B. Extracting low-level features
employ the Bag of Feature (BOF) model to encode the low- 1) Preprocessing: Noise can reduce the accuracy of mo- este algoritmo
level feature. We define a collision type as a set of collision reduce el ruido
tion estimation represented by optical flow which is sensitive manteniendo los
patterns involving similar driving directions. Experiments are to pixel-wise brightness. So we first adopt a bilateral filter bordes nitidos ya
conducted for different collision types. algorithm [21] to denoise the image, which can effectively que los pixeles de
optical flow son
To solve the second issue, we propose an extreme learning reduce unwanted noise while keeping edges fairly sharp. sensibles a la
machine (ELM) algorithm as the classifier for traffic accident Then, we compute dense optical flow using the duality luminosidad
detection. Compared to the Bayesian Probability framework, se usa para el
based total variation with an L1 norm (TV-L1 ) algorithm [22],calculo de la
the ELM requires no prior knowledge of vehicle activities. [23] for capturing the magnitude and orientation at each pixeldensidad de flujo
Compared to the SVM and HMM, theoretical studies have optico el algoritmo
in flow images. We let u, v be x and y components of thede 22 y 23 para
shown that the ELM tends to search a global optimization optical
solution, and it is remarkably efficient in a high training √ flow vector at each pixel, then the flow magnitudecapturar lade cada
is u2 + v2 and the flow orientation is atan2 (v, u) + π magnitud
speed without enumerations and complex equations because (transformed into 0◦ − 360◦ ).
pixel y luego se
realizan las
only regularized least squares are involved. Some researches The magnitudes in any non-motion regions of the originalformulas para el
have shown that ELM outperforms other classifiers in the flujo de la magnitud
flow fields are extremely small positive integers instead ofy de la direccion
classification field [18]–[20]. equaling to zeros in the ideal case. If these values are pre-
The rest of this paper is organized as follows. Section II served, we would be unable to calculate the area of motion
describes the architecture and other details of the proposed regions by counting the number of non-zero magnitudes,
method. Experimental results are given in Section III. which is expected to be used for filtering out meaningless
sliding windows in detection stage. Hence, we set a small
II. METHODOLOGY threshold for the magnitude of all windows. Any magnitude
smaller than this threshold will be reset to zero, and so will
A. Architecture
its relevant orientation.
The architecture of our method is shown in Fig. 1, which In practice, the magnitude may range among different
consists of two stages, namely, training stage and detection windows. It is natural because different vehicles at different
stage. They each include three consecutive modules: (1) low- periods are driven at different speeds, but the magnitude
level feature extraction module; (2) feature encoding module; variation may affect the classifier’s detection performance.
(3) detection module. In both stages, we firstly extract OF- So we apply the min-max normalization technique to each
SIFT features from samples, then quantize them into visual window by scaling all magnitudes ranging between 0 and
568
Algorithm 1 The OF-SIFT algorithm into K regions. Each region is represented by its center,
Input: called codeword. Then the descriptors from each training
The local flow field as a regular grid in a globe flow flow image and testing flow image are both assigned to the
field. closest codeword in the codebook using Euclidean distance.
Output: As for feature quantization, we introduce X as a group of
The vector as OF-SIFT feature descriptor. N D-dimensional descriptors extracted from the training set,
then we have X = [x1 , x2 , · · ·, xN ]TN×D . Given a codebook C
Step 1: Calculate the flow magnitude and its orientation at including K codewords, then C = [c1 , c2 , · · ·, cK ]TK×D . The
each pixel in the local flow field. goal of coding is to transform each input descriptor x into
Step 2: Build histograms with orientated flow magnitude for a new code y, where y(i) is expressed as a value in the
each cell. i-th element of y. Now, we can regard the voting value for
Step 3: Concatenate all histograms and weight the flow codeword ci as a function of x, namely g (x) = y (i). We use
magnitude as a vector. two coding schemes: hard vector quantization (hard-VQ) and
Step 4: Normalize the vector as the final feature descriptors. soft vector quantization (soft-VQ).
1, i f i=argmin j x − c j 2
hard-VQ : g(x)= (2)
1 using the following simplified formula, which performs a 0, otherwise
linear transformation on the original data.
where only one element gets the voting value 1 in y for the
x − minv descriptor x.
new x = (1)
maxv − minv
where [minv , maxv ] are the minimum and maximum of soft-VQ : g(x)= exp (-αx − ci 2 ) (3)
feature v. (1) maps a value x in the original range into a
new x in a new range [0, 1]. where α is a smoothing factor to adjust the assigned weight.
Min-max normalization is most suitable for this step. However, as BOF gives no consideration to the spatial
Because on the one hand, it precisely corresponds with our distribution among features in an image, its ability in image
need to normalize the data into a specified range without representation is thus affected. So we introduce the SPM
changing their original relationship; on the other hand, the method [24] in our system to integrate spatial information
original relationship should not be transformed until the next into the BOF model. The basic idea of SPM is that each
round of nonlinear normalization in our OF-SIFT algorithm. image is represented using L level of image resolution,
So even after the linear normalization for optical flow, the where l-th (l ∈ [0, 1, · · ·, L − 1]) is divided evenly into 22l
property of the OF-SIFT descriptor still remains unchanged. discrete cells. Then compute K-sized histograms of features
2) Construction of the OF-SIFT descriptor: This section that fall into each cell. SPM will degrade into BOF when
describes the construction of the feature descriptor for our l = 0. Further, these histograms are concatenated into a
OF-SIFT algorithm. Firstly, the flow magnitude and ori- K ∑L−1 2l
l=0 2 -dimensional vector as the final image representa-
entation of M × M cells at each pixel are extracted on a tion. Additionally, all histograms are weighted and normal-
regular grid. Secondly, the flow information in each cell ized to make the representation more expressive and invariant
is quantized using bilinear interpolation in an orientation to images with different numbers of features.
histogram with T bins ranging from 0◦ − 360◦ evenly. The L = 3 level pyramid is recommended in [25] so that
histogram is constructed by accumulating all flow orientation it can generate a total of 21(1 + 4 + 16) histograms and
in M × M cells. After that, the flow magnitude is weighted. consequently 21K-dimensional feature vector will be formed
Thus M × M × T = D dimensional vector can serve as the for an image representation.
feature descriptor. To make the descriptor more robust in
feature matching, nonlinear normalization is employed on D. KELM based traffic accident detection
OF-SIFT descriptor. In our work, we apply Z-score nor-
malization for descriptors. So a global optical flow image This paper uses a two-class kernel based ELM (KELM) to
(field) can be mapped as a matrix-vector feature descriptors detect whether any traffic accident occurs or not in a given
F = [f1 , f2 , · · · , fN ]TN×D , where N denotes the number of video. KELM has a better generalization performance than
keypoints (girds) in the image, fi (i ∈ [1, 2, · · ·, N]) is an OF- basic ELM, and it is simply conducted and runs much faster
SIFT descriptor for i − th keypoint. The typical steps of this than SVM [20], [26]. For the training stage, after encoding
proposed OF-SIFT algorithm is shown in Algorithm 1. low-level features, all training feature vectors with labels
are grouped as the training data for the KELM classifier.
C. Encoding low-level features During the detection stage, testing data is organized in the
In this part, we quantize descriptors using the popular same manner as the training data. But testing feature vectors
BOF model. All OF-SIFT descriptors are extracted from the are obtained by the sliding window technique through video
training set and then used to generate a visual codebook frames. Their classes are predicted by the classifier rather
through k-means clustering which partitions the feature space than labeled in advance.
569
1) Overview of kernel based extreme learning machine: So the output function of ELM in (4) can be transformed
Basic ELM is derived from a single-hidden-layer feed- into:
−1
forward neural network (SLFN) whose structure is illustrated I
f (x) = h(x)β = h(x)HT ( + HHT ) T (8)
in Fig. 2. It seeks to achieve fast network training with little C
human supervision. The essence of ELM is that the hidden The basic ELM in (8) can be extended to KELM by using
layer of a SLFN doesn’t require tuning since the hidden kernel trick to transform the explicit activation function into
layer parameters are arbitrarily assigned. Recently, KELM an implicit mapping function. To be specific, both h(x)HT
has been proposed by generalizing basic ELM as a kernel and HHT in (8) can be replaced by a kernel function: h(x1 ) ·
version by using kernel trick. h(x2 ) = k(x1 , x2 ). As a result, the output function of KELM
Given a standard SLFN with L hidden nodes and M output is obtained as follows:
nodes (M classes), for the input vector x of a training sample, −1
the output of ELM can be written as I
f (x) = h(x)HT ( + HHT ) T
L C (9)
−1
f (x) = ∑ βi hi (x) = h(x)β = t(x) (4) I
= Kx ( + K) T
i=1 C
where β = [β1 , β2 , · · · , βL ]TL×M denotes the output weight where kernel vector Kx = [k(x, x1 ), k(x, x2 ), · · · , k(x, xN )],
matrix connecting the hidden layer to the output layer. and kernel matrix K = [kx1 , kx2 , · · · , kxN ]TN×N . In our exper-
hi (x) is the activation function of i-th hidden node and iment, we employ ELM with Gaussian kernel: k(x1 , x2 ) =
h(x) = [h1 (x), h2 (x), · · · hL (x)] denotes the output vector of exp(−γx1 − x2 2 ). So the performance of KELM solely
the hidden layer. t(x) = [t1 (x),t2 (x), · · ·tM (x)] denotes the depends on the combination of the regularization factor C
output vector of the real class label. and the kernel parameter γ .
Given N training samples, (4) can be written in a linear 2) Filtering out meaningless sliding windows: In the
matrix-vector form: detection stage, we use the sliding window technique to
Hβ = T (5) handle the fuzzy location and size problem. But a new
where H = [h(x1 ), h(x2 ), · · · , h(xN )]TN×L denotes the output problem arises that sliding window produces a large amount
matrix of the hidden layer for all training samples. T = of meaningless windows (samples) containing only a few
[t(x1 ), t(x2 ), · · · t(xN )]TN×M denotes the output matrix of the small non-zero flow fields or nothing at all (backgrounds).
real class label for all training samples. These windows are unable to reflect complete and distinctive
Different from traditional SLFN, randomness here is ap- motion patterns. Besides, they would significantly increase
plied to all parameters of h(x) so that H doesn’t need to be the classifiers workload in detection.
re-tuned. So the training of such a SLFN is converted into To solve this problem, we use the area ratio concept to
a problem of finding the smallest norm least-square solution filter out unwanted sliding windows. The area ratio is denoted
β̂ using Moore-Penrose generalized inverse: by A/B, where A represents the total area of all non-zero
flow fields in the window and is defined by counting the
β̂ = H† T (6) number of non-zero flow vectors for each pixel in a window.
−1 B represents the area of a corresponding window and is
H† can be calculated as H† = HT ( CI + HHT )
[27], where
calculated by the product of its width and height. Given a
I is an identity matrix of dimension N and C is a regu-
preset ratio C that specifies the minimum requirement for
larization factor to improve the stability and generalization
the meaningful windows. A certain window should retain
performance of the learning system. Thus β̂ is obtained:
if A/B > C, otherwise it will be reset to the background.
−1
I Consequently, the area ratio enables the selected training
β̂ = HT ( + HHT ) T (7)
C windows to focus only on regions containing rich motion
information or pure background so that the number of
training samples to collect are greatly reduce.
III. EXPERIMENTS
A. Dataset
Our system is evaluated on the dataset collected by our-
selves from the web resource at http://www.youtube.
com, which is composed of 324 training samples that covers
six different collision types as mentioned in section I. Each
type contains an average of 53 training samples, of which the
proportion of positive training samples to negative training
samples is around one to six. Some hand-picked training
samples in a collision type of our dataset are shown in Fig.
Fig. 2. Structure of ELM. 3.
570
Fig. 3. Columns 1, 2 and 3 respectively constitute a pair of consecutive
frame. Green bounding boxes as positive windows represent the target
regions on the last seconds before the vehicle collision. Red bounding boxes
and cyan bounding boxes as negative windows respectively represent the (a)
normal driving vehicles and background regions.
571
[4] Z. Fu, W. Hu, and T. Tan, “Similarity based vehicle trajectory clus-
tering and anoamaly detection,” in Proceedings of IEEE International
Conference on Image Processing, 2005, p. 1.
[5] W. Hu, X. Xiao, D. Xie, and T. Tan, “Traffic accident prediction using
3-d model-based vehicle tracking,” IEEE Transactions on Vehicular
Technology, vol. 53, no. 3, pp. 677–695, 2004.
[6] W. Hu, X. Xiao, Z. Fu, D. Xie, T. Tan, and S. Maybank, “A system
for learning statistical motion patterns,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1450–1464,
2006.
[7] S. Kamijo, Y. Matsushita, K. Ikeuchi, and M. Sakuchi, “Traffic
(a) monitoring and accident detection at intersections,” IEEE Transactions
on Intelligent Transportation Systems, vol. 1, no. 2, 2000.
[8] C. Piciarelli and G. L. Foresti, “On-line trajectory clustering for
anomalous events detection,” Pattern Recognition Letters, vol. 27,
no. 15, pp. 1835–1842, 2006.
[9] X. Li and F. M. Porikli, “A hidden markov model framework for
traffic event detection using video features,” in Proceedings of IEEE
International Conference on Image Processing, 2004, pp. 2901–2904.
[10] E. A. Swears, A. Hoogs, and A. G. A. Perera, “Learning motion
patterns in surveillance video using hmm clustering,” in Proceedings
of IEEE Workshop on Motion & Video Computing, 2008, pp. 1–8.
[11] Ö. Aköz and M. Karsligil, “Traffic event classification at intersections
based on the severity of abnormality,” Machine vision and applica-
(b) tions, vol. 25, no. 3, pp. 613–632, 2014.
[12] Y. Ki and D. Lee, “A traffic accident recording and reporting model
Fig. 5. (a), (b) respectively represent the sequential testing video frames at intersections,” IEEE Transactions on Intelligent Transportation
of the pattern type ii by hard-VQ and soft-VQ. Systems, vol. 8, no. 2, 2007.
[13] M. Y. Chen and A. Hauptmann, “Mosift: Recognizing human actions
TABLE I in surveillance videos,” 2009.
M EASURES T HE D ETECTION R ESULTS F OR PATTERN T YPE I A ND II [14] V. Saligrama, J. Konrad, and P. Jodoin, “Video anomaly identifica-
tioin,” IEEE Signal Processing Magzine, vol. 27, no. 5, 2010.
[15] Y. Zou, G. Shi, H. Shi, and Y. Wang, “Image sequences based
Pattern types Precison Recall Accuracy traffic incident detection for signaled intersections using HMM,” in
Proceedings of the 2009 IEEE International Conference on Hybrid
Hard-VQ 100% 50% 82.6% Intelligent Systems, vol. 1, 2009, pp. 257–261.
Collision type i [16] S. Chen, W. Wang, and H. V. Zuylen, “Construct support vector
Soft-VQ 85.7% 75% 86.9% machine ensemble to detect traffic incident,” Expert systems with
applications, vol. 36, no. 8, pp. 10 976–10 986, 2009.
Hard-VQ 100% 60% 93.3% [17] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evalua-
Collision type ii tion of local spatio-temporal features for action recognition,” in BMVC
Soft-VQ 83.3% 100% 96.7% 2009-British Machine Vision Conference, 2009, pp. 124–1.
[18] M. Pal, “Extreme-learning-machine-based land cover classification,”
International Journal of Remote Sensing, vol. 30, no. 14, pp. 3835–
Then we devise an advanced local descriptor that can capture 3841, 2009.
[19] S. Xie, Y. Wu, Y. Zhang, J. Zhang, and C. Liu, “Single channel single
rich and explicit motion information in temporal domain trial p300 detection using extreme learning machine: Compared with
while using no spatial constraints. In order to construct BPNN and SVM,” in Proceedings of the 2014 IEEE International
a more compact representation for video scenes, we build Joint Conference on Neural Networks (IJCNN), 2014, pp. 544–549.
[20] G. B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a
the bag of feature model with spatial information. Extreme survey,” International Journal of Machine Learning and Cybernetics,
learning machine with Gaussian kernel is employed as the vol. 2, no. 2, pp. 107–122, 2011.
base classifier. Such additional strategies as normalization [21] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color
images,” in Proceedings of the 1998 IEEE International Conference
of flow magnitude and filtering of sliding windows are on Computer Vision, 1998, pp. 839–846.
used to reduce the classifiers workload in computation and [22] C. Zach, T. Pock, and H. Bischof, “A duality based approach for
detection. Experiments using real-world data have shown that realtime TV-L1 optical flow,” in Pattern Recognition. Springer, 2007,
pp. 214–223.
the proposed method has high recognition accuracy. In the [23] J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo, “TV-L1 optical flow
future, we will seek to describe the collision process by estimation,” Image Processing On Line, vol. 2013, pp. 137–150, 2013.
establishing a 3D motion descriptor based on the current [24] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
Spatial pyramid matching for recognizing natural scene categories,”
work. in Proceedings of the 2006 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 2169–
2178.
R EFERENCES [25] ——, “Spatial pyramid matching,” Object Categorization: Computer
and Human Vision Perspectives, vol. 3, no. 4, 2009.
[1] C. Lin, J. C. Tai, and K. Song, “Traffic monitoring based on real- [26] G. B. Huang and C. K. Siew, “Extreme learning machine with
time image tracking,” in Proceedings of the 2003 IEEE International randomly assigned rbf kernels,” International Journal of Information
Conference on Robotics and Automation, vol. 2, 2003, pp. 2091–2096. Technology, vol. 11, no. 1, pp. 16–24, 2005.
[2] B. T. Morris and M. M. Trivedi, “A survey of vision-based trajectory [27] G. B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learning
learning and analysis for surveillance,” IEEE Transactions on Circuits machine for regression and multiclass classification,” IEEE Transac-
and Systems for Video Technology, vol. 18, no. 8, pp. 1114–1127, tions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 42,
2008. no. 2, pp. 513–529, 2012.
[3] T. Liao, “Clustering of time series data,” Pattern Recognition, vol. 38,
pp. 1857–1874, 2005.
572