0% found this document useful (0 votes)
14 views

An Intelligent Video Analysis Method For Abnormal Event Detection in Intelligent Transportation Systems

2. An Intelligent Video Analysis Method for Abnormal Event Detection in Intelligent Transportation Systems

Uploaded by

ky.lamthai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

An Intelligent Video Analysis Method For Abnormal Event Detection in Intelligent Transportation Systems

2. An Intelligent Video Analysis Method for Abnormal Event Detection in Intelligent Transportation Systems

Uploaded by

ky.lamthai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 1

An Intelligent Video Analysis Method for


Abnormal Event Detection in Intelligent
Transportation Systems
Shaohua Wan , Senior Member, IEEE, Xiaolong Xu , Member, IEEE, Tian Wang ,
and Zonghua Gu , Senior Member, IEEE

Abstract— Intelligent transportation systems pervasively of vehicles and pedestrians on the supervised section. There-
deploy thousands of video cameras. Analyzing live video streams fore, it has widely attracted researchers’ attention. The road
from these cameras is of significant importance to public safety. traffic safety situation in the past is facing increasingly severe
As streaming video is increasing, it becomes infeasible to have
human operators sitting in front of hundreds of screens to catch challenges, and traffic accidents have still frequently happened.
suspicious activities or detect objects of interests in real-time. It is a huge challenge to detect traffic accidents quickly and
Actually, with millions of traffic surveillance cameras installed, accurately, and to avoid the traffic safety problems caused by
video retrieval is more vital than ever. To that end, this article them. As one of the important sources of video data, video
proposes a long video event retrieval algorithm based on capture cameras can be seen anywhere in all corners of road
superframe segmentation. By detecting the motion amplitude
of the long video, a large number of redundant frames can intersections. Not only that, the number of cameras has also
be effectively removed from the long video, thereby reducing been expanding at an annual growth rate of 20%, accompanied
the number of frames that need to be calculated subsequently. by video analysis derived from video big data. With the rapid
Then, by using a superframe segmentation algorithm based growth of the number of applications, video analysis in the
on feature fusion, the remaining long video is divided into intelligent transportation public safety scene has also attracted
several Segments of Interest (SOIs) which include the video
events. Finally, the trained semantic model is used to match the attention of academia and industry. In the context of the
the answer generated by the text question, and the result with rapid growth of data processing, how to obtain useful data in
the highest matching value is considered as the video segment videos has become a key goal in the development of ITS to cut
corresponding to the question. Experimental results demonstrate down traffic accidents and confirm on the liability of the traffic
that our proposed long video event retrieval and description accidents. An intelligent video analysis method for abnormal
method which significantly improves the efficiency and accuracy
of semantic description, and significantly reduces the retrieval event detection is an effective means to achieve this goal, and
time. will determine the degree of intelligence of the entire ITS.
It is easy for humans to watch a long video and describe
Index Terms— Intelligent transportation systems, long video
event retrieval, segment of interest, superframe segmentation, what happened at each moment in text. However, it is a
question-answering. very challenging task to make a machine capture and extract
specific events from long videos and then give descriptive
I. I NTRODUCTION text. The technology that completes such task has received
extensive attention in the field of computer vision due to
I NTELLIGENT transportation system (ITS) can improve
the traffic efficiency and effectively guarantee the safety its promising prospects in video surveillance and assisting
the blind. Traffic departments analyze video streams from
Manuscript received March 28, 2020; revised July 29, 2020; accepted cameras at intersections for traffic flow control, vehicles recog-
August 13, 2020. This work was supported in part by the National Natural nition, vehicle properties extraction, traffic rule violations, and
Science Foundation of China (No.61672454, No. 61762055); in part by
the Fundamental Research Funds for the Central Universities of China accidents detection. Different from the simple task of the
under Grant 2722019PY052 and by the open project from the State Key semantic description of static images, the description of video
Laboratory for Novel Software Technology, Nanjing University, under Grant content is more challenging, because it needs to understand a
No. KFKT2019B17. The Associate Editor for this article was A. Jolfaei.
(Corresponding author: Shaohua Wan.) series of consecutive scenes to generate multiple description
Shaohua Wan is with the Department of Computer Science and Engineering, segments. At present, most of the existing research focuses on
Shaoxing University, Shaoxing 312000, China, also with the School of the description of short videos or video segments. However,
Information and Safety Engineering, Zhongnan University of Economics and
Law, Wuhan 430073, China, and also with the State Key Laboratory for Novel the videos that record actual scenarios are very long, which
Software Technology, Nanjing University, Nanjing 210023, China (e-mail: may be hundreds of minutes in length. So, it takes a lot
[email protected]). of time and cost to achieve video retrieval and information
Xiaolong Xu is with the School of Computer and Software, Nanjing
University of Information Science and Technology, Nanjing 210044, China. selection.
Tian Wang is with the College of Computer Science, Huaqiao University, Event retrieval and description of long videos are generally
Xiamen 361021, China. driven by the advances in segment of interest (SOI) recogni-
Zonghua Gu is with the Department of Applied Physics and Electronics,
Umeå Universitet, 90187 Umeå, Sweden. tion, key frame selection, and image semantic description and
Digital Object Identifier 10.1109/TITS.2020.3017505 generation. Sah et al. [1] extracted the SOI based on the quality
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

•An intelligent video analysis method for abnormal event


detection in intelligent transportation systems is proposed
based on VQA. By detecting the motion amplitude of
the long video, a large number of redundant frames
can be effectively removed from the long video, thereby
reducing the number of frames that need to be calculated
subsequently.
• By using a superframe segmentation algorithm based on
feature fusion, the remaining long video is divided into
Fig. 1. Long video segmentation and specific event retrieval. several SOIs which include the video events.
• The trained semantic model is presented to match the
of video frames, and then used deep learning algorithms to answer generated by the text question, and the result with
encode and decode the video segments, thereby converting the the highest matching value is considered as the video
key frames of valid video segments into text annotations, and segment corresponding to the question.
finally humans were asked to perform information selection • An extensive experimental validation study has been
and semantic evaluation on the text annotations. Lu and Grau- conducted on some benchmark datasets like SumMe
man [2] proposed a video summary generation algorithm that dataset and the Hollywood2 dataset, which get excellent
used image quality factors to select representative sub-videos performance.
from a given long video to describe basic events. Wolf [3] The rest of the paper is organized as follows. Section II pro-
used the key frame sequence in the video segment to represent vides background of the closely related work. Then, Section III
the change in video content to replace the corresponding introduces the proposed long video event retrieval algorithm
video, which not only reduced the data to be processed, based on superframe segmentation. In section IV, we discuss
but also greatly improved the efficiency of video retrieval. the experimental setup and the results obtained respectively.
All the above methods select the key frames in long videos Section V wraps up this article with conclusions and discus-
and use these frames instead of long videos to describe the sions of our on-going efforts.
video content. But the above methods all rely on a single
modality (video) as the reference for retrieving video content. II. R ELATED W ORK
In practice, videos are often associated with other forms such A. Long Video Event Retrieval
as audio or text, such as the subtitles of a movie/TV show or
the audience words accompanying a live video. These related With the rapid development of Internet technology and the
modes may be an equally important source for retrieving popularization of multimedia equipment, video resources have
user-related moments. also greatly boomed. For example, about 100 hours of video
As shown in Fig. 1, in a continuous video of street scene, resources are uploaded to YouTube every minute. These videos
several volunteers prepared some food for distribution to the often lack professional annotations and content descriptions,
homeless. The video contains introductions, food preparation, which is not conducive to people’s rapid retrieval for required
and distribution of food to the homeless on the street. If we video resources and cannot achieve the real-time surveillance
want to quote a specific scene or a certain moment in the video, in traffic video. Therefore, the use of natural language descrip-
such as an old man sitting on the street, simply referencing tions has been proposed to describe events in videos, then
the moment by the keywords like action, object, or attribute people propose the corresponding text questions according
may not uniquely identify it. For example, important objects to their needs, and finally retrieve and locate the video
in the scene, such as the elderly, appear in many frames. events through answer matching. At present, the widely-used
Based on this example, we consider using natural language method [4] to achieve event retrieval is to use the deep video
to locate the moments in a video. Specifically, for a video language embedding method proposed by references [5]–[8].
and text description, we identify the start and end in the In addition, such methods also rely on the joint embedding of
video that correspond to the given text description, which is a video features and natural language. For example, reference [9]
challenging task that requires understanding both language and used home video surveillance to retrieve daily events, which
video. It has important applications in video retrieval, such as included a fixed set of spatial prepositions (“across” and
finding specific moments from a a long video, or finding the “through”). Similarly, reference [10] considered aligning text
desired B-roll stock video segment from large video libraries instructions with video events. However, the method of align-
(such as Adobe Stock1, Getty2, Shutterstock3). Aiming at the ing instructions with video is only applicable for structured
problems of large-scale computation and large time consump- videos because they constrain alignment through the order of
tion in the content analysis and topic retrieval of long videos, instructions. In contrast, the actual surveillance video generally
this article proposes a novel long video event retrieval and contains unconstrained open scenes.
description method which significantly improves the efficiency
and accuracy of semantic description, and significantly reduces B. Video Semantic Description
the retrieval time. The essence of video semantic description is to separate
The main contributions of this article can be summarized important events in a video according to time labels and give
as follows: corresponding description sentences. Earlier research on video

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 3

summary did not include natural language input [11]–[14], but In order to improve the speed of processing large videos,
some algorithms used video-like text [15] or category tags it is necessary to detect and remove a large amount of
for event query and content selection [16]. Reference [17] redundant and meaningless frames contained in long videos.
collected the text descriptions of video blocks as a summary In this research, the method of motion amplitude detection
of the entire video. The dataset used in the above method does based on local spatiotemporal interest points to achieve the
not contain relational expressions and has a limited scope of effective detection of redundant frames. Firstly, the improved
application, so it is not suitable for the event retrieval in actual spatiotemporal interest point detection algorithm is used to
monitoring scenarios. calculate the spatiotemporal interest points of each frame in
the video. Then, surround inhibition is combined with local
C. Video Captioning With Question Answering and temporal constraints to detect static interest points in
the frame. According to the characteristics of spatiotemporal
The question answering system is a task system that takes
interest points, when the number and position of interest
an image and a free, open natural language question about
points in a video have not changed, according to experimental
the image as the input, and generates a natural language
observations, it is considered that the content of this video has
answer as the output. Since the question answering system
not changed. Therefore, this characteristic can be employed
involves machine vision and natural language processing,
to remove a large number of unchanged redundant frames
combining the machine vision algorithm with the natural
existing in a long video. When the number of valid spatiotem-
language processing algorithm to build a combined model
poral interest points detected is lower than the threshold value,
has become the most common method to solve the problem
it means that the current video has a low amplitude of motion
of the question answering system. This combined structure
or no motion occurs, so it can be determined that the content
first uses deep learning architecture to extract visual features,
of this video has not changed and the redundant frames can be
and then uses a recurrent neural network capable of process-
removed. In addition, due to the repetitive nature of frames,
ing sequence information to generate the text descriptions
deleting the redundant frames does not affect the expression
of an image. Ma et al. [18] used 3 convolutional neural
of the video content.
networks (CNN) to complete the image question-and-answer
task. Gao et al. [19] used a more complex model structure.
Malinowski and Fritz [20] combined the latest technologies B. Extraction of SOI Based on Superframe Segmentation
in natural language processing and computer vision to pro-
In the previous section, a large number of redundant frames
pose a method for automatically answering image questions.
in a long video can be removed by comparing the changes
Ren et al. [21], [22] suggested combining neural network and
in the number of motion detection boxes. Since the feature
visual semantics instead of preprocessing processes such as
extraction and feature matching of the frames in a long video
object detection and image segmentation to perform answer
need to be performed later, the reduced number of extra frames
prediction, and obtained good results on public benchmark
can greatly improve the processing speed. This section will
datasets. Tu et al. [23] jointly parsed the video and the corre-
perform video segmentation on the long video with redundant
sponding text content and tested it on two data sets containing
frames removed, and then extract SOI for video event retrieval.
15 video samples. Therefore, a successful VQA system usually
Video superframe segmentation divides a video sequence into
requires a more detailed understanding of the image and
specific, unique parts or subsets according to certain rules, and
complex reasoning than a system that generates generic image
extracts the SOI. Reference [25] proposed a method for image
subtitles. Agrawal et al. [24] proposed a free-form open-ended
quality assessment and applied it to the fast classification of
visual VQA model. The model can provide accurate natural
high-quality professional images and low-quality snapshots.
language answers by entering images into the model and
Inspired by this, this section chooses to combine low-level
relevant natural language questions.
features such as contrast, sharpness, and color with advanced
semantic features such as attention and face information. This
III. P ROPOSED M ETHOD
linear combination of these features is used to calculate the
A. Detection of Redundant Frames in a Long Video interestingness measure of the video segment, and then the
Traffic surveillance cameras generally collect video data long video is segmented based on the interestingness measure.
in the surveillance area at a sampling rate of 25 frames This article refers to the method in [25] to calculate the
per second. This is to ensure that the video can maintain a good contrast score C. Each frame in the video is converted to a
smoothness. Because these cameras need to collect the traffic grayscale image, and the converted image is processed using
scenes 24 hours in an uninterrupted manner, the total number low-pass filtering. The converted image is resampled, and the
of generated frames can be hundreds of thousands or even height is adjusted to 64, followed by the adjustment of the
millions. The processing of such a large number of frames will width according to the aspect ratio. Since sharpness is an
consume a lot of computation time, making it difficult to meet important indicator to describe the quality of a frame, it can
the requirements of real-time traffic monitoring. By observing well correspond to human subjective feelings. The sharpness
the behavior events in surveillance videos, it is found that long score E is obtained by converting a frame into a grayscale
videos often contain a large number of useless static frames image, followed by calculating the square of the difference
(redundant frames), and the processing of these redundant of the grayscale values of two adjacent pixels. In addition
frames consumes much time. to contrast and sharpness, color is also an important feature

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

for video segmentation. Biological saliency research holds


that color is objectively a stimulus and symbol to humans,
and it is subjectively a reaction and behavior. The human
visual system is very sensitive to external color changes.
In addition, the spatial relationship also affects visual saliency,
for example, the high contrast of adjacent areas is more likely
to attract visual attention. Similar to the method of calculating
the contrast score C, each frame in the video is first converted
to the HSV color space, and then is processed using low-pass
Fig. 2. Segmenting a long video into segments using a superframe
filtering. The image is resampled, and the height is adjusted segmentation algorithm.
to 64, followed by the adjustment of the width according to
the aspect ratio. Next, the average color saturation score S of
for query localization. For any given SOI v = (v t ), where
the frame is calculated.
t =∈ (0, . . . , T − 1) represents the length of the SOI, and
In video segmentation, in addition to the underlying feature
ˆ = (τst art − τend ) represents the start and end of the SOI
information, high-level semantic information also needs to be
corresponding to the event relative to the entire video. By com-
considered. Here, the method in reference [26] is used to
bining local features and global video context, the temporal
calculate the attention score A. By using a time-gradient-based
context features of the video are extracted to encode the video
dynamic visual saliency detection model, frames that may
moments.
cause visual attention are collected and the corresponding
attention score A is calculated. Face information can be used ˆ = argmi n J0(q, v, τ ) (2)
as an important reference for video event retrieval. Similar to
where J0 represents the joint embedding model which com-
the method in reference [27], by detecting the face information
bines the text question q, the features v of SOI, and the given
in the frame, each score is assigned to each detected face, and
model parameter θ .
then all the scores of detected faces are added as the face score
In order to further extract the key frame features and feature
F. Finally, referring to the contribution weights of different
information of SOI, a deep convolution network is used to
features, a linear combination of multi-modal features is used
extract features for each video. Local features are extracted by
to calculate the score of SOI in the video:
extracting the features of the frame at a specific time, global
Iscore = η(A) + C · E · S + γ (F) (1) features are extracted by extracting the features of the SOI,
and temporal endpoint features are extracted by extracting the
where Iscore is the final score of interestingness measure, features of the moment of SOI. In order to construct local and
γ = 0.5, η = 0.25. We compute an interestingness score by global video features, a deep convolution network is used to
using non-linear combination of fractions including Attention extract the high-level features of each frame, and then average
(A), Contrast (C), Sharpness (E), Colorfulness (S) and Facial pooling is performed on the video features within the SOI,
Impact (F). Finally, the boundary of long video is determined that is, averaging all frames in the SOI. When there are scene
by the interestingness score. Different colors represent con- events in the video that may involve the text questions in this
tribution of features — Attention, Contrast, Sharpness, Col- article, we can query and confirm the events by matching. For
orfulness and Facial Score. Empirical testing has shown that example, if the text question is “someone is riding a bicycle,”
Attention, Contrast, Colorfulness and Sharpness are essential the proposed algorithm can locate the scene in the video where
feature elements for video segmentation. Facial information is the event occurs, and lock the moment in the initial stage
of great importance, however, not every human face appears in of the event, followed by encoding the characteristics of this
every video, thus an influence factor η is added to Facial score. time period. This is similar to the global image features and
The final measure of superframe cut interestingness score is context features often used in natural language object retrieval.
computed as Equation (1). Locating video events is usually achieved by locating some
The long video is segmented into some segments. These specific actions, such as “cycling”, and “running”. This article
segmented video segments contain the key frames from the uses the VGG model pre-trained on ImageNet to extract local
original video used to generate video subtitles. As shown features, global features, and temporal endpoint features from
in Fig. 2, the video is segmented according to the scores of frames, which can be represented by Fvθ .
different feature elements in the video segmentation process,
and the long video is segmented into multiple SOIs.
D. Word Vector Transformation of Question Text
The question text consists of natural language and requires
C. Extraction of Visual Features to be pre-processed before further used. Firstly, the text
Through the above processing, the long video is converted questions are pre-processed. Each question is divided into
into several SOIs, and the redundant frames of these seg- words by spaces and punctuation. Words including numbers
ments have been removed. Video events are included in these are also treated as separate words. Teney et al. [28] analyzed
segments, so we only need to conduct further processing on the length of the questions in VQA dataset and found that
these segments. It is required that the video event retrieval only about 0.25% of the questions were longer than 15 words.
model can effectively use natural language-based text question Therefore, in order to improve the computation efficiency, this

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 5

Fig. 3. The pipeline of visual question answering model.

article only retains 15 words when segmenting a sentence. vector. The basic structure of the VQA model is to directly
After that, the words are transformed with word2vec into extract image information with CNN, and then send the image
a 300-dimensional word vector. Finally, the word vectors features into LSTM to produce prediction results. In this
are sent to the LSTM to extract language features, where article, the target combination feature vector and the target
the embedding sequence of the question sentence has a size relationship predicate generated by the image appearance
of 15,300. relation model are used to provide image information for the
image. The image appearance model consists of two parts:
E. Combination of Visual Features and Text Vectors target detection model and target relationship judgment model.
After the video feature vector PθV is obtained, it is unified IV. E XPERIMENTAL R ESULTS AND A NALYSIS
with the question text PθL into the same word vector space
through a non-linear transformation function. Then the two This section verifies the performance of the proposed
vectors are combined and represented by: algorithm through quantitative and qualitative analysis. The
experiment is divided into the following subsections according
Jθ (q, v, τ ) = |PθV (v, τ ) − PθL (q)| (3) to the algorithm steps. The first subsection mainly verifies the
long video segmentation and SOI extraction algorithms based
After the model is constructed, it is trained using the loss
on the detection of motion amplitude. The second subsection
function. The purpose of training is to obtain the event moment
is to verify the long video event retrieval algorithm based
information close to the description of the question text.
on text questions. Finally, the accuracy and reliability of the
In order to enhance the robustness of the model, negative sam-
proposed algorithm is analyzed and verified using actual traffic
ples from different SOIs of the same video and from different
scenarios.
videos are added when training the model, so that the model
can distinguish some subtle behavioral differences. Herein,
we refer to the method proposed by Hendricks et al. [29], A. Evaluation of Long Video Superframe Segmentation
where the loss function used is defined as: Algorithm
 This section uses the SumMe dataset [30] and the Hol-
Lossiin (θ ) = Loss R (Jθ (q i , v i , τ i ), Jθ (q i , v i , n i )) (4) lywood2 dataset [31] to evaluate the effectiveness of the
n∈τ
proposed algorithm for superframe segmentation. The SumMe
where L R (x, y) = max(0, x − y + b) represents the loss dataset contains 25 videos. The Hollywoo2 dataset contains
ranking. In this way, the current video segment is closer to 3,669 samples, including 12 categories of actions and 10 cat-
the query result of the question text than all other possible egories of scenes, all from 69 Hollywood movies. As shown
video segments from the same video. in Fig. 4, the video is selected from the Hollywood2 dataset
The pipeline of VQA model based on multi-objective visual and it describes the process of the male protagonist driving
relationship detection is proposed in this article (as shown home through the street in a movie clip. By detecting the
in Fig.3), which is inspired by the research on the target number of points of interest, we could decide whether the
relationship in the image. Firstly, the target relationship detec- video contents at different times change, and then the redun-
tion model is pre-trained, and then the appearance relationship dant video frames can be optimized according to the changes
feature is used to replace the image features extracted from in the number of points of interest in a certain time horizon.
the original target. At the same time, the appearance model is As shown in Fig. 5, the video is selected from the Hol-
extended by the word vector similarity principle of the relation lywood2 dataset, and it describes an outdoor street scene.
predicate, and the appearance features and relationship predi- Unlike the previous two videos, outdoor scenes are often
cates are sent to the word vector space and are represented by more complex and changeable. The characters and events
fixed-size vectors. Finally, the integrated vector is sent to the contained in the video are no longer unique, and different
classifier to generate an answer output, through the cascading events may overlap or partially overlap on the time axis.
of elements between the picture feature vector and the question Therefore, it is a very challenging task to screen useful

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

Fig. 6. Identification of SOI from the long video on school scene.


Fig. 4. Identification of interesting segments from the long video. TABLE I
F EATURE E VALUATION OF C OMPLEX S TREETSCAPE ON S UM M E D ATASET

the sharpness and angle of the video in the actual detection


process, which is an unstable factor.

B. Question Text Matching and Event Retrieval


After optimizing and segmenting the long video, several
Fig. 5. Identification of SOI from the long video on street scene.
video segments can be obtained, each of which contains poten-
tial video events. As shown in Fig. 7, a traffic accident video
shows several steps, including pre-accident, post-accident and
video segments from complex outdoor environments and give the crowd reaction of the accident. These steps can be divided
appropriate descriptions. The number of frames in this video into different video segments through preprocessing, and then
is 7,373. After motion amplitude detection, the number of the semantic description of these video segments is extracted
frames is reduced to 1,700, and the entire video is divided using a semantic model. Next, when querying a question or
into 29 SOIs. searching for a specific event, we only need to match the
As shown in Fig. 6, the video content explains a road question text with the extracted description sentence of the
scene in campus. The road conditions on campus are relatively video segment to locate the time of the event and obtain the
simple compared to external transportation, and the main corresponding event description.
task is on the detection and description of pedestrians. The As shown in Fig. 7, after the long video is subject to the
video contents vary with the movement of the vehicle, and removal of redundant frames and divided into several video
mainly records the state of the vehicle and pedestrians in the segments, a semantic-based VQA model can be used to obtain
front. There are a large volume of redundant video frames a natural language description sentence that can represent each
in this kind of video, and optimization algorithms should video segment. When we want to query the video content or
be used to remove more redundant video frames. Table I a specific event, we only need to convert the question into
shows the influences of different features on the results of a text vector and assign it to different video segments. The
video segmentation. The average value of each feature in part marked by the red box is the video segment closest to
superframe segmentation is used for feature influence analysis the text question. For example, our problem is “two men
together with the average benchmark correlation score. The are talking”, then the model will automatically retrieve for
mean square error of the linear regression model is used as the moment when the two men start talking in the video in
the fitness criterion that affects the score. It can be seen from chronological order, and record the video segment and moment
the evaluation results that all features have a significant role in where this content is located. Similarly, if our problem is
superframe segmentation. Although the contrast features and “the moment when a white car appears”, then the model will
facial features have the lowest scores, the overall performance locate the first video segment of the white car based on the
of each feature is well balanced. Although reference [32] existing retrieval results. With the above method, as long as the
considered facial features as the most important parts in the following two conditions are met, the video event retrieval task
detection of key frames, they would be greatly affected by can be successfully completed. The first is an accurate frame

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 7

Fig. 7. Steps of extracting event description from several video segments.

Fig. 8. Process of retrieving and extracting the corresponding video segments from several video segments.

semantic description, and the second is that the question’s datasets. The performance of the image question-answering
answer matches the semantic description. system model is mainly evaluated according to Acc and
In addition, a quantitative verification is also performed WUPS [33]. Table II compares the experimental results of the
to examine the effectiveness of the VQA model. Compari- proposed algorithm in the standard dataset DAQUAR-ALL.
son is made between the proposed algorithm and the most The Acc method is a comparison method referring to image
widely-used question-answering system algorithm on multiple classification problems. As most of the answers are composed

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS

TABLE II ACKNOWLEDGMENT
R EPORTED R ESULTS ON THE DAQUAR-ALL D ATASET
The authors would like to thank the anonymous reviewers
for their helpful insights and suggestions which have substan-
tially improved the content and presentation of this article.
R EFERENCES
[1] S. Sah, S. Kulhare, A. Gray, S. Venugopalan, E. Prud’Hommeaux, and
R. Ptucha, “Semantic text summarization of long videos,” in Proc. IEEE
Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, pp. 989–997.
[2] Z. Lu and K. Grauman, “Story-driven summarization for egocentric
video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 2714–2721.
[3] W. Wolf, “Key frame selection by motion analysis,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process. Conf., May 1996, pp. 1228–1231.
of single or multiple words, the effectiveness of the proposed [4] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya,
algorithm can be easily evaluated by examining the accuracy “Learning joint representations of videos and sentences with web image
search,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer,
of the words. 2016, pp. 651–667.
[5] S. Ding, S. Qu, Y. Xi, and S. Wan, “Stimulus-driven and concept-
V. C ONCLUSION driven analysis for image caption generation,” Neurocomputing, vol. 398,
pp. 520–530, Jul. 2020.
Semantic retrieval of long videos is of paramount impor- [6] S. Ding, S. Qu, Y. Xi, and S. Wan, “A long video caption generation
tance in the application of traffic video surveillance. This algorithm for big video data retrieval,” Future Gener. Comput. Syst.,
vol. 93, pp. 583–595, Apr. 2019.
article proposes a long video event retrieval algorithm based on [7] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,
superframe segmentation. By detecting the motion amplitude “Grounded compositional semantics for finding and describing images
of the long video, a large number of redundant frames can with sentences,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 207–218,
Dec. 2014.
be effectively removed from the long video, thereby reducing [8] Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view-based 3D
the number of frames that need to be calculated subsequently. model retrieval,” ACM Trans. Multimedia Comput., Commun., Appl.,
Then, by using a superframe segmentation algorithm based on vol. 16, no. 1, pp. 1–21, Apr. 2020.
[9] S. Tellex and D. Roy, “Towards surveillance video search by natural
feature fusion, the remaining long video is divided into several language query,” in Proc. ACM Int. Conf. Image Video Retr. CIVR, 2009,
SOIs which include the video events. Finally, the trained pp. 1–8.
[10] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and
semantic model is used to match the answer generated by S. Lacoste-Julien, “Unsupervised learning from narrated instruction
the text question, and the result with the highest matching videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
value is considered as the video segment corresponding to the Jun. 2016, pp. 4575–4583.
[11] O. Boiman and M. Irani, “Detecting irregularities in images and in
question. video,” Int. J. Comput. Vis., vol. 74, no. 1, pp. 17–31, Apr. 2007.
When preventing and handling traffic safety accidents, peo- [12] S. Wan, Y. Xia, L. Qi, Y.-H. Yang, and M. Atiquzzaman, “Automated
ple have more requirements for the real-time and accuracy. colorization of a grayscale image with seed points propagation,” IEEE
Trans. Multimedia, vol. 22, no. 7, pp. 1756–1768, Jul. 2020.
Processing videos and images at the edge can obviously reduce [13] M. Gygli, Y. Song, and L. Cao, “Video2GIF: Automatic generation of
network bandwidth and lower delay. Therefore, a video pre- animated GIFs from video,” in Proc. IEEE Conf. Comput. Vis. Pattern
processing architecture based on edge computing is presented Recognit. (CVPR), Jun. 2016, pp. 1001–1009.
[14] C. Chen, X. Liu, T. Qiu, and A. K. Sangaiah, “A short-term traffic
to remove redundant information of video images, so that par- prediction model in the vehicular cyber–physical systems,” Future
tial or all of the video analysis is migrated to the edge or edge Gener. Comput. Syst., vol. 105, pp. 894–903, Apr. 2020.
server, thereby reducing the dependency for cloud centers, [15] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep visual-
semantic embedding for video thumbnail selection,” in Proc. IEEE Conf.
decreasing the computation, storage, and network bandwidth Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3707–3715.
requirements of the network while improving the efficiency of [16] A. Sharghi, B. Gong, and M. Shah, “Query-focused extractive video
video image analysis. Real-time data analysis and processing summarization,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:
Springer, 2016, pp. 3–19.
play an extremely important role in the prevention of many [17] S. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video summary evalu-
traffic accidents. The high accuracy and low latency of video ation through text,” 2014, arXiv:1406.5824. [Online]. Available: http://
analysis tasks require strong computing performance. In order arxiv.org/abs/1406.5824
[18] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image
to solve this problem, an architecture of collaborative edge using convolutional neural network,” in Proc. 31th AAAI Conf. Artif.
and cloud is proposed, which offloads heavy computing tasks Intell., 2016, pp. 3567–3573.
to the edge server or even the cloud, while the small amount [19] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are
you talking to a machine? dataset and methods for multilingual image
of computation tasks are kept locally at the edge. However, question,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2296–2304.
some video analysis tasks are long-term and continuous. For [20] M. Malinowski and M. Fritz, “A multi-world approach to question
answering about real-world scenes based on uncertain input,” in Proc.
example, statistics of traffic volume are used as a reference Adv. Neural Inf. Process. Syst., 2014, pp. 1682–1690.
for the duration of traffic lights, and the demand for delay [21] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image
is not very important. Therefore, for edge computing-driven question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015,
pp. 2953–2961.
intelligent transportation video analysis, how to design an [22] Y. Xi, Y. Zhang, S. Ding, and S. Wan, “Visual question answering
efficient integrated cloud, edge and end architecture, perform model based on visual relationship detection,” Signal Process., Image
computing migration at different levels, and reasonably con- Commun., vol. 80, Feb. 2020, Art. no. 115648.
[23] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “Joint video
figure edge computing resources is a critical research topic and text parsing for understanding events and answering queries,” IEEE
that needs to be solved in the future. Multimedia, vol. 21, no. 2, pp. 42–70, Apr./Jun. 2014.

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 9

[24] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int. Xiaolong Xu (Member, IEEE) received the Ph.D.
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2425–2433. degree in computer science and technology from
[25] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for Nanjing University, China, in 2016. He was a
photo quality assessment,” in Proc. IEEE Comput. Soc. Conf. Comput. Research Scholar with Michigan State University,
Vis. Pattern Recognit. (CVPR), vol. 1, Jun. 2006, pp. 419–426. USA, from April 2017 to May 2018. He is currently
[26] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan, “Large-scale video an Associate Professor with the School of Computer
summarization using Web-image priors,” in Proc. IEEE Conf. Comput. and Software, Nanjing University of Information
Vis. Pattern Recognit., Jun. 2013, pp. 2698–2705. Science and Technology. He has published more
[27] N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based than 60 peer-review articles in international jour-
framework for extracting key frames from videos,” Signal Process., nals and conferences, including the IEEE T RANS -
Image Commun., vol. 28, no. 1, pp. 34–44, Jan. 2013. ACTIONS ON I NTELLIGENT T RANSACTIONS S YS -
[28] D. Teney, P. Anderson, X. He, and A. V. D. Hengel, “Tips and tricks TEMS (TITS), the IEEE T RANSACTIONS ON I NDUSTRIAL I NFORMATICS
for visual question answering: Learnings from the 2017 challenge,” (TII), the ACM Transactions on Internet Technology (TOIT), the ACM
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Transactions on Multimedia Computing, Communications, and Applica-
pp. 4223–4232. tions (TOMM), the IEEE T RANSACTIONS ON C LOUD C OMPUTING (TCC),
[29] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and the IEEE T RANSACTIONS ON B IG D ATA (TBD), the IEEE T RANSACTIONS
B. Russell, “Localizing moments in video with temporal language,” ON C OMPUTATIONAL S OCIAL S YSTEMS (TCSS), the IEEE I NTERNET OF
2018, arXiv:1809.01337. [Online]. Available: http://arxiv.org/abs/1809. T HINGS J OURNAL (IOT), the IEEE T RANSACTIONS ON E MERGING T OPICS
01337 IN C OMPUTATIONAL I NTELLIGENCE (TETCI), the IEEE International Con-
[30] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating ference on Web Services (ICWS), and ICSOC. His research interests include
summaries from user videos,” in Proc. Eur. Conf. Comput. Vis. Cham, edge computing, the Internet of Things (IoT), cloud computing, and big data.
Switzerland: Springer, 2014, pp. 505–520. He received the Best Paper Award from the IEEE CBD 2016, the TOP Citation
[31] G. Guan, Z. Wang, S. Lu, J. D. Deng, and D. D. Feng, “Keypoint-based Award from the Computational Intelligence Journal in 2019, the Distinguished
keyframe selection,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, Paper Award, and the Best Student Paper of EAI Cloudcomp 2019.
no. 4, pp. 729–734, Apr. 2013.
[32] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image
captioning and visual question answering based on attributes and exter-
nal knowledge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6,
pp. 1367–1381, Jun. 2018.
[33] C. L. Zitnick, D. Parikh, and L. Vanderwende, “Learning the visual
interpretation of sentences,” in Proc. IEEE Int. Conf. Comput. Vis.,
Dec. 2013, pp. 1681–1688.
[34] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A
neural-based approach to answering questions about images,” in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1–9.
[35] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, “What value
do explicit high level concepts have in vision to language problems?”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 203–212.
[36] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia,
“ABC-CNN: An attention based convolutional neural network for Tian Wang received the B.Sc. and M.Sc. degrees in
visual question answering,” 2015, arXiv:1511.05960. [Online]. Avail- computer science from the Central South University
able: http://arxiv.org/abs/1511.05960 in 2004 and 2007, respectively, and the Ph.D. degree
[37] K. Kafle and C. Kanan, “Answer-type prediction for visual question from the City University of Hong Kong in 2011.
answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), He is currently a Professor with the College of Com-
Jun. 2016, pp. 4976–4984. puter Science and Technology, Huaqiao University,
[38] H. Noh, P. H. Seo, and B. Han, “Image question answering using China. His research interests include the Internet of
convolutional neural network with dynamic parameter prediction,” in Things, edge computing, and mobile computing.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 30–38.
[39] Q. Wu, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, “Ask me
anything: Free-form visual question answering based on knowledge from
external sources,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 4622–4630.
[40] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
networks for image question answering,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 21–29.

Shaohua Wan (Senior Member, IEEE) received the


joint Ph.D. degree from the School of Computer,
Wuhan University and the Department of Electrical
Engineering and Computer Science, Northwestern
University, USA, in 2010. Since 2015, he has been Zonghua Gu (Senior Member, IEEE) received the
holding a post-doctoral position with the State Key Ph.D. degree in computer science and engineering
Laboratory of Digital Manufacturing Equipment and from the University of Michigan at Ann Arbor under
Technology, Huazhong University of Science and the supervision of Prof. Kang G. Shin in 2004.
Technology. From 2016 to 2017, he was a Vis- He worked as a Post-Doctoral Researcher with the
iting Professor with the Department of Electrical University of Virginia from 2004 to 2005, and then
and Computer Engineering, Technical University of as an Assistant Professor with The Hong Kong
Munich, Germany. He is currently an Associate Professor with the School of University of Science and Technology from 2005 to
Information and Safety Engineering, Zhongnan University of Economics and 2009 before joining Zhejiang University as an
Law. He is the author of over 100 peer-reviewed research articles and books. Associate Professor in 2009. His research interests
His main research interests include deep learning for the Internet of Things include real-time and embedded systems. He serves
and edge computing. on the editorial board of the Journal of Systems Architecture.

Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.

You might also like