An Intelligent Video Analysis Method For Abnormal Event Detection in Intelligent Transportation Systems
An Intelligent Video Analysis Method For Abnormal Event Detection in Intelligent Transportation Systems
Abstract— Intelligent transportation systems pervasively of vehicles and pedestrians on the supervised section. There-
deploy thousands of video cameras. Analyzing live video streams fore, it has widely attracted researchers’ attention. The road
from these cameras is of significant importance to public safety. traffic safety situation in the past is facing increasingly severe
As streaming video is increasing, it becomes infeasible to have
human operators sitting in front of hundreds of screens to catch challenges, and traffic accidents have still frequently happened.
suspicious activities or detect objects of interests in real-time. It is a huge challenge to detect traffic accidents quickly and
Actually, with millions of traffic surveillance cameras installed, accurately, and to avoid the traffic safety problems caused by
video retrieval is more vital than ever. To that end, this article them. As one of the important sources of video data, video
proposes a long video event retrieval algorithm based on capture cameras can be seen anywhere in all corners of road
superframe segmentation. By detecting the motion amplitude
of the long video, a large number of redundant frames can intersections. Not only that, the number of cameras has also
be effectively removed from the long video, thereby reducing been expanding at an annual growth rate of 20%, accompanied
the number of frames that need to be calculated subsequently. by video analysis derived from video big data. With the rapid
Then, by using a superframe segmentation algorithm based growth of the number of applications, video analysis in the
on feature fusion, the remaining long video is divided into intelligent transportation public safety scene has also attracted
several Segments of Interest (SOIs) which include the video
events. Finally, the trained semantic model is used to match the attention of academia and industry. In the context of the
the answer generated by the text question, and the result with rapid growth of data processing, how to obtain useful data in
the highest matching value is considered as the video segment videos has become a key goal in the development of ITS to cut
corresponding to the question. Experimental results demonstrate down traffic accidents and confirm on the liability of the traffic
that our proposed long video event retrieval and description accidents. An intelligent video analysis method for abnormal
method which significantly improves the efficiency and accuracy
of semantic description, and significantly reduces the retrieval event detection is an effective means to achieve this goal, and
time. will determine the degree of intelligence of the entire ITS.
It is easy for humans to watch a long video and describe
Index Terms— Intelligent transportation systems, long video
event retrieval, segment of interest, superframe segmentation, what happened at each moment in text. However, it is a
question-answering. very challenging task to make a machine capture and extract
specific events from long videos and then give descriptive
I. I NTRODUCTION text. The technology that completes such task has received
extensive attention in the field of computer vision due to
I NTELLIGENT transportation system (ITS) can improve
the traffic efficiency and effectively guarantee the safety its promising prospects in video surveillance and assisting
the blind. Traffic departments analyze video streams from
Manuscript received March 28, 2020; revised July 29, 2020; accepted cameras at intersections for traffic flow control, vehicles recog-
August 13, 2020. This work was supported in part by the National Natural nition, vehicle properties extraction, traffic rule violations, and
Science Foundation of China (No.61672454, No. 61762055); in part by
the Fundamental Research Funds for the Central Universities of China accidents detection. Different from the simple task of the
under Grant 2722019PY052 and by the open project from the State Key semantic description of static images, the description of video
Laboratory for Novel Software Technology, Nanjing University, under Grant content is more challenging, because it needs to understand a
No. KFKT2019B17. The Associate Editor for this article was A. Jolfaei.
(Corresponding author: Shaohua Wan.) series of consecutive scenes to generate multiple description
Shaohua Wan is with the Department of Computer Science and Engineering, segments. At present, most of the existing research focuses on
Shaoxing University, Shaoxing 312000, China, also with the School of the description of short videos or video segments. However,
Information and Safety Engineering, Zhongnan University of Economics and
Law, Wuhan 430073, China, and also with the State Key Laboratory for Novel the videos that record actual scenarios are very long, which
Software Technology, Nanjing University, Nanjing 210023, China (e-mail: may be hundreds of minutes in length. So, it takes a lot
[email protected]). of time and cost to achieve video retrieval and information
Xiaolong Xu is with the School of Computer and Software, Nanjing
University of Information Science and Technology, Nanjing 210044, China. selection.
Tian Wang is with the College of Computer Science, Huaqiao University, Event retrieval and description of long videos are generally
Xiamen 361021, China. driven by the advances in segment of interest (SOI) recogni-
Zonghua Gu is with the Department of Applied Physics and Electronics,
Umeå Universitet, 90187 Umeå, Sweden. tion, key frame selection, and image semantic description and
Digital Object Identifier 10.1109/TITS.2020.3017505 generation. Sah et al. [1] extracted the SOI based on the quality
1524-9050 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 3
summary did not include natural language input [11]–[14], but In order to improve the speed of processing large videos,
some algorithms used video-like text [15] or category tags it is necessary to detect and remove a large amount of
for event query and content selection [16]. Reference [17] redundant and meaningless frames contained in long videos.
collected the text descriptions of video blocks as a summary In this research, the method of motion amplitude detection
of the entire video. The dataset used in the above method does based on local spatiotemporal interest points to achieve the
not contain relational expressions and has a limited scope of effective detection of redundant frames. Firstly, the improved
application, so it is not suitable for the event retrieval in actual spatiotemporal interest point detection algorithm is used to
monitoring scenarios. calculate the spatiotemporal interest points of each frame in
the video. Then, surround inhibition is combined with local
C. Video Captioning With Question Answering and temporal constraints to detect static interest points in
the frame. According to the characteristics of spatiotemporal
The question answering system is a task system that takes
interest points, when the number and position of interest
an image and a free, open natural language question about
points in a video have not changed, according to experimental
the image as the input, and generates a natural language
observations, it is considered that the content of this video has
answer as the output. Since the question answering system
not changed. Therefore, this characteristic can be employed
involves machine vision and natural language processing,
to remove a large number of unchanged redundant frames
combining the machine vision algorithm with the natural
existing in a long video. When the number of valid spatiotem-
language processing algorithm to build a combined model
poral interest points detected is lower than the threshold value,
has become the most common method to solve the problem
it means that the current video has a low amplitude of motion
of the question answering system. This combined structure
or no motion occurs, so it can be determined that the content
first uses deep learning architecture to extract visual features,
of this video has not changed and the redundant frames can be
and then uses a recurrent neural network capable of process-
removed. In addition, due to the repetitive nature of frames,
ing sequence information to generate the text descriptions
deleting the redundant frames does not affect the expression
of an image. Ma et al. [18] used 3 convolutional neural
of the video content.
networks (CNN) to complete the image question-and-answer
task. Gao et al. [19] used a more complex model structure.
Malinowski and Fritz [20] combined the latest technologies B. Extraction of SOI Based on Superframe Segmentation
in natural language processing and computer vision to pro-
In the previous section, a large number of redundant frames
pose a method for automatically answering image questions.
in a long video can be removed by comparing the changes
Ren et al. [21], [22] suggested combining neural network and
in the number of motion detection boxes. Since the feature
visual semantics instead of preprocessing processes such as
extraction and feature matching of the frames in a long video
object detection and image segmentation to perform answer
need to be performed later, the reduced number of extra frames
prediction, and obtained good results on public benchmark
can greatly improve the processing speed. This section will
datasets. Tu et al. [23] jointly parsed the video and the corre-
perform video segmentation on the long video with redundant
sponding text content and tested it on two data sets containing
frames removed, and then extract SOI for video event retrieval.
15 video samples. Therefore, a successful VQA system usually
Video superframe segmentation divides a video sequence into
requires a more detailed understanding of the image and
specific, unique parts or subsets according to certain rules, and
complex reasoning than a system that generates generic image
extracts the SOI. Reference [25] proposed a method for image
subtitles. Agrawal et al. [24] proposed a free-form open-ended
quality assessment and applied it to the fast classification of
visual VQA model. The model can provide accurate natural
high-quality professional images and low-quality snapshots.
language answers by entering images into the model and
Inspired by this, this section chooses to combine low-level
relevant natural language questions.
features such as contrast, sharpness, and color with advanced
semantic features such as attention and face information. This
III. P ROPOSED M ETHOD
linear combination of these features is used to calculate the
A. Detection of Redundant Frames in a Long Video interestingness measure of the video segment, and then the
Traffic surveillance cameras generally collect video data long video is segmented based on the interestingness measure.
in the surveillance area at a sampling rate of 25 frames This article refers to the method in [25] to calculate the
per second. This is to ensure that the video can maintain a good contrast score C. Each frame in the video is converted to a
smoothness. Because these cameras need to collect the traffic grayscale image, and the converted image is processed using
scenes 24 hours in an uninterrupted manner, the total number low-pass filtering. The converted image is resampled, and the
of generated frames can be hundreds of thousands or even height is adjusted to 64, followed by the adjustment of the
millions. The processing of such a large number of frames will width according to the aspect ratio. Since sharpness is an
consume a lot of computation time, making it difficult to meet important indicator to describe the quality of a frame, it can
the requirements of real-time traffic monitoring. By observing well correspond to human subjective feelings. The sharpness
the behavior events in surveillance videos, it is found that long score E is obtained by converting a frame into a grayscale
videos often contain a large number of useless static frames image, followed by calculating the square of the difference
(redundant frames), and the processing of these redundant of the grayscale values of two adjacent pixels. In addition
frames consumes much time. to contrast and sharpness, color is also an important feature
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 5
article only retains 15 words when segmenting a sentence. vector. The basic structure of the VQA model is to directly
After that, the words are transformed with word2vec into extract image information with CNN, and then send the image
a 300-dimensional word vector. Finally, the word vectors features into LSTM to produce prediction results. In this
are sent to the LSTM to extract language features, where article, the target combination feature vector and the target
the embedding sequence of the question sentence has a size relationship predicate generated by the image appearance
of 15,300. relation model are used to provide image information for the
image. The image appearance model consists of two parts:
E. Combination of Visual Features and Text Vectors target detection model and target relationship judgment model.
After the video feature vector PθV is obtained, it is unified IV. E XPERIMENTAL R ESULTS AND A NALYSIS
with the question text PθL into the same word vector space
through a non-linear transformation function. Then the two This section verifies the performance of the proposed
vectors are combined and represented by: algorithm through quantitative and qualitative analysis. The
experiment is divided into the following subsections according
Jθ (q, v, τ ) = |PθV (v, τ ) − PθL (q)| (3) to the algorithm steps. The first subsection mainly verifies the
long video segmentation and SOI extraction algorithms based
After the model is constructed, it is trained using the loss
on the detection of motion amplitude. The second subsection
function. The purpose of training is to obtain the event moment
is to verify the long video event retrieval algorithm based
information close to the description of the question text.
on text questions. Finally, the accuracy and reliability of the
In order to enhance the robustness of the model, negative sam-
proposed algorithm is analyzed and verified using actual traffic
ples from different SOIs of the same video and from different
scenarios.
videos are added when training the model, so that the model
can distinguish some subtle behavioral differences. Herein,
we refer to the method proposed by Hendricks et al. [29], A. Evaluation of Long Video Superframe Segmentation
where the loss function used is defined as: Algorithm
This section uses the SumMe dataset [30] and the Hol-
Lossiin (θ ) = Loss R (Jθ (q i , v i , τ i ), Jθ (q i , v i , n i )) (4) lywood2 dataset [31] to evaluate the effectiveness of the
n∈τ
proposed algorithm for superframe segmentation. The SumMe
where L R (x, y) = max(0, x − y + b) represents the loss dataset contains 25 videos. The Hollywoo2 dataset contains
ranking. In this way, the current video segment is closer to 3,669 samples, including 12 categories of actions and 10 cat-
the query result of the question text than all other possible egories of scenes, all from 69 Hollywood movies. As shown
video segments from the same video. in Fig. 4, the video is selected from the Hollywood2 dataset
The pipeline of VQA model based on multi-objective visual and it describes the process of the male protagonist driving
relationship detection is proposed in this article (as shown home through the street in a movie clip. By detecting the
in Fig.3), which is inspired by the research on the target number of points of interest, we could decide whether the
relationship in the image. Firstly, the target relationship detec- video contents at different times change, and then the redun-
tion model is pre-trained, and then the appearance relationship dant video frames can be optimized according to the changes
feature is used to replace the image features extracted from in the number of points of interest in a certain time horizon.
the original target. At the same time, the appearance model is As shown in Fig. 5, the video is selected from the Hol-
extended by the word vector similarity principle of the relation lywood2 dataset, and it describes an outdoor street scene.
predicate, and the appearance features and relationship predi- Unlike the previous two videos, outdoor scenes are often
cates are sent to the word vector space and are represented by more complex and changeable. The characters and events
fixed-size vectors. Finally, the integrated vector is sent to the contained in the video are no longer unique, and different
classifier to generate an answer output, through the cascading events may overlap or partially overlap on the time axis.
of elements between the picture feature vector and the question Therefore, it is a very challenging task to screen useful
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 7
Fig. 8. Process of retrieving and extracting the corresponding video segments from several video segments.
semantic description, and the second is that the question’s datasets. The performance of the image question-answering
answer matches the semantic description. system model is mainly evaluated according to Acc and
In addition, a quantitative verification is also performed WUPS [33]. Table II compares the experimental results of the
to examine the effectiveness of the VQA model. Compari- proposed algorithm in the standard dataset DAQUAR-ALL.
son is made between the proposed algorithm and the most The Acc method is a comparison method referring to image
widely-used question-answering system algorithm on multiple classification problems. As most of the answers are composed
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II ACKNOWLEDGMENT
R EPORTED R ESULTS ON THE DAQUAR-ALL D ATASET
The authors would like to thank the anonymous reviewers
for their helpful insights and suggestions which have substan-
tially improved the content and presentation of this article.
R EFERENCES
[1] S. Sah, S. Kulhare, A. Gray, S. Venugopalan, E. Prud’Hommeaux, and
R. Ptucha, “Semantic text summarization of long videos,” in Proc. IEEE
Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2017, pp. 989–997.
[2] Z. Lu and K. Grauman, “Story-driven summarization for egocentric
video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 2714–2721.
[3] W. Wolf, “Key frame selection by motion analysis,” in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Process. Conf., May 1996, pp. 1228–1231.
of single or multiple words, the effectiveness of the proposed [4] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya,
algorithm can be easily evaluated by examining the accuracy “Learning joint representations of videos and sentences with web image
search,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer,
of the words. 2016, pp. 651–667.
[5] S. Ding, S. Qu, Y. Xi, and S. Wan, “Stimulus-driven and concept-
V. C ONCLUSION driven analysis for image caption generation,” Neurocomputing, vol. 398,
pp. 520–530, Jul. 2020.
Semantic retrieval of long videos is of paramount impor- [6] S. Ding, S. Qu, Y. Xi, and S. Wan, “A long video caption generation
tance in the application of traffic video surveillance. This algorithm for big video data retrieval,” Future Gener. Comput. Syst.,
vol. 93, pp. 583–595, Apr. 2019.
article proposes a long video event retrieval algorithm based on [7] R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng,
superframe segmentation. By detecting the motion amplitude “Grounded compositional semantics for finding and describing images
of the long video, a large number of redundant frames can with sentences,” Trans. Assoc. Comput. Linguistics, vol. 2, pp. 207–218,
Dec. 2014.
be effectively removed from the long video, thereby reducing [8] Z. Gao, Y. Li, and S. Wan, “Exploring deep learning for view-based 3D
the number of frames that need to be calculated subsequently. model retrieval,” ACM Trans. Multimedia Comput., Commun., Appl.,
Then, by using a superframe segmentation algorithm based on vol. 16, no. 1, pp. 1–21, Apr. 2020.
[9] S. Tellex and D. Roy, “Towards surveillance video search by natural
feature fusion, the remaining long video is divided into several language query,” in Proc. ACM Int. Conf. Image Video Retr. CIVR, 2009,
SOIs which include the video events. Finally, the trained pp. 1–8.
[10] J.-B. Alayrac, P. Bojanowski, N. Agrawal, J. Sivic, I. Laptev, and
semantic model is used to match the answer generated by S. Lacoste-Julien, “Unsupervised learning from narrated instruction
the text question, and the result with the highest matching videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
value is considered as the video segment corresponding to the Jun. 2016, pp. 4575–4583.
[11] O. Boiman and M. Irani, “Detecting irregularities in images and in
question. video,” Int. J. Comput. Vis., vol. 74, no. 1, pp. 17–31, Apr. 2007.
When preventing and handling traffic safety accidents, peo- [12] S. Wan, Y. Xia, L. Qi, Y.-H. Yang, and M. Atiquzzaman, “Automated
ple have more requirements for the real-time and accuracy. colorization of a grayscale image with seed points propagation,” IEEE
Trans. Multimedia, vol. 22, no. 7, pp. 1756–1768, Jul. 2020.
Processing videos and images at the edge can obviously reduce [13] M. Gygli, Y. Song, and L. Cao, “Video2GIF: Automatic generation of
network bandwidth and lower delay. Therefore, a video pre- animated GIFs from video,” in Proc. IEEE Conf. Comput. Vis. Pattern
processing architecture based on edge computing is presented Recognit. (CVPR), Jun. 2016, pp. 1001–1009.
[14] C. Chen, X. Liu, T. Qiu, and A. K. Sangaiah, “A short-term traffic
to remove redundant information of video images, so that par- prediction model in the vehicular cyber–physical systems,” Future
tial or all of the video analysis is migrated to the edge or edge Gener. Comput. Syst., vol. 105, pp. 894–903, Apr. 2020.
server, thereby reducing the dependency for cloud centers, [15] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep visual-
semantic embedding for video thumbnail selection,” in Proc. IEEE Conf.
decreasing the computation, storage, and network bandwidth Comput. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 3707–3715.
requirements of the network while improving the efficiency of [16] A. Sharghi, B. Gong, and M. Shah, “Query-focused extractive video
video image analysis. Real-time data analysis and processing summarization,” in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland:
Springer, 2016, pp. 3–19.
play an extremely important role in the prevention of many [17] S. Yeung, A. Fathi, and L. Fei-Fei, “VideoSET: Video summary evalu-
traffic accidents. The high accuracy and low latency of video ation through text,” 2014, arXiv:1406.5824. [Online]. Available: http://
analysis tasks require strong computing performance. In order arxiv.org/abs/1406.5824
[18] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image
to solve this problem, an architecture of collaborative edge using convolutional neural network,” in Proc. 31th AAAI Conf. Artif.
and cloud is proposed, which offloads heavy computing tasks Intell., 2016, pp. 3567–3573.
to the edge server or even the cloud, while the small amount [19] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are
you talking to a machine? dataset and methods for multilingual image
of computation tasks are kept locally at the edge. However, question,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 2296–2304.
some video analysis tasks are long-term and continuous. For [20] M. Malinowski and M. Fritz, “A multi-world approach to question
answering about real-world scenes based on uncertain input,” in Proc.
example, statistics of traffic volume are used as a reference Adv. Neural Inf. Process. Syst., 2014, pp. 1682–1690.
for the duration of traffic lights, and the demand for delay [21] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image
is not very important. Therefore, for edge computing-driven question answering,” in Proc. Adv. Neural Inf. Process. Syst., 2015,
pp. 2953–2961.
intelligent transportation video analysis, how to design an [22] Y. Xi, Y. Zhang, S. Ding, and S. Wan, “Visual question answering
efficient integrated cloud, edge and end architecture, perform model based on visual relationship detection,” Signal Process., Image
computing migration at different levels, and reasonably con- Commun., vol. 80, Feb. 2020, Art. no. 115648.
[23] K. Tu, M. Meng, M. W. Lee, T. E. Choe, and S.-C. Zhu, “Joint video
figure edge computing resources is a critical research topic and text parsing for understanding events and answering queries,” IEEE
that needs to be solved in the future. Multimedia, vol. 21, no. 2, pp. 42–70, Apr./Jun. 2014.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
WAN et al.: INTELLIGENT VIDEO ANALYSIS METHOD FOR ABNORMAL EVENT DETECTION IN ITSs 9
[24] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int. Xiaolong Xu (Member, IEEE) received the Ph.D.
Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 2425–2433. degree in computer science and technology from
[25] Y. Ke, X. Tang, and F. Jing, “The design of high-level features for Nanjing University, China, in 2016. He was a
photo quality assessment,” in Proc. IEEE Comput. Soc. Conf. Comput. Research Scholar with Michigan State University,
Vis. Pattern Recognit. (CVPR), vol. 1, Jun. 2006, pp. 419–426. USA, from April 2017 to May 2018. He is currently
[26] A. Khosla, R. Hamid, C.-J. Lin, and N. Sundaresan, “Large-scale video an Associate Professor with the School of Computer
summarization using Web-image priors,” in Proc. IEEE Conf. Comput. and Software, Nanjing University of Information
Vis. Pattern Recognit., Jun. 2013, pp. 2698–2705. Science and Technology. He has published more
[27] N. Ejaz, I. Mehmood, and S. W. Baik, “Efficient visual attention based than 60 peer-review articles in international jour-
framework for extracting key frames from videos,” Signal Process., nals and conferences, including the IEEE T RANS -
Image Commun., vol. 28, no. 1, pp. 34–44, Jan. 2013. ACTIONS ON I NTELLIGENT T RANSACTIONS S YS -
[28] D. Teney, P. Anderson, X. He, and A. V. D. Hengel, “Tips and tricks TEMS (TITS), the IEEE T RANSACTIONS ON I NDUSTRIAL I NFORMATICS
for visual question answering: Learnings from the 2017 challenge,” (TII), the ACM Transactions on Internet Technology (TOIT), the ACM
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018, Transactions on Multimedia Computing, Communications, and Applica-
pp. 4223–4232. tions (TOMM), the IEEE T RANSACTIONS ON C LOUD C OMPUTING (TCC),
[29] L. Anne Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and the IEEE T RANSACTIONS ON B IG D ATA (TBD), the IEEE T RANSACTIONS
B. Russell, “Localizing moments in video with temporal language,” ON C OMPUTATIONAL S OCIAL S YSTEMS (TCSS), the IEEE I NTERNET OF
2018, arXiv:1809.01337. [Online]. Available: http://arxiv.org/abs/1809. T HINGS J OURNAL (IOT), the IEEE T RANSACTIONS ON E MERGING T OPICS
01337 IN C OMPUTATIONAL I NTELLIGENCE (TETCI), the IEEE International Con-
[30] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating ference on Web Services (ICWS), and ICSOC. His research interests include
summaries from user videos,” in Proc. Eur. Conf. Comput. Vis. Cham, edge computing, the Internet of Things (IoT), cloud computing, and big data.
Switzerland: Springer, 2014, pp. 505–520. He received the Best Paper Award from the IEEE CBD 2016, the TOP Citation
[31] G. Guan, Z. Wang, S. Lu, J. D. Deng, and D. D. Feng, “Keypoint-based Award from the Computational Intelligence Journal in 2019, the Distinguished
keyframe selection,” IEEE Trans. Circuits Syst. Video Technol., vol. 23, Paper Award, and the Best Student Paper of EAI Cloudcomp 2019.
no. 4, pp. 729–734, Apr. 2013.
[32] Q. Wu, C. Shen, P. Wang, A. Dick, and A. van den Hengel, “Image
captioning and visual question answering based on attributes and exter-
nal knowledge,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6,
pp. 1367–1381, Jun. 2018.
[33] C. L. Zitnick, D. Parikh, and L. Vanderwende, “Learning the visual
interpretation of sentences,” in Proc. IEEE Int. Conf. Comput. Vis.,
Dec. 2013, pp. 1681–1688.
[34] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A
neural-based approach to answering questions about images,” in Proc.
IEEE Int. Conf. Comput. Vis. (ICCV), Dec. 2015, pp. 1–9.
[35] Q. Wu, C. Shen, L. Liu, A. Dick, and A. Van Den Hengel, “What value
do explicit high level concepts have in vision to language problems?”
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 203–212.
[36] K. Chen, J. Wang, L.-C. Chen, H. Gao, W. Xu, and R. Nevatia,
“ABC-CNN: An attention based convolutional neural network for Tian Wang received the B.Sc. and M.Sc. degrees in
visual question answering,” 2015, arXiv:1511.05960. [Online]. Avail- computer science from the Central South University
able: http://arxiv.org/abs/1511.05960 in 2004 and 2007, respectively, and the Ph.D. degree
[37] K. Kafle and C. Kanan, “Answer-type prediction for visual question from the City University of Hong Kong in 2011.
answering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), He is currently a Professor with the College of Com-
Jun. 2016, pp. 4976–4984. puter Science and Technology, Huaqiao University,
[38] H. Noh, P. H. Seo, and B. Han, “Image question answering using China. His research interests include the Internet of
convolutional neural network with dynamic parameter prediction,” in Things, edge computing, and mobile computing.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
pp. 30–38.
[39] Q. Wu, P. Wang, C. Shen, A. Dick, and A. Van Den Hengel, “Ask me
anything: Free-form visual question answering based on knowledge from
external sources,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 4622–4630.
[40] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention
networks for image question answering,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 21–29.
Authorized licensed use limited to: Cornell University Library. Downloaded on September 11,2020 at 05:13:14 UTC from IEEE Xplore. Restrictions apply.