DeepRide Dashcam Video Description Dataset For Autonomous Vehicle Location-Aware Trip Description
DeepRide Dashcam Video Description Dataset For Autonomous Vehicle Location-Aware Trip Description
ABSTRACT Video description is one of the most challenging task in the combined domain of computer
vision and natural language processing. Captions for various open and constrained domain videos have been
generated in the recent past but descriptions for driving dashcam videos have never been explored to the best
of our knowledge. With the aim to explore dashcam video description generation for autonomous driving, this
study presents DeepRide: a large-scale dashcam driving video description dataset for location-aware dense
video description generation. The human-described dataset comprises visual scenes and actions with diverse
weather, people, objects, and geographical paradigms. It bridges the autonomous driving domain with video
description by textual description generation of the visual information as seen by a dashcam. We describe
16,000 videos (40 seconds each) in English employing 2,700 man-hours by two highly qualified teams with
domain knowledge. The descriptions consist of eight to ten sentences covering each dashcam video’s global
features and event features in 60 to 90 words. The dataset consists of more than 130K sentences, totaling
approximately one million words. We evaluate the dataset by employing location aware vision-language
recurrent transformer framework to elaborate on the efficacy and significance of the visio-linguistics research
for autonomous vehicles. We provided base line results to evaluate the dataset by employing three existing
state-of-the-art recurrent models. The memory augmented transformer performed superior due to its highly
summarized memory state for visual information and the sentence history while generating the trip descrip-
tion. Our proposed dataset opens a new dimension of diverse and exciting applications, such as self-driving
vehicle reporting, driver and vehicle safety, inter-vehicle road intelligence sharing, and travel occurrence
reports.
INDEX TERMS Dashcam video description, video description dataset, video captioning, autonomous trip
description.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 107361
G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description
generation is directly associated with the amount and quality B. DEEPRIDE APPLICATIONS
of the training and validation data provided to the model for The importance of video description is evident from its prac-
understanding. tical and real time applications, i.e., efficient searching and
Comprehending the localized events of a video appro- indexing of videos on the internet, human-robot relation-
priately and then transforming the attained visual under- ships in industrial zones, facilitation of autonomous vehicle
standing accurately into a textual format is called dense driving, video descriptions can outline procedures in instruc-
video captioning, or simply, video description [8]. Captur- tional/tutorial videos for industry, education, and the house-
ing the scenes, objects, and activities in a video, as well hold (e.g., recipes). The visually impaired can gain useful
as the spatial–temporal relationships and the temporal information from a video that incorporates audio descrip-
order, is crucial for precise and grammatically correct tions. Long surveillance videos can be transformed into short
multi-line text narration. The generated fine-grained cap- texts for quick previews. Sign language videos can be con-
tion is a requirement of such a mechanism that proves verted to natural language descriptions. Automatic, accurate,
to be expressive and subtle. Its purpose is to bag the and precise video/movie subtitling is another important and
temporal dynamics of the visuals in specific order as practical application of the video description task.
presented in the video, and then join them with syntacti- Particular to DeepRide dataset, the basic application or
cally and semantically correct representations using natural the purpose of dataset creation is to automatically generate
language. summaries (trip descriptions) for autonomous vehicles using
Considering the training-test data for video description dashcam videos. The desired generated summary contains
systems, various datasets have been proposed in the recent the vehicle’s location information from the GPS data stored
past for better visual comprehension and diverse description by default in the dashcam meta data, day/night, weather,
generation. These datasets belong to a variety of domains, scene, road side information (trees, buildings, parking), and
discovering aspects associated with our daily lives like human dynamic events taking place on and around the road (vehi-
actions [9], [10], cooking [11], [12], [13], [14], [15], [16], cle position and speed on the road, traffic signals, turnings,
movies [17], [18], social media [19], [20], TV shows [21], entering/exiting underpass/overheads, traffic flow, pedestri-
E-commerce [22] and generalized categories [8], [23], [24], ans movements/waiting, accident occurring). Other signifi-
[25]. cant and noteworthy applications can be
FIGURE 1. 12 Sample frames from dashcam video in training split of the DeepRide dataset. Static and Dynamic Scenes are described temporally.
The 40 seconds dashcam video is described in 14 English sentences.
can take vision-to-language research in a distinct direction typically consist of a CNN being used as a visual model
and provide organized means for training and evaluations. to extract visual features from video frames, and an
The publicly available datasets with deep and diverse descrip- RNN being used as a language model to generate cap-
tions, novel tasks and challenges, and meticulous benchmarks tions word by word. Other compositions of CNN, RNN
have contributed intensely for recent rapid developments in and their variants LSTMs and GRUs are also explored
the Visio-linguistic field. The intersection of computer vision in this field following the ED architecture.
for autonomous driving with natural language processing 2) Attention Mechanism based Approaches: The stan-
by [26], [27], [28], and [29] is pushing forward the frontiers dard encoder-decoder architecture further fused with
of research domain in a new direction altogether. attention mechanism to focus on specific distinctness
showed high quality performances. The captioning sys-
A. VIDEO DESCRIPTION DATASETS tem developed by [36], [37], [38], [39], [40], [41],
Various datasets have been launched from time to time to and [42] demonstrated the employment of visual, local,
exhibit an enhanced accomplishment for the task of video global, adaptive, spatial, temporal, and channel atten-
description, exploring a wide range of constrained and open tion for coherent and diverse caption generation.
domains like cooking by [11], [12], [13], [14], [15], and [16], 3) Transformer based Approaches: Recently with the
human activities by [8], [9], [23], [24], and [25], social media advent of efficient and modern transductive trans-
by [20], and [19], movies by [17], and [18], TV shows by [21], former architecture, free from recurrence, and solely
and e-commerce by [22] presented in detail by [30]. Table 1 based on self-attention, video description systems
lists a brief overview of the key attributes and major statis- enhanced the performance allowing parallelization
tics of existing multi-caption (dense/paragraph like) video along with training on massive amount of data. With
description datasets. The existing renowned datasets gradu- the emergence of several versions of transformers
ally heightened their visual complexity and language diver- and models employing transformers [2], [3], [4], [5],
sity to expand dynamic and hefty algorithms. [43], [44], [45], [46], [47] long term dependency han-
dling is not an issue anymore for researchers engaged
in video processing for summarization and description,
B. VIDEO DESCRIPTION APPROACHES
or for autonomous-vehicle, surveillance, and instruc-
Video description generation approaches can be broadly tional purposes.
classified into four groups based on their technological 4) Deep Reinforcement Learning based Approaches:
advancement in time. Reinforcement learning employed within the encoder-
1) Encoder-Decoder (ED) based Approaches: The ED decoder structure [48], [49], [50], [51], [52] can pro-
framework is the most popular paradigm for video gressively deliver state-of-the-art captions by following
description generation [31], [32], [33], [34], [35] in exploration and exploitation strategies. Recently, the
recent years, it pioneered the video description task by notion of deep reinforcement learning in the video
addressing the limitations associated with conventional description domain with the capacity of repeated
and statistical approaches. Conventional ED pipelines polishing [53] simulates human cognitive behaviors.
FIGURE 3. Description Entry Screen for Qualified Operators. Some instructions and help tips are provided for dashcam video description.
FIGURE 4. Revision screen with Accept option if agree with the operator, and reject options to route back the description to operator for re-description.
FIGURE 6. 4 × Sample dashboard screens of the description collection framework - Displaying users and batches statistics.
FIGURE 8. Overview of Location-aware Dense Video Description framework: Video features (RGB and Flow) and word embedding with
300 dimensions is employed. GPS/IMU info from dashcam video is utilized to fetch the location of the trip. Mean Latitude and Longitude
values are calculated and then is searched in the geographical database for corresponding high level location containing city and road name.
The memory augmented recurrent transformer generated trip description is concatenated with the template based location sentence to form
location-aware trip description. Evaluation performed by comparing ground truth/ referenced description and generated location-aware trip
description.
Figure 8. Transformers are proven to be more efficient and the augmented memory block leverages the video segments
powerful for sequential modeling. We investigate Recurrent and their previous caption history to assist with next sen-
Transformers: Masked Transformer by [46], Transformer- tence generation. We generate our dataset corpus compliant
XL by [47], and Memory Augmented Recurrent Transformer with the ActivityNet-Captions dataset in JavaScript Object
(MART) by [5] as candidate models. We choose MART as Notation(JSON) file format. We evaluate our dataset with
the fundamental building block of our proposed framework, metrics of BLEU (1 to 4), CIDEr, ROUGE L, METEOR, and
a transformer [44] focused model with an additional memory Repetition (1 to 4). We investigate the results of the following
module. As part of an encoder-decoder shared environment, models while evaluating our proposed dataset DeepRide.
A. MASKED TRANSFORMER
Considering Neural Machine Translation (NMT), imple-
menting a self-attention mechanism with the objectives of
parallelization, reduction of computational complexity, long-
range dependency handling [44] introduced basic trans-
former architecture, which was employed for the video to
text paragraph description generation by [46]. They pro-
posed a masking network comprising a video encoder,
proposal decoder and a captioning decoder aiming to
decode the proposal-specific representations into differen-
tiable masks, resulting in consistent training of proposal gen-
eration and captioning decoder. Learning the representation
with the capability of long-range dependencies is addressed
by employing self-attention, facilitating more effective
learning. FIGURE 9. Sample frames from the train set representing the time of the
day attribute: daytime, nighttime, and dawn/dusk.
C. MART
The Mart proposed by [5] for the video to text paragraph
description generation task is based on vanilla transformer
model [44]. Unlike the vanilla model with separate encoder-
decoder networks, mart introduced a shared encoder-decoder
environment with an auxiliary memory module to enable V. EXPERIMENTATION
recurrence in transformers. The augmented external memory A. FEATURE EXTRACTION
block similar to LSTM [56] and GRU [57] facilitates the In order to keep the scenes standardized/uniform and get
processing of caption history information corresponding to features, we sample 15 frames per second and extract I3D
video segments. The shared environment of encoder-decoder features [58] from these sampled frames. The sampling mech-
and implementation of memory module by MART allow it anism is based on time not on frame rate. If a dashcam video is
to utilize previous contextual information so that it is able either 30 or 60 fps encoded, the trip description system will
to produce a better paragraph that is more coherent and less sample and process 15 frames per second. If frame rate is
repetitive. less than the required frames/second, i.e. 15 fps, the system
will achieve the desired frames per second by adding zero
D. LOCATION AWARE DESCRIPTION GENERATION padding.
The proposed dataset DeepRide utilizes the GPS/IMU We feed 64 frames with a spatial size of 224 × 224 . For
recording of preserved trajectory information while pro- better feature representations, we use the I3D model, pre-
cessing the corresponding dashcam video to generate a trained on the Kinetics training dataset [59] and calculated
location-aware road trip description. The Latitude and Lon- video RGB and optical flow features prior to the training.
gitude associated with the dashcam videos are cached We extract the temporal features using PWCNet by [60]. The
from Google Geocoding API with their corresponding posi- I3D, the spatial/ RGB 1024D feature vectors, and tempo-
tion/location containing road & city name and stored in a geo- ral/optical flow 1024D feature vectors are concatenated to
graphic repository. This database is utilized to get the location form the input to the transformer layers. It formed a single
associated with the latitude and longitude of the dashcam 2048D representation for every stack of 64 frames. The dash-
video while generating trip summary. cam videos are 40 seconds long, hence make ten segments of
Further, the location containing sentence is concate- 244 × 224 × 64, which is sufficient to generate ten sentences.
nated with the generated paragraph summary to form We employ Glove-6B, 300 dimension word embeddings
location-aware trip description as demonstrated in Figure 8. and generated vocabulary index for language model.
TABLE 5. Description results on test set of DeepRide dataset. ↑ follows the Higher the better, whereas ↓ follows lower the better policy.
FIGURE 11. Sample frames from the train set representing the scene
attribute: city street, highway, residential, tunnel, and parking lot.
FIGURE 12. Description Analysis - English ground truth description of the sample dashcam videos from train
set of DeepRide dataset. Brown-colored sentence is the trip location information. Blue-colored Text indicate
Static scenes whereas green-colored text indicate dynamic events taking place on and around the road.
FIGURE 13. Qualitative Analysis - English ground truth and generated descriptions of the sample dashcam videos from
test set of DeepRide dataset. Blue-colored Text indicate Static scenes whereas green-colored text indicate dynamic
events. Included only Location-aware MART produced qualitative results.
The driving video datasets are challenging and have a [4] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang, ‘‘SBAT: Video captioning
bunch of similar features within every dashcam video. There- with sparse boundary-aware transformer,’’ in Proc. 29th Int. Joint Conf.
Artif. Intell., Jul. 2020, pp. 630–636.
fore, there is a significant amount of text description that [5] J. Lei, L. Wang, Y. Shen, D. Yu, T. Berg, and M. Bansal, ‘‘MART: Memory-
could repeat due to feature similarities as shown in Fig- augmented recurrent transformer for coherent video paragraph caption-
ure 9, 10, and 11. The (MART) took advantage of memory ing,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020,
pp. 2603–2614.
and generated far better sequences of sentences while describ-
[6] Z. Yu and N. Han, ‘‘Accelerated masked transformer for dense video
ing the video features. These attributes show various scenes captioning,’’ Neurocomputing, vol. 445, pp. 72–80, Jul. 2021, doi:
from the dataset, where every scene has typical graphics, i.e., 10.1016/j.neucom.2021.03.026.
road, marking lines, cars, zebra crossing, signals, building, [7] M. Hosseinzadeh and Y. Wang, ‘‘Video captioning of future frames,’’
in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021,
trees, parked vehicles, pedestrians etc. Therefore, it is quite pp. 980–989.
possible that once a data entry operator describes that vehicle [8] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, ‘‘Dense-
is moving on the road, the model can predict this sentence captioning events in videos,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 706–715.
from every frame because the features are present throughout
[9] D. L. Chen and W. B. Dolan, ‘‘Collecting highly parallel data for para-
the video, this sometimes causes the model to predict the phrase evaluation,’’ in Proc. 49th Annu. Meeting Assoc. Comput. Linguis-
sentence at some other time slot since it is a global scenario. tics: Hum. Lang. Technol. (ACL-HLT), vol. 1, 2011, pp. 190–200.
Similarly, the local event features to the timeline are predicted [10] G. A. Sigurdsson, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, ‘‘Hol-
lywood in homes: Crowdsourcing data collection for activity understand-
at the time of occurrence, i.e., vehicle stops, turns right, slows ing,’’ 2016, arXiv:1604.01753.
down, crosses underpass, pedestrians crossing etc. Although [11] M. Rohrbach and M. Planck, ‘‘A database for fine grained activity detec-
the scene predictions are global and can be listed at any tion of cooking activities,’’ no. in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2012, pp. 1194–1201.
specific time, we have obtained encouraging results, setting
[12] P. Das, C. Xu, R. F. Doell, and J. J. Corso, ‘‘A thousand frames in just a few
a baseline for further improvements. We show description words: Lingual description of videos through latent topics and sparse object
analysis in Figure 12 and qualitative results in Figure 13. stitching,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 2634–2641.
VII. CONCLUSION & FUTURE WORK [13] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and
M. Pinkal, ‘‘Grounding action descriptions in videos,’’ Trans. Assoc. Com-
In this research work, we present DeepRide, a new diverse put. Linguistics, vol. 1, pp. 25–36, 2013.
location-aware dashcam video description dataset intended [14] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and
to explore emerging autonomous vehicles driving in the per- B. Schiele, ‘‘Coherent multi-sentence video description with variable
level of detail,’’ in Lecture Notes in Computer Science (Including
spective of the fast-growing video description domain. Fea- Subseries Lecture Notes in Artificial Intelligence and Lecture Notes
turing 16k dashcam videos linked with around 130k sentences in Bioinformatics), vol. 8753. 2014, pp. 184–195. [Online]. Available:
of description in English. This dataset may help automate https://arxiv.org/abs/1403.6173
the creation of driving commentary. Moreover, the embed- [15] L. Zhou, C. Xu, and J. J. Corso, ‘‘Towards automatic learning of procedures
from web instructional videos,’’ in Proc. 32nd AAAI Conf. Artif. Intell.,
ded GPS/IMU information recording capability of dash- 2018.
cam video empowers the description system to associate [16] G. Huang, B. Pang, Z. Zhu, C. Rivera, and R. Soricut, ‘‘Multimodal pre-
the concerned locations and positions with natural language training for dense video captioning,’’ 2020, arXiv:2011.11760.
[17] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, ‘‘A dataset for
descriptions. We provided guidelines integrating location movie description,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
information extraction with the recurrent transformers. (CVPR), Jun. 2015, pp. 3202–3212.
Further, our proposed dataset opens a new dimension of [18] A. Torabi, C. Pal, H. Larochelle, and A. Courville, ‘‘Using descriptive
diverse and exciting applications: self-driving vehicle report- video services to create a large data source for video annotation research,’’
2015, arXiv:1503.01070.
ing, driver and vehicle safety, inter-vehicle road intelligence [19] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, ‘‘Grounded
sharing, and travel occurrence reports. Our future efforts will video description,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
include creating descriptions for all dashcam videos publicly nit. (CVPR), Jun. 2019, pp. 6571–6580.
[20] S. Gella, M. Lewis, and M. Rohrbach, ‘‘A dataset for telling the stories
available by BDD100k, focusing on videos recorded by rear of social media videos,’’ in Proc. Conf. Empirical Methods Natural Lang.
camera, extending the language domain from single to mul- Process., 2018, pp. 968–974.
tilingual, along with object detection and relational features [21] J. Lei, L. Yu, T. L. Berg, and M. Bansal, ‘‘TVR: A large-scale
research. dataset for video-subtitle moment retrieval,’’ Lecture Notes in Com-
puter Science (Including Subseries Lecture Notes in Artificial Intelli-
We anticipate that the DeepRide dataset’s release will help gence and Lecture Notes in Bioinformatics) (Lecture Notes in Com-
advance the Visio-linguistic research. puter Science), vol. 12366. 2020, pp. 447–463. [Online]. Available:
https://arxiv.org/abs/2001.09099v2
REFERENCES [22] S. Zhang, Z. Tan, J. Yu, Z. Zhao, K. Kuang, J. Liu, J. Zhou, H. Yang, and
[1] S. Bhatt, F. Patwa, and R. Sandhu, ‘‘Natural language processing (almost) F. Wu, ‘‘Poet: Product-oriented video captioner for E-commerce,’’ 2020,
from scratch,’’ in Proc. IEEE 3rd Int. Conf. Collaboration Internet Com- arXiv:2008.06880.
put. (CIC), Jan. 2017, pp. 328–338. [23] J. Xu, T. Mei, T. Yao, and Y. Rui, ‘‘MSR-VTT: A large video description
[2] L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, ‘‘HERO: Hierar- dataset for bridging video and language,’’ in Proc. IEEE Conf. Comput.
chical encoder for video+language omni-representation pre-training,’’ in Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5288–5296.
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, [24] K. H. Zeng, T. H. Chen, J. C. Niebles, and M. Sun, ‘‘Title generation for
pp. 2046–2065. user generated videos,’’ in Lecture Notes in Computer Science (Including
[3] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, ‘‘COOT: Cooperative Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
hierarchical transformer for video-text representation learning,’’ in Proc. Bioinformatics) (Lecture Notes in Computer Science), vol. 9906. 2016,
NeurIPS, 2020, pp. 1–27. pp. 609–625. [Online]. Available: https://arxiv.org/abs/1608.07068
[25] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, ‘‘VaTeX: [49] D. He, X. Zhao, J. Huang, F. Li, X. Liu, and S. Wen, ‘‘Read, watch, and
A large-scale, high-quality multilingual dataset for video-and-language move: Reinforcement learning for temporally grounding natural language
research,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, descriptions in videos,’’ in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019,
pp. 4580–4590. pp. 8393–8400.
[26] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and [50] W. Zhang, B. Wang, L. Ma, and W. Liu, ‘‘Reconstruct and represent video
T. Darrell, ‘‘BDD100K: A diverse driving dataset for heterogeneous mul- contents for captioning via reinforcement learning,’’ IEEE Trans. Pattern
titask learning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Anal. Mach. Intell., vol. 42, no. 12, pp. 3088–3101, Dec. 2020.
(CVPR), Jun. 2020, pp. 2633–2642. [51] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang, ‘‘Video captioning
[27] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, via hierarchical reinforcement learning,’’ in Proc. IEEE/CVF Conf. Com-
Y. Pan, G. Baldan, and O. Beijbom, ‘‘NuScenes: A multimodal dataset put. Vis. Pattern Recognit., Jun. 2018, pp. 4213–4222.
for autonomous driving,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [52] Y. Chen, S. Wang, W. Zhang, and Q. Huang, ‘‘Less is more: Picking
Recognit. (CVPR), Jun. 2020, pp. 11618–11628. informative frames for video captioning,’’ in Proc. ECCV, in Lecture
[28] Y. Choi, N. Kim, S. Hwang, K. Park, J. S. Yoon, K. An, and I. S. Kweon, Notes in Computer Science:Lecture Notes in Artificial Intelligence
‘‘KAIST multi-spectral day/night data set for autonomous and assisted and Lecture Notes in Bioinformatics, vol. 11217. 2018, pp. 367–384.
driving,’’ IEEE Trans. Intell. Transp. Syst., vol. 19, no. 3, pp. 934–948, [Online]. Available: https://openaccess.thecvf.com/content_ECCV_
Mar. 2018. 2018/html/Yangyu_Chen_Less_is_More_ECCV_2018_paper.html
[29] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? [53] W. Xu, J. Yu, Z. Miao, L. Wan, Y. Tian, and Q. Ji, ‘‘Deep reinforcement
The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis. polishing network for video captioning,’’ IEEE Trans. Multimedia, vol. 23,
Pattern Recognit., Jun. 2012, pp. 3354–3361. pp. 1772–1784, 2021.
[30] M. Rafiq, G. Rafiq, and G. S. Choi, ‘‘Video description: Datasets & eval- [54] G. Singh, S. Akrigg, M. Di Maio, V. Fontana, R. Javanmard Alitappeh,
uation metrics,’’ IEEE Access, vol. 9, pp. 121665–121685, 2021. S. Saha, K. Jeddisaravi, F. Yousefi, J. Culley, T. Nicholson, J. Omoke-
[31] Q. Zheng, C. Wang, and D. Tao, ‘‘Syntax-aware action targeting for video owa, S. Khan, S. Grazioso, A. Bradley, G. Di Gironimo, and F. Cuzzolin,
captioning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. ‘‘ROAD: The ROad event awareness dataset for autonomous driving,’’
(CVPR), Jun. 2020, pp. 13093–13102. 2021, arXiv:2102.11585.
[32] H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, ‘‘A semantics- [55] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung,
assisted video captioning model trained with scheduled sampling,’’ 2019, L. Hauswald, V. Hoang Pham, M. Mühlegg, S. Dorn, T. Fernandez,
arXiv:1909.00121. M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker,
[33] J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, ‘‘Joint syntax representa- S. Garreis, and P. Schuberth, ‘‘A2D2: Audi autonomous driving dataset,’’
tion learning and visual cue translation for video captioning,’’ in Proc. 2020, arXiv:2004.06320.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8917–8926. [56] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
[34] N. Aafaq, N. Akhtar, W. Liu, and A. Mian, ‘‘Empirical autopsy of deep Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
video captioning frameworks,’’ 2019, arXiv:1911.09345. [57] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evalua-
[35] J. Chen, Y. Pan, Y. Li, T. Yao, H. Chao, and T. Mei, ‘‘Temporal deformable tion of gated recurrent neural networks on sequence modeling,’’ 2014,
convolutional encoder–decoder networks for video captioning,’’ Proc. arXiv:1412.3555.
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8167–8174. [58] J. Carreira and A. Zisserman, ‘‘Quo vadis, action recognition? A new
[36] J. Perez-Martin, B. Bustos, and J. Perez, ‘‘Attentive visual semantic spe- model and the kinetics dataset,’’ in Proc. CVPR, Jul. 2017, pp. 6299–6308.
cialized network for video captioning,’’ in Proc. 25th Int. Conf. Pattern [59] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
Recognit. (ICPR), Jan. 2021, pp. 5767–5774. S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
[37] J. Xu, H. Wei, L. Li, Q. Fu, and J. Guo, ‘‘Video description model based and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017,
on temporal-spatial and channel multi-attention mechanisms,’’ Appl. Sci., arXiv:1705.06950.
vol. 10, no. 12, p. 4312, Jun. 2020. [60] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, ‘‘PWC-Net: CNNs for optical
[38] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, ‘‘Object rela- flow using pyramid, warping, and cost volume,’’ in Proc. IEEE/CVF Conf.
tional graph with teacher-recommended learning for video captioning,’’ Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8934–8943.
2020, arXiv:2002.11566. [61] G. Wentzel, ‘‘Funkenlinien im Röntgenspektrum,’’ Annalen der Physik,
[39] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, ‘‘Stat: vol. 371, no. 23, pp. 437–461, 1922.
Spatial-temporal attention mechanism for video captioning,’’ IEEE Trans. [62] V. Tech, C. L. Zitnick, and D. Parikh, ‘‘CIDEr: Consensus-based image
Multimedia, vol. 22, no. 1, pp. 229–241, Feb. 2020. description evaluation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[40] L. Gao, X. Wang, J. Song, and Y. Liu, ‘‘Fused GRU with semantic-temporal nit., Jun. 2015, pp. 4566–4575.
attention for video captioning,’’ Neurocomputing, vol. 395, pp. 222–228, [63] C.-Y. Lin, ‘‘ROUGE: A package for automatic evaluation of summaries,’’
Jun. 2020, doi: 10.1016/j.neucom.2018.06.096. in Text Summarization Branches Out. Barcelona, Spain: Association for
[41] S. Liu, Z. Ren, and J. Yuan, ‘‘SibNet: Sibling convolutional encoder for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
video captioning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 9, https://aclanthology.org/W04-1013
pp. 3259–3272, Sep. 2021. [64] A. Lavie and A. Agarwal, ‘‘METEOR: An automatic metric for MT eval-
[42] S. Pramanik, P. Agrawal, and A. Hussain, ‘‘OmniNet: A unified architec- uation with improved correlation with human judgments,’’ in Proc. 2nd
ture for multi-modal multi-task learning,’’ 2019, arXiv:1907.07804. Workshop Stat. Mach. Transl., 2007, pp. 223–228. [Online]. Available:
[43] V. Lashin and E. Rahtu, ‘‘Multi-modal dense video captioning,’’ in Proc. http://acl.ldc.upenn.edu/W/W05/W05-09.pdf#page=75
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Jun. 2020, pp. 4117–4126.
[44] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [Online].
Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[45] A. Hussain, T. Hussain, W. Ullah, and S. W. Baik, ‘‘Vision transformer
GHAZALA RAFIQ received the B.Sc. degree in
and deep sequence learning for human activity recognition in surveillance
mathematics from Punjab University, Lahore, Pak-
videos,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–10, Apr. 2022.
[46] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, ‘‘End-to-end dense istan, in 2000, and the master’s degree in computer
video captioning with masked transformer,’’ in Proc. IEEE/CVF Conf. science, in 2002. She is currently pursuing the
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8739–8748. Ph.D. degree with the Data Sciences Laboratory,
[47] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, Department of Information and Communication
‘‘Transformer-XL: Attentive language models beyond a fixed-length con- Engineering, Yeungnam University, South Korea.
text,’’ Proc. 57th Annu. Meeting Assoc. Comput. Linguistics (ACL), 2020, She has over 15 years of industry experience. Her
pp. 2978–2988. research interests include deep-learning applica-
[48] N. Li, Z. Chen, and S. Liu, ‘‘Meta learning for image captioning,’’ in Proc. tions, video description, reinforcement learning,
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8626–8633. natural language processing, computer vision, and pattern recognition.
MUHAMMAD RAFIQ (Member, IEEE) received MANKYU SUNG received the B.S. degree in
the M.S. degree in electronics engineering from computer science from Chungnam National Uni-
International Islamic University, Pakistan, in 2008, versity, Daejeon, in 1993, and the M.S. and Ph.D.
and the Ph.D. degree in information and commu- degrees in computer science from the Univer-
nication engineering from Yeungnam University, sity of Wisconsin–Madison, WI, USA, in 2005.
South Korea, in 2022. He has extensive indus- From January 1995 to July 2012, he worked
try experience with a background in databases, for Digital Contents Division, ETRI, Daejeon,
business applications, and industrial technology South Korea. He has been an Assistant Professor,
solutions. His research interests include modern Department of Game and Mobile, Keimyung Uni-
3-D game development, computer vision, video versity, Daegu, South Korea, since March 2012.
description incorporating artificial intelligence, and deep learning. His current research interests include computer graphics, deep-learning
applications, computer animation, computer games, and human–computer
interaction. He is a member of the ACM.