0% found this document useful (0 votes)
41 views15 pages

DeepRide Dashcam Video Description Dataset For Autonomous Vehicle Location-Aware Trip Description

Uploaded by

12345venkab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views15 pages

DeepRide Dashcam Video Description Dataset For Autonomous Vehicle Location-Aware Trip Description

Uploaded by

12345venkab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received 13 September 2022, accepted 28 September 2022, date of publication 6 October 2022, date of current version 13 October 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3212745

DeepRide: Dashcam Video Description Dataset


for Autonomous Vehicle Location-Aware
Trip Description
GHAZALA RAFIQ 1 , MUHAMMAD RAFIQ 2, (Member, IEEE), BYUNG-WON ON 3,

MANKYU SUNG 2 , AND GYU SANG CHOI 1, (Member, IEEE)


1 Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si 38541, South Korea
2 Department of Game and Mobile Engineering, Keimyung University, Daegu 42601, South Korea
3 Department of Software Convergence Engineering, Kunsan National University, Gunsan-si 54150, South Korea
Corresponding authors: Gyu Sang Choi ([email protected]) and Muhammad Rafiq ([email protected])
This work was supported in part by the Basic Science Research Program through the National Research Foundation of Korea (NRF)
funded by the Ministry of Education under Grant NRF-2021R1A6A1A03039493, Grant NRF-2018R1D1A1B07048414, and Grant
2022R1A2C1011404; and in part by the 2022 Yeungnam University Research Grant.

ABSTRACT Video description is one of the most challenging task in the combined domain of computer
vision and natural language processing. Captions for various open and constrained domain videos have been
generated in the recent past but descriptions for driving dashcam videos have never been explored to the best
of our knowledge. With the aim to explore dashcam video description generation for autonomous driving, this
study presents DeepRide: a large-scale dashcam driving video description dataset for location-aware dense
video description generation. The human-described dataset comprises visual scenes and actions with diverse
weather, people, objects, and geographical paradigms. It bridges the autonomous driving domain with video
description by textual description generation of the visual information as seen by a dashcam. We describe
16,000 videos (40 seconds each) in English employing 2,700 man-hours by two highly qualified teams with
domain knowledge. The descriptions consist of eight to ten sentences covering each dashcam video’s global
features and event features in 60 to 90 words. The dataset consists of more than 130K sentences, totaling
approximately one million words. We evaluate the dataset by employing location aware vision-language
recurrent transformer framework to elaborate on the efficacy and significance of the visio-linguistics research
for autonomous vehicles. We provided base line results to evaluate the dataset by employing three existing
state-of-the-art recurrent models. The memory augmented transformer performed superior due to its highly
summarized memory state for visual information and the sentence history while generating the trip descrip-
tion. Our proposed dataset opens a new dimension of diverse and exciting applications, such as self-driving
vehicle reporting, driver and vehicle safety, inter-vehicle road intelligence sharing, and travel occurrence
reports.

INDEX TERMS Dashcam video description, video description dataset, video captioning, autonomous trip
description.

I. INTRODUCTION and natural language processing [1], exploring a variety


Automatic description generation is an established and chal- of constrained and open domains. Developing robust video
lenging task for short as well as relatively long videos. It has description systems [2], [3], [4], [5], [6], [7] demands not only
achieved intense attention in the realm of computer vision, the ability to understand the sequential visual data but also
to generate a syntactically concise and semantically accurate
The associate editor coordinating the review of this manuscript and interpretation of that understanding into natural language.
approving it for publication was Arianna Dulizia . The accomplishment of accurate and diverse description

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 107361
G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

generation is directly associated with the amount and quality B. DEEPRIDE APPLICATIONS
of the training and validation data provided to the model for The importance of video description is evident from its prac-
understanding. tical and real time applications, i.e., efficient searching and
Comprehending the localized events of a video appro- indexing of videos on the internet, human-robot relation-
priately and then transforming the attained visual under- ships in industrial zones, facilitation of autonomous vehicle
standing accurately into a textual format is called dense driving, video descriptions can outline procedures in instruc-
video captioning, or simply, video description [8]. Captur- tional/tutorial videos for industry, education, and the house-
ing the scenes, objects, and activities in a video, as well hold (e.g., recipes). The visually impaired can gain useful
as the spatial–temporal relationships and the temporal information from a video that incorporates audio descrip-
order, is crucial for precise and grammatically correct tions. Long surveillance videos can be transformed into short
multi-line text narration. The generated fine-grained cap- texts for quick previews. Sign language videos can be con-
tion is a requirement of such a mechanism that proves verted to natural language descriptions. Automatic, accurate,
to be expressive and subtle. Its purpose is to bag the and precise video/movie subtitling is another important and
temporal dynamics of the visuals in specific order as practical application of the video description task.
presented in the video, and then join them with syntacti- Particular to DeepRide dataset, the basic application or
cally and semantically correct representations using natural the purpose of dataset creation is to automatically generate
language. summaries (trip descriptions) for autonomous vehicles using
Considering the training-test data for video description dashcam videos. The desired generated summary contains
systems, various datasets have been proposed in the recent the vehicle’s location information from the GPS data stored
past for better visual comprehension and diverse description by default in the dashcam meta data, day/night, weather,
generation. These datasets belong to a variety of domains, scene, road side information (trees, buildings, parking), and
discovering aspects associated with our daily lives like human dynamic events taking place on and around the road (vehi-
actions [9], [10], cooking [11], [12], [13], [14], [15], [16], cle position and speed on the road, traffic signals, turnings,
movies [17], [18], social media [19], [20], TV shows [21], entering/exiting underpass/overheads, traffic flow, pedestri-
E-commerce [22] and generalized categories [8], [23], [24], ans movements/waiting, accident occurring). Other signifi-
[25]. cant and noteworthy applications can be

A. MOTIVATION • self-driving vehicle reporting


The emerging autonomous vehicle technology has achieved • driver and vehicle safety
increasing attention in the recent past. There is a great • inter-vehicle road intelligence sharing
deal of research being conducted in various research sec- • travel occurrence reports.
tors of autonomous driving, particular to computer vision
tasks; object detection, semantic segmentation, semantic C. LOCATION-AWARE DESCRIPTION
instance segmentation and depth estimation are some of them. In the desire to get human-like precise and accurate descrip-
However, video description for driving videos has never tions for a supplied video, various strategies have been inves-
been explored to the best of our knowledge. Blending the tigated for quality enhancement and optimization. Since a
challenging video description domain with the promising video naturally comprises of multiple modalities, i.e., visual,
autonomous driving research can definitely push the fron- audio, sound, and sometimes subtitles, so catering the avail-
tiers of the research in an ambitious direction. With the able modes within a video can result in accelerated training
motivation to understand the challenges of video description and boosted performance. Dealing with trip descriptions from
system in the context of autonomous driving, we collect a dashcam videos, one of the important aspects is the location,
novel, large-scale, location-aware dashcam videos descrip- i.e., embedded GPS/IMU information recorded automatically
tion dataset, DeepRide. The proposed dataset features 16k with the visual data. A trip summary is considered incom-
dashcam videos corresponding to more than 130k sentences plete without the location information so utilization of GPS
in 16k paragraph descriptions. Each description on average data can help in vehicle’s location detection. We propose a
has ten sentences describing the day/night time, weather location-aware recurrent transformer based dashcam video
information, scene attributes along with static features and description framework for the generation of rich and infor-
dynamic events. Static features include parked cars, trees, mative trip description. Deeming the location-aware feature,
signboards, and high-rise buildings on the road side whereas, Our proposed dataset opens a new dimension of diverse and
by dynamic events we mean the switching of traffic signals exciting applications stated above.
at the intersection, turning of vehicle, passing under/over a Our contributions towards this research work are as
bridge, wiping windshield and an accident happening. Explo- follows:
ration of the challenging video description research applied
in the potentially exciting domain of autonomous driving, 1) We explore a new direction for the task of video
no doubt, represents an expanded challenge in this research description by blending it with the fast-growing and
area. emerging domain of autonomous vehicle driving.

107362 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 1. 12 Sample frames from dashcam video in training split of the DeepRide dataset. Static and Dynamic Scenes are described temporally.
The 40 seconds dashcam video is described in 14 English sentences.

2) We collect a novel, large scale, location-aware


video description dataset using dashcam videos for
autonomous vehicle trip description.
3) We employ a state-of-the-art web-based platform for
the systematic collection, revision, proofreading, and
finalization of the data.
4) We perform in-depth analysis of the collected data
and further investigate the shared recurrent transformer
with the location framework for the generation of nat-
ural language descriptions and validate the system’s
efficiency and effectiveness.
D. PROBLEM STATEMENT
Assume that we have a dashcam video V , containing multiple FIGURE 2. DeepRide - description collection flow.
temporal event sections {E1 , E2 , E3 , . . . ET } and GPS/IMU
information. An automatic sentence is generated for every
event in the video to describe the content thereof in natural of DeepRide dataset followed by Section IV presenting the
language. Our goal is to come up with a location-aware coher- proposed multi-modal location-aware recurrent transformer
ent multi-sentence {S0 , S1 , S2 , S3 , . . . ST } description, where based video description framework. Section V elaborates the
T represents number of events and S1 through ST are the experimentation and implementation details. The Qualita-
corresponding generated sentences, whereas S0 is the first tive and quantitative results are presented in Section VI and
sentence of generated description providing location infor- finally, the paper is concluded in Section VII with few future
mation. Figure 1 demonstrates some sample dashcam video directions.
frames from training set of DeepRide dataset with ground
truth description. II. RELATED WORK
Rest of the paper is organized as follows: Section II pro- Datasets creation related to computer vision tasks have played
vides a brief overview of the related literature on the topic, significant part in developing algorithms for robust perfor-
Section III explores the collection, statistics and analysis mance. Creating broadly challenging and ambitious datasets

VOLUME 10, 2022 107363


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

TABLE 1. Benchmark multi-caption video description datasets.

TABLE 2. DeepRide dataset split (train, validation, and test) statistics.

can take vision-to-language research in a distinct direction typically consist of a CNN being used as a visual model
and provide organized means for training and evaluations. to extract visual features from video frames, and an
The publicly available datasets with deep and diverse descrip- RNN being used as a language model to generate cap-
tions, novel tasks and challenges, and meticulous benchmarks tions word by word. Other compositions of CNN, RNN
have contributed intensely for recent rapid developments in and their variants LSTMs and GRUs are also explored
the Visio-linguistic field. The intersection of computer vision in this field following the ED architecture.
for autonomous driving with natural language processing 2) Attention Mechanism based Approaches: The stan-
by [26], [27], [28], and [29] is pushing forward the frontiers dard encoder-decoder architecture further fused with
of research domain in a new direction altogether. attention mechanism to focus on specific distinctness
showed high quality performances. The captioning sys-
A. VIDEO DESCRIPTION DATASETS tem developed by [36], [37], [38], [39], [40], [41],
Various datasets have been launched from time to time to and [42] demonstrated the employment of visual, local,
exhibit an enhanced accomplishment for the task of video global, adaptive, spatial, temporal, and channel atten-
description, exploring a wide range of constrained and open tion for coherent and diverse caption generation.
domains like cooking by [11], [12], [13], [14], [15], and [16], 3) Transformer based Approaches: Recently with the
human activities by [8], [9], [23], [24], and [25], social media advent of efficient and modern transductive trans-
by [20], and [19], movies by [17], and [18], TV shows by [21], former architecture, free from recurrence, and solely
and e-commerce by [22] presented in detail by [30]. Table 1 based on self-attention, video description systems
lists a brief overview of the key attributes and major statis- enhanced the performance allowing parallelization
tics of existing multi-caption (dense/paragraph like) video along with training on massive amount of data. With
description datasets. The existing renowned datasets gradu- the emergence of several versions of transformers
ally heightened their visual complexity and language diver- and models employing transformers [2], [3], [4], [5],
sity to expand dynamic and hefty algorithms. [43], [44], [45], [46], [47] long term dependency han-
dling is not an issue anymore for researchers engaged
in video processing for summarization and description,
B. VIDEO DESCRIPTION APPROACHES
or for autonomous-vehicle, surveillance, and instruc-
Video description generation approaches can be broadly tional purposes.
classified into four groups based on their technological 4) Deep Reinforcement Learning based Approaches:
advancement in time. Reinforcement learning employed within the encoder-
1) Encoder-Decoder (ED) based Approaches: The ED decoder structure [48], [49], [50], [51], [52] can pro-
framework is the most popular paradigm for video gressively deliver state-of-the-art captions by following
description generation [31], [32], [33], [34], [35] in exploration and exploitation strategies. Recently, the
recent years, it pioneered the video description task by notion of deep reinforcement learning in the video
addressing the limitations associated with conventional description domain with the capacity of repeated
and statistical approaches. Conventional ED pipelines polishing [53] simulates human cognitive behaviors.

107364 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 3. Description Entry Screen for Qualified Operators. Some instructions and help tips are provided for dashcam video description.

The proposed model-irrelevant algorithm introduced A. VIDEOS COLLECTION


a polishing mechanism into the video description via DeepRide dataset is created with the objective to discover
reinforcement learning and gradually improved the the challenging video description task in conjunction with
generated captions by revising the ambiguous word and the emerging autonomous driving domain. The 16k dash-
grammar errors. cam videos encompassing diverse driving scenarios are taken
from BDD100K [26], the large-scale driving video dataset
C. DRIVING DATASETS exposing the challenges of street-scene understanding. The
Domain-specific, large-scale, and diverse datasets can fuel dashcam videos, each of 40 seconds duration, are obtained
further advances in supervised learning. In the fast-growing in a crowd-sourcing manner from more than 50K rides, pri-
field of autonomous driving, the datasets BDD-100K [26], marily uploaded by thousands of drivers covering New York,
NuScenes [27], KAIST multi-spectral driving dataset [28], Berkeley, San Francisco, Bay Area, and other regions in the
KITTI [29], ROAD [54], and A2D2 [55] have proven to be populous areas of the USA and around the world. Vehicle’s
of great value for computer vision tasks like object classifi- dashcam is used to record these videos with 30-fps frame
cation, object detection, and scene segmentation. rate along with GPS/IMU information preserving the driving
BDD-100K dataset [26] consists of 100K video clips trajectories. The GPS/IMU information is employed to gen-
embracing realistic driving scenarios with increasing com- erate location-aware descriptions. These videos are recorded
plexities for heterogeneous multitask learning. The crowd at different times of the day with diverse weather conditions
sourced dataset, solely collected from drivers, explored for and varied scene locations. The dashcam video’s three global
ten computer vision tasks involving image and tracking tasks. features/characteristics given below are also considered for
Through this research work, associating the renowned area the natural language descriptions generation.
of autonomous driving with the video description domain can 1) Time like dawn/dusk, day, and night time (sample
exclusively proceed the research with a distinct focus. frames shown in Figure 9)
2) Weather conditions including rainy, snowy, foggy,
III. DATASET: DEEPRIDE overcast, cloudy, and clear (sample frames shown in
This section presents the videos collection, descriptions col- Figure 10)
lection, description collection framework, Data batches, and 3) Scene type such as residential area, city-street, and
statistics of the DeepRide dataset. highways (sample frames shown in Figure 11)
VOLUME 10, 2022 107365
G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 4. Revision screen with Accept option if agree with the operator, and reject options to route back the description to operator for re-description.

Appreciating the original train-test split of the BDD100k


dataset, among 16k dashcam videos of our DeepRide, 11k
videos are in the training set, taken from the training split of
BDD100k and 5k videos from the validation set of BDD100k
constitutes the validation and test set of our collected dataset
as shown in the Table 2.

B. DESCRIPTION COLLECTION FRAMEWORK


We designed and developed a proprietary web-based portal
for the descriptions collection of 16k dashcam videos. Data
entry operators are examined for their driving knowledge and
road experience by conducting basic test and interviews. The
75% of operator qualification added value to the data quality.
The qualified operators are assigned to describe each dash-
cam video in eight to ten concise yet descriptive sentences
(not limited to ten sentences, if there is more to describe,
they can) covering all the static scenes and dynamic events
taking place on and around the road. Static scenes include FIGURE 5. English Word Cloud for 200 most frequent words used in the
parked cars, trees, signboards, and high-rise buildings on the DeepRide Dataset.
roadside, whereas, by dynamic events, we mean the switching
of traffic signals at the intersection, turning of the vehicle,
passing under/over a bridge, an accident happening, and wip- definition(HD) quality. Multiple role-specific screens are
ing windshield. An overview of the DeepRide description available in the portal to deal with basic entry, revision, proof
collection procedure is demonstrated in Figure 2. reading, and finalization parallel to administration dash-
In order to ensure smooth operation at the operator’s end, boards and description entry statistics screens. These screens
the videos of dashcams have been adjusted for their high includes basic description entry screen shown in Figure 3,

107366 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 6. 4 × Sample dashboard screens of the description collection framework - Displaying users and batches statistics.

revision screen shown in Figure 4 and dashboard screens for


administrators shown in Figure 6.
We constitute two teams of highly qualified English-
speaking operators with domain knowledge, first for video
primary description and second for spelling, grammar, qual-
ity check, and proofreading of the primary descriptions.
We instruct description operators for concise yet descriptive
eight to ten sentences descriptions for each dashcam video in
English.
The web portal groups 100 × dashcam videos selected
from the dashcam videos pool to constitute a batch. The
batch is assigned (batch status: Assigned) to a specific oper-
ator by the administrator. On completion of batch descrip-
tion, the operator submits (batch status: Submitted) the batch FIGURE 7. Directed network Graphs for keywords and terms occurring in
the close proximity of each other elaborating the relationship between
back to the administrator for revision (batch status: Revision). keywords.
The batch is checked for spelling, grammar, and description
quality. The batch is assigned back to the same operator for
have a paragraph of eight to ten diverse temporally described
corrections if it does not satisfy the description standards.
sentences for every dashcam video. It makes up to 976,941
Operators with more than 10% rejections disqualify for fur-
total words with 3,722 unique words resulting in overall 130K
ther description tasks. Upon acceptance, the batch is further
sentences. The average description length of nine sentences
assigned to the proofread operator (batch status: Proofread)
with 68 words on average makes it superior as compared to
for proofreading purposes. The proofreading operator checks
other datasets shown in Table 1. We layout detailed statistics
for spelling, grammar, and description quality. After proof-
in Table 2. The Figure 5 represents English word cloud for
reading, the batch is pushed to administrator section for batch
the DeepRide dataset (top 200 most frequent words).
finalization(batch status: Finalized). The batch statuses are
described in Table 3.
IV. METHOD
We developed a location-aware video description evalua-
C. DATASET STATISTICS tion framework that generates human analogous descriptions
DeepRide is a dense dashcam videos description dataset for dashcam videos. We employ various transformer based
spanning 177 hours with 16k paragraphs having vocabulary dense video captioning models to evaluate our proposed
density of 0.004 and readability index of 5.209. Our dataset dataset.An overview of the proposed framework is shown in

VOLUME 10, 2022 107367


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 8. Overview of Location-aware Dense Video Description framework: Video features (RGB and Flow) and word embedding with
300 dimensions is employed. GPS/IMU info from dashcam video is utilized to fetch the location of the trip. Mean Latitude and Longitude
values are calculated and then is searched in the geographical database for corresponding high level location containing city and road name.
The memory augmented recurrent transformer generated trip description is concatenated with the template based location sentence to form
location-aware trip description. Evaluation performed by comparing ground truth/ referenced description and generated location-aware trip
description.

TABLE 3. Batch-statuses in description collection framework.

Figure 8. Transformers are proven to be more efficient and the augmented memory block leverages the video segments
powerful for sequential modeling. We investigate Recurrent and their previous caption history to assist with next sen-
Transformers: Masked Transformer by [46], Transformer- tence generation. We generate our dataset corpus compliant
XL by [47], and Memory Augmented Recurrent Transformer with the ActivityNet-Captions dataset in JavaScript Object
(MART) by [5] as candidate models. We choose MART as Notation(JSON) file format. We evaluate our dataset with
the fundamental building block of our proposed framework, metrics of BLEU (1 to 4), CIDEr, ROUGE L, METEOR, and
a transformer [44] focused model with an additional memory Repetition (1 to 4). We investigate the results of the following
module. As part of an encoder-decoder shared environment, models while evaluating our proposed dataset DeepRide.

107368 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

A. MASKED TRANSFORMER
Considering Neural Machine Translation (NMT), imple-
menting a self-attention mechanism with the objectives of
parallelization, reduction of computational complexity, long-
range dependency handling [44] introduced basic trans-
former architecture, which was employed for the video to
text paragraph description generation by [46]. They pro-
posed a masking network comprising a video encoder,
proposal decoder and a captioning decoder aiming to
decode the proposal-specific representations into differen-
tiable masks, resulting in consistent training of proposal gen-
eration and captioning decoder. Learning the representation
with the capability of long-range dependencies is addressed
by employing self-attention, facilitating more effective
learning. FIGURE 9. Sample frames from the train set representing the time of the
day attribute: daytime, nighttime, and dawn/dusk.

B. TRANSFORMER-XL TABLE 4. Simulation parameters.

Introducing the notion of recurrence into pure self-attention


based networks, transformer-XL [47] is capable of a
paragraph-like description generation by learning beyond a
fixed and without interrupting the temporal coherence. They
introduced simple yet effective position encoding to gener-
alize attention weights beyond training and the reusability of
hidden states to build up the recurrent connection between the
segments.

C. MART
The Mart proposed by [5] for the video to text paragraph
description generation task is based on vanilla transformer
model [44]. Unlike the vanilla model with separate encoder-
decoder networks, mart introduced a shared encoder-decoder
environment with an auxiliary memory module to enable V. EXPERIMENTATION
recurrence in transformers. The augmented external memory A. FEATURE EXTRACTION
block similar to LSTM [56] and GRU [57] facilitates the In order to keep the scenes standardized/uniform and get
processing of caption history information corresponding to features, we sample 15 frames per second and extract I3D
video segments. The shared environment of encoder-decoder features [58] from these sampled frames. The sampling mech-
and implementation of memory module by MART allow it anism is based on time not on frame rate. If a dashcam video is
to utilize previous contextual information so that it is able either 30 or 60 fps encoded, the trip description system will
to produce a better paragraph that is more coherent and less sample and process 15 frames per second. If frame rate is
repetitive. less than the required frames/second, i.e. 15 fps, the system
will achieve the desired frames per second by adding zero
D. LOCATION AWARE DESCRIPTION GENERATION padding.
The proposed dataset DeepRide utilizes the GPS/IMU We feed 64 frames with a spatial size of 224 × 224 . For
recording of preserved trajectory information while pro- better feature representations, we use the I3D model, pre-
cessing the corresponding dashcam video to generate a trained on the Kinetics training dataset [59] and calculated
location-aware road trip description. The Latitude and Lon- video RGB and optical flow features prior to the training.
gitude associated with the dashcam videos are cached We extract the temporal features using PWCNet by [60]. The
from Google Geocoding API with their corresponding posi- I3D, the spatial/ RGB 1024D feature vectors, and tempo-
tion/location containing road & city name and stored in a geo- ral/optical flow 1024D feature vectors are concatenated to
graphic repository. This database is utilized to get the location form the input to the transformer layers. It formed a single
associated with the latitude and longitude of the dashcam 2048D representation for every stack of 64 frames. The dash-
video while generating trip summary. cam videos are 40 seconds long, hence make ten segments of
Further, the location containing sentence is concate- 244 × 224 × 64, which is sufficient to generate ten sentences.
nated with the generated paragraph summary to form We employ Glove-6B, 300 dimension word embeddings
location-aware trip description as demonstrated in Figure 8. and generated vocabulary index for language model.

VOLUME 10, 2022 107369


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

TABLE 5. Description results on test set of DeepRide dataset. ↑ follows the Higher the better, whereas ↓ follows lower the better policy.

FIGURE 11. Sample frames from the train set representing the scene
attribute: city street, highway, residential, tunnel, and parking lot.

Table 4. We train the model for 50 epochs, using Adam opti-


mizer with five epoch warm up, an initial learning rate of
FIGURE 10. Sample frames from the train set representing the weather
attribute: clear, overcast, foggy, cloudy, rainy, and snowy.
0.0001 with β1 of 0.9, β2 = 0.999, and weight decay of 0.01.
B. IMPLEMENTATION DETAILS C. EVALUATION METRICS
We adapt implementation details of the MART [5] for coher- We evaluate the model using popular automatic evaluation
ent video description generation. MART uses two transform- metrics with dense video captioning: Bilingual Evaluation
ers with 12 multi-head attentions, a hidden size of 768, and Understudy (BLEU) [61], Consensus-based Image Descrip-
a positional encoding based as described in [44]. A memory tion Evaluation (CIDEr) [62], Recall-Oriented Understudy
module with a recurrent memory state of one is included in for Gisting Evaluation(ROUGE) [63], Metric for Evalua-
the model. tion of Translation with Explicit ORdering(METEOR) [64],
A major challenge in training a machine learning model and Repetition. We employ standard evaluation sources from
is determining how many epochs should be run. Insuffi- MSCOCO server.
cient epochs are likely to delay the model’s convergence,
while excessive epochs may represent overfitting. In order VI. RESULTS & DISCUSSION
to reduce overfitting without compromising model accuracy, We compare three transformer-based models and record the
early stopping is a technique used to optimize the model. results. In Table 5, we reported all three results however
Early stopping as a form of regularization is primarily con- observed that only MART based result demonstrated supe-
cerned with stopping training before an over-fitted model rior performance on DeepRide dataset for all of the evalua-
occurs. It is used when we don’t want to train our model for tion metrics. The other two Transformers: Transformer-XL
too long, saving computational power. and Masked Transformer models, demonstrated an average
We employed an early stopping with a patience of performance. The high performance of MART transformer
10 epochs. With the greedy decoding approach, we mea- is fundamentally caused by the memory module, and it has
sured CIDEr-D as the primary evaluation parameter and early shown a good performance due to the nature of the dataset
stopping condition. We show simulation parameters in descriptions.

107370 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 12. Description Analysis - English ground truth description of the sample dashcam videos from train
set of DeepRide dataset. Brown-colored sentence is the trip location information. Blue-colored Text indicate
Static scenes whereas green-colored text indicate dynamic events taking place on and around the road.

VOLUME 10, 2022 107371


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

FIGURE 13. Qualitative Analysis - English ground truth and generated descriptions of the sample dashcam videos from
test set of DeepRide dataset. Blue-colored Text indicate Static scenes whereas green-colored text indicate dynamic
events. Included only Location-aware MART produced qualitative results.

107372 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

The driving video datasets are challenging and have a [4] T. Jin, S. Huang, M. Chen, Y. Li, and Z. Zhang, ‘‘SBAT: Video captioning
bunch of similar features within every dashcam video. There- with sparse boundary-aware transformer,’’ in Proc. 29th Int. Joint Conf.
Artif. Intell., Jul. 2020, pp. 630–636.
fore, there is a significant amount of text description that [5] J. Lei, L. Wang, Y. Shen, D. Yu, T. Berg, and M. Bansal, ‘‘MART: Memory-
could repeat due to feature similarities as shown in Fig- augmented recurrent transformer for coherent video paragraph caption-
ure 9, 10, and 11. The (MART) took advantage of memory ing,’’ in Proc. 58th Annu. Meeting Assoc. Comput. Linguistics, 2020,
pp. 2603–2614.
and generated far better sequences of sentences while describ-
[6] Z. Yu and N. Han, ‘‘Accelerated masked transformer for dense video
ing the video features. These attributes show various scenes captioning,’’ Neurocomputing, vol. 445, pp. 72–80, Jul. 2021, doi:
from the dataset, where every scene has typical graphics, i.e., 10.1016/j.neucom.2021.03.026.
road, marking lines, cars, zebra crossing, signals, building, [7] M. Hosseinzadeh and Y. Wang, ‘‘Video captioning of future frames,’’
in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Jan. 2021,
trees, parked vehicles, pedestrians etc. Therefore, it is quite pp. 980–989.
possible that once a data entry operator describes that vehicle [8] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, ‘‘Dense-
is moving on the road, the model can predict this sentence captioning events in videos,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 706–715.
from every frame because the features are present throughout
[9] D. L. Chen and W. B. Dolan, ‘‘Collecting highly parallel data for para-
the video, this sometimes causes the model to predict the phrase evaluation,’’ in Proc. 49th Annu. Meeting Assoc. Comput. Linguis-
sentence at some other time slot since it is a global scenario. tics: Hum. Lang. Technol. (ACL-HLT), vol. 1, 2011, pp. 190–200.
Similarly, the local event features to the timeline are predicted [10] G. A. Sigurdsson, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, ‘‘Hol-
lywood in homes: Crowdsourcing data collection for activity understand-
at the time of occurrence, i.e., vehicle stops, turns right, slows ing,’’ 2016, arXiv:1604.01753.
down, crosses underpass, pedestrians crossing etc. Although [11] M. Rohrbach and M. Planck, ‘‘A database for fine grained activity detec-
the scene predictions are global and can be listed at any tion of cooking activities,’’ no. in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2012, pp. 1194–1201.
specific time, we have obtained encouraging results, setting
[12] P. Das, C. Xu, R. F. Doell, and J. J. Corso, ‘‘A thousand frames in just a few
a baseline for further improvements. We show description words: Lingual description of videos through latent topics and sparse object
analysis in Figure 12 and qualitative results in Figure 13. stitching,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,
pp. 2634–2641.
VII. CONCLUSION & FUTURE WORK [13] M. Regneri, M. Rohrbach, D. Wetzel, S. Thater, B. Schiele, and
M. Pinkal, ‘‘Grounding action descriptions in videos,’’ Trans. Assoc. Com-
In this research work, we present DeepRide, a new diverse put. Linguistics, vol. 1, pp. 25–36, 2013.
location-aware dashcam video description dataset intended [14] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and
to explore emerging autonomous vehicles driving in the per- B. Schiele, ‘‘Coherent multi-sentence video description with variable
level of detail,’’ in Lecture Notes in Computer Science (Including
spective of the fast-growing video description domain. Fea- Subseries Lecture Notes in Artificial Intelligence and Lecture Notes
turing 16k dashcam videos linked with around 130k sentences in Bioinformatics), vol. 8753. 2014, pp. 184–195. [Online]. Available:
of description in English. This dataset may help automate https://arxiv.org/abs/1403.6173
the creation of driving commentary. Moreover, the embed- [15] L. Zhou, C. Xu, and J. J. Corso, ‘‘Towards automatic learning of procedures
from web instructional videos,’’ in Proc. 32nd AAAI Conf. Artif. Intell.,
ded GPS/IMU information recording capability of dash- 2018.
cam video empowers the description system to associate [16] G. Huang, B. Pang, Z. Zhu, C. Rivera, and R. Soricut, ‘‘Multimodal pre-
the concerned locations and positions with natural language training for dense video captioning,’’ 2020, arXiv:2011.11760.
[17] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, ‘‘A dataset for
descriptions. We provided guidelines integrating location movie description,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
information extraction with the recurrent transformers. (CVPR), Jun. 2015, pp. 3202–3212.
Further, our proposed dataset opens a new dimension of [18] A. Torabi, C. Pal, H. Larochelle, and A. Courville, ‘‘Using descriptive
diverse and exciting applications: self-driving vehicle report- video services to create a large data source for video annotation research,’’
2015, arXiv:1503.01070.
ing, driver and vehicle safety, inter-vehicle road intelligence [19] L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, ‘‘Grounded
sharing, and travel occurrence reports. Our future efforts will video description,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recog-
include creating descriptions for all dashcam videos publicly nit. (CVPR), Jun. 2019, pp. 6571–6580.
[20] S. Gella, M. Lewis, and M. Rohrbach, ‘‘A dataset for telling the stories
available by BDD100k, focusing on videos recorded by rear of social media videos,’’ in Proc. Conf. Empirical Methods Natural Lang.
camera, extending the language domain from single to mul- Process., 2018, pp. 968–974.
tilingual, along with object detection and relational features [21] J. Lei, L. Yu, T. L. Berg, and M. Bansal, ‘‘TVR: A large-scale
research. dataset for video-subtitle moment retrieval,’’ Lecture Notes in Com-
puter Science (Including Subseries Lecture Notes in Artificial Intelli-
We anticipate that the DeepRide dataset’s release will help gence and Lecture Notes in Bioinformatics) (Lecture Notes in Com-
advance the Visio-linguistic research. puter Science), vol. 12366. 2020, pp. 447–463. [Online]. Available:
https://arxiv.org/abs/2001.09099v2
REFERENCES [22] S. Zhang, Z. Tan, J. Yu, Z. Zhao, K. Kuang, J. Liu, J. Zhou, H. Yang, and
[1] S. Bhatt, F. Patwa, and R. Sandhu, ‘‘Natural language processing (almost) F. Wu, ‘‘Poet: Product-oriented video captioner for E-commerce,’’ 2020,
from scratch,’’ in Proc. IEEE 3rd Int. Conf. Collaboration Internet Com- arXiv:2008.06880.
put. (CIC), Jan. 2017, pp. 328–338. [23] J. Xu, T. Mei, T. Yao, and Y. Rui, ‘‘MSR-VTT: A large video description
[2] L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, ‘‘HERO: Hierar- dataset for bridging video and language,’’ in Proc. IEEE Conf. Comput.
chical encoder for video+language omni-representation pre-training,’’ in Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5288–5296.
Proc. Conf. Empirical Methods Natural Lang. Process. (EMNLP), 2020, [24] K. H. Zeng, T. H. Chen, J. C. Niebles, and M. Sun, ‘‘Title generation for
pp. 2046–2065. user generated videos,’’ in Lecture Notes in Computer Science (Including
[3] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, ‘‘COOT: Cooperative Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in
hierarchical transformer for video-text representation learning,’’ in Proc. Bioinformatics) (Lecture Notes in Computer Science), vol. 9906. 2016,
NeurIPS, 2020, pp. 1–27. pp. 609–625. [Online]. Available: https://arxiv.org/abs/1608.07068

VOLUME 10, 2022 107373


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

[25] X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, ‘‘VaTeX: [49] D. He, X. Zhao, J. Huang, F. Li, X. Liu, and S. Wen, ‘‘Read, watch, and
A large-scale, high-quality multilingual dataset for video-and-language move: Reinforcement learning for temporally grounding natural language
research,’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, descriptions in videos,’’ in Proc. AAAI Conf. Artif. Intell., vol. 33, 2019,
pp. 4580–4590. pp. 8393–8400.
[26] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and [50] W. Zhang, B. Wang, L. Ma, and W. Liu, ‘‘Reconstruct and represent video
T. Darrell, ‘‘BDD100K: A diverse driving dataset for heterogeneous mul- contents for captioning via reinforcement learning,’’ IEEE Trans. Pattern
titask learning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Anal. Mach. Intell., vol. 42, no. 12, pp. 3088–3101, Dec. 2020.
(CVPR), Jun. 2020, pp. 2633–2642. [51] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang, ‘‘Video captioning
[27] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, via hierarchical reinforcement learning,’’ in Proc. IEEE/CVF Conf. Com-
Y. Pan, G. Baldan, and O. Beijbom, ‘‘NuScenes: A multimodal dataset put. Vis. Pattern Recognit., Jun. 2018, pp. 4213–4222.
for autonomous driving,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern [52] Y. Chen, S. Wang, W. Zhang, and Q. Huang, ‘‘Less is more: Picking
Recognit. (CVPR), Jun. 2020, pp. 11618–11628. informative frames for video captioning,’’ in Proc. ECCV, in Lecture
[28] Y. Choi, N. Kim, S. Hwang, K. Park, J. S. Yoon, K. An, and I. S. Kweon, Notes in Computer Science:Lecture Notes in Artificial Intelligence
‘‘KAIST multi-spectral day/night data set for autonomous and assisted and Lecture Notes in Bioinformatics, vol. 11217. 2018, pp. 367–384.
driving,’’ IEEE Trans. Intell. Transp. Syst., vol. 19, no. 3, pp. 934–948, [Online]. Available: https://openaccess.thecvf.com/content_ECCV_
Mar. 2018. 2018/html/Yangyu_Chen_Less_is_More_ECCV_2018_paper.html
[29] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? [53] W. Xu, J. Yu, Z. Miao, L. Wan, Y. Tian, and Q. Ji, ‘‘Deep reinforcement
The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis. polishing network for video captioning,’’ IEEE Trans. Multimedia, vol. 23,
Pattern Recognit., Jun. 2012, pp. 3354–3361. pp. 1772–1784, 2021.
[30] M. Rafiq, G. Rafiq, and G. S. Choi, ‘‘Video description: Datasets & eval- [54] G. Singh, S. Akrigg, M. Di Maio, V. Fontana, R. Javanmard Alitappeh,
uation metrics,’’ IEEE Access, vol. 9, pp. 121665–121685, 2021. S. Saha, K. Jeddisaravi, F. Yousefi, J. Culley, T. Nicholson, J. Omoke-
[31] Q. Zheng, C. Wang, and D. Tao, ‘‘Syntax-aware action targeting for video owa, S. Khan, S. Grazioso, A. Bradley, G. Di Gironimo, and F. Cuzzolin,
captioning,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. ‘‘ROAD: The ROad event awareness dataset for autonomous driving,’’
(CVPR), Jun. 2020, pp. 13093–13102. 2021, arXiv:2102.11585.
[32] H. Chen, K. Lin, A. Maye, J. Li, and X. Hu, ‘‘A semantics- [55] J. Geyer, Y. Kassahun, M. Mahmudi, X. Ricou, R. Durgesh, A. S. Chung,
assisted video captioning model trained with scheduled sampling,’’ 2019, L. Hauswald, V. Hoang Pham, M. Mühlegg, S. Dorn, T. Fernandez,
arXiv:1909.00121. M. Jänicke, S. Mirashi, C. Savani, M. Sturm, O. Vorobiov, M. Oelker,
[33] J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, ‘‘Joint syntax representa- S. Garreis, and P. Schuberth, ‘‘A2D2: Audi autonomous driving dataset,’’
tion learning and visual cue translation for video captioning,’’ in Proc. 2020, arXiv:2004.06320.
IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 8917–8926. [56] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
[34] N. Aafaq, N. Akhtar, W. Liu, and A. Mian, ‘‘Empirical autopsy of deep Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
video captioning frameworks,’’ 2019, arXiv:1911.09345. [57] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, ‘‘Empirical evalua-
[35] J. Chen, Y. Pan, Y. Li, T. Yao, H. Chao, and T. Mei, ‘‘Temporal deformable tion of gated recurrent neural networks on sequence modeling,’’ 2014,
convolutional encoder–decoder networks for video captioning,’’ Proc. arXiv:1412.3555.
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8167–8174. [58] J. Carreira and A. Zisserman, ‘‘Quo vadis, action recognition? A new
[36] J. Perez-Martin, B. Bustos, and J. Perez, ‘‘Attentive visual semantic spe- model and the kinetics dataset,’’ in Proc. CVPR, Jul. 2017, pp. 6299–6308.
cialized network for video captioning,’’ in Proc. 25th Int. Conf. Pattern [59] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier,
Recognit. (ICPR), Jan. 2021, pp. 5767–5774. S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman,
[37] J. Xu, H. Wei, L. Li, Q. Fu, and J. Guo, ‘‘Video description model based and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017,
on temporal-spatial and channel multi-attention mechanisms,’’ Appl. Sci., arXiv:1705.06950.
vol. 10, no. 12, p. 4312, Jun. 2020. [60] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, ‘‘PWC-Net: CNNs for optical
[38] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. Zha, ‘‘Object rela- flow using pyramid, warping, and cost volume,’’ in Proc. IEEE/CVF Conf.
tional graph with teacher-recommended learning for video captioning,’’ Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8934–8943.
2020, arXiv:2002.11566. [61] G. Wentzel, ‘‘Funkenlinien im Röntgenspektrum,’’ Annalen der Physik,
[39] C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, ‘‘Stat: vol. 371, no. 23, pp. 437–461, 1922.
Spatial-temporal attention mechanism for video captioning,’’ IEEE Trans. [62] V. Tech, C. L. Zitnick, and D. Parikh, ‘‘CIDEr: Consensus-based image
Multimedia, vol. 22, no. 1, pp. 229–241, Feb. 2020. description evaluation,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
[40] L. Gao, X. Wang, J. Song, and Y. Liu, ‘‘Fused GRU with semantic-temporal nit., Jun. 2015, pp. 4566–4575.
attention for video captioning,’’ Neurocomputing, vol. 395, pp. 222–228, [63] C.-Y. Lin, ‘‘ROUGE: A package for automatic evaluation of summaries,’’
Jun. 2020, doi: 10.1016/j.neucom.2018.06.096. in Text Summarization Branches Out. Barcelona, Spain: Association for
[41] S. Liu, Z. Ren, and J. Yuan, ‘‘SibNet: Sibling convolutional encoder for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available:
video captioning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 9, https://aclanthology.org/W04-1013
pp. 3259–3272, Sep. 2021. [64] A. Lavie and A. Agarwal, ‘‘METEOR: An automatic metric for MT eval-
[42] S. Pramanik, P. Agrawal, and A. Hussain, ‘‘OmniNet: A unified architec- uation with improved correlation with human judgments,’’ in Proc. 2nd
ture for multi-modal multi-task learning,’’ 2019, arXiv:1907.07804. Workshop Stat. Mach. Transl., 2007, pp. 223–228. [Online]. Available:
[43] V. Lashin and E. Rahtu, ‘‘Multi-modal dense video captioning,’’ in Proc. http://acl.ldc.upenn.edu/W/W05/W05-09.pdf#page=75
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
Jun. 2020, pp. 4117–4126.
[44] A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’
in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008. [Online].
Available: http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
[45] A. Hussain, T. Hussain, W. Ullah, and S. W. Baik, ‘‘Vision transformer
GHAZALA RAFIQ received the B.Sc. degree in
and deep sequence learning for human activity recognition in surveillance
mathematics from Punjab University, Lahore, Pak-
videos,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–10, Apr. 2022.
[46] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, ‘‘End-to-end dense istan, in 2000, and the master’s degree in computer
video captioning with masked transformer,’’ in Proc. IEEE/CVF Conf. science, in 2002. She is currently pursuing the
Comput. Vis. Pattern Recognit., Jun. 2018, pp. 8739–8748. Ph.D. degree with the Data Sciences Laboratory,
[47] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, Department of Information and Communication
‘‘Transformer-XL: Attentive language models beyond a fixed-length con- Engineering, Yeungnam University, South Korea.
text,’’ Proc. 57th Annu. Meeting Assoc. Comput. Linguistics (ACL), 2020, She has over 15 years of industry experience. Her
pp. 2978–2988. research interests include deep-learning applica-
[48] N. Li, Z. Chen, and S. Liu, ‘‘Meta learning for image captioning,’’ in Proc. tions, video description, reinforcement learning,
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 8626–8633. natural language processing, computer vision, and pattern recognition.

107374 VOLUME 10, 2022


G. Rafiq et al.: DeepRide: Dashcam Video Description Dataset for Autonomous Vehicle Location-Aware Trip Description

MUHAMMAD RAFIQ (Member, IEEE) received MANKYU SUNG received the B.S. degree in
the M.S. degree in electronics engineering from computer science from Chungnam National Uni-
International Islamic University, Pakistan, in 2008, versity, Daejeon, in 1993, and the M.S. and Ph.D.
and the Ph.D. degree in information and commu- degrees in computer science from the Univer-
nication engineering from Yeungnam University, sity of Wisconsin–Madison, WI, USA, in 2005.
South Korea, in 2022. He has extensive indus- From January 1995 to July 2012, he worked
try experience with a background in databases, for Digital Contents Division, ETRI, Daejeon,
business applications, and industrial technology South Korea. He has been an Assistant Professor,
solutions. His research interests include modern Department of Game and Mobile, Keimyung Uni-
3-D game development, computer vision, video versity, Daegu, South Korea, since March 2012.
description incorporating artificial intelligence, and deep learning. His current research interests include computer graphics, deep-learning
applications, computer animation, computer games, and human–computer
interaction. He is a member of the ACM.

BYUNG-WON ON received the M.S. degree from


the Department of Computer Science and Engi-
neering, Korea University, Seoul, South Korea,
in 2000, and the Ph.D. degree from the Depart-
ment of Computer Science and Engineering,
Pennsylvania State University, University Park,
PA, USA, in 2007. He is currently a Professor with
the Department of Software Science and Engi- GYU SANG CHOI (Member, IEEE) received the
neering, Kunsan National University, Gunsan-si, Ph.D. degree in computer science and engineer-
South Korea, where he also leads the Data Intel- ing from Pennsylvania State University. He was a
ligence Laboratory. His current research interests include around data min- Research Staff Member at the Samsung Advanced
ing, in particular probability theory and applications, machine learning, and Institute of Technology (SAIT), Samsung Elec-
artificial intelligence, mainly working on abstractive summarization, creative tronics, from 2006 to 2009. Since 2009, he has
computing, and multiagent reinforcement learning. He is currently serving been with Yeungnam University, where he is cur-
as a Committee Member for ISO/IEC JTC 1/SC 32, Korean Association of rently an Assistant Professor. He is currently work-
Data Science, and SIG on Human Language Technology with the Korean ing on embedded systems and storage systems,
Institute of Information Scientists and Engineers (KIISE), and the Informa- while his prior research has been mainly focused
tization Committee and Jeonbuk Large Leap Policy Consultation Body in on improving the performance of clusters. His research interests include
Jeollabuk-do Provincial Government. He is an Editor of Journal of Korean embedded systems, storage systems, parallel and distributed computing,
Institute of Information Scientists and Engineers (KIISE), Electronics and supercomputing, cluster-based web servers, and data centers. He is a member
Telecommunications Research Institute Journal (ETRI), and Quality and of the ACM.
Quantity.

VOLUME 10, 2022 107375

You might also like