Transformer Network For Video To Text Translation

Uploaded by

mirandasuryaprakash_

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Transformer Network For Video To Text Translation

Uploaded by

mirandasuryaprakash_

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

Transformer Network for video to text translation

*
2020 International Conference on Power, Instrumentation, Control and Computing (PICC) | 978-1-7281-7590-4/20/$31.00 ©2020 IEEE | DOI: 10.1109/PICC51425.2020.9362374

Mubashira N Dr.Ajay James

Department of Computer Assistant Professor,Department of
Science and Engineering Computer Science and Engineering
Government Engineering College Thrissur Government Engineering College Thrissur
Kerala,India Kerala, India
[email protected] [email protected]

Abstract—Recently generation of natural language descriptions driving,video subtitling, procedure generation for instructional
for videos has created a lot of focus in computer vision and nat- videos, video surveillance software for visually impaired
ural language processing research. Video understanding involves people and understanding sign language are the real world
detecting scene’s visual and temporal elements and reasoning
it for description generation.Several real world implementations applications.
such as video indexing and retrieval,video to sign language Convolutional neural networks(CNN) provides more sophis-
translation etc, are based on this.Because of the complicated ticated feature representations by doing series of convolution
nature and diversified content, the captioning problem becomes operations over images or videos(ie, series of frames).These
more challenging.This is a machine translation problem which convolutions are for comparing the visual data frames against
uses encoder decoder architecture of GRU or LSTM to deal
with this kind of problem.But here the decoding starts with some specific patterns(filters) that the network looking for.As
the final hidden state of encoder as input.This can’t be a good the network perform more convolutions, it could identify spe-
summary of the input sequence,because all the intermediate cific object.This is done by large amount of labelled data.It is
states of encoder are ignored.This paper proposes a transformer clear that CNN can only identify the spatial features or visual
network with deep attention based encoder and decoder to data of an image,but can’t handle temporal features, that is
generate the natural language description for video sequence
data. This network processes the sequences as a whole and learns how a frame related to one before it.Temporal sensitive model
relationship between each elements in the sequence by providing can be vector-to-sequence,sequence to-vector,or sequence-to-
attention. sequence.Here is the importance of temporal sensitive models
Index Terms—Video captioning,CNN,Transformer like Recurrent neural networks(RNN),Long short term mem-
network,Attention Mechanism,LSTM,RNN ory(LSTM),Auto encoders and Transformer networks which
I. I NTRODUCTION takes the output of CNN to get the output which might be a
vector or sequence based on the model.Attention mechanism
Understanding the contents of visual data is a complex
helps to which part of input should focus to yield more
task.Nowadays, Machine learning techniques allows us to
accurate outcome.Video captioning is kind of sequence to
train the context for a dataset,so that an algorithm can
sequence modeling,it takes series of frames in, to generate
understand what are the content in a video.Problems re-
textual description.
lated describing visual contents, computer vision and natu-
This process of collecting the interactions between ob-
ral language processing is taking an increasingly complex
jects,identifying the fine changes of video contents in temporal
challenges and it seeing the accuracy comparable to hu-
dimension, prioritizing the activities captured in the video
man observations.Studying and analysing the content of a
makes the captioning problem more challenging.
video is the important research area of multimedia.The task
In this paper Section II discuss the Related works of differ-
of generating contextual description of visual contents is a
ent methodology proposed for video captioning.A transformer
challenging and essential task in computer vision.This gen-
network proposed for video captioning described at section
erally includes feature extraction and generating descriptions
III. Different datasets used for training the models presents in
based on the feature vectors extracted.Semantic concepts like
section IV and Section V explains about evaluation metrics.
scenes,objects,actions,interactions between objects and order-
ing of the events in temporal dimension should consider to II. R ELATED WORKS
design an efficient solution architecture for the captioning
The methodologies can be categorized into two methods
problem.Moreover, this requires translation of the extracted
that are template-based methods, sequence learning methods.
visual information into grammatically correct natural language
by preserving the semantic concepts.Content based recom- A. Template based method
mendation and retrieval, human-robot interaction,autonomous
Template based method[1] are based on set of specific
grammar rules.First the sentences are divided in to three

978-1-7281-7590-4/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

types of fragments:subject, verb and object by following the Another study based on a reconstruction network for video
grammar.visual contents detects word into objects, actions captioning[7] is based on dual learning approach[8].It is a
and attributes categories.Then and each fragment is associated learning framework that leverages the (primal dual)structure of
with detected words.Then generated fragments are composed AI task to obtain effective feedback or regularization signals
to a sentence with predefined language template.this method to enhance the learning or inference process.Regularization
focused mainly on the detection of predefined entities and techniques is to avoid overfitting.This happens when the model
events separately.Describing open domain videos using this trying to capture the noise in the training data.The architecture
method found unrealistic or too expensive because of its consist of three modules:a CNN based encoder which extract
computational complexity. the semantic representations of the video frames.A LSTM
based decoder which generates natural language for visual
B. Sequence learning method content description.Then the reconstructor which exploits the
Deep RNN based model[2] to translate videos to natural backward flow from caption to visual contents to reproduce the
language is a naive approach.The frame features are extracted frame representations.From this reconstructed representation it
using CNN.Applying mean pooling along the features ex- provides a constrain for decoder to embed more information
tracted across the entire video to get a single video descrip- in to the input video representations.
tor,which is then fed into the LSTM network to generate tex- Fused GRU with semantic-temporal attention for video cap-
tual description.But this mean pooling causes loss of temporal tioning[9],this methodology provides two types of attention,
information. semantic and temporal.Temporal attention involves directing
Another work[3] of video to text discusses end-to-end attention to specific instant of time while decoding the video
sequence to sequence-video to text(S2VT) model to generate representation in to textual description.Semantic attention is
captions for videos.this method is similar to machine trans- the ability to provide the representations of semantically
lation between natural languages.A two stack LSTM first important objects that are needed when they are needed.The
encodes the feature vectors generated by CNN for RGB different modules of this architecture are a CNN based video
images or optical flow images.Then the decoder generates encoder for feature extraction,semantic concept prediction
sentences.Decoding starts only when all the features are en- network and hierarchical semantic decoder network.
coded.So the decoder gets only the previous output and hidden Video Captioning via Hierarchical Reinforcement Learn-
states as input,which works well for short sequences but can’t ing[10] is the work of incorporating reinforcement learning
memorize long term dependencies. techniques into video captioning.A high-level Manager module
An attention based LSTM and semantic consistency used in learns to design sub-goals and a low-level Worker module
the work of[4] for video captioning.This framework integrates recognizes the primitive actions to fulfill the sub-goal.
attention mechanism with LSTM to capture salient structures Transformer network is a major breakthrough towards more
of the video and also explores correlation between multimodal sophisticated sequence learning models.Transformer[11] is a
representations(ie text and visual data) to generate sentences deep learning model introduced,which are designed to solve
with rich semantic content.Inception v3[5] CNN used here the problem of sequence transduction, such as machine trans-
to extract more meaningful spatial features.This architecture lation and text summarizing.This is better than all other
allows to do the convolution along with pooling operation architecture because they totally avoid recursion by processing
in a single layer of CNN and stack up the feature maps to sequences as a whole and by learning relationship between
get a single volume output.The extracted features fed into an input elements by providing attention based encoder and
attention based long short term memory encoder-decoder. decoder. Attention mechanism which provides the advantage
A dual stream recurrent neural network architecture based to look entire input and the target sequence generated so far
on the work on video captioning[6] discusses about both given as input to the encoder and to the decoder respec-
visual and semantic stream.This architecture includes a visual tively.The softmax induces the probability distribution among
descriptor and a semantic descriptor which encodes both visual output.Transformers not only provides the attention but also
and semantic features respectively.Visual descriptor encodes make parallel the work of processing sequences.All these
frame representations of video and the semantic descriptor properties of transformers benefited a lot in the studies of
encodes each video frame with a high level representation image and video captioning.
of semantic concept like objects,actions, and interactions.This
visual descriptor and semantic descriptor are different modal- III. P ROPOSED SYSTEM
ities.Dual stram RNNs are used because of this two asyn-
chronous modalities to flexibly exploit the hidden states of The proposed system based on the transformer network.A
each stream.Finally it integrates the hidden state representa- CNN module designed to extract features form video frames.A
tions of visual and semantic descriptor.Then a dual stream deep self attention based encoder in transformer converts the
decoder is deployed to perform dual stream hidden state fusion feature vectors from CNN in to set of vectors with context and
for sentence generation.An attentive multi grained encoder order information included.The decoder predicts the next word
module to enhance the local feature learning with global by using the output from encoder and the predicted words so
semantics feature for each modality. far.Figure 1 shows the block diagram of the whole framework.

Fig. 1. Proposed transformer network for video captioning

B. Attention based deep sequence encoder

The output from the inception network will be a feature
vector of some dimension ’d’.The extracted visual feature does
not represent the context and order information of the frames.
1) Attention mechanism: This is to take context of
each vector associated with all other vectors in the se-
quence.Consider the input embedding X= x1 , x2 , ..., xN .Then
follow the process below for all the frames from 1,...,N.
• Keys(K), Values(V) and queries(Q): Create three vectors
from each of the input feature vectors.These vectors are
created by multiplying the embedding by three weight
matrices Wk , Wv and Wq which will trained during the
training process.
Fig. 2. An inception module • k1 , v1 , q1 are the keys,queries and values generated
from (x1 .Wk ),(x1 .Wv ) and (x1 .Wq ).Then K
= {k1 , k2 , .., kN },V={v1 , v2 , .., vN } and Q =
{q1 , q2 , .., qN }.dk is the dimension of k,v and q.
A. Feature extraction using Inception Network • Inner product:Inner product of qi and kj , (qi .kj ) for every
j= 1,2,...N. is to quantify how similar all inputs to the the
vector xi
A single specific operation is defined between layers while • Softmax function Normalizes to get relative similarity
designing a CNN architecture.The inception network is an between Query qi and key kj vector, ri→j
exception, which is modelled so that it could perform convo-
lution along with pooling in a single layer.The output feature eqi .kj
ri→j = PN (1)
map is then stacked up into a single volume.Filters with qi .kl
l=1 e
different dimension can be used,but the output dimension will
not change in terms of height and width because it performs This measures how similar the ith input to the j th input,
’same convolution’ and padding in pooling.A bottleneck layer relative to all other N input in the sequence. ri→j always
PN
is introduced to reduce the computational cost without hurting positive,between 0 and 1 and j=1 ri→j = 1
the performance.Figure 2 shows an inception module of the • Multiply each relative relationship with corresponding
network, Value vector, (eg: ri→1 ) with v1 .Add all these together

to get refined code zi for xi ,Then Z = z1 , z2 , ..., zN C. Deep Self and cross-attention based decoder
N
X Input of the decoder is the words are generated thus far
zi = ri→l .vl (2) marked as output sequence in the decoder network in figure
l=1
4.Left most word is the most recently predicted word.Every
The above process is only provides the context factor time a new word predicted ,input sequence on the bottom
irrespective of the frame ordering.But the ordering does mat- shifts to the right by one position.The positional embedding
ters.So positional embedding is for include the ordering. added to get the order.In cross attention(Encoder-Decoder
2) Positional Embedding(PE): attention) the keys K and values V are generated from the
• Constitute a ’d’ dimensional positional embedding each output of the encoder and query Q is the generated from
embedding dimension having a sinusoidal function. the output wordvector predicted thus far.Multihead attention
• The frequency of sine wave directly proportional to allows the attention mechanism to attend different aspects
embedding dimension. of characteristics of sequence.There is a feed forward neural
• Associate a number to each of ’d’ dimension in the PE network same as in the encoder.This operations performed J
connected to the sine wave at that dimension with respect times to form Deep Self and cross-attention based decoder A
to the the position.
The simple block diagram of deep sequence encoder shown
in the figure 3.The process performed by the encoder are:
• Positional embedding added to this feature vectors to
provide the order information.
• Then the attention network takes in to account the context
of the video frames.Every feature vector corresponding to
frames plays the roll of keys K,values V and queries Q.
• A skip connection provided for not to loose the original
feature vectors and add it into output of attention network.
• This added and normalized output then fed in to a
feed forward neural network to provide regularization
or structure on this network, and the hyperbolic tanh
function restricts the output of the network to be between
-1 and 1.
• Repeat the above process for K times for deep sequence
encoder.

Fig. 4. Deep self and cross attention based decoder

softmax on the top of this network is to predict the next word

in the sequence.This will be the next left most input to the
decoder network and previously generated sequence shifted to
right.

IV. DATASETS

The roll of datasets are crucial for structuring a machine

Fig. 3. Deep self attention encoder learning model.A dataset which contains enormous training
instances and discussing a wide range of contexts would con-
Self attention is for modifying the vector representation of tribute for the effective learning process.This section describes
each frames to take in to account the characteristics of the datasets that are commonly used for image and video
surrounding frames. captioning.

A. Image Datasets A. BLEU Score(Bilingual evaluation Understudy)[18]

1) Microsoft COCO[12]: Short for common objects in BLEU is the most commonly used algorithm to evaluate
contexts is a large image classification/recognition, object the quality of the text description generated to the reference
detection, segmentation, and captioning dataset. It contain a sentences.The evaluation approach is by counting the matching
total of about 165k training images, 81k of validation data, n-grams in the candidate translation to n-grams in the reference
and 81k test images with 91 categories with 5 captions per text.The unigram would be each token bigram comparison
image. would be each word pair.All this comparisons are made
2) Flickr30K dataset[13]: This dataset itself is a bench- regardless of order.It is a modified precision measure in which
mark for sentence based image description, which augments each word credit only up to the maximum number of times it
the 158k captions from Flickr30k, contains 244k coreference appears in the reference sentences.
chains and 276k manually annotated bounding boxes for each B. METEOR[19]
of the 31k images and 5 English captions for each image in
the original dataset.coreference chains means, each image in Metric for Evaluation of Translation with Explicit Ordering,
the dataset has a txt file in the ”Sentences” folder. Every line is calculated as an average mean of precision and recall
of this file contains a caption with annotated phrases blocked by performing stemming and looking up for synonyms in
off with brackets. wordNet[20].It relies on finding an optimal visual to text
alignment. This metric evaluates machine generated output by
aligning it to the reference translations to find sentence level
B. Video Dataset similarity scores.The words are considered as matched if the
1) Montreal Video Annotation Dataset (M-VAD)[14]: It surface forms of words are identical(exact),stems are identical.
consist of 49k video clips taken from DVD movies. 39k video Match phrases if they are listed as paraphrases in a language
clips from the dataset used for training the captioning model. appropriate paraphrase table.
49k are allocated for validation and 5,000 video clips are used C. ROUGE[21]
for testing.A single sentence description provided even though
it is difficult and challenging to describe a movie in single This is a Recall-Oriented Understudy for Gisting Evalua-
sentence.This video dataset follows face tracks to the semi- tion.It is an intrincic metric for evaluating summaries based
automatic annotation process.Tracking face only because of it on BLEU scores.Given the automatic description generated by
directly depends on the body movements. machine and set of reference descriptions.ROUGE-N is to find
that what percentage of N-grams from these human generated
2) MPII Movie Description Corpus (MPII-MD)[15]: As
references occur in machine generated description.
name indicates it is a movie description dataset.The video
snippets are taken from 94 HD hollywood movies and over VI. C ONCLUSION
68k sentence descriptions in parellal.37k video clips are taken
The purpose of this study is to design a transformer network
from 55 movies.Audio descriptions also included along with
model for translating video clips in to natural language sen-
31k video clips are taken from the 49 movies.
tence.A deep self attention based encoder encodes the frame
3) Microsoft Research Video Description Corpus
using keys,values, and queries.Self and cross attention based
(MSVD)[16]: This dataset consist of 1970 video clips each of
deep decoder generates sentences.This network is different
length about 10 seconds created by Amazon Mechanical Turk
from LSTM and other sequential learning models,because de-
(AMT).This dataset splits in to training,validation and test
pendency every vectors in sequence with surrounding vectors
dataset each having 1200,100,670 video clips respectively.
can be perform parallel.
Every video snippet are annotated in multiple description
in different language with single sentence. Each video clips R EFERENCES
annotated by 40 different sentences in English.
[1] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Doll´ar,
4) MSR Video to Text (MSR-VTT)[17]: new large-scale J.Gao, X. He, M. Mitchell, J. C. Platt, et al. “From captions to
video benchmark for describing video contents in texts.It visual concepts and back,” in Proc. Conf. Computer Vision and Pattern
Recognition, pp. 1473–1482, 2015.
contains 10000 web based video clips and 200k single sen- [2] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, K.
tence description.A single video may have multiple descrip- Saenko, Translating Videos to Natural Language Using Deep Recurrent
tions.Here 20 captions included per video.The dataset splits in Neural Networks, Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL), 2015, pp. 1494-
to 6513 training,2990 testing,497 validating samples. 1504.
[3] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K.
Saenko, Sequence to sequence-video to text, Proceedings of the IEEE
V. E VALUATION M ETRICS International Conference on Computer Vision, 2015, pp. 4534-4542.
[4] L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, “Video captioning
There could be multiple equally good descriptions for a sinle with attention-based LSTM and semantic consistency,” IEEE Trans.
video clip.The following methods are commonly used machine Multimedia, vol. 19, no. 9, pp. 2045–2055, Sep. 2017.
[5] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
translation systems to calculate the accuracy. A machine V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In
translation system can have more than one good answers. CVPR, 2015.

[6] N. Xu, A.-A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M.

Kankanhalli,“Dual-stream recurrent neural network for video caption-
ing,” IEEE Transactions on Circuits and Systems for Video Technology
(TCSVT), 2018.
[7] Wang, B., Ma, L., Zhang, W., Liu, W.: ”Reconstruction network for
video captioning”.In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. (2018) 7622–7631
[8] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual
learning for machine translation. In NIPS, pages 820–828, 2016.
[9] L. Gao, X. Wang, J. Song, Y. Liu, Fused GRU with semantic-
temporal attention for video captioning, Neurocomputing (2019)
https://doi.org/10.1016/j.neucom. 2018.06.096 .
[10] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang. Video caption-
ing via hierarchical reinforcement learning. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2018.
[11] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. ”At-
tention is all you need”. In Advances in Neural Information Processing
Systems, pages 6000–6010.
[12] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Doll´ar, and C. L. Zitnick. Microsoft COCO: Common objects in
context. In ECCV. 2014.
[13] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image
descriptions to visual denotations: New similarity metrics for semantic
inference over event descriptions,” ACL, vol. 2, pp. 67–78, 2014.
[14] A. Torabi, C. Pal, H. Larochelle, and A. Courville, “Using descriptive
video services to create a large data source for video annotation
research,” arXiv preprint arXiv:1503.01070, 2015.
[15] A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, “A dataset for
movie description,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 3202–3212.
[16] D. L. Chen and W. B. Dolan, “Collecting highly parallel data for
paraphrase evaluation,” in ACL: Human Language Technologies-Vol.
1. Association for Computational Linguistics, 2011, pp.190–200
[17] J. Xu, T. Mei, T. Yao, and Y. Rui. MSR-VTT: A large video description
dataset for bridging video and language. In CVPR, 2016.
[18] K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: a method for
automatic evaluation of machine translation”, in Proceedings of the 40th
Annual Meeting on Association for Computational Linguistics (ACL
’02). Association for Computational Linguistics, Stroudsburg, PA, USA,
311-318, 2002.
[19] D. Elliott and F. Keller, “ Image description using visual depedency
representations,” in Proc. Empirical Methods Natural Lang. Process.
2013, vol. 13, pp. 1292-1302.
[20] Miller, G., Beckwith, R, Fellbaum, C., Gross, D., and Miller, K. 1990.
Introduction to WordNet: An on-line lexical database. International
Journal of Lexicography (special issue), 3(4):235-312.
[21] Lin CY, “ROUGE: a package for automatic evaluation of summaries”,
in Proceedings of the workshop on text summarization branches out,
Barcelona, Spain, (WAS2004) 2004

Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.

Walc 2
92% (26)
Walc 2
301 pages
Wisdom Oracle PDF
79% (57)
Wisdom Oracle PDF
248 pages
WALC 10 Memory
83% (12)
WALC 10 Memory
186 pages
Pachislo Manual PDF
100% (3)
Pachislo Manual PDF
30 pages
Kohler tp6805 14/20RESA/L Service Manual
100% (10)
Kohler tp6805 14/20RESA/L Service Manual
128 pages
Dynomite Owners Manual
No ratings yet
Dynomite Owners Manual
405 pages
Gideon's Guardians - New Meth Recipe - A - K - A Easter Bunny Meth
67% (6)
Gideon's Guardians - New Meth Recipe - A - K - A Easter Bunny Meth
50 pages
EPA07 Maxxforce 11, 13 Engine Service Manual
79% (29)
EPA07 Maxxforce 11, 13 Engine Service Manual
490 pages
Unlock Codes All Cell Phones
100% (25)
Unlock Codes All Cell Phones
15 pages
DIY: Immobilizer Hacking For Lost Keys or Swapped ECU
50% (4)
DIY: Immobilizer Hacking For Lost Keys or Swapped ECU
14 pages
Cell Phone Unlock Code Instructions
63% (8)
Cell Phone Unlock Code Instructions
41 pages
A Computer Motherboard Diagram
100% (2)
A Computer Motherboard Diagram
10 pages
Toyota Camry 2002 2006 Workshop Manual
98% (62)
Toyota Camry 2002 2006 Workshop Manual
20 pages
Lock Picking Hotel Rooms
100% (1)
Lock Picking Hotel Rooms
22 pages
2019 Mac Pro Service Technician Manual
No ratings yet
2019 Mac Pro Service Technician Manual
341 pages
All CDMA Codes
75% (4)
All CDMA Codes
17 pages
Logic Pro X Shortcuts
92% (13)
Logic Pro X Shortcuts
11 pages
A Survey of Evolution of Image Captioning PDF
No ratings yet
A Survey of Evolution of Image Captioning PDF
18 pages
Holley Carb Manual PDF
100% (1)
Holley Carb Manual PDF
2 pages
IBM Assembly Language Coding (ALC) Part 1
100% (8)
IBM Assembly Language Coding (ALC) Part 1
68 pages
Philips HTD 5540 Service Manual PDF
100% (1)
Philips HTD 5540 Service Manual PDF
56 pages
All Mobile Tricks
91% (35)
All Mobile Tricks
19 pages
ServiceManualNamux4English PDF
100% (9)
ServiceManualNamux4English PDF
112 pages
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
No ratings yet
Aafaq Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding For Video CVPR 2019 Paper
10 pages
Video Captioning Using Neural Networks
No ratings yet
Video Captioning Using Neural Networks
13 pages
Video Captioning Approaches
No ratings yet
Video Captioning Approaches
6 pages
Bridging Video and Text A Two-Step Polishing Transformer For Video Captioning
No ratings yet
Bridging Video and Text A Two-Step Polishing Transformer For Video Captioning
15 pages
Deep Learning-Based Video Captioning Technique Using Transformer
No ratings yet
Deep Learning-Based Video Captioning Technique Using Transformer
4 pages
Adaptive_Feature_Abstraction_for_Translating_Video
No ratings yet
Adaptive_Feature_Abstraction_for_Translating_Video
16 pages
2 - Hierarchical LSTMs With Adaptive Attention For
No ratings yet
2 - Hierarchical LSTMs With Adaptive Attention For
18 pages
Seminar Report Final
No ratings yet
Seminar Report Final
20 pages
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
No ratings yet
Movie Caption Generation With Vision Transformer and Transformer-Based Language Model
6 pages
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
No ratings yet
Multi-Modal Hierarchical Attention-Based Dense Video Captioning
5 pages
Long Short-Term Relation Transformer With Global Gating For Video Captioning
No ratings yet
Long Short-Term Relation Transformer With Global Gating For Video Captioning
13 pages
Comparing Attention-Based Neural Architectures For Video Captioning
No ratings yet
Comparing Attention-Based Neural Architectures For Video Captioning
10 pages
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
No ratings yet
This Manuscript Is Currently Submitted To Computer Vision and Image Understanding Journal
34 pages
IEEE Paper
No ratings yet
IEEE Paper
13 pages
[email protected]
No ratings yet
[email protected]
16 pages
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
No ratings yet
Aic - 2022 - 35 2 - Aic 35 2 Aic210172 - Aic 35 Aic210172
19 pages
Gu An Empirical Study ICCV 2017 Paper PDF
No ratings yet
Gu An Empirical Study ICCV 2017 Paper PDF
10 pages
An Empirical Study of Language CNN For Image Captioning
No ratings yet
An Empirical Study of Language CNN For Image Captioning
10 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
Synchronized Audio-Visual Frames With Fractional Positional Encoding For Transformers in Video-to-Text Translation
No ratings yet
Synchronized Audio-Visual Frames With Fractional Positional Encoding For Transformers in Video-to-Text Translation
5 pages
P099
No ratings yet
P099
5 pages
Cross-Domain Modality Fusion For Dense Video Captioning
No ratings yet
Cross-Domain Modality Fusion For Dense Video Captioning
15 pages
Ref12
No ratings yet
Ref12
7 pages
Aneja Convolutional Image Captioning CVPR 2018 Paper
No ratings yet
Aneja Convolutional Image Captioning CVPR 2018 Paper
10 pages
CVIU Hema 1-S2.0-S1077314222000650-Main
No ratings yet
CVIU Hema 1-S2.0-S1077314222000650-Main
13 pages
Video To Sequence
No ratings yet
Video To Sequence
9 pages
Video Captioning Using Deep Learning and NLP To Detect Suspicious Activities
No ratings yet
Video Captioning Using Deep Learning and NLP To Detect Suspicious Activities
5 pages
PPT for the First Paper (1)
No ratings yet
PPT for the First Paper (1)
49 pages
A Multimodal Framework For Video Caption Generation
No ratings yet
A Multimodal Framework For Video Caption Generation
11 pages
Meshed-Memory Transformer For Image Captioning
No ratings yet
Meshed-Memory Transformer For Image Captioning
15 pages
Template Master USDB (11)
No ratings yet
Template Master USDB (11)
53 pages
Show and Tell: A Neural Image Caption Generator
No ratings yet
Show and Tell: A Neural Image Caption Generator
9 pages
Vision-Text Cross-Modal Fusion For Accurate Video Captioning
No ratings yet
Vision-Text Cross-Modal Fusion For Accurate Video Captioning
16 pages
ImagecaptionusingCNNandLSTM
No ratings yet
ImagecaptionusingCNNandLSTM
11 pages
Dense_Video_Captioning_CVPR_2024_paper_جيدة
No ratings yet
Dense_Video_Captioning_CVPR_2024_paper_جيدة
10 pages
Image Captioning Synopsis
No ratings yet
Image Captioning Synopsis
17 pages
2207.07852v1
No ratings yet
2207.07852v1
23 pages
VideoChat Chat-Centric Video Understanding
No ratings yet
VideoChat Chat-Centric Video Understanding
16 pages
Image Captioning
No ratings yet
Image Captioning
8 pages
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
No ratings yet
Pami Im2Show and Tell: Lessons Learned From The 2015 MSCOCO Image Captioning Challenge
12 pages
TSP_CMC_53245
No ratings yet
TSP_CMC_53245
18 pages
Automated Image Captioning Using CNN and RNN
No ratings yet
Automated Image Captioning Using CNN and RNN
17 pages
he2017
No ratings yet
he2017
8 pages
Research Paper Final
No ratings yet
Research Paper Final
5 pages
Video Chat GPT
No ratings yet
Video Chat GPT
17 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
No ratings yet
LTRC-MT Simple & Effective Hindi-English Neural Machine Translation Systems at WAT 2019
4 pages
How_To_Do_Literature_Search_Summarization
No ratings yet
How_To_Do_Literature_Search_Summarization
2 pages
Paper - 3
No ratings yet
Paper - 3
33 pages
Conference Paper A5
No ratings yet
Conference Paper A5
9 pages
9
No ratings yet
9
9 pages
2024 - Streaming Dense Video Captioning - Zhou Et Al
No ratings yet
2024 - Streaming Dense Video Captioning - Zhou Et Al
11 pages
Image Captioning Using R-CNN & LSTM Deep Learning Model
No ratings yet
Image Captioning Using R-CNN & LSTM Deep Learning Model
4 pages
Video Clasification PDF
100% (1)
Video Clasification PDF
114 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet
Net 2018 11 1 - 12
No ratings yet
Net 2018 11 1 - 12
7 pages
Computer Vision: Cse 576 Ali Farhadi
No ratings yet
Computer Vision: Cse 576 Ali Farhadi
90 pages
Survey Papers
No ratings yet
Survey Papers
14 pages
Bharti Airtel LTD
No ratings yet
Bharti Airtel LTD
11 pages
Research & Publication Ethics (PHD20-02) - 921065
No ratings yet
Research & Publication Ethics (PHD20-02) - 921065
1 page
RUDIMENTSOFPHILOSOPHY
No ratings yet
RUDIMENTSOFPHILOSOPHY
25 pages
Ece Good Manual
No ratings yet
Ece Good Manual
83 pages
What Is Philosophy
No ratings yet
What Is Philosophy
3 pages
Samsung Full Codes
100% (5)
Samsung Full Codes
7 pages
Diy Inmobilizer Circuits
100% (5)
Diy Inmobilizer Circuits
18 pages
Direct Key Web
No ratings yet
Direct Key Web
7 pages
Data Link, Fault Tracing V2
100% (1)
Data Link, Fault Tracing V2
22 pages
Manual Ecu D Citit
No ratings yet
Manual Ecu D Citit
145 pages
How Cell Phones Work
No ratings yet
How Cell Phones Work
12 pages
Charmed RPG Player Handbook
75% (4)
Charmed RPG Player Handbook
18 pages
Hints Computer System Design
100% (1)
Hints Computer System Design
27 pages
PD
No ratings yet
PD
76 pages
H.O.P.E 1 2ND Quarter
No ratings yet
H.O.P.E 1 2ND Quarter
40 pages
Orange (Software)
No ratings yet
Orange (Software)
6 pages
Gandcrab Ransomware Decryption Tool
No ratings yet
Gandcrab Ransomware Decryption Tool
5 pages
29663
No ratings yet
29663
13 pages
Preventive Maintenance
No ratings yet
Preventive Maintenance
31 pages
SAP BP Workbook Document Generation IN
No ratings yet
SAP BP Workbook Document Generation IN
26 pages
Oxford University Thesis Title Page
100% (3)
Oxford University Thesis Title Page
7 pages
The Order of Operations (PEMDAS) : Parentheses or or - Exponents 5)
No ratings yet
The Order of Operations (PEMDAS) : Parentheses or or - Exponents 5)
82 pages
Wireframing
No ratings yet
Wireframing
3 pages
CIT237 PROGRAMMING AND ALGORITHMS SUMMARY
No ratings yet
CIT237 PROGRAMMING AND ALGORITHMS SUMMARY
38 pages
ARL-700 MR Geared EN81-20.en
No ratings yet
ARL-700 MR Geared EN81-20.en
51 pages
Introduction To Os Important Question
No ratings yet
Introduction To Os Important Question
12 pages
Decals - Unreal Engine
No ratings yet
Decals - Unreal Engine
3 pages
IPLEX II Networking For Administrator
No ratings yet
IPLEX II Networking For Administrator
38 pages
Exercise Number:01
No ratings yet
Exercise Number:01
2 pages
Ericsson GSM BSC Printout PDF
100% (1)
Ericsson GSM BSC Printout PDF
4 pages
Hands-On Lab: Create Tables and Load Data in Postgresql Using Pgadmin
No ratings yet
Hands-On Lab: Create Tables and Load Data in Postgresql Using Pgadmin
25 pages
RSM Calculator 28 Feb 2015
No ratings yet
RSM Calculator 28 Feb 2015
40 pages
Ipreo BD (Bigdough) System Administration
No ratings yet
Ipreo BD (Bigdough) System Administration
1 page
Requirements Model Engineering
No ratings yet
Requirements Model Engineering
37 pages
Adversarial Attack Estimation and Mitigation in Semantic Segmentation
No ratings yet
Adversarial Attack Estimation and Mitigation in Semantic Segmentation
13 pages

Transformer Network For Video To Text Translation

Uploaded by

Transformer Network For Video To Text Translation

Uploaded by

2020 International Conference on Power, Instrumentation, Control and Computing (PICC)

Transformer Network for video to text translation

Mubashira N Dr.Ajay James

978-1-7281-7590-4/20/$31.00 ©2020 IEEE

Fig. 1. Proposed transformer network for video captioning

B. Attention based deep sequence encoder

Fig. 4. Deep self and cross attention based decoder

softmax on the top of this network is to predict the next word

The roll of datasets are crucial for structuring a machine

A. Image Datasets A. BLEU Score(Bilingual evaluation Understudy)[18]

[6] N. Xu, A.-A. Liu, Y. Wong, Y. Zhang, W. Nie, Y. Su, and M.

You might also like