Transformer Network For Video To Text Translation
Transformer Network For Video To Text Translation
Abstract—Recently generation of natural language descriptions driving,video subtitling, procedure generation for instructional
for videos has created a lot of focus in computer vision and nat- videos, video surveillance software for visually impaired
ural language processing research. Video understanding involves people and understanding sign language are the real world
detecting scene’s visual and temporal elements and reasoning
it for description generation.Several real world implementations applications.
such as video indexing and retrieval,video to sign language Convolutional neural networks(CNN) provides more sophis-
translation etc, are based on this.Because of the complicated ticated feature representations by doing series of convolution
nature and diversified content, the captioning problem becomes operations over images or videos(ie, series of frames).These
more challenging.This is a machine translation problem which convolutions are for comparing the visual data frames against
uses encoder decoder architecture of GRU or LSTM to deal
with this kind of problem.But here the decoding starts with some specific patterns(filters) that the network looking for.As
the final hidden state of encoder as input.This can’t be a good the network perform more convolutions, it could identify spe-
summary of the input sequence,because all the intermediate cific object.This is done by large amount of labelled data.It is
states of encoder are ignored.This paper proposes a transformer clear that CNN can only identify the spatial features or visual
network with deep attention based encoder and decoder to data of an image,but can’t handle temporal features, that is
generate the natural language description for video sequence
data. This network processes the sequences as a whole and learns how a frame related to one before it.Temporal sensitive model
relationship between each elements in the sequence by providing can be vector-to-sequence,sequence to-vector,or sequence-to-
attention. sequence.Here is the importance of temporal sensitive models
Index Terms—Video captioning,CNN,Transformer like Recurrent neural networks(RNN),Long short term mem-
network,Attention Mechanism,LSTM,RNN ory(LSTM),Auto encoders and Transformer networks which
I. I NTRODUCTION takes the output of CNN to get the output which might be a
vector or sequence based on the model.Attention mechanism
Understanding the contents of visual data is a complex
helps to which part of input should focus to yield more
task.Nowadays, Machine learning techniques allows us to
accurate outcome.Video captioning is kind of sequence to
train the context for a dataset,so that an algorithm can
sequence modeling,it takes series of frames in, to generate
understand what are the content in a video.Problems re-
textual description.
lated describing visual contents, computer vision and natu-
This process of collecting the interactions between ob-
ral language processing is taking an increasingly complex
jects,identifying the fine changes of video contents in temporal
challenges and it seeing the accuracy comparable to hu-
dimension, prioritizing the activities captured in the video
man observations.Studying and analysing the content of a
makes the captioning problem more challenging.
video is the important research area of multimedia.The task
In this paper Section II discuss the Related works of differ-
of generating contextual description of visual contents is a
ent methodology proposed for video captioning.A transformer
challenging and essential task in computer vision.This gen-
network proposed for video captioning described at section
erally includes feature extraction and generating descriptions
III. Different datasets used for training the models presents in
based on the feature vectors extracted.Semantic concepts like
section IV and Section V explains about evaluation metrics.
scenes,objects,actions,interactions between objects and order-
ing of the events in temporal dimension should consider to II. R ELATED WORKS
design an efficient solution architecture for the captioning
The methodologies can be categorized into two methods
problem.Moreover, this requires translation of the extracted
that are template-based methods, sequence learning methods.
visual information into grammatically correct natural language
by preserving the semantic concepts.Content based recom- A. Template based method
mendation and retrieval, human-robot interaction,autonomous
Template based method[1] are based on set of specific
grammar rules.First the sentences are divided in to three
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)
types of fragments:subject, verb and object by following the Another study based on a reconstruction network for video
grammar.visual contents detects word into objects, actions captioning[7] is based on dual learning approach[8].It is a
and attributes categories.Then and each fragment is associated learning framework that leverages the (primal dual)structure of
with detected words.Then generated fragments are composed AI task to obtain effective feedback or regularization signals
to a sentence with predefined language template.this method to enhance the learning or inference process.Regularization
focused mainly on the detection of predefined entities and techniques is to avoid overfitting.This happens when the model
events separately.Describing open domain videos using this trying to capture the noise in the training data.The architecture
method found unrealistic or too expensive because of its consist of three modules:a CNN based encoder which extract
computational complexity. the semantic representations of the video frames.A LSTM
based decoder which generates natural language for visual
B. Sequence learning method content description.Then the reconstructor which exploits the
Deep RNN based model[2] to translate videos to natural backward flow from caption to visual contents to reproduce the
language is a naive approach.The frame features are extracted frame representations.From this reconstructed representation it
using CNN.Applying mean pooling along the features ex- provides a constrain for decoder to embed more information
tracted across the entire video to get a single video descrip- in to the input video representations.
tor,which is then fed into the LSTM network to generate tex- Fused GRU with semantic-temporal attention for video cap-
tual description.But this mean pooling causes loss of temporal tioning[9],this methodology provides two types of attention,
information. semantic and temporal.Temporal attention involves directing
Another work[3] of video to text discusses end-to-end attention to specific instant of time while decoding the video
sequence to sequence-video to text(S2VT) model to generate representation in to textual description.Semantic attention is
captions for videos.this method is similar to machine trans- the ability to provide the representations of semantically
lation between natural languages.A two stack LSTM first important objects that are needed when they are needed.The
encodes the feature vectors generated by CNN for RGB different modules of this architecture are a CNN based video
images or optical flow images.Then the decoder generates encoder for feature extraction,semantic concept prediction
sentences.Decoding starts only when all the features are en- network and hierarchical semantic decoder network.
coded.So the decoder gets only the previous output and hidden Video Captioning via Hierarchical Reinforcement Learn-
states as input,which works well for short sequences but can’t ing[10] is the work of incorporating reinforcement learning
memorize long term dependencies. techniques into video captioning.A high-level Manager module
An attention based LSTM and semantic consistency used in learns to design sub-goals and a low-level Worker module
the work of[4] for video captioning.This framework integrates recognizes the primitive actions to fulfill the sub-goal.
attention mechanism with LSTM to capture salient structures Transformer network is a major breakthrough towards more
of the video and also explores correlation between multimodal sophisticated sequence learning models.Transformer[11] is a
representations(ie text and visual data) to generate sentences deep learning model introduced,which are designed to solve
with rich semantic content.Inception v3[5] CNN used here the problem of sequence transduction, such as machine trans-
to extract more meaningful spatial features.This architecture lation and text summarizing.This is better than all other
allows to do the convolution along with pooling operation architecture because they totally avoid recursion by processing
in a single layer of CNN and stack up the feature maps to sequences as a whole and by learning relationship between
get a single volume output.The extracted features fed into an input elements by providing attention based encoder and
attention based long short term memory encoder-decoder. decoder. Attention mechanism which provides the advantage
A dual stream recurrent neural network architecture based to look entire input and the target sequence generated so far
on the work on video captioning[6] discusses about both given as input to the encoder and to the decoder respec-
visual and semantic stream.This architecture includes a visual tively.The softmax induces the probability distribution among
descriptor and a semantic descriptor which encodes both visual output.Transformers not only provides the attention but also
and semantic features respectively.Visual descriptor encodes make parallel the work of processing sequences.All these
frame representations of video and the semantic descriptor properties of transformers benefited a lot in the studies of
encodes each video frame with a high level representation image and video captioning.
of semantic concept like objects,actions, and interactions.This
visual descriptor and semantic descriptor are different modal- III. P ROPOSED SYSTEM
ities.Dual stram RNNs are used because of this two asyn-
chronous modalities to flexibly exploit the hidden states of The proposed system based on the transformer network.A
each stream.Finally it integrates the hidden state representa- CNN module designed to extract features form video frames.A
tions of visual and semantic descriptor.Then a dual stream deep self attention based encoder in transformer converts the
decoder is deployed to perform dual stream hidden state fusion feature vectors from CNN in to set of vectors with context and
for sentence generation.An attentive multi grained encoder order information included.The decoder predicts the next word
module to enhance the local feature learning with global by using the output from encoder and the predicted words so
semantics feature for each modality. far.Figure 1 shows the block diagram of the whole framework.
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)
to get refined code zi for xi ,Then Z = z1 , z2 , ..., zN C. Deep Self and cross-attention based decoder
N
X Input of the decoder is the words are generated thus far
zi = ri→l .vl (2) marked as output sequence in the decoder network in figure
l=1
4.Left most word is the most recently predicted word.Every
The above process is only provides the context factor time a new word predicted ,input sequence on the bottom
irrespective of the frame ordering.But the ordering does mat- shifts to the right by one position.The positional embedding
ters.So positional embedding is for include the ordering. added to get the order.In cross attention(Encoder-Decoder
2) Positional Embedding(PE): attention) the keys K and values V are generated from the
• Constitute a ’d’ dimensional positional embedding each output of the encoder and query Q is the generated from
embedding dimension having a sinusoidal function. the output wordvector predicted thus far.Multihead attention
• The frequency of sine wave directly proportional to allows the attention mechanism to attend different aspects
embedding dimension. of characteristics of sequence.There is a feed forward neural
• Associate a number to each of ’d’ dimension in the PE network same as in the encoder.This operations performed J
connected to the sine wave at that dimension with respect times to form Deep Self and cross-attention based decoder A
to the the position.
The simple block diagram of deep sequence encoder shown
in the figure 3.The process performed by the encoder are:
• Positional embedding added to this feature vectors to
provide the order information.
• Then the attention network takes in to account the context
of the video frames.Every feature vector corresponding to
frames plays the roll of keys K,values V and queries Q.
• A skip connection provided for not to loose the original
feature vectors and add it into output of attention network.
• This added and normalized output then fed in to a
feed forward neural network to provide regularization
or structure on this network, and the hyperbolic tanh
function restricts the output of the network to be between
-1 and 1.
• Repeat the above process for K times for deep sequence
encoder.
IV. DATASETS
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Power, Instrumentation, Control and Computing (PICC)
Authorized licensed use limited to: VIT University. Downloaded on March 30,2023 at 10:18:46 UTC from IEEE Xplore. Restrictions apply.