SSICT-2023 Paper 5
SSICT-2023 Paper 5
Language
Nhu-Vinh Hoang1, 2, 3 , Nghia-Viet Hoang1, 2, 3 , Nhu-Binh Nguyen Truc1, 2, 3 , Kim-Phat Tran1, 2, 3
1 Faculty of Information Technology, University of Science, VNU-HCM
2 Vietnam National University, Ho Chi Minh City, Vietnam
3 {hnvinh21, hnviet21, ntnbinh21, tkphat21}@apcs.fitus.edu.vn
Abstract—Automating Sign Language Interpretation is being sidering business aspect, the model can be developed into a
considered seriously to improve the conversation quality of bot teaching sign language. It will mark a big step forward
people with hearing impaired. The more places Sign Language is disability-inclusive efforts.
available, the more accessible deaf people to the community. In
this paper, the authors attempt to developve the Sign Language
model which translate from discrete text sentences to continuous In this paper, we try to developve a Sign Language Model
3D skeleton sign pose sequences, using the Back Translation Eval- that translate from text to skeleton video. Our approach
uation method which transform Video to Text for comparison. includes two different parts: once is text to sign pose and
The model is trained on PHOENIX14T DATASETS. The data
another is text to sign pose via gloss. In addition, we attempt
includes parallel sign videos and German translation sequence
with 8057 videos of 9 different signers, 2887 German words, 1066 to evaluate the performance by back translation evaluation
different sign glosses. Although the output skeleton videos are mechanism using Sign Language Translation.
bad, the results suggest that there is a considerable potential for
future improvements since the authors obtain the validation score We evaluate on the RWTH-PHOENIX-Weather-2014T
of 17.252 DTW. Overall, the results imply a promise foundation
for continued investigation and optimization of the approach, (PHOENIX14T) dataset which includes parallel sign videos
with the aim of improving the conversation quality between the and German translation sequence with 2887 German words
Deaf community and hearing counterparts. and 1066 different sign glosses from a combined 835,356
frames with 25 fps, 210 x 260 resolution, for each video
I. I NTRODUCTION using OpenPose model to extract 2D joint positions then lift
to 3D. The Sign Language Production (SLP) model initially
Equity is a critical component of modern society, shaping using encoder-decoder architecture to convert text sequences
how individuals and communities are treated with equal op- input to gloss . Then re-train the Progressive Transformers SLP
portunity to success. The concept of equity has been central model, transform the text sequences and gloss representation
to political and social discourse for decades, with many to skeleton sign pose sequence. To evaluate the performance,
countries and organizations embracing it as a core principle. we use back translation evaluation method using state-of-the-
To ensure that no one with a disability is excluded, effective art Sign Language Translation model which translate back the
communication methods are imperatively being developed and skeleton videos to text sequences, then using BLEU-n score
implemented for over 1.5 billion people worldwide facing metric to evaluate model and comparison with the text input.
hearing difficulties [4]. Sign language interpretation is high-
lighted as this means of communication is vital to the deaf Our experiment is divided into three different times. In the
and hard of hearing community. first time, because the dataset has only 15 sets of text, gloss and
However, becoming a sign language interpreter is a chal- skeleton, our model gives bad result. In the second time, the
lenging and rigorous process that requires fluency in multiple total of dataset is bigger with 8057 videos. However, the model
languages, specialized training, and a deep understanding of does not show any improvement. We find some problems
cultural differences. This raises the question of whether an through this two times and try to fix some parameters, and
automated sign language interpreter could be developed to the result is significantly improve. xxx This paper includes five
address these challenges and provide more accessible com- sections. In Section II, we provide a brief review of current
munication for the deaf and hard of hearing community. methodologies and methods used for transform texts to hand
Improving the communication quality for this community gestures as videos as well as the model which the authors
is crucial to promoting inclusion and equality. One potential use to provide ground truth. In Section III, we present our
solution is to develop an automated sign language interpreter method for transforming texts to the video of sign language.
that could provide accessible communication to individuals The experimental setup, results, and comparisons are presented
with hearing disabilities. Additionally, providing sign language in Section IV. Lastly, Section V provides conclusions and
interpreters in a variety of areas, particularly news broadcasts suggestions for future work to improve the performance of
and conferences, can significantly improve accessibility and the proposed technique.
inclusion for the deaf and hard of hearing community. Con-
II. R ELATED W ORK running using OpenPose” by Emily Hansen et al, the
In this paper, we present a existing method that propose author utilizes OpenPose to automatically access the
Sign Language Production to build a 3D Sign pose sequences lower-limb kinematics of runners by tracking the joint
of skeleton form some text sequences. And another model angles and positions of the runners during a treatmill run.
which support recognizing and translate back the German sign
language (weather broadcast vocabulary) to text for evaluate Beside being considered as a foundation in many
model performance. research work, OpenPose also provides a great potential
as a tool for data processing. In the paper ”Head Pose
1) Progressive Transformers for End-to-End Sign Lan- Estimation for Longitudinal Behavioral Analysis using
guage Production (Ben Saunders, Necati Cihan Camgoz, OpenPose” by Kevin Huang et al., the authors use
Richard Bowden - ECCV 2020) [3] OpenPose to track the head positions of the children
• Sign Language Production is still a challenging prob- with autism spectrum disorder during the interactions.
lem, including the challenge when mapping from The data is gathered and analyzed to identify the patterns
lingual domain into visual domain [2]. Previous over time. In the research work by Ali K. Thabet et al.
approach by Stoll et al. focused on create sign pose [Analyzing the Passing Strategies of Professional Soccer
sequences from text via glosses []. By contrast, our Teams using OpenPose], the author uses OpenPose
paper aim to produce sign pose sequences directly to extract the body pose of the soccer players during
without priori. Moreover, the number of frames is the matches. The data then is analyzed to identify the
dynamically depend on length of input in oder to passing strategies and patterns among players.
produce correct result.
• Symbolic Transformer is still be used for data pro- B. Symbolic Transformer
cessing - convert from Text to Gloss representation. Symbolic Transformer is a relative new data processing
We use this for Text to Gloss to Pose (T2G2P) tool which gains attention in the machine learning and
models. This model will be used to compare with natural language processing fields. Additionally, it’s used
our end-to-end Text to Pose (T2P) model. in transforming from text to glosses, a process which
2) Sign Language Translation involves mapping natural language descriptions to a
Sign Language Translation (SLT) is a difficult and structured vocabulary of concepts and labels.
complex task of converting sign language into text. Due
to the differences in grammar between sign language In the paper ”Generating Textual Glosses for Equations
and spoken language, as well as the large number with Symbolic Transformer” by Nafiseh Shabib et
of meaningless words in a sentence, this task can al., the author suggest a method which leverages the
be challenging. One of the complex tasks of Sign textual glosses generation of Symbolic Tranformer
Language Translation is directly convert sign videos for mathematical equations. The result reflects that
sequence to spoken language sentences, The Sign Symbolic Transformer outperforms other existing
Language Transformers Joint End-to-end Sign Language methods. In the paper ”Symbolic Transformer Networks
Recognition and Translation (Necati Cihan Camgoz for Knowledge Base Completion” by Jianyuan Shi et al.,
– CVPR’20) [1] are currently state-of-the-art models Symbolic Transformer is used to represent the concepts
in SLT. In this paper, these models are used for back and relations in the knowledge base and acchieve
translation mechanism method to evaluate Sign Language state-of-the-art results on several benchmark datasets.
Production model performance.
III. M ETHOD
3) Data Processing
A. Method overview:
A. OpenPose Our method contain two main parts:
a) Data preparation: The model is trained on
OpenPose is a popular and widely-used open-source
PHOENIX14T data sets. This data sets contains
library which uses deep learning techniques to detect
8057 sign language videos.
human body joints from images and videos. Many
scientists use OpenPose as a foundation for their
work in computer vision, human pose estimation.
The famous paper is that of Zhe Cao et al[], who
proposes a method for real-time multi-person 2D pose
estimation using OpenPose, which achieves state-of-the-
art results on several benchmark datasets. OpenPose is
not only used in the field of human pose estimation,
it’s also be considered in sport sciences. In the paper
”Automated assessment of lower-limb kinematics in Fig. 1. A frame from a video of the dataset
We use OpenPose model to produce skeleton sequences
for each sign language videos, extracted to 2D joint
positions, minimise 3D pose whilst maintaining
consistent bone length and correcting misplaced joints.
Then apply skeleton normalisation and represent 3D
joints as x, y and z coordinates for Sign Language
Production Model. Texts are also pre-processed to
be come glosses using Symbolic Transformer. In
conclusion, out data sets contains 8057 sets of skeleton
video, texts and glosses, 7096 sets are used for training,
519 sets are used for developing and 642 sets are used
for testing.
Fig. 4. Progressive Transformers model
B. Progressive Transformer:
In this work, Progressive Transformers (Figure 2b) trans-
late from the symbolic domains of gloss or text to contin-
uous sign pose sequences that represent the motion of a
signer producing a sentence of sign language. The model
must produce skeleton pose outputs that can both express
an accurate translation of the given input sequence and a
realistic sign pose sequence. In the detail, the text input can
be described as X = (x1 , x2 , ..., xT ) which T is number of
words, and output of this model is a sign pose sequences Fig. 5. Symbolic Transformers model
with U frames Y = (y1 , y2 , ..., yU ).
To ensure the continuity and smoothness of the output D. Evaluate performance method:
This method use back translation evaluation mechanism
to evaluate performance. The 3D Sign pose sequence
(skeleton) output will be transformed to spoken language
(text) by Sign Language Translation model (the authors
use state-of-the-art model for best comparison).
To measure the translation performance of this method, we
utilized BLEU score (n-grams ranging from 1 to 4), which
is the most common metric for machine translation.
V. C ONCLUSION
Conversion of Text to Sign Language plays an important
role in enhancing communication between the Deaf and hear-
ing. Our experiments are evaluated on the PHOENIX14T
dataset. The results are possible for improvement for the
reasons lack of experience in training model. In both tries,
our validation and train batch loss are still decreasing. In the
second try, our e-pox is at 500. If we increase it to 20000, the
validation and train batch loss could decrease more.
Challenges from the beginning days: “properly” references
rareness; unknown code bugs, definitions, advanced knowl-
edge; new to notebook-based Google Lab. No prior knowledge
about Machine Learning / Deep Learning leads to our investing
a huge amount of time in Literature Review. Learn to adjust
I/O, trace back to important code blocks, know how to import
data and train the model. Seek for important portion of codes
was a problem until our search engine utilization came in.
Adjust console logs to get proper output for further research.
The coming goals are learn more about data processing,
optimize code, think of alternative SL interpreting methods,
enhance training time and use Deepfake to visualize 3D output.
Besides that, add “speech to text” phase to complete the initial
objective: speech to sign language, create a case study: self-
learning Sign Language and develop an application supporting
the Deaf in communication by applying our method.
ACKNOWLEDGMENT
The authors would like to thank Mr Minh-Triet Tran.
R EFERENCES
[1] Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard
Bowden. Sign language transformers: Joint end-to-end sign language
recognition and translation. 2020.
[2] Sergio Escalera Vassilis Athitsos Mohammad Sabokrou Razieh Rastgoo,
Kourosh Kiani. All you need in sign language production. 2022.
[3] Ben Saunders, Necati Cihan Camgoz, and Richard Bowden. Progressive
Transformers for End-to-End Sign Language Production. 2020.
[4] WHO. Deafness and hearing loss.