Indian Sign Language Generation A Multi-Modal Approach
Indian Sign Language Generation A Multi-Modal Approach
Abstract—As social beings, communication is an though it is available, it is not yet accessible enough for
important skill for us humans. But some people who have people with hearing issues. Even when most of these
hearing issues face problems while communicating in their videos have captions in universal language, it becomes
day to day lives. They use Sign Language as a way to express difficult to comprehend the text and watch the video at
themselves with the help of hand gestures and facial the same time.
expressions. In today’s times when we have the whole world
on our palms, online content in multiple forms is a large A. Sign Language
source of entertainment and infotainment for us. The In the world with a population of 8 billion people,
accessibility of online content is quite high, but for people according to the World deaf federation, approximately
with hearing loss, it is still not accessible enough. In this 1.5 billion people are deaf. Sign Language is a common
research, we propose a method of generating Indian Sign
medium of communication among people with hearing
Language gestures from text, subtitles of video content,
loss. This language primarily uses visual cues such as
audio and image with text as inputs. For the video content
we have taken Youtube videos as a case study. After further hand gestures, facial expressions and body movements. It
processing and with help of SiGML files, we get the is an independent language used by deaf communities
animated gestures as output. Another method we propose is throughout the world with some variations based on
for recognizing the ISL gestures with the help of spatial and different regions similar to spoken languages.
temporal feature extraction using Convolutional Neural B. Sign Language Generation and Recognition
Networks (CNN) and Recurrent Neural Networks (RNN).
These will give the word for the ISL gesture being played. Generation of Sign Language and Recognizing Sign
We used accuracy and confusion matrix as performance Language are two methods which help in bridging the
metrics for this methodology. We recorded an accuracy of gap of communication faced by people with hearing loss.
96% in ISL Recognition methodology. The Sign Language Generation involves getting an input
and providing an output which shows sign language
Keywords—Indian Sign Language (ISL), Natural Language gestures either in the form of real life videos or an
Processing (NLP), Deep Learning (DL), Convolutional Neural animated character playing the gestures. Sign Language
Network (CNN), Recurrent Neural Network (RNN) Recognition on the other hand takes a video with sign
language gestures as input and provides us with the
I. INTRODUCTION respective text for those gestures.
With ever increasing population density, India is a
country with a total population of 1.41 billion people, In this paper we propose a methodology of using the
which means that it has the second highest population method of Sign Language Generation for generating
among world countries. [1] And out of this complete Indian Sign Language Gestures for given multiple types
population, approximately 63 million people have hearing of inputs. These inputs are text, audio, video with
issues which means that a total of 6.3% of the Indian subtitles (youtube video as test case) and image
population is deaf and hard-of-hearing [2]. And for these containing text..
many people there are less than 300 Indian Sign II. RELATED WORKS
Language (ISL) interpreters in India. Communication is a
difficult task for people with hearing issues and less This section presents a literature survey on the
availability of interpreters makes it even harder to research being done on Indian Sign Language Generation
communicate, also because of the presence of a third and Recognition.
person communication loses its privacy. A. Indian Sign Language Generation
If a person with hearing issues has to communicate Indian Sign Language Generation uses similar
with someone the latter has to understand and know how processing for the input. Sometimes input is audio which
to communicate in sign language. But that is not the case is converted to text or direct text. And then it is processed
most of the time, hence it is important to have a medium using some of the natural language methods like parsing,
of communication which is provided by the work of stop words removal, stemming and lemmatization. After
generating sign language. The person with hearing issues this the final text will be used for generating the
can easily convert text and/or audio input into Indian Sign output.There are two methods of generating ISL, first one
Language which can be understood by them. is simply getting ISL gesture videos and concatenating
them. And the second way is to use an animation tool for
In recent times, use of technology has been at its peak,
generating the ISL gestures with the help of HamNoSys
online media has become a huge source of entertainment,
and SiGML files. The method used in [3] is taking text
communication and infotainment for people. But even
from video subtitles along with audio and converts it into
ISL sentences. Then the final text is mapped with the III. DATASET
dataset and animated output video is shown. Authors in
[4], use audio from a video as their input which is A. Dataset for Generation
converted to text. This text is then processed and after We have searched and found a public dataset present
applying ISL rules converted to ISL sentences. SiGML in a github repository which consists of SiGML files with
files of the text keywords are used to generate animated notations of ISL gestures. These files would be sent to an
ISL gestures for the same. In [5], the authors take audio animation generation tool which then will produce the
as input and convert it to text then to ISL sentence with gesture using an animated character. There were a total of
the help of NLP. Then these ISL sentences are converted 849 SiGML files, out of which some of the files and
to animation video. [6] takes text input and applies ISL words were not correctly formed, hence we removed
grammar rules to get ISL sentences. This sentence is then them and got 793 files which represent 793 ISL gestures.
converted to animation using HamNoSys and SiGML
B. Dataset for Recognition
files. Similar to this, [7] converts text input to HamNoSys
Notations. These notations are further converted to ISL Recognition is a task of recognizing the ISL
SiGML files which are then used to play gestures in gestures either from images or video. In this research, we
animation. [8] has proposed a corpus based method for will be focusing on dynamic gesture recognition. For the
translating english text to indian sign language. The same, we require an extensive dataset of videos which
system accepts input text and translates it in sequence by have ISL gestures being shown.
displaying signs of each word using an avatar.In [9], the There are two possibilities for a dataset for this
input can be given in two ways, audio converted to text purpose. The first one is using ISL gesture videos which
and text directly. This input is further processed and the have people performing ISL gestures. The ‘INCLUDE’
keywords are mapped to ISL videos in the database. [21] dataset is one such dataset for ISL. This dataset
Concatenated ISL videos are shown as output. Similar to contains a total of 4287 videos of 263 words in English.
this, in [10] and [11] audio is taken as input and Another way is to create our own dataset using animated
converted to text, then if available, video of that sentence videos of ISL gestures being displayed. For this purpose
from the dataset is taken, otherwise it is broken into we can use the SiGML files data which has 849 words.
words and videos for individual words are merged.
Authors from [12] and [13] also use similar methodology Comparing words in both data, there were total of 84
but use ISL grammar and convert the text into ISL words which were common in them. After looking further
sentences. Then concatenate the ISL videos for individual into these 84 words, only 25 words had the same or
words for complete interpretation. similar ISL gestures.
B. Indian Sign Language Recognition
Indian Sign Language Recognition refers to
recognizing the Sign Gestures from ISL being given in
the form of video or images. Firstly, we can recognize
static gestures like alphabets and other characters that
don't require any movement with the help of methods like
image classification and object detection. And secondly,
for the dynamic gestures that require movement of hands,
the input will be a short video hence it will require Fig. 1. Gesture for word ‘Hello’
methods of video analytics for recognizing those
gestures.In this paper, we will be focusing on recognizing In Figure 1, we can see that there are different signs or
the dynamic gestures of ISL. gestures for the same word in Animated (SiGML) data
In [14], the authors have taken video as input and and ‘INCLUDE’ dataset. The reason for this might be as
divided it into frames. Then from these frames features ISL is still in the developing phase and regional
are extracted using CNN. Using features in sequences, differences. Hence, to keep uniformity and simplicity in
LSTM-RNN are trained to classify or predict the ISL the research its focus would be on these 25 words.
gesture. In [15], they have used similar pre -processing For the purpose of ISL recognition, we would be
steps of dividing video into frames. But for clustering using gesture videos as input. Hence, these input videos
they have used Fuzzy C- Means algorithm which in turn should be uniform and clear. Out of the two datasets
classifies/predicts the gestures. In [16], with similar INCLUDE and animated videos, animated videos are
processing steps, signs are classified with encoded much uniform and clean.
features. These are then used in Hidden Markov Model
(HMM) to recognize the sign gestures. Authors from [17] Hence, we propose the use of animated ISL gesture
extracted frames from the videos, then spatial and videos for the purpose of ISL Recognition as they will
temporal features were extracted and using RNN and provide uniformity and consistency for recognizing ISL
CNN we get the gesture recognition model. Similar to gestures. We will be working with animated videos for 25
this, in [18] RGB image frames are extracted from the words that we found were common in both the datasets.
videos. From these images local spatiotemporal features
are extracted using a 3DCNN model. Then using feature
fusion, gesture labels are recognized.
map the text from input to the SiGML files in the data We recorded such videos which contain ISL gestures
and get the SiGML files for specific words in the text and for twenty five words which are as follows: Afternoon,
if the SiGML file is not there SiGML files for each black, boat, expensive, girl, he, key, mother, night, office,
character in the word. These SiGML files are sent to the orange, pen, pink, quiet, sad, science, slow, teacher, tight,
animation player which then plays hand gestures for the time, today, university, woman, yesterday and you. These
text one by one and generates an animated video for the videos are then converted to image frames which are
same. saved in sequence as they appear in the video. These
image frames in the given sequence become the final
B. Proposed Methodology for Indian Sign Language input to the CNN model.
Recognition
The methodology for ISL Recognition by taking b) Architecture of Recognition Model
video as input includes the following steps. The video For ISL Recognition, we have used CNN to extract
data is processed by creating frames for each of the spatial features from input frames of the given videos.
videos from the data. These frames will then be sent to Based on these spatial features, the CNN model is trained
CNN to extract spatial features and the predictions from for each of the image frames from input videos. After
this CNN model will be further sent to RNN. The RNN training, the predictions from this CNN model are saved
will then extract temporal features and along with the and used as input to the RNN model. The RNN model
predictions from CNN it will try to predict the ISL then extracts temporal features from the image frame
gestures with the help of sequential frames. Figure 3 sequences and along with the predictions from CNN,
shows the architecture of the ISL Recognition used. gives a gesture label as output for the complete sequence
of image frames.
c) Training and Testing
The CNN model extracts spatial features from the
sequential frames, and gives predictions for the individual
sequential frames. Then these predictions are given as
input to RNN along with the frame which then extracts
and learns the temporal features and trains. After the
RNN model is trained, for testing image frames in
sequential form are extracted from the test videos and
sent to the RNN model which predicts the gesture being
shown in the test videos.
V. EXPERIMENT AND ANALYSIS
A. Indian Sign Language Generation
ISL Generation takes multiple forms of input: text,
video, audio and image. Figure 5 shows the first page of
User Interface which lets the user select the mode of input
from the available four.
a) Input
The input for ISL Recognition is a video with an ISL
gesture being shown. For the purpose of uniformity we
have used the videos generated with the help of animation Fig. 5. First Page for selecting input type
as input. Figure 4 below shows a still from the gesture for
word ‘Key’ being played by the animated character. More a) Text to ISL
such videos have been recorded and used for the purpose After selecting the Text form as input, the user is
of ISL recognition. provided with a textbox which takes the text input for
conversion to ISL in Animated form.
b) Video to ISL
For taking the video form as input, the user is
provided with a textbox for youtube video id which is to
be interpreted in ISL. Figure 6 shows the given youtube
video along-side its ISL interpretation.
Fig. 4. ISL Gesture for the word ‘Key’
c) Audio to ISL
For audio to ISL conversion, the user is provided with
a method to record audio by clicking on the record button
and stop recording with stop. The submit button then
submits the recorded audio which is further processed to
convert to ISL interpretation.
d) Image to ISL
Fig. 8. Confusion Matrix on the test data
For image form as input, the user selects an image file
from the system and submits it for ISL generation.
Following Figure 7 shows the output for the inputs of
type text, audio or image. It shows the avatar generating
the signs of the input one by one.
REFERENCES
VI. CHALLENGES [1] https://statisticstimes.com/demographics/country/india-
population.php
While performing the research, we faced various
[2] https://nhm.gov.in/index1.php?lang=1&level=2&sublinkid=1051&lid
challenges which can be further investigated and solved to =606
improve the research on ISL generation or ISL recognition. [3] Mehta, N., Pai, S. & Singh, S. Automated 3D sign language caption
generation for video. Univ Access Inf Soc 19, 725–738 (2020).
https://doi.org/10.1007/s10209-019-00668-9.
For the scope of sign language generation, limited
[4] D. S. Jayalakshmi, H. Salpekar, R. H. K. Kiran, R. Rahul and Shobha,
corpus and data availability poses an issue as there "Augmenting Kannada Educational Video with Indian Sign Language
might be not enough words whose sign language Captions Using Synthetic Animation," 2020 Fourth World
Conference on Smart Trends in Systems, Security and Sustainability
interpretations we have. (WorldS4), 2020, pp. 324-329, doi:
10.1109/WorldS450073.2020.9210385.
Because of the above limitation, it is a difficult task [5] Das Chakladar, D.; Kumar, P.; Mandal, S.; Roy, P.P.; Iwamura, M.;
Kim, B.-G. 3D Avatar Approach for Continuous Sign Movement
for getting sign language representation of complex Using Speech/Text. Appl. Sci. 2021, 11, 3439.
words and sentences. https://doi.org/10.3390/app11083439
[6] Kumar, Parteek, and Sanmeet Kaur. "Sign language generation
Similar to ISL Generation, ISL Recognition for system based on Indian sign language grammar." ACM Transactions
on Asian and Low-Resource Language Information Processing
dynamic gestures also has a limited corpus. (TALLIP) 19, no. 4 (2020): 1-26
[7] Kaur, Sandeep, and Maninder Singh. "Indian Sign Language
There were only 25 signs common among both animation generation system." In 2015 1st International Conference
on Next Generation Computing Technologies (NGCT), pp. 909-914.
datasets; hence it was a challenge to find a common IEEE, 2015.
set for both ISL generation and recognition. [8] Ali, Syed Faraz; Mishra, Gouri Sankar; and Sahoo, Ashok Kumar
(2014) "Domain Bounded English to Indian Sign Language
Translation Model," International Journal of Computer Science and
For the sake of this project, we have used the Informatics: Vol. 4 : Iss. 1 , Article 6. DOI:
generated animation video for selected 25 words as 10.47893/IJCSI.2014.1169
data for ISL recognition. [9] Sharma, Purushottam, Devesh Tulsian, Chaman Verma, Pratibha
Sharma, and Nancy Nancy. "Translating Speech to Indian Sign
Language Using Natural Language Processing." Future Internet 14,
VII. CONCLUSION no. 9 (2022): 253.
[10] Kulkarni, Alisha, Archith Vinod Kariyal, V. Dhanush, and Paras Nath
Communication is an important skill to have for humans Singh. "Speech to Indian Sign Language Translator." In 3rd
International Conference on Integrated Intelligent Computing
as social beings. And people with hearing issues face Communication & Security (ICIIC 2021), pp. 278-285. Atlantis Press,
problems while communicating. These people also face 2021.
issues while understanding contents of various modes of [11] Katariya, Ashmi, Vaibhav Rumale, Aishwarya Gholap, Anuprita
Dhamale, and Ankita Gupta. "Voice to Indian Sign Language
input like text, video, audio and images. This research Conversion for Hearing Impaired People." SAMRIDDHI: A Journal
of Physical Sciences, Engineering and Technology 12, no. SUP 2
shares a method of generating ISL for text, youtube videos, (2020): 31-35
audio and images. Proposed methodology uses the Stanford [12] Aditya, C. R., C. Shraddha, and Ramakrishna Hegde. "ENGLISH
parser for parsing the input text; afterwards ISL rules are TEXT TO INDIAN SIGN LANGUAGE TRANSLATION
SYSTEM." Turkish Journal of Computer and Mathematics Education
applied on the parse tree. Then natural language processing (TURCOMAT) 11, no. 3 (2020): 1418-1423.
methods like stop word removal, tokenization and [13] Dasgupta, Tirthankar, and Anupam Basu. "Prototype machine
translation system from text-to-Indian sign language." In Proceedings
lemmatization are applied and we get the final ISL sentence of the 13th international conference on Intelligent user interfaces, pp.
in text form. By mapping the text to SiGML files available 313-316. 2008.
[14] Gomathi, V. "Indian Sign Language Recognition through Hybrid
in the data; we generate animation for the input text. For ConvNet-LSTM Networks." EMITTER International Journal of
ISL Recognition, this research puts forward a method that Engineering Technology 9, no. 1 (2021): 182-203.
takes video as input and converts it to sequential frames. A [15] Mariappan, H. Muthu, and V. Gomathi. "Real-time recognition of
Indian sign language." In 2019 International Conference on
CNN model is trained for extracting spatial features and its Computational Intelligence in Data Science (ICCIDS), pp. 1-6. IEEE,
2019.
prediction is sent to RNN model which considers the [16] Shenoy, Kartik, Tejas Dastane, Varun Rao, and Devendra
prediction along with temporal features. RNN model then Vyavaharkar. "Real-time Indian sign language (ISL) recognition."
provides the final prediction for a given video. In 2018 9th international conference on computing, communication
and networking technologies (ICCCNT), pp. 1-9. IEEE, 2018.
In future scope for this research, one way could be to [17] Masood, Sarfaraz, Adhyan Srivastava, Harish Chandra Thuwal, and
Musheer Ahmad. "Real-time sign language gesture (word)
increase the dataset of SiGML files by creating recognition from video sequences using CNN and RNN."
In Intelligent Engineering Informatics, pp. 623-632. Springer,
interpretations of more words in ISL with the help of Singapore, 2018.
professional ISL interpreters. It is important to keep these [18] Al-Hammadi, Muneer H., Ghulam Muhammad, Wadood Abdul,
interpretations uniform throughout to make it easy to Mansour Alsulaiman, Mohamed Abdelkader Bencherif and Mohamed
Amine Mekhtiche. “Hand Gesture Recognition for Sign Language
understand. To further improve the ISL generation, we can Using 3DCNN.” IEEE Access 8 (2020): 79491-79509.
add more robust ISL rules and handle complex sentences to [19] Kaur, Sandeep, and Maninder Singh. "Indian Sign Language
animation generation system." In 2015 1st International Conference
improve the scope of this work In case of Sign Language on Next Generation Computing Technologies (NGCT), pp. 909-914.
Recognition; one major task for further improvement would IEEE, 2015.
[20] https://vh.cmp.uea.ac.uk/index.php/JASigning
be to standardize the data for both recognition and [21] Advaith Sridhar, Rohith Gandhi Ganesan, Pratyush Kumar, and
generation. Mitesh Khapra. 2020. INCLUDE: A Large Scale Dataset for Indian
Sign Language Recognition. In Proceedings of the 28th ACM
International Conference on Multimedia (MM '20). Association for
Computing Machinery, New York, NY, USA, 1366–1375.
https://doi.org/10.1145/3394171.3413528
[22] https://pypi.org/project/SpeechRecognition/
[23] https://pypi.org/project/pytesseract/
[24] https://niepid.nic.in/nep_2020.pdf
[25] https://social.desa.un.org/issues/disability/sustainable-development-
goals-sdgs-and-disability/