Research 6
Research 6
Abstract Communication is an important part of our lives. Deaf and dumb people
being unable to speak and listen, experience a lot of problems while communicating
with normal people. There are many ways by which people with these disabilities
try to communicate. One of the most prominent ways is the use of sign language,
i.e. hand gestures. It is necessary to develop an application for recognizing gestures
and actions of sign language so that deaf and dumb people can communicate easily
with even those who don’t understand sign language. The objective of this work is
to take an elementary step in breaking the barrier in communication between the
normal people and deaf and dumb people with the help of sign language. The image
dataset in this work consists of 2524 ASL gestures which were used as input for the
pre-trained VGG16 model. VGG16 is a vision model developed by the Vision
Geometry Group from oxford. The accuracy of the model obtained using
the Convolution Neural Network was about 96%.
1 Introduction
Unlike communication through speech which uses sound to express one’s thoughts,
a sign language uses facial expressions and movement of lips, hand movements and
gestures, alignment and positioning of the hands and body. Similar with spoken
dialects, sign-based languages vary from one area to another like ISL—Indian Sign
Language, BSL––British Sign Language and ASL––American Sign Language.
Being a vision-based language, it can be categorized into three types which are as
follows: use of fingers to spell each alphabet of the word called, sign vocabulary for
words (used for most of the communication), use of mouth and lip movements,
facial expressions and hand and body positions. American Sign Language is
prominently used by the deaf population of the United States of America along with
some parts of Canada. It is an advanced and completely standardized language that
uses both the shape of the hand gesture and its position and movement in the
three-dimensional space. It is the primary mode of communication between the
people who are deaf and their relatives.
Fundamentally, two methodologies are used for recognizing hand gestures: one
that is based on sight, i.e. vision and another one that is based on sensory data
measured by using gloves. The primary objective of this work is to create a system
based on vision to identify finger-spelled letters of ASL. The fact that the
vision-based system offers a cleaner and innate mean of interaction and commu-
nication between a human and computer is the primary reason we chose it. In this
work, 36 different categories have been considered: 26 categories for English
Alphabets (a–z) and 10 categories for Numerals (0–9) (Fig. 1).
The rest of the paper sdescribes the remaining five sections of the paper.
Section 2 focuses on the related previous research work and their contribution,
Sect. 3 describes the image augmentation and resizing of images, Sect. 4 deals with
CNN, Sect. 5 deals with the results and Sect. 6 finally concludes the paper.
American Sign Language Character Recognition … 405
2 Literature Review
Trigueiros et al. [1] discussed a design, where Kinect Camera was use to obtain
features corresponding to hand gestures of Portuguese sign Language. They used
multiclass SVM to perform the training and testing of the gestures. Their approach
required special hardware Kinect Camera which may not be available to everyone.
Tavari Neha and Deorankar [2] in their work implemented an algorithm to
extract HOG features. These features were then used to train an artificial neural
network, which was later used for the purpose of recognizing hand gestures and
actions.
Hasan and Abdul-Kareem [3] have proposed a technique for hand gesture
recognition based on shape analysis. They used neural network based approach to
classify among six static hand actions to perform various tasks like maximizing and
minimizing the current window, opening and closing objects etc. They used a
backpropagation-based algorithm on specially designed neural network that con-
sisted of several layers. They were able to achieve an accuracy of 86.38%.
Pigou et al. [4] in their paper proposed a system to perform recognition of
gestures of Italian Sign Language. Their system combines the above works and uses
Microsoft Kinect alongside GPU (Graphic processing unit) accelerated convolu-
tional neural networks. Their system gave a high accuracy for 20 actions and they
achieved across validation accuracy of nearly 92%.
Gupta et al. [5] have used HOG and SIFT to extract feature for image. These
features are then combined into a single matrix. They calculated the correlation for
the above matrices and used the output as an input for the classifier based on KNN.
Out of 200 gestures 179 were identified correctly.
Nagarajan and Subashini [6] in their approach performed feature extraction by
using Edge-Oriented Histogram in which every input image of the hand gesture is
represented by the count of edge histogram. They obtained an overall accuracy of
93.75% by using multiclass SVM as the classifier.
The image dataset consists of ASL gestures from [7]. The dataset consists of 2524
images with 70 images per category. Each category represented a different character
of ASL. This dataset was then augmented to create a dataset of 14,781 images. Out
of this dataset 75%, i.e. 11,085 images were used for training and remaining 25%,
i.e. 3696 images were used for testing (Figs. 2 and 3).
The images in the data set [7] were of different dimensions. Therefore, first of all
we resized each image to a common resolution of 224 by 224 pixels. Once every
image corresponding to each gesture was resized to a common standard resolution
then they can be used as input for training of the convolutional neural network.
406 S. Masood et al.
To increase the size of the training dataset, new images were synthesized from
the existing images thereby escalating the number of images. This not only added
more images to the dataset, but also helped in dealing with the common obstacle of
overfitting. One point to be kept in mind while creating new images was to ensure
that the original group or class to which the image belongs remains preserved. In
order to do this arbitrary and random amount of transformations such as scaling the
image, translation in multiple directions, zoom and shear were applied to produce
new images. These random values must be limited by properly deciding upper and
lower bounds to ensure that the generated images belong to same class as the
original image.
Image augmentation ensures that the classifier is strong and classifies poorly
captured images with greater accuracy.
American Sign Language Character Recognition … 407
Convolutional Neural Network Model was proposed by LeCun et al. [8], and has
proved to be a significant milestone in the area of detection and classification of
images. Deep CNNs decrease the aspect of the input images as they consist of
several layers which are hidden within the network. This Depth allows the CNN to
obtain low density image features in low-dimensional space. Our training model
was inspired from the VGG16 [9] model.
From each pixel mean value of the RGB pixel is subtracted. In the first pass, the
model will compute the mean pixel value of each channel over the entire set of
pixels in a channel and in the second pass it will modify the images by subtracting
the mean from each pixel value. The mean value is subtracted to ensure that data is
centered. This is done because the model involves multiplying weights and adding
biases to the initial inputs to cause activations which are then backpro-
pogated [10] with the gradients to train the model.
To stop the gradients from losing control every feature must have a similar
range. Also in CNN parameters are shared therefore if images are not raised to have
similar range sharing of parameters cannot be done easily because different parts of
image will have unbalanced value of weights.
VGG16 [9] model shown in Fig. 1 represents a deep convolutional neural network
model proposed by Simonyan and Zisserman in their work [9]. The model achieved
a top-5 test accuracy of 92.7% in ImageNet.
Figure 4 depicts its macro architecture. The convolutional neural network takes
as input RGB images of predefined size 224 by 224. This is why we resized every
image to a size of 224 by 224. Each image is passed through multiple convolutional
layers. A 3 3 receptive field size is used at every layer. The convolution stride of 1
pixel is used. To preserve the resolution after convolution appropriate spatial pad-
ding must be chosen for the convolution. The spatial padding of convolutional layer
is selected such that the resolution is preserved after convolution. Some of the
convolutional layers are tailed by max pooling layers which perform spatial pooling.
Convolutional layers are tailed by three FC layers (Fully Connected). A total of
4096 channels are used in initial two FC layers. 1000-way ILSVRC categorization
is performed by the third FC layer. A soft-max layer acts as the last FC layer. All
hidden layers have the rectification (ReLU) nonlinearity [11].
408 S. Masood et al.
From the ImageNet database, pre-trained weights were attained to perform the
initialization of the model. Since the model was originally designed to categorize
images into 1000 groups the soft-max channel consisted of 1000 channels. But for
our purpose we needed only 36 categories (26 categories for alphabets and 10 for
numerals). Therefore, we deleted the last layer from the model and inserted a layer
which would be able to categories 36 different types of images. The rest of the
model remained untouched.
Stochastic gradient descent [13] with momentum was used to train the model.
Batch size of 128 images and momentum of 0.9 was used. The learning and decay
rates were initialized to 0.001 and 10−6 respectively.
It was observed that model required very few epochs to converge irrespective of
the fact that it was a very deep model. This is because it was initialized using a
pre-trained model. ImageNet database was used to train the pre-trained model
because it contains around 1.2 million images. More the number of images, more
different initial features will be discovered and there will be a greater probability of
input image features to be matched.
Table 1 depicts the validation loss and accuracy obtained during each epoch of
training the model. During the first four epoch loss decreased and accuracy
increased at each epoch. After the fourth epoch, the loss started increasing and the
model started overfitting.
American Sign Language Character Recognition … 409
5 Result
The samples were tested on the VGG16 model. The testing accuracies are tabulated
below for 4 epochs.
Overall, out of the 3696 images used for testing 3531 images were classified into
correct categories and the remaining 165 images were misclassified resulting in an
average accuracy of 95.54%.
Table 2 shows the number of samples that were correctly classified and mis-
classified for each symbol and the corresponding accuracy. Figure 5 graphically
depicts the accuracy obtained for each symbol. These however do not provide a
complete metric to analyze the work. The table shows that while most of the
symbols are classified correctly with high accuracy, zero and the alphabet ‘W’ are
misclassified as alphabet ‘O’ and six respectively in significant cases (Fig. 6).
Table 2 (continued)
Characters Total Correct Incorrect Percentage
samples predictions predictions accuracy
O 111 107 4 96.40
P 100 100 0 100.00
Q 104 104 0 100.00
R 102 98 4 96.08
S 100 100 0 100.00
T 84 83 1 98.81
U 99 98 1 98.99
V 110 103 7 93.64
W 97 66 31 68.04
X 93 90 3 96.77
Y 114 114 0 100.00
Z 112 109 3 97.32
Fig. 6 a–d Shows the prediction of the CNN overlayed on the test images [7]
6 Conclusion
This work dealt with the application of Convolution Neural Network for recog-
nizing the hand gestures. One of the vital applications of hand gesture recognition is
to identify the sign language. Sign Language is one of the methods of communi-
cation for physically impaired, deaf and dumb people. This tool will help to bridge
the gap between normal, deaf and dumb people.
From the above results, we can conclude that Convolutional Neural Network
provides a significant accuracy in identifying the sign language characters. This
work can be further extended to building a real-time application which can identify
the sign language and including words, sentences to recognize instead of just
characters.
412 S. Masood et al.
References
1. Trigueiros, P., Ribeiro, F. and Reis, L.P.: “Vision-based Portuguese sign language recognition
system”. In New Perspectives in Information Systems and Technologies, 2014 Volume 1
(pp. 605–617). Springer International Publishing.
2. Tavari, Neha V., and A.V. Deorankar.: “Indian Sign Language Recognition based on
Histograms of Oriented Gradient.” International Journal of Computer Science Information
Technologies 5 (2014).
3. Hasan, Haitham, and S. Abdul-Kareem.: “Static hand gesture recognition using neural
networks.” Artificial Intelligence Review 41, no. 2 (2014): 147–181.
4. Pigou, Lionel, Sander Dieleman, Pieter-Jan Kindermans, and Benjamin Schrauwen.: “Sign
language recognition using convolutional neural networks”, In Workshop at the European
Conference on Computer Vision 2014, pp. 572–578. Springer International Publishing.
5. Gupta, Bhumika, Pushkar Shukla, and Ankush Mittal.: “K-nearest correlated neighbor
classification for Indian sign language gesture recognition using feature fusion.” In 2016
International Conference on Computer Communication and Informatics (ICCCI), pp. 1–5.
IEEE, 2016.
6. Nagarajan, S., and T.S. Subashini.: “Static hand gesture recognition for sign language
alphabets using edge oriented histogram and multi class SVM.” International Journal of
Computer Applications 82, no. 4 (2013).
7. Barczak, A.L.C., N.H. Reyes, M. Abastillas, A. Piccio, and T. Susnjak.: “A new 2D static
hand gesture colour image dataset for asl gestures.” (2011). http://mro.massey.ac.nz/bitstream/
handle/10179/4514/GestureDatasetRLIMS2011.pdf.
8. LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner.: “Gradient-based learning
applied to document recognition.” Proceedings of the IEEE 86, no. 11 (1998): 2278–2324.
9. Simonyan, Karen, and Andrew Zisserman.: “Very deep convolutional networks for
large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).
10. Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.: “Learning representations
by back-propagating errors.” Cognitive modeling 5, no. 3 (1988): 1.
11. Hahnloser, Richard HR, Rahul Sarpeshkar, Misha A. Mahowald, Rodney J. Douglas, and H.
Sebastian Seung.: “Digital selection and analogue amplification coexist in a cortex-inspired
silicon circuit.” Nature 405, no. 6789 (2000): 947–951.
12. Copyright © William Vicars, Sign Language resources at LifePrint.com, http://lifeprint.com/
asl101/topics/wallpaper1.htm.
13. Bottou, Léon.: “Large-scale machine learning with stochastic gradient descent.” In
Proceedings of COMPSTAT’2010, pp. 177–186. Physica-Verlag HD, 2010.