Sign Language Character Recognition Research Paper
Sign Language Character Recognition Research Paper
Figure 3. Network Architecture Figure 5. Training Accuracy on ASL Digits
Figure 8. Validation Accuracy on ASL Letters
Figure 6. Training Loss on ASL Alphabet
4.2 Data Augmentation Figure 9. Validation Accuracy on ASL Digits
We saw the performances improve differently in our two
datasets via data augmentation. By transforming our
images just a few pixels (rotating by 20 degrees, translating
by 20% on both axes) there was an increased accuracy of
approximately 0.05. We also flipped the images
horizontally as we can sign using both hands. While it
wasn’t extremely effective, we saw that with better and
more representative initial training data, augmenting
improved the performance more drastically. This was
observed after augmentation of the premade dataset, which
improved the performance by nearly 20%.
5. Results
We observed 82.5% accuracy on the alphabet gestures, and
97% validation set accuracy on digits, when using the NZ
ASL dataset. On our self-generated dataset, we observed Figure 10. Validation Loss on ASL Letters
much lower accuracy measures, as was expected since our
data was less uniform than that which was collected under
studio settings with better equipment. We saw 67%
accuracy on letters of the alphabet, and 70% accuracy on
the digits. In terms of time complexity, gestures of the
letters converged in approximately 25 minutes, and the
digits converged in nearly 10 minutes.
6. Conclusions and Future Work
In this paper, we described a deep learning approach for
a classification algorithm of American Sign Language.
Our results and process were severely affected and
hindered by skin color and lighting variations in our
self-generated data which led us to resort to a pre-made
professionally constructed dataset. With a camera like
Microsoft’s Kinect that has a depth sensor, this problem
is easy to solve [5]. However, such cameras and
technology are not widely accessible, and can be costly.
Our method shows to have potential in solving this
problem using a simple camera, if enough substantial
training data is provided, which can be continuously
Figure 11. Validation Loss on ASL Digits done and added via the aforementioned processing
pipeline. Since more people have access to simple
5.1 Evaluation camera technologies, this could contribute to a scalable
We trained with a categorical cross entropy loss function solution.
for both our datasets. It is a fairly common loss function In recognizing that classification is a limited
used along with image classification problems. goal, we plan on incorporating structured PGMs in
future implementations of this classification schema that
would describe the probability distributions of the
different letters’ occurrences based on their sequential
Initially, we observed low accuracy measures when testing contexts. We think that by accounting for how the
on the validation set of the self-generated data, which we individual letters interact with each other directly (e.g.
accounted largely to the lighting and skin tone variations in the likelihood for the vowel ‘O’ to proceed the letter ‘J’),
the images. The higher accuracy measure for the digits was the accuracy of the classification would increase. This
expected, since the gestures for the digits are much more HMM approach with sequential pattern boosting
distinguishable and easier to classify. Compared to (SP-boosting) has been done with the actual gesture
previous methods working on this same task, our network units that occur in certain gestures’ contexts, i.e.
performed quite well, considering RF-JA were using both a capturing the upper-arm movements that precede a
color glove and depth-sensing Kinect camera. The cause of certain letter to incorporate that probability weight into
higher accuracy than Stanford’s method was likely due to the next unit’s class, [6] and processing sequential
their lack of background-subtraction for the images, since phonological information in tandem with gesture
they used a large dataset from ILSVRC2012 as part of a recognition [4], but not for part-of-word tagging with an
competition. application like what we hope to achieve.
We also recognize that the representation itself
makes a huge difference in the performance of
Method Accuracy
algorithms like ours, so we hope to find the best
representation of our data, and building off our results
deepCNN (our method) 82.5 from this research, incorporate it into a zero-shot
learning process. We see zero-shot learning as having
Stanford deepCNN [7] 72 the potential to facilitate the translation process from
American Sign Language into English. Implementing
RF-JA+C(h-h) [8] 90 one-shot learning for translating the alphabet and
numbers from American Sign Language to written
RF-JA+C(l-o-o) [8] 70 English, and comparing it with a pure deep learning
heuristic could be successful and have the potential to
benefit from error correction via language models.
Figure 12. Comparison of previous methods with ours; Stanford
didn’t use background subtraction, RF-JA(h-h) split the training Recent implementations of one-shot adaptation have also
and validation set 50-50, (l-o-o) omitted specific data. had success in solving real world computer vision tasks,
and effectively trained deep convolutional neural
networks using very little domain-specific data, even as
limited as single-image datasets. We ultimately aim to
create a holistic and comprehensive representation
learning system for which we have designed a set of
features that can be recognized from simple gesture
images that will optimize the translation process.
7. References
[1] X. Chen and A. Yuille. Articulated pose estimation by
a graphical model with image dependent pairwise
relations. In Advances in Neural Information
Processing Systems (NIPS), 2014.
[2] T. Pfister, J. Charles, and A. Zisserman. Flowing
convnets for human pose estimation in videos. In IEEE
International Conference on Computer Vision, 2015.
[3] Barczak, A.L.C., Reyes, N.H., Abastillas, M., Piccio,
A., Susnjak, T. (2011), A new 2D static hand gesture
colour image dataset for ASL gestures, Research Letters in
the Information and Mathematical Sciences, 15, 12-20
[4] Kim, Taehwan & Livescu, K & Shakhnarovich, Greg.
(2012). American sign language fingerspelling recognition
with phonological feature-based tandem models. In IEEE
Spoken Language Technology Workshop (SLT), 119-124.
[5] Agarwal, Anant & Thakur, Manish. Sign Language
Recognition using Microsoft Kinect. In IEEE International
Conference on Contemporary Computing, 2013.
[6] Cooper, H., Ong, E.J., Pugeault, N., Bowden, R.: Sign
language recognition using sub-units. The Journal of
Machine Learning Research, 13(1), 2205–2231, 2012.
[7] Garcia, Brandon and Viesca, Sigberto. Real-time
American Sign Language Recognition with Convolutional
Neural Networks. In Convolutional Neural Networks for
Visual Recognition at Stanford University, 2016.
[8] Cao Dong, Ming C. Leu and Zhaozheng Yin. American
Sign Language Alphabet Recognition Using Microsoft
Kinect. In IEEE International Conference on Computer
Vision and Pattern Recognition Workshops, 2015.