0% found this document useful (0 votes)
24 views

FPGA Based Implementation of Binarized Neural Network For Sign Language Application

FPGA_based_Implementation_of_Binarized_Neural_Network_for_Sign_Language_Application

Uploaded by

misalabhijeet000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

FPGA Based Implementation of Binarized Neural Network For Sign Language Application

FPGA_based_Implementation_of_Binarized_Neural_Network_for_Sign_Language_Application

Uploaded by

misalabhijeet000
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2021 IEEE International Symposium on Smart Electronic Systems (iSES)

FPGA based Implementation of Binarized


Neural Network for Sign Language Application
2021 IEEE International Symposium on Smart Electronic Systems (iSES) | 978-1-7281-8753-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/iSES52644.2021.00077

1st Mohita Jaiswal 2nd Vaidehi Sharma 3rd Abhishek Sharma


Dept. of Electronics and Dept. of Electronics and Dept. of Electronics and
Communication Engineering, Communication Engineering, Communication Engineering,
The LNM Institute of Information The LNM Institute of Information The LNM Institute of Information
Technology, Jaipur, India Technology, Jaipur, India Technology, Jaipur, India
ORCID- 0000-0003-1111-898X ORCID- 0000-0002-3687-0134 ORCID- 0000-0002-8821-9837

4th Sandeep Saini 5th Raghuvir Tomar


Dept. of Electronics and Dept. of Electronics and
Communication Engineering, Communication Engineering,
The LNM Institute of Information The LNM Institute of Information
Technology, Jaipur, India Technology, Jaipur, India
ORCID- 0000-0002-8906-8639 [email protected]

Abstract—In the last few years, there is an increasing expensive which makes their deployment infeasible on
demand for developing efficient solutions for computer embedded hardware platforms like FPGAs for targeting
vision-related tasks on FPGA hardware due to its quick applications. This issue can be resolved by applying model
prototyping and computing capabilities. Therefore, this work
aims to implement a low precision Binarized Neural Network compression techniques such as Pruning [10], low-rank
(BNN) using a Python framework on the Xilinx PYNQ- approximation of convolutional kernels [11], Binarization
Z2 embedded platform to tackle the challenging problem of neural networks to reduce overall computational load
of Sign Language recognition. More specifically, the FINN [12].
framework is adopted and the BNN topology is modified to
adapt large resolution (i.e 64x64) to perform classification Several existing literature shows the idea of imple-
of proposed Indian Sign Language (ISL) gestures into cor- menting DL algorithms to tackle the challenges of Sign
responding numbers. In addition, data augmentation tech- Language application. In [6] authors have performed the
niques are also applied to improve the overall performance
analysis of various Deep Neural Networks (DNNs) based
of the neural network. Furthermore, hardware/software co-
verification of BNN topology is performed to validate the on pre-trained VGG-16 with transfer learning and fine-
accuracy after implementing it onto hardware. Extensive tuning, hierarchical networks on ISL gestures. A method-
experimental results show that it achieves a classification ology based on Artificial Neural Network (ANN) and
rate of 843.8 frames per second (FPS) on PYNQ-Z2 FPGA handcrafted feature extraction is reported in [13] used to
which delivers higher performance as compared to previous
recognize ISL gestures and translate them into English.
works. Also, the post-implementation results are analyzed in
terms of resource utilization and power consumption. A novel approach based on ANN is discussed in [14] for
Index Terms—Binarized Neural Network(BNN), Com- ISL number recognition using a leap motion controller.
puter Vision, Sign Language, FPGA [15] shows sensor-based methodology where static and
dynamic signs of ISL are collected using flex sensors
I. I NTRODUCTION and classification of received data is performed using
Recognition of Sign Language is an interesting and Long Short-Term Memory (LSTM) networks in real-
challenging task in computer vision systems [1]–[4]. This time. In [16] authors have proposed one-dimensional CNN
vision-based language is a primary means of communi- array architecture for recognizing ISL signs using signals
cation between hearing and vocally-impaired community recorded from a wearable IMU device.
and ordinary people. In recent years, researchers have However, all of the above-mentioned approaches are
shown tremendous interest in solving the complexity of performing well with accuracy and also, their implemen-
sign language gestures [5]–[7]. The lighting conditions, tation is either performed using sensors or on CPU/GPU
various hand movements and positions, constrained en- platforms but none of these provide efficient solutions
vironment, and intensive CPU computations are the most for deploying this application on low-power embedded
significant challenges that degrade the performance of sign devices. Yet, to date [17] is the only work that shows
language recognition in the real world. FPGA-based CNN implementation using an 8-bit dynamic
The recent progress of Deep Learning (DL) algorithms fixed point scheme for recognizing gestures of Swedish
especially Convolutional neural networks (CNNs) plays sign language. The authors have reused and modified the
a significant role in solving complicated tasks such as Zynqnet design to measure the performance on the FPGA
image recognition [8] and object detection [9]. However, platform. However, to optimize parameters of neural net-
they consume massive power and are computationally works this approach uses 8-bit quantization.

978-1-7281-8753-2/21/$31.00 ©2021 IEEE 303


DOI 10.1109/iSES52644.2021.00077
Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on August 03,2024 at 10:57:53 UTC from IEEE Xplore. Restrictions apply.
Contribution: The major contributions of this paper are and generalization capabilities of the model. The authors
summarized below: have selected the Imgaug library1 for applying various
• Firstly, the authors have proposed static gestures of augmentations of gestures images in offline mode.
Maths numbers in ISL. Some applied image transformations such as gaussian
• Using FINN framework, the authors have modified blur, contrast normalization, additive Gaussian noise and,
the BNN topology and, implemented it with appli- spatial transformations such as shearing, translations, and
cation use case of Sign Language on the embedded scaling as shown in Figure 2. Also, heatmap representation
PYNQ-Z2 platform. is used to visualize the image and thus, helps to aug-
• Detailed performance analysis of proposed ISL ment the dataset with different skin tones using OpenCV
dataset in terms of resource utilization, power con- operations. As a result, the final dataset contains 12,000
sumption and FPS on targeted PYNQ-Z2 hardware images of gestures. For training, the dataset splits into
and also, comparative analysis with recent works. 80% training set, 10% test set, and 10% validation set.
The structure of paper is summarized as follows. Sec- These images now feed to binarized networks for further
tion II presents the proposed ISL gestures representing feature extraction and classification.
numbers. Section III discusses dataset augmentation tech-
nique and the design flow of modified topology of BNN
using FINN framework. Section IV presents experimen-
tal results and analysis using PYNQ-Z2 FPGA. Finally,
Section V conclude this work and discusses future work.
II. DATASET P REPARATION
Some of the existing datasets of numbers in ISL are
available with a small set of gestures having huge sizes.
These large-size and high-quality datasets are computa-
tionally complex to process and, are usually not preferred
from a hardware perspective. According to the hardware Figure 2. Sample images from Image Augmentation
requirements, the authors have generated a dataset of
8,000 gestures of Maths numbers in ISL. It comprises 10
static gestures representing numbers from (0-9) having a B. Modified Binarized Neural Network Architecture
resolution 64x64 are illustrated in Figure 1. This work explores the BNN-FINN framework designed
by Umuroglu et. al [18] which provides a highly optimized
design for mapping binarized neural networks onto FPGA
using benchmark datasets. The authors have extended the
FINN design to target it with sign language application by
modifying the BNN topology. The main objective behind
implementing sign language recognition or translation on
FPGA is to provide an efficient hardware solution for
developing an educational bundle for hearing and speech
Figure 1. Proposed ISL Dataset representing numbers impaired children.
The original BNN topology takes RGB and grayscale
These sign gestures of Maths numbers were manually
images of small resolution (i.e 28x28, 32x32) and classi-
collected using a web camera in plain and cluttered
fies them into respective categories. Here, the authors have
background. To make the dataset diverse and large, the
adopted the heterogeneous streaming design of BNN and
signs were performed by a group of 4 different signers
modified its topology according to the requirements of
who have little knowledge of sign language. Each signer
the proposed dataset. The network topology is designed
performs 200 gestures per category with various hand
to adapt the input resolution of 64x64 having 7 2D
orientations and positions in normal and bright lighting
Convolutional (CONV2D) layers with a different set of
conditions using single hands. Each category consists of
filters (32, 64, 128) having 3x3 filter size followed by Max
a total of 800 gestures.
pooling layers. The first and last layers are not binarized
III. P ROPOSED W ORK to maintain the network performance. Dropout layers are
In this section, an efficient way of implementing sign also added with 2 Fully Connected (FC) or Dense layers
language gesture classification task is presented using to reduce the network parameters. Moreover, the batch
the modified BNN topology and, by an amalgamation of norm layer is applied to 2D convolutional or FC layer
python and FPGA design. outputs, and after that RELU activation function is added.
The modified BNN topology is shown in Table 1.
A. Dataset Augmentation This full binarization proposed network is trained using
The main idea of applying the dataset augmentation flexible 1-bit binary weights & 1-bit activations on the
technique is to expand the data artificially so that the learn- sign language dataset. The key benefit of using BNN
ing algorithms can learn new variations in data throughout is that it replaces the multiply and accumulate (MAC)
the training process. Also, improves the performance 1 https://github.com/aleju/imgaug

304

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on August 03,2024 at 10:57:53 UTC from IEEE Xplore. Restrictions apply.
Figure 3. Design Flow

Table I
M ODIFIED B INARIZED N EURAL N ETWORK T OPOLOGY
mapping to FPGAs. Third, The IP core generated from the
Input Output C++ network description and parameters is used for the
Layers Feature Feature No of filters, synthesis and generation of executable bit file that contains
Map(IFM) Map(OFM) Kernel Size
Binary CONV2D (64, 64, 3) (62, 62, 32) 32, (3, 3)
the neural network topology. The bit file generated from
Binary CONV2D (62, 62, 32) (60, 60, 32) 32, (3, 3) VIVADO is loaded in programmable logic (PL) as an
Max Pool 2D (60, 60, 32) (30, 30, 32) (2, 2) overlay onto the jupyter notebook environment of PYNQ.
Binary CONV2D (30, 30, 32) (28, 28, 64) 64, (3, 3) Here, PYNQ provides a python language library to call an
Max Pool 2D (28, 28, 64) (14, 14, 64) (2, 2)
Binary CONV2D (14, 14, 64) (12, 12, 64) 64, (3, 3) IP core which is generated by HLS and HDL. Further, the
Binary CONV2D (12, 12, 64) (10, 10, 64) 64, (3, 3) PYNQ-Z2 board contains pre-installed PYNQ hardware
Max Pool 2D (10 ,10, 64) (5, 5, 64) (2, 2) libraries that are used to measure the performance of
Binary CONV2D (5, 5, 64) (3, 3, 128) 128, (3, 3)
gestures classification.
Binary CONV2D (3, 3, 128) (1, 1, 128) 128, (3, 3)
Flatten (1, 1, 128) 128 —
Binary Dense 128 512 — IV. E XPERIMENTAL S ET-U P AND R ESULTS
Binary Dense 512 512 — A. Experimental Set-Up
In this paper, the PYNQ-Z2 FPGA board is selected as
operation between weights and activation with XNOR the testing platform that is used to accelerate the software
and pop count operations [19]. Instead of the binary process using PYNQ open-source framework as shown
dot product, the bitwise XNOR operations are performed in Figure 4. This board is the fusion of an FPGA and
and the pop count operation is used as a counter to set ARM cortex A9 processor along with peripherals that can
bits using results of XNOR operations. As compared to be used for real-time signal processing, parallel hardware
signed accumulation, the combination of these operations execution and, video processing applications. As discussed
significantly reduces LUT counts. in the previous section, IP core is generated using Vivado
HLS 2018.2 and further used in VIVADO to create a bit
C. Design Flow file to configure with FPGA.
The design flow of the modified BNN topology to
deploy the application of Sign Langauge into the PYNQ-
Z2 FPGA board is discussed and shown in Figure 3. This
FPGA board is selected for various reasons as it has a
small no of hardware resources (i.e LUTs and DSPs)
and is generally used for low-power embedded devices
because it consumes less power (approx 2.5W). The
training of the BNN model is performed using the Theano
framework on the CPU platform that creates a ‘1w-1a.npz’
file where 1w and 1a denotes 1-bit binary weights and
activations. Theano2 is a lightweight Python platform
that allows users to evaluate and optimize mathematical
operations involving multi-dimensional arrays efficiently.
It offers fast computations and is used for building and
training neural networks. Second, the FINN synthesizer
having weight generation script that extracts the network
parameters from the ‘1w-1a.npz’ file and converts them
to weights.bin, thresh.bin and, generates a config file.
These parameters and config file having a neural network
model are further used for validating on software and
2 https://pypi.org/project/Theano/
Figure 4. Experimental Set-Up

305

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on August 03,2024 at 10:57:53 UTC from IEEE Xplore. Restrictions apply.
Table II R EFERENCES
R ESOURCE U TILIZATION AND P OWER C ONSUMPTION S UMMARY
[1] L. S. T. Mangamuri, L. Jain, and A. Sharmay, “Two hand indian
sign language dataset for benchmarking classification models of
Utiliza- Avail-
Device Total Resources Usage machine learning,” in 2019 International Conference on Issues and
tion able
Power Challenges in Intelligent Computing Techniques (ICICT), vol. 1.
IEEE, 2019, pp. 1–5.
LUTs 32164 53200 [2] M. J. Cheok, Z. Omar, and M. Jaward, “A review of hand gesture
60.46%
LUTRAM 3444 17400 19.7% and sign language recognition techniques,” International Journal
PYNQ- of Machine Learning and Cybernetics, vol. 10, pp. 131–153, 2019.
1.83W FF 44243 106400 41.5%
Z2 [3] M. Jaiswal, V. Sharmay, A. Sharmaz, and R. Tomar, “Transfer
BRAM 82 140 58.5%
DSP 28 220 12.7% learning with l2 norm regularization for classifying static two hand
hindi sign language gestures,” in 2020 IEEE 9th International
Conference on Communication Systems and Network Technologies
Table III (CSNT). IEEE, 2020, pp. 44–48.
P ERFORMANCE C OMPARISON TO PRIOR WORKS [4] A. S. Ghotkar and G. K. Kharate, “Study of vision based hand
gesture recognition using indian sign language,” Computer, vol. 55,
Hwang et al. Prieto et al. Proposed p. 56, 2014.
[20] [17] Work [5] M. Jaiswal, V. Sharma, A. Sharma, S. Saini, and R. Tomar, “An
Sebastien Hand Proposed efficient binarized neural network for recognizing two hands indian
Dataset sign language gestures in real-time environment,” in 2020 IEEE
Marcel[21] Alphabet ISL gestures
Stratix 4 PYNQ-Z2 17th India Council International Conference (INDICON), 2020,
FPGA Platform XCKU060 pp. 1–6.
EPFSGX (XC7Z020)
Floating 1- bit [6] A. Sharma, N. Sharma, Y. Saxena, A. Singh, and D. Sad-
Approach Fixed Point hya, “Benchmarking deep neural network approaches for indian
Point Binarization
DSP Usage 928 646 28 sign language recognition,” Neural Computing and Applications,
FPS — 23.5 843.8 vol. 33, pp. 6685–6696, 2021.
[7] A. A. Barbhuiya, R. K. Karsh, and R. Jain, “Cnn based feature
extraction and classification for sign language,” Multimedia Tools
and Applications, vol. 80, pp. 3051–3069, 2021.
B. Results and Analysis [8] W. Rawat and Z. Wang, “Deep convolutional neural networks
for image classification: A comprehensive review,” Neural
The post-implementation results of a modified BNN Computation, vol. 29, no. 9, pp. 2352–2449, Sep. 2017. [Online].
topology for classifying static gestures are analyzed in Available: https://doi.org/10.1162/neco_a_00990
the form of resource utilization and power consumption [9] Z.-Q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object detection with
deep learning: A review,” IEEE Transactions on Neural Networks
targeting PYNQ-Z2 hardware as shown in Table II. The and Learning Systems, vol. 30, no. 11, pp. 3212–3232, Nov. 2019.
percentage utilization indicates the number of LUT and [Online]. Available: https://doi.org/10.1109/tnnls.2018.2876865
BRAM resources required to restore weights, thresholds [10] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both
weights and connections for efficient neural network,” ArXiv, vol.
parameters and, models. Also, the overall power consump- abs/1506.02626, 2015.
tion of 1.83W is very efficient for developing hardware [11] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up con-
solution for the intended application. volutional neural networks with low rank expansions,” ArXiv, vol.
abs/1405.3866, 2014.
To evaluate the performance of the proposed work, the [12] T. Simons and D.-J. Lee, “A review of binarized neural networks,”
comparative analysis is carried out with prior works based Electronics, vol. 8, no. 6, p. 661, Jun. 2019. [Online]. Available:
on Dataset used, FPGA platform, Approach, DSP usage https://doi.org/10.3390/electronics8060661
[13] P. C. Badhe and V. Kulkarni, “Artificial neural network based indian
and FPS in Table III. Using 1-bit Binarization precision sign language recognition using hand crafted features,” in 2020
approach, the model achieves a high classification rate of 11th International Conference on Computing, Communication and
843.8 FPS and significantly performs better with fewer Networking Technologies (ICCCNT), 2020, pp. 1–6.
[14] D. Naglot and M. Kulkarni, “Ann based indian sign language
DSP resources as compared to existing designs. numerals recognition using the leap motion controller,” in 2016
International Conference on Inventive Computation Technologies
(ICICT), vol. 2, 2016, pp. 1–6.
V. C ONCLUSION AND F UTURE W ORK [15] E. Abraham, A. Nayak, and A. Iqbal, “Real-time translation of
indian sign language using lstm,” in 2019 Global Conference for
Advancement in Technology (GCAT), 2019, pp. 1–5.
This paper demonstrates an efficient way of deploying [16] K. Suri and R. Gupta, “Convolutional neural network array for
the application of sign language with high performance sign language recognition using wearable imus,” in 2019 6th
using modified binarized networks on a low-power PYNQ- International Conference on Signal Processing and Integrated
Networks (SPIN), 2019, pp. 483–488.
Z2 FPGA platform. The utilization of 1-bit binarization of [17] R. Núñez Prieto, P. C. Gómez, and L. Liu, “A real-time gesture
weights and activations provides classification accuracy of recognition system with fpga accelerated zynqnet classification,” in
85% on static gestures and thus, outperforms the other 2019 IEEE Nordic Circuits and Systems Conference (NORCAS):
NORCHIP and International Symposium of System-on-Chip (SoC),
previous designs by achieving 843.8 frames per second 2019, pp. 1–6.
(FPS). As a future direction, the authors are planning to [18] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong,
extend the application in real-time using deeper neural M. Jahre, and K. Vissers, “Finn,” Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable
networks and with different bit precision. Gate Arrays, Feb 2017. [Online]. Available: http://dx.doi.org/10.
1145/3020078.3021744
[19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio,
ACKNOWLEDGEMENT “Binarized neural networks,” ArXiv, vol. abs/1602.02505, 2016.
[20] W.-J. Hwang, Y.-J. Jhang, and T.-M. Tai, “An efficient fpga-
This research work is funded by Department of Science based architecture for convolutional neural networks,” in 2017
40th International Conference on Telecommunications and Signal
and Technology (DST), India under project name Sign Processing (TSP), 2017, pp. 582–588.
Language to Regional Language Converter (SLRLC) with [21] S. Marcel, “Hand posture recognition in a body-face centered
project number SEED/TIDE/063/2016. space,” CHI ’99 Extended Abstracts on Human Factors in Com-
puting Systems, 1999.

306

Authorized licensed use limited to: COLLEGE OF ENGINEERING - Pune. Downloaded on August 03,2024 at 10:57:53 UTC from IEEE Xplore. Restrictions apply.

You might also like