0% found this document useful (0 votes)

56 views10 pages

SignExplainer An Explainable AI-Enabled Framework For Sign Language Recognition With Ensemble Learning

The document discusses an explainable AI framework called SignExplainer for sign language recognition using ensemble learning. SignExplainer uses an attention-based ResNet50 and self-attention model for ensemble learning to achieve 98.2% accuracy. It then explains the relevancy of predictions in percentage to provide model transparency and improve user trust.

Uploaded by

18-Mamta Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views10 pages

SignExplainer An Explainable AI-Enabled Framework For Sign Language Recognition With Ensemble Learning

Uploaded by

18-Mamta Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Received 13 April 2023, accepted 5 May 2023, date of publication 10 May 2023, date of current version 18 May 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3274851

SignExplainer: An Explainable AI-Enabled

Framework for Sign Language Recognition
With Ensemble Learning
DEEP R. KOTHADIYA 1,2 , CHINTAN M. BHATT 3 , (Member, IEEE),
AMJAD REHMAN 2 , (Senior Member, IEEE), FATEN S. ALAMRI4 ,
AND TANZILA SABA 2 , (Senior Member, IEEE)
1U & P U Patel Department of Computer Engineering, Faculty of Technology (FTE), Chandubhai S. Patel Institute of Technology (CSPIT), Charotar University
of Science and Technology (CHARUSAT), Changa 388421, India
2 Artificial Intelligence and Data Analytics Laboratory (AIDA), College of Computer and Information Sciences (CCIS), Prince Sultan University, Riyadh 11586,

Saudi Arabia
3 Department of Computer Science and Engineering, School of Engineering and Technology, Pandit Deendayal Energy University, Gandhinagar, Gujarat 382007,

India
4 Department of Mathematical Sciences, College of Science, Princess Nourah bint Abdulrahman University, Riyadh 11671, Saudi Arabia

Corresponding authors: Deep R. Kothadiya ([email protected]) and Chintan M. Bhatt ([email protected])

This work was supported by Princess Nourah bint Abdulrahman University Researchers Supporting through Princess Nourah bint
Abdulrahman University, Riyadh, Saudi Arabia, under Project PNURSP2023R346.

ABSTRACT Deep learning has significantly aided current advancements in artificial intelligence. Deep
learning techniques have significantly outperformed more than typical machine learning approaches,
in various fields like Computer Vision, Natural Language Processing (NLP), Robotics Science, and
Human-Computer Interaction (HCI). Deep learning models are ineffective in outlining their fundamental
mechanism. That’s the reason the deep learning model mainly consider as Black-Box. To establish confi-
dence and responsibility, deep learning applications need to explain the model’s decision in addition to the
prediction of results. The explainable AI (XAI) research has created methods that offer these interpretations
for already trained neural networks. It’s highly recommended for computer vision tasks relevant to medical
science, defense system, and many more. The proposed study is associated with XAI for Sign Language
Recognition. The methodology uses an attention-based ensemble learning approach to create a prediction
model more accurate. The proposed methodology used ResNet50 with the Self Attention model to design
ensemble learning architecture. The proposed ensemble learning approach has achieved remarkable accuracy
at 98.20%. In interpreting ensemble learning prediction, the author has proposed SignExplainer to explain
the relevancy (in percentage) of predicted results. SignExplainer has illustrated excellent results, compared
to other conventional Explainable AI models reported in state of the art.

INDEX TERMS Deep learning, computer vision, explainable AI, SignExplainer, classification, sign
language, technological development.

I. INTRODUCTION and industry, especially computer vision with deep learn-

Arevolutionized era of Artificial Intelligence with Machine ing has proven incredible results. Computer vision in fields
Learning and Deep Learning has demonstrated potential in a like medicine, autonomous vehicles, agriculture, and remote
different sector. Over the one decade, Machine learning and sensing have little chance for failure [1]. Deep learning
deep Learning had a vast range of applications in research methods, computer vision, human-computer interface, and
other related sub-fields have also illustrated compatible per-
The associate editor coordinating the review of this manuscript and
formance in various domains. Computer vision with deep
approving it for publication was Yongjie Li. learning has proven hard to fail for many tasks [2]. With
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
47410 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

the availability of exclusive computing resources and a huge For medical domain tasks like Sign language recognition,
amount of learning dataset, deep learning can generate much it is necessary to explain and relive the internal learning
more accurate results than before. With the good performance pattern. If the internal learning patent is correct, then it will
of machine learning and deep learning, artificial intelligence increase trust in sign language recognition models. However,
can achieve superhuman abilities. The world’s social environ- this explanation also provides misclassification error, leading
ment will transform dramatically due to artificial intelligence to improvisation in the model or input scenario. Trust values
over the use of different platforms. These changes come with are much more essential for sign language recognition to pre-
various ethical issues, which society will need to quickly dict how the model will learn a given gesture-based sign [8].
adjust to influence the advances in a way that will lead to pos- The interpretability improves the methodology to predict the
itive consequences. The complexity of deep learning models actual label. Because the generation of sign gestures may vary
allows artificial intelligence to learn and react over complex from person to person, in that case, there is a high possibility
data structures. Computer vision is one of the best approaches to recognize a different label. Sign language recognition with
for image classification, segmentation, object detection, and Explainable AI helps to improve the recognition model with
many more applications [3]. various expectations, and also help the end user to understand
Deep learning models prove excellent performances in sen- the learning methodology of the deep learning model to
sitive areas like medical science, national defense, automa- recognize different sign gestures [9].
tion driving, finance, and many more, but these applications A sign language recognition system helps physically
also need attention to trust-related problems. A system having impaired people to communicate with the rest of the world.
promising results but with good interpretation is easier to People having hearing impairment use gesture-based signs
trust [4]. The significant performance of computer vision task to express their emotions and thoughts. The majority of the
generates a huge number of parameters and links with the contribution to generating a sign is a hand gesture, but to
physical environment, which is extremely hard to explain. express proper meaning it will involve other non-manual
This complex learning structure generally considers as a body parts like the orientation of the head, the direction of
‘‘Black-Box’’ [5]. Since, the advancement of deep learning, eyes, eyebrows, and lips moment. XAI for sign language
especially computer vision in sensitive and critical sectors, recognition helps to understand the predicted result, which
the issue of transparency and interpretability is highly recom- may lead to improved accuracy of the model as well as users
mended. It’s necessary to involve explainability in Artificial also get familiar with the generated ideal gesture of sign.
Intelligence generally referred to as Explainable Artificial Computer vision-based sign language recognition systems
Intelligence (XAI). A rapidly expanding field of study, XAI not only improve in terms of accuracy but also improve user
is quickly emerging as one of the more important compo- trust [10].
nents of Artificial Intelligence (AI) [6]. Research over XAI This study proposed a threefold main contribution.
in the context of computer vision aims to extract or try to • First, Attention-based ensemble learning for sign lan-
interpret the structure inside the black box. Additionally, guage recognition.
it provides trust and interpretability to assist bias-free debug- • Second, the authors have introduced novel architecture
ging over different computer vision applications like object using XAI for Sign language recognition.
detection, classification, and others. Interpretation from XAI • Finally, illustrate concrete evidence for interpretability
models explains potential design flow or structures [7]. and decision-driven approach of the proposed method-
Figure 1 represents a functional comparison of AI and XAI, ology with Explainable AI.
especially for reaction over predicted results by black box The rest of the article is designed as section II illustrates the
learning. recently published methodology for sign language recogni-
tion and XAI. Section III demonstrate the proposed method-
ology with deep learning and XAI. Section IV represents
the simulation process and demonstrates the explainability
and interpretability of the proposed architecture. Section V
illustrates the evaluation and results discussions.

II. RELATED WORK

Kim et al. [11], introduce Concept Activation Vectors
(CAVs), which translate a neural network’s internal state
into understandable ideas, which the author introduces.
The important concept is to use a neural network’s high-
dimensional internal state as a tool rather than a hindrance.
The authors have demonstrated the application of CAVs as a
component of a method called Testing with CAVs (TCAV),
FIGURE 1. Architectural summary and analysis of artificial intelligence which uses directional derivatives to gauge. How important
and explainable artificial intelligence. a user-defined concept is to the categorization result, such as

VOLUME 11, 2023 47411

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

how much of a zebra prediction is influenced by the presence other standard machine learning models like Design Tree,
of stripes. We explain how CAVs may be used to evaluate Gradient Boost, Support vector machine, Random Forest,
predictions and generate knowledge for a standard image and Ada Boost. The proposed methodology has achieved
classification network and a medical application, putting con- 81% remarkable accuracy. The author has also considered
cepts to the test in image categorization. the lack of transparency issue for Machine Learning models.
In this research [12], authors describe a unique technique To determine the significance of the characteristics of the
that offers contrasting justifications for categorizing an input predicted result, Local Interpretable Model-agnostic Expla-
by a deep neural network or another black box classifier. nations (LIME) are used. The author has demonstrated the
Given an input, we find what needs to be simply and ade- different available particles in water like Chloramines, Tur-
quately present (viz. important object pixels in an image) bidity, Sulfate, and many more to justify results with Explain-
to justify its classification and analogously, along with that able AI, the proposed LIME model utilize to generate a result
minimally and necessarily absent (viz. certain background with the percentage of water particles.
pixels) for the same. We contend that such explanations are Vermeire et al. [16] proposed a model-agnostic model
typical in fields like criminology and health care because ‘‘Search for EviDence Counterfactual’’ (SEDC) for image
they are natural to people. A key aspect of an explanation classification. The ‘‘EdC’’ explanation is an irreducible col-
that, to our knowledge, has not yet been formally identified lection of characteristics that, if absent, would change the
by current explanation methods used to explain neural net- classification of the document. The SEDC additionally sup-
work predictions is minimally represented but critically not ports a single task for image explanation. The proposed
present. The authors have validated the proposed method- methodology used image segmentation as a core component
ology over three datasets obtained from diverse domains; to interpret. The authors have the simulated model to com-
a brain activity strength dataset, a large procurement fraud pare different counterfactual classes and also compare with
dataset, and a handwritten digits dataset MNIST. In all three standard explainer models like SHAP and LIME. Simulation
cases, we observe the effectiveness of our method in produc- has used pre-train weights of MobileNet V2 to demonstrate
ing precise explanations that are also simple for specialists to the interpretation of the proposed SEDC model.
comprehend and evaluate. [12]. Goel et al. [17], a proposed technique to design ‘‘coun-
Akula et al. [13], proposed the CoCoX model to explain the terfactual explanations’’. Generally, it is used to justify by
prediction generated by CNN classification. The author has content area of the image, through the model that made the
proposed a fault-line model to identify minimum segmented- prediction. The methodology also encountered the problem
level features. Explanation from the CoCoX model was of Minimum-Edit Counterfactual. A methodology work on
understandable to the technical and non-technical communi- input image trained by a computer vision model, to inter-
ties. The author has evaluated qualitative matrices like Justifi- pret the predicted class. The methodology used the MNIST
cation Trust (JT), and Explanation Satisfaction (ES) to make dataset over the CNN model achieved 98.40% accuracy.
performance understandable. The author has also compared The proposed training model has 2 convolutions and 2 FC
the fault line model to other state-of-the-art models like LIME (Fully connected) layers to generate a feature size of 4 ×
and LRP [13], author has successfully achieved 69.1 JT with 4×40. To generalize counterfactual explanations, the author
CNN learning and Fault-Line Identification. has also experimented with Omniglot and Caltech-UCSD
Contreras et al. [14], design Deep Explainer and Rule Birds dataset. Proposed technique working over Greedy
Extraction (DEXiRE), to make binary neural networks Sequential Exhaustive Search model. The author has summa-
explainable. The proposed methodology uses rule extraction, rized the qualitative and quantitative results of the proposed
which improves knowledge extraction from DL model (CNN) technique.
output. A final (global) rule set describing the general behav- Arras et al. [18], proposed a framework that provides,
ior of DL predictors can be created by integrating intermedi- a controlled, selective, and realistic testbed for the prediction
ate rule sets explaining the behavior of each concealed layer. of deep neural networks. The proposed methodology uses
They used BCWD, Banknote, and Prima diabetes datasets the CLEVR-XAI dataset for simulation, there were around
for the simulation of the proposed DEXiRE model. The 140k questions in the CLEVR-XAI evaluation set. With
number of words in the intermediate and final rule sets 28 alternative solutions. The prediction issue is presented as
may be regulated precisely with DEXiRE. The rule Extrac- a classification challenge. The author has used ten polling
tion model has achieved remarkable accuracy and fidelity techniques to visualize the explanation evaluation over a
0.94 and 0.95 respectively in a very small amount of time round truth mask. The experiment section summarized the
(around 232 ms). evaluation of different XAI methods like Guided Backprop,
Patel et al. [15] water Potability prediction synthetic over- LRP, SmoothGrad, and other 7 methods [18]. The conclusive
sampling technique and Explainable AI. The author has study finds that LRP performed much better compared to
used Synthetic Minority Oversampling Technique (SMOTE) another method over the proposed (CLEVR-XAI) benchmark
method to classify water quality on the Kaggle dataset. The dataset. Table 1 represent comparative analysis over different
author has also compared the proposed architecture with explainable model to predict result by black-box learning,

47412 VOLUME 11, 2023

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

analysis also represents a statistical comparison to justify features. Especially, when the task was related to computer
trust and confidence. vision, proper model training is necessary. The proposed
methodology used ensemble learning with an attention
TABLE 1. Comparative analysis of state-of-the-art Explainable AI model model. Figure 3 represents an ensemble attention-based
overconfidence and justified trust value.
model for sign language recognition. The proposed method-
ology uses a bagging-based ensemble model to learn the
associated feature of sign images. Attention-based Ensem-
ble learning mainly divides into two categories, multi-
head ensemble and attention-based ensemble [23]. Figure 3
demonstrate the different way of attention-based ensemble
learning. Algorithm 1 represents the architectural structure of
the proposed ensemble learning approach with the bagging
concept.

III. MATERIALS AND METHODS

FIGURE 3. A different way to use attention for ensemble learning,
The proposed architecture used an Explainable AI-based (a) represents a multi-head ensemble with different feature embedding
methodology for sign language recognition with DeepEx- parameters, (b) represents the same feature embedding with different
feature learning.
plainer. Which use to predict and validate generated output
with learning interpretability. The proposed methodology
uses SHAP (Shapley Additive exPlanations) [18] to interpret The proposed methodology used ensemble learning.
framework prediction. A global interpreter SHAP is used Which is mainly divided into two parts. The first one is
over LIME [22], to interpret the effect of the single fea- ResNet50 with a 23.521M parameter as part of the con-
ture on the target variable. SHAP framework utilizes vari- volution learning module. ResNet50 is used to reduce the
ous explainability methods for better interpretation of model vanishing gradient problem. Generally, in a deep convolution
prediction. The proposed methodology is divided into three network loss function is shrunk to zero after several itera-
major stages i) Ensemble learning, ii) Prediction of learning, tions. With the help of the ResNet network, gradients can be
iii) Sign Explainer, and interpret the results. Figure 2 shows directed to skip connections from previous layers to the next
the sequential flow of the proposed model. filter layer. The linear learning of residual network can be
considered as equation 1. [24], where G (x, and {Wi }) stand
for mapping of residual learning, while Ws and x stand for
projection square matrix of x dimension.
η = G (x, {Wi }) + Ws + x (1)
Another component of ensemble learning is the attention
module, which can be designed with two associated modules
as feature extraction module F(x) and attention module A(x).
The feature extraction module was designed with a pro-layer
perceptron model, and generalized as equation 2 [23]. And
the attention weights were calculated as equations 3 and 4,
where he and hd stand for encoder and decoder weights.
FIGURE 2. Sequential process architecture of proposed methodology. F (x) = hi (hii−1 (. . . (h2 (h1 (x))) (2)
γ = tanh (W ∗ he + W ∗ hd ) (3)
A (x) = Softmax(γ ) (4)
A. ENSEMBLE LEARNING
Every custom Deep Learning model is based on training- The global feature embedding model G(x) (equa-
based learning and must necessarily stage to learn deep tion 5), for the embedding module. Authors have proposed

VOLUME 11, 2023 47413

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

Algorithm 1 Pseudo-Code for Proposed Ensemble Learning

Architecture (Bagging based)
Input: Training Image set I
Output: Interpretation_Index
1. K ← Conv_Layer(ResNet(i))
2. l ← Class_Labels {0,1,2. . . .,A,B,. . . Z}
3. G ← Ensemble_feature(l)
4. C ← Num_Classes(l)
5. for k ∈ {1, . . . , K } do
6. for C ∈ {0, . . . , C} do
7. Dc = Conv(Ic0 ∗ Ic1 ∗ . . . .. ∗ Icn )
8. fc = D0 ∪ D1 ∪ . . . . . . . ∪ Dn
9. end for
10. G (k) = MLP(fc )
11. end for FIGURE 4. Proposed ensemble learning architecture with ResNet50 and
12. G(x) = softmax((G1(x) + G2(x) + . . . + Gk(x)) / K ) attention model, to learn embedded features with global feature
embedding method, (FC stands for fully connected layer).
13. #Feature Explainer:
14. procedure Sign_Ex(g(x), l)
15. i ← max_val(int)
score has been calculated, and accuracy, precision, recall,
16. Create5 for collections
and F1-Score were calculated to analyze ensemble model
17. for i ∈ g(x) do
performance. Performance standards have been calculated as
18. for each π ∈ {π0 . . . . . . π1 } do
per equations 7 to 10. [27], [28].
19. Calculateπi ;
20. πO ← 1(πi ) Accuracy = (TP + TN )/(TP + FP + TN + FN ) (7)
21. end for

Precision = TP (TP + FP) (8)
22. ϒ ← evaluate(πO , l)
Recall = TP (TP + FN ) (9)
23 end for
24. return(index ← max_val(ϒ)) (Precision ∗ Recall)
F1 − Score = 2 ∗ (10)
25. end procedure (Precision + Recall)

C. SIGNEXPLAINER
Interpretation and explainable techniques involved with
three-dimension blob channel to recognize input images in an
black-box deep learning models fall under two categories,
RGB channel. The attention feature and convolution feature
model specific or agnostic. This section focuses on the design
are associated with the final feature vector generation and
of SignExplainer an agnostic interpretability technique, that
it was forwarded to a fully connected DCNN network for
can be applied to any black-box deep-learning model to
classification. Figure 4 represents the conceptual architecture
interpret gesture-based signs. SHAP [29] is among the most
representation of the proposed ensemble learning with the
utilized interpretability methods for deep learning-based
attention model.
methods. SHAP can construct interpretations for multi-class
X
G (x) = F (x) ⊗ A(x) (5) classifier responses. SignExplainer uses Sign-specific Xcon-
cept to generate a fault line explanation. Let’s assume that
δpred and δalt can be Xconcept for Ealt and Ealt respectively
B. CLASSIFICATION AND PREDICTION where E stands for the actual class. Based on Xconcept, line
The output from the fully connected layer is further pro- prediction can be calculated as equation 11 [30].
cesses for classification and prediction. The authors have
9(E pred , Ealt ←
) min α δpred , δalt + β |δpred | + λ ||δalt ||

implemented multi-layer perceptron (MLP) [25] to classify δpred ,δalt
sign language. The proposed methodology uses DFFN (Deep (11)
Forward Neural Network) to recognize gesture signs from
The proposed Methodology designs DeepExplainer as an
input images. ReLU activation was implemented in the final
additive feature attribution method with accuracy and miss-
layer of the deep network for sign recognition, and it can
ingness. DeepExplainer combines the SHAP value computed
be calculated as equation (6), where (W1 , W2 ) are different
for a smaller component of the ensemble network and calcu-
weights and (b1 , b2 ) as bias.
lates it as equation 12, [31]. Where, o = f (x) − f (r) and
DFNN = ReLU (W1x + b1 ) W2 + b2 (6) xi = xi − xr , ris the reference input, while f (x) is the model
output,
The authors have utilized NumPy and Scikit-learn [26] Xn
for evaluation and visualization. The class-wise performance O= Cxi ∗ 1o (12)
i=1

47414 VOLUME 11, 2023

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

IV. EXPERIMENTS AND RESULT The proposed ensemble methodology has achieved 98.20 %
A. DATASET accuracy with extracted features from attention and the
The authors have evaluated SignExplainer with ensemble ResNet50 model. Model training was divided with 0.2 train-
learning on Indian Sign Language Dataset [32]. The dataset test split ratios (80:20) for all experiments, with an image
used for simulation consists of 36 Indian Sign classes having size of (72, 72, 3) and a batch size of 16. The model was
digits (0-9) and an alphabet (A-Z). The dataset consists of simulated with 0.3 as a dropout ratio and a 0.001 learning
approximately 1200 images per class, with 3 channel images. rate with the Adam optimizer. Table 3 demonstrate superior
Along with Indian Sign Language (ISL) dataset, the authors performance over other standard Convolution networks, addi-
have also experimented with other static datasets like Amer- tionally, the best performance was observed by the proposed
ican Sign Language (ASL) [33], and Bangla Sign Language Attention-based ensemble model. The proposed method-
(BSL) [34]. Property of datasets described in Table 2. ology has achieved significant accuracy over 50 learning
epochs, as shown in Figure 6.
TABLE 2. Statistical representation of different sign language datasets
used in the simulation.

B. DATA AUGMENTATION
The proposed simulation uses data augmentation to make the
model more generalized for feature learning. Data augmen-
tation is also used to balance training image samples and
improve robustness for learning variability over the different
images, making the model more generalized toward real-
time scenarios. Direct image inference may yield biased find-
ings due to particular transformations and noise associated
with equipment and surroundings. Image augmentation must
be used to achieve more reliable and robust prediction to
improve accuracy and prevent overfitting. The authors have
implemented i) Geometric transformations as random hori-
zontal flip, random rotation with +0.2 to -0.2, and zooming
by 1.5% to 2.5%. ii) Color space transformations as random
RGB change and Brightness by 0.5%. Figure 5 represents the
sample of the augmented training dataset.

FIGURE 6. Accuracy and loss curve for Indian Sign Language recognition
using Attention-based Ensemble learning.

FIGURE 5. Input Sign image augmentation, (a) original image,

(b) horizontal flip, (c) color transformation, (d) random rotation,
(e) zooming. D. INTERPRETATION WITH SIGNEXPLAINER
The proposed methodology simulates SignExplainer to gen-
erate a model prediction and explain the correctness of
C. SIMULATION DETAILS the prediction. The simulation uses OpenCV for masking
The authors have implemented training of an ensemble learn- the input images and passes them to the ‘‘blur (128,128)’’
ing module on the ISL dataset [32]. TensoFlow-Keras has method, which is responsible to mask the predicted image
been used for the design of the proposed methodology. output with inpaint-telea value. The authors have created

VOLUME 11, 2023 47415

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

TABLE 3. Performance analysis with state-of-the-art models for Image TABLE 5. Performance analysis of SignExplainer over different static Sign
classificatio. Language Datasets.

relevance of feature attribution it’s easy to interpret how the

model was learned to predict sign language. The presence of
a red pixel over the corresponding area of the hand gesture
increases the prediction probabilities.

SignExplainer with adaptive feature abstraction, which com-

pares with and without x-features. X- Features are the asso-
ciative contribution of ensemble learning features. Prediction
function of SignExplainer, which is working as a masked fea-
ture. The authors have passed sign images with the Explainer
object to generate SHAP values, and Figure 6 represents the
plot.
The interpretation plot has been taken with 4 flips over
1,000 evaluations as (max_eval=1000) for the Explainer
object (shown in figure 7). The gradient bar prediction repre-
sents the prediction’s relevance interpretation, red stands for
the maximum, and blue stands for the minimum. Table 4 rep-
resents the performance of other basic XAI models to inter-
pret the ensemble model prediction output over the Indian
Sign Language dataset.

TABLE 4. Statistical performance comparison of different models for

Interpretation over ISL dataset (where TRP is True Positive Rate, FNR is
False Negative Rate, PPV is Positive Predictive value, and FDR is False
Discovery Rate).

We have demonstrated the remarkable result of explanation

over sign language, especially for Indian signs. To ensure
FIGURE 7. Support feature for SignExplainer over Indian Sign Language
the robustness of the proposed SignExplainer with an ensem- Recognition, (a few samples have been taken to maintain article
ble learning model, the author has evaluated the proposed readability).
methodology over other static and standard sign language
datasets like American Sign Language ASL) and Bangla Sign V. RESULTS ANALYSIS AND DISCUSSION
Language (BSL), statistical comparison describe in Table 5. A Computer vision-based model to learn and interpret the
The prediction score of SignExplainer for the test sign prediction was proposed by this study. The authors have
image is demonstrated in Figure 8. SignExplainer helps to proposed a sequential (two-phase) methodology from learn-
understand and recognizes why the model recognizes the data ing from the ensemble model to interpretation of the pre-
instance as it has. The first image is from the testing dataset dicted result, with the SignExplainer model. The authors
as a significant gesture of ‘‘4.’’ The top of all predictions have also implemented the proposed architecture for Indian
shows the matching value. Red dots represent high relevance Sign Language (ISL). Experiment also extends to other static
while Blue dots represent low relevance. Based on the high sign languages like American Sign Language (ASL), and

47416 VOLUME 11, 2023

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

FIGURE 8. Representation of SignExplainer to interpret sign gesture with prediction value and class (class stars from 0-9 in left to right).

Bangla Sign Language. This study proposed and demon-

strated attention-based ensemble learning with ResNet50 and
Self-attention model. The proposed architecture was able to
achieve 98.20 percent remarkable accuracy for ISL, and also
compare with other computer vision state-of-the-art models.
The second phase of the study demonstrated the interpretation
of the learning model. The authors have used the SignEx-
plainer model to extract masked values from the black-box
model.

FIGURE 10. Comparative accuracy analysis of proposed methodology

with other deep learning State-of-the-art object detection models.

FIGURE 9. Comparative analysis of proposed methodology with other

deep learning State-of-the-art methodology.

The proposed SignExplainer uses fault line calculation to

interpret the correctness of the predicted sign image. The
result section also demonstrates the achieved result by Sign-
Explainer, and also compare it with other conventional XAI
model. The author has also evaluated TP-rate and FP-rate
for the proposed model, and it’s found remarkable with other
black box deep learning models as 0.98 and 0.17 respectively.
Figure 9 represents a comparative analysis of the proposed
architecture (ensemble learning + SignExplainer) with other
deep learning models like SVM [40], Random Forest [41],
CNN [35], VGG16 [36], and EfficientNetV2 [37]. The eval-
uation matrix was calculated with a True-False positive rate,
F-measures, and RMSE (Root Mean Square Error) value. The
statistical analysis represents the proposed associative archi-
tecture is more accurate than other standard machine learning
and deep learning models (shown in Figure 9). The authors FIGURE 11. Confusion Matrix for Static Indian Sign Language using
have also analyzed other deep learning object detection mod- Ensemble Learning with ResNer50.
els like R-CNN [42], Faster R-CNN [43], and Single Shot
Detector (SSD) [44] with VGG16 [45] as the backbone over in Figure 10. Figure 11 illustrates the confusion matrix of
the proposed Attention-based Ensemble model. A compara- the proposed ensemble learning methodology for the static
tive analysis of deep learning detection models was illustrated Indian Sign Language dataset.

VOLUME 11, 2023 47417

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

VI. CONCLUSION tem for smart television users,’’ Sustainability, vol. 15, no. 3, p. 2206,
The era of Explainable AI growing exponentially, to over- Jan. 2023.
[7] M. Baldeon Calisto and S. K. Lai-Yuen, ‘‘AdaEn-net: An ensemble
come trust and transparency issues of deep learning models. of adaptive 2D–3D fully convolutional networks for medical image
Especially tasks relevant to Computer vision or NLP must segmentation,’’ Neural Netw., vol. 126, pp. 76–94, Jun. 2020, doi:
require interpreting predicted results over critical sectors. 10.1016/j.neunet.2020.03.007.
The review has explored different XAl methodologies like [8] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
‘‘DeepLab: Semantic image segmentation with deep convolutional nets,
LRP, LIME, SHAP, and SmoothGrad over relevant com- atrous convolution, and fully connected CRFs,’’ IEEE Trans. Pat-
puter vision applications. This study has proposed Sign Lan- tern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018, doi:
guage Recognition to make explainable artificial intelligence. 10.1109/TPAMI.2017.2699184.
[9] J. Ganesan, A. T. Azar, S. Alsenan, N. A. Kamal, B. Qureshi, and
Ensemble learning-based architecture was proposed to recog-
A. E. Hassanien, ‘‘Deep learning reader for visually impaired,’’ Electron-
nize sign gestures from sign images. Ensemble weights were ics, vol. 11, no. 20, p. 3335, Oct. 2022.
passed to the proposed SignExplainer to generate statistical [10] D. Kothadiya, C. Bhatt, K. Sapariya, K. Patel, A.-B. Gil-González, and
values like TP-rate and FP-rate, to evaluate the correctness J. M. Corchado, ‘‘Deepsign: Sign language detection and recognition
using deep learning,’’ Electronics, vol. 11, no. 11, p. 1780, Jun. 2022, doi:
of the proposed SignExplainer. This study also evaluated 10.3390/electronics11111780.
ensemble learning with another deep learning model for [11] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas,
image classification. The proposed study also evaluates the and R. Sayres, ‘‘Interpretability beyond feature attribution: Quantita-
performance of SignExplainer over other benchmark static tive testing with concept activation vectors (TCAV),’’ in Proc. Int.
Conf. Mach. Learn., Mar. 2023, pp. 2668–2677. [Online]. Available:
sign language datasets like ASL and BSL, and it also achieves http://proceedings.mlr.press/v80/kim18d.html
remarkable performance. The proposed study also simulates [12] A. Dhurandhar, P.-Y. Chen, R. Luss, C.-C. Tu, P. Ting, K. Shanmugam,
additional machine learning and deep learning models like and P. Das, ‘‘Explanations based on the missing: Towards contrastive
explanations with pertinent negatives,’’ 2018, arXiv:1802.07623.
Decision tree, Random Forest, VGG16, and EfficientNetV2,
[13] A. Akula, S. Wang, and S.-C. Zhu, ‘‘CoCoX: Generating concep-
and evaluates the performance of SignExplainer. Ensemble tual and counterfactual explanations via fault-lines,’’ in Proc. AAAI
learning and other deep learning models were also per- Conf. Artif. Intell., Apr. 2020, vol. 34, no. 3, pp. 2594–2601, doi:
formed well over SignExplainer to interpret predicted signs 10.1609/aaai.v34i03.5643.
[14] V. Contreras, N. Marini, L. Fanda, G. Manzo, Y. Mualla, J.-P. Calbimonte,
with proper statistical values. The proposed work can be M. Schumacher, and D. Calvaresi, ‘‘A DEXiRE for extracting propositional
extended to other static Sign Languages as well as isolated rules from neural networks via binarization,’’ Electronics, vol. 11, no. 24,
Sign Languages. The proposed methodology can be enhanced p. 4171, Dec. 2022, doi: 10.3390/electronics11244171.
for real-time or portable Sign Language Recognition with [15] J. Patel, C. Amipara, T. A. Ahanger, K. Ladhva, R. K. Gupta, H. O. Alsaab,
Y. S. Althobaiti, and R. Ratna, ‘‘A machine learning-based water potability
acceptable interpretations. prediction model by using synthetic minority oversampling technique and
explainable AI,’’ Comput. Intell. Neurosci., vol. 2022, pp. 1–15, Sep. 2022,
ACKNOWLEDGMENT doi: 10.1155/2022/9283293.
This research was funded by Princess Nourah bint Abdulrah- [16] T. Vermeire, D. Brughmans, S. Goethals, R. M. B. de Oliveira, and
D. Martens, ‘‘Explainable image classification with evidence counterfac-
man University and Researchers Supporting Project number tual,’’ Pattern Anal. Appl., vol. 25, no. 2, pp. 315–335, Jan. 2022, doi:
(PNURSP2023R346), Princess Nourah bint Abdulrahman 10.1007/s10044-021-01055-y.
University, Riyadh, Saudi Arabia. The authors would also [17] Y. Goyal, Z. Wu, J. Ernst, D. Batra, D. Parikh, and S. Lee, ‘‘Coun-
terfactual visual explanations,’’ in Proc. 36th Int. Conf. Mach. Learn.,
like to acknowledge the support of Prince Sultan University May 2019, pp. 2376–2384, Accessed: Mar. 2023. [Online]. Available:
for paying the Article Processing Charges (APC) of this https://proceedings.mlr.press/v97/goyal19a.html
publication. [18] L. Arras, A. Osman, and W. Samek, ‘‘CLEVR-XAI: A benchmark
dataset for the ground truth evaluation of neural network explanations,’’
Inf. Fusion, vol. 81, pp. 14–40, May 2022, doi: 10.1016/j.inffus.2021.
REFERENCES 11.008.
[1] P. P. Angelov, E. A. Soares, R. Jiang, N. I. Arnold, and P. M. Atkinson, [19] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba,
‘‘Explainable artificial intelligence: An analytical review,’’ WIREs Data ‘‘Learning deep features for discriminative localization,’’ in Proc.
Mining Knowl. Discovery, vol. 11, no. 5, p. e1424, 2021. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[2] Y. Yuan and Y. Lo, ‘‘Improving dermoscopic image segmentation with pp. 2921–2929, Accessed: Mar. 6, 2023. [Online]. Available:
enhanced convolutional-deconvolutional networks,’’ IEEE J. Biomed. https://openaccess.thecvf.com/content_cvpr_2016/html/Zhou_Learning
Health Informat., vol. 23, no. 2, pp. 519–526, Mar. 2019, doi: _Deep_Features_CVPR_2016_paper.html
10.1109/jbhi.2017.2787487.
[20] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and
[3] A. Gramegna and P. Giudici, ‘‘SHAP and LIME: An evaluation of dis- D. Batra, ‘‘Grad-CAM: Visual explanations from deep networks via
criminative power in credit risk,’’ Frontiers Artif. Intell., vol. 4, Sep. 2021, gradient-based localization,’’ Int. J. Comput. Vis., vol. 128, no. 2,
Art. no. 752558. pp. 336–359, Feb. 2020, doi: 10.1007/s11263-019-01228-7.
[4] F. Afza, M. A. Khan, M. Sharif, S. Kadry, G. Manogaran, T. Saba,
I. Ashraf, and R. Damaševičius, ‘‘A framework of human action recogni- [21] M. T. Ribeiro, S. Singh, and C. Guestrin, ‘‘Why should i trust you?:
tion using length control features fusion and weighted entropy-variances Explaining the predictions of any classifier’’ in Proc. 22nd ACM SIGKDD
based feature selection,’’ Image Vis. Comput., vol. 106, Feb. 2021, Int. Conf. Knowl. Discovery Data Mining, 2016, pp. 1135–1144.
Art. no. 104090. [22] X. Shen, K. Lu, S. Mehta, J. Zhang, W. Liu, J. Fan, and Z. Zha, ‘‘MKEL:
[5] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, ‘‘Explainable AI: Multiple kernel ensemble learning via unified ensemble loss for image
A review of machine learning interpretability methods,’’ Entropy, vol. 23, classification,’’ ACM Trans. Intell. Syst. Technol., vol. 12, no. 4, pp. 1–21,
no. 1, p. 18, Dec. 2020, doi: 10.3390/e23010018. Aug. 2021.
[6] K. V. Dudekula, H. Syed, M. I. M. Basha, S. I. Swamykan, [23] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon, ‘‘Attention-based
P. P. Kasaraneni, Y. V. P. Kumar, A. Flah, and A. T. Azar, ‘‘Convolu- ensemble for deep metric learning,’’ in Proc. Eur. Conf. Comput. Vis.
tional neural network-based personalized program recommendation sys- (ECCV, 2018, pp. 736–751.

47418 VOLUME 11, 2023

D. R. Kothadiya et al.: SignExplainer: An XAI-Enabled Framework for Sign Language Recognition

[24] B. Chen and W. Deng, ‘‘Deep embedding learning with adaptive large DEEP R. KOTHADIYA received the bachelor’s
margin N-pair loss for image retrieval and clustering,’’ Pattern Recognit., and master’s degrees in computer science and
vol. 93, pp. 353–364, Sep. 2019, doi: 10.1016/j.patcog.2019.05.011. engineering from Gujarat Technological Univer-
[25] D. R. Kothadiya, C. M. Bhatt, T. Saba, A. Rehman, and S. A. Bahaj, sity. He is currently pursuing the Ph.D. degree
‘‘SIGNFORMER: DeepVision transformer for sign language with the Charotar University of Science and Tech-
recognition,’’ IEEE Access, vol. 11, pp. 4730–4739, 2023, doi: nology (CHARUSAT). He is also an Assistant
10.1109/access.2022.3231130.
Professor with the U & P U Patel Department
[26] J. Mueller and L. Massaron, Python for Data Science. Hoboken, NJ, USA:
of Computer Engineering, Chandubhai S. Patel
Wiley, 2019.
[27] J. Huang, W. Zhou, H. Li, and W. Li, ‘‘Sign language recognition using Institute of Technology, CHARUSAT. He is also
real-sense,’’ in Proc. IEEE China Summit Int. Conf. Signal Inf. Process. a Research Scholar with CHARUSAT, and Prince
(ChinaSIP, Jul. 2015, pp. 166–170. Sultan University, Riyadh, Saudi Arabia. He has already published many
[28] L. Pigou, S. Dieleman, P.-J. Kindermans, and B. Schrauwen, ‘‘Sign lan- research papers, including one SCI-indexed paper. He is also a Technical
guage recognition using convolutional neural networks,’’ in Proc. Eur. Reviewer of International Journal of Computing and Digital Systems.
Conf. Comput. Vis., 2015, pp. 572–578.
[29] S. Knapič, A. Malhi, R. Saluja, and K. Främling, ‘‘Explainable artificial
intelligence for human decision support system in the medical domain,’’
Mach. Learn. Knowl. Extraction, vol. 3, no. 3, pp. 740–770, Sep. 2021,
doi: 10.3390/make3030037. CHINTAN M. BHATT (Member, IEEE) was
[30] J. van der Waa, E. Nieuwburg, A. Cremers, and M. Neerincx, ‘‘Evaluating an Assistant Professor with the CE Department,
XAI: A comparison of rule-based and example-based explanations,’’ Artif. CSPIT, CHARUSAT, for 11 years. He is currently
Intell., vol. 291, Feb. 2021, Art. no. 103404. an Assistant Professor with the Department of
[31] F. Gabbay, S. Bar-Lev, O. Montano, and N. Hadad, ‘‘A LIME-based Computer Science and Engineering (CSE), School
explainable machine learning model for predicting the severity level of of Technology, Pandit Deendayal Energy Univer-
COVID-19 diagnosed patients,’’ Appl. Sci., vol. 11, no. 21, p. 10417, sity (PDEU). He is the author or coauthor of more
Nov. 2021. than 80 publications in the areas of computer
[32] D. R. Kothadiya. (Oct. 2022). Deepkothadiya/STATIC_ISL: Static Indian
vision, the Internet of Things, and fog computing.
Sign Language Dataset Having Sign of Digit and Alphabet. [Online].
He was involved in successful organization of few
Available: https://github.com/DeepKothadiya/Static_ISL
[33] Thakur. (May 2019). American Sign Language Dataset. [Online]. special issues in SCI/Scopus journals. He has won several awards, including
Available: https://www.kaggle.com/datasets/ayuraj/american-sign- the CSI Award and the Best Paper Award for his CSI articles and conference
language-dataset publications.
[34] S. M. Rayeed. (Aug. 2021). Bangla Sign Language Dataset. [Online].
Available: https://www.kaggle.com/datasets/rayeed045/bangla-sign-
language-dataset
[35] T. Saba, M. A. Khan, A. Rehman, and S. L. Marie-Sainte, ‘‘Region
extraction and classification of skin cancer: A heterogeneous framework
AMJAD REHMAN (Senior Member, IEEE)
of deep CNN features fusion and reduction,’’ J. Med. Syst., vol. 43, no. 9, received the Ph.D. and postdoctoral degrees
Jul. 2019, doi: 10.1007/s10916-019-1413-3. (Hons.) from the Faculty of Computing, Universiti
[36] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for Teknologi Malaysia, Johor Bahru, Malaysia, with
large-scale image recognition,’’ 2014, arXiv:1409.1556. a specialization in forensic documents analysis and
[37] B. Li, B. Liu, S. Li, and H. Liu, ‘‘An improved EfficientNet for Rice germ security, in 2010 and 2011, respectively. He is
integrity classification and recognition,’’ Agriculture, vol. 12, no. 6, p. 863, currently a Senior Researcher with the Artificial
Jun. 2022, doi: 10.3390/agriculture12060863. Intelligence and Data Analytics Laboratory, Prince
[38] Y. Heffetz, R. Vainshtein, G. Katz, and L. Rokach, ‘‘DeepLine: Sultan University, Riyadh, Saudi Arabia. He is
AutoML tool for pipelines generation using deep reinforcement also a PI in several funded projects and also com-
learning and hierarchical actions filtering,’’ in Proc. 26th ACM pleted projects funded from MOHE Malaysia, Saud Arabia. His research
SIGKDD Int. Conf. Knowl. Discovery Data Mining, Aug. 2020, interests include data mining, health informatics, and pattern recognition.
pp. 2103–2113.
[39] H. Chen, S. Lundberg, and S.-I. Lee, ‘‘Explaining models by propagat-
ing Shapley values of local components,’’ in Explainable AI in Health-
care and Medicine. Cham, Switzerland: Springer, 2020, pp. 261–270.
[Online]. Available: https://link.springer.com/book/10.1007/978-3-030- FATEN S. ALAMRI received the Ph.D. degree in system modeling and
53352-6?page=2#toc, doi: 10.1007/978-3-030-53352-6. analysis in statistics from Virginia Commonwealth University, USA, in 2020.
[40] A. Razaque, M. Ben Haj Frej, M. Almi’ani, M. Alotaibi, and B. Alotaibi, Her Ph.D. research was in Bayesian dose response modeling, experimental
‘‘Improved support vector machine enabled radial basis function and linear
design, and nonparametric modeling. She is currently an Assistant Professor
variants for remote sensing image classification,’’ Sensors, vol. 21, no. 13,
with the Department of Mathematical Sciences, College of Science, Princess
p. 4431, Jun. 2021, doi: 10.3390/s21134431.
[41] Z. Noshad, N. Javaid, T. Saba, Z. Wadud, M. Saleem, M. Alzahrani, and Nourah bint Abdul Rahman University. Her research interests include spatial
O. Sheta, ‘‘Fault detection in wireless sensor networks through the random area, environmental statistics, and brain imaging.
forest classifier,’’ Sensors, vol. 19, no. 7, p. 1568, Apr. 2019.
[42] X. Xie, G. Cheng, J. Wang, X. Yao, and J. Han, ‘‘Oriented R-CNN for
object detection,’’ 2021, arXiv:2108.05699.
[43] Y. Liu, ‘‘An improved faster R-CNN for object detection,’’ in Proc.
TANZILA SABA (Senior Member, IEEE) received the Ph.D. degree in
11th Int. Symp. Comput. Intell. Design (ISCID), vol. 2, Dec. 2018,
pp. 119–123.
document information security and management from the Faculty of Com-
[44] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and puting, Universiti Teknologi Malaysia (UTM), Malaysia, in 2012. She is
A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ Proc. Comput. Vis. currently the Associate Chair of Information Systems Department, College
(ECCV), 2016, pp. 21–37. of Computer and Information Sciences, Prince Sultan University, Riyadh,
[45] A. T. Azar, Z. I. Khan, S. U. Amin, and K. M. Fouad, ‘‘Hybrid global Saudi Arabia. Her primary research interests include medical imaging, pat-
optimization algorithm for feature selection,’’ Comput., Mater. Continua, tern recognition, data mining, MRI analysis, and soft-computing.
vol. 74, no. 1, pp. 2021–2037, 2023.