A_Comprehensive_Study_on_Sign_Language_Recognition
A_Comprehensive_Study_on_Sign_Language_Recognition
net/publication/350575220
CITATIONS READS
114 1,313
10 authors, including:
All content following this page was uploaded by Klimis Antzakas on 13 September 2021.
Abstract—In this paper, a comparative experimental assess- • Non-manual features, namely eye gaze, head-nods/
ment of computer vision-based methods for sign language recog- shakes, shoulder orientations, various kinds of facial
nition is conducted. By implementing the most recent deep neural expression as mouthing and mouth gestures.
network methods in this field, a thorough evaluation on multiple
publicly available datasets is performed. The aim of the present Combinations of the above-mentioned features represent a
study is to provide insights on sign language recognition, focusing gloss, which is the fundamental building block of a SL and
on mapping non-segmented video streams to glosses. For this
represents the closest meaning of a sign [2]. SLs, similar to the
task, two new sequence training criteria, known from the fields of
speech and scene text recognition, are introduced. Furthermore, a spoken ones, include an inventory of flexible grammatical rules
plethora of pretraining schemes is thoroughly discussed. Finally, that govern both manual and non-manual features [3]. Both
a new RGB+D dataset for the Greek sign language is created. To of them, are simultaneously (and often with loose temporal
the best of our knowledge, this is the first sign language dataset structure) used by signers, in order to construct sentences in
where sentence and gloss level annotations are provided for a
a SL. Depending on the context, a specific feature may be the
video capture.
most critical factor towards interpreting a gloss. It can modify
Index Terms—Sign Language Recognition, Greek sign lan- the meaning of a verb, provide spatial/temporal reference and
guage, Deep neural networks, stimulated CTC, conditional en- discriminate between objects and people.
tropy CTC.
Due to the intrinsic difficulty of the deaf community to
interact with the rest of the society (according to [4], around
500,000 people use the American SL to communicate in
I. I NTRODUCTION
the USA), the development of robust tools for automatic
Spoken languages make use of the “vocal - auditory” SL recognition would greatly alleviate this communication
channel, as they are articulated with the mouth and perceived gap. As stated in [5], there is an increased demand for
with the ear. All writing systems also derive from, or are interdisciplinary collaboration including the deaf community
representations of, spoken languages. Sign languages (SLs) and for the creation of representative public video datasets.
are different as they make use of the “corporal - visual” Sign Language Recognition (SLR) can be defined as the
channel, produced with the body and perceived with the eyes. task of inferring glosses performed by a signer from video
SLs are not international and they are widely used by the captures. Even though there is a significant amount of work
communities of deaf people. They are natural languages, since in the field of SLR, a lack of a complete experimental study is
they are developed spontaneously wherever deaf people have profound. Moreover, most publications do not report results in
the opportunity to congregate and communicate mutually. SLs all available datasets or share their code. Thus, experimental
are not derived from spoken languages; they have their own results in the field of SL are rarely reproducible and lacking
independent vocabularies and their own grammatical structures interpretation. Apart from the inherent difficulties related to
[1]. The signs used by deaf people, actually have internal human motion analysis (e.g. differences in the appearance of
structure in the same way as spoken words. Just as hundreds of the subjects, the human silhouette features, the execution of the
thousands of English words are produced using a small number same actions, the presence of occlusions, etc.) [6], automatic
of different sounds, the signs of SLs are produced using a SLR exhibits the following key additional challenges:
finite number of gestural features. Thus, signs are not holistic • Exact position in surrounding space and context have a
gestures but are rather analyzable, as a combination of lin- large impact on the interpretation of SL. For example,
guistically significant features. Similarly to spoken languages, personal pronouns (e.g. “he”, “she”, etc.) do not exist.
SLs are composed of the following indivisible features: Instead, the signer points directly to any involved referent
• Manual features, i.e. hand shape, position, movement, or, when reproducing the contents of a conversation,
orientation of the palm or fingers, and pronouns are modeled by twisting his/her shoulders or
gaze.
* authors contributed equally • Many glosses are only distinguishable by their constituent
non-manual features and they are typically difficult to be
2
accurately detected, since even very slight human move- insights of the conducted experiments are discussed. Finally,
ments may impose different grammatical or semantic conclusions are drawn and future research directions are
interpretations depending on the context [7]. highlighted in Section VIII.
• The execution speed of a given gloss may indicate a
different meaning or the particular signers attitude. For II. R ELATED W ORK
instance, signers would not use two glosses to express The various automatic SLR tasks, depending on the model-
“run quickly”, but they would simply speed up the ing’s level of detail and the subsequent recognition step, can
execution of the involved signs [7]. be roughly divided in (Fig. 1):
• Signers often discard a gloss sub-feature, depending on
• Isolated SLR: Methods of this category target to address
previously performed and proceeding glosses. Hence,
the task of video segment classification (where the seg-
different instances of the exact same gloss, originating
ment boundaries are provided), based on the fundamental
even from the same signer, can be observed.
assumption that a single gloss is present [9], [21], [18].
• For most SLs so far, very few formal standardization ac-
• Sign detection in continuous streams: The aim of these
tivities have been implemented, to the extent that signers
approaches is to detect a set of predefined glosses in a
of the same country exhibit distinguishable differences
continuous video stream [11], [22], [23].
during the execution of a given gloss [8].
• Continuous SLR (CSLR): These methods aim at rec-
Historically, before the advent of deep learning methods, the ognizing the sequence of glosses that are present in
focus was on identifying isolated glosses and gesture spotting. a continuous/non-segmented video sequence [3], [24],
Developed methods were often making use of hand crafted [25]. This category of approaches exhibits characteristics
techniques [9], [10]. For spatial representation of the different that are most suitable for the needs of real-life SLR
sub-gloss components, they usually used handcrafted features applications [5]; hence, it has gained increased research
and/or fusion of multiple modalities. Temporal modeling was attention and will be further discussed in the remainder
achieved by classical sequence learning models, such as of this section.
Hidden Markov Model (HMM) [11], [12], [13] and hidden
conditional random fields [14]. The rise of deep networks
A. Continuous sign language recognition
was met with a significant boost in performance for many
video-related tasks, like human action recognition [15], [16], By definition, CSLR is a task very similar to the one of
gesture recognition, [17], [18], motion capturing [19], [20], continuous human action recognition, where a sequence of
etc. SLR, is a task closely related to computer vision. This is glosses (instead of actions) needs to be identified in a contin-
the reason that most approaches tackling SLR have adjusted uous stream of video data. However, glosses typically exhibit
to this direction. a significantly shorter duration than actions (i.e. they may
In this paper, SLR using Deep Neural Network (DNN) only involve a very small number of frames), while transitions
methods is investigated. The main contributions of this work among them are often very subtle for their temporal boundaries
are summarized as follows: to be efficiently recognized. Additionally, glosses may only
involve very detailed and fine-grained human movements (e.g.
• A comprehensive, holistic and in-depth analysis of multi- finger signs or facial expressions), while human actions usually
ple literature DNN-based SLR methods is performed, in refer to more concrete and extensive human body actions. The
order to provide meaningful and detailed insights to the latter facts highlight the particular challenges that are present
task at hand. in the CSLR field [3].
• Two new sequence learning training criteria are proposed, Due to the lack of gloss-level annotations, CSLR is regularly
known from the fields of speech and scene text recogni- casted as a weakly supervised learning problem. The majority
tion. of CSLR architectures usually consists of a feature extractor,
• A new pretraining scheme is discussed, where transfer followed by a temporal modeling mechanism, [26], [27]. The
learning is compared to initial pseudo-alignments. feature extractor is used to compute feature representations
• A new publicly available large-scale RGB+D Greek Sign from individual input frames (using 2D CNNs) or sets of
Language (GSL) dataset is introduced, containing real- neighbouring frames (using 3D CNNs). On the other hand,
life conversations that may occur in different public a critical aspect of the temporal modeling scheme enables the
services. This dataset is particularly suitable for DNN- modeling of the SL unit feature representations (i.e gloss-
based approaches that typically require large quantities level, sentence-level). With respect to temporal modeling,
of expert annotated data. sequence learning can be achieved using HMMs, Connec-
The remainder of this paper is organized as follows: in tionist Temporal Classification (CTC) [28] or Dynamic Time
Section II, related work is described. In Section III, an Warping (DTW) [29] techniques. From the aforementioned
overview of the publicly available datasets in SLR is provided, categories, CTC has, in general, shown superior performance
along with the introduction of a new GSL dataset. In Section and the majority of works in CSLR has established CTC
IV, a description of the implemented architectures is given. as the main sequence training criterion (for instance, HMMs
In Section V, a description of the proposed sequence training may fail to efficiently model complex dynamic variations, due
criteria is detailed. In Section VI, the performed experimental to expressiveness limitations [25]). However, CTC has the
results are reported. Then, in Section VII, interpretations and tendency to produce overconfident peak distributions, that are
3
CSLR
Individual Gloss 2D CNN-based segment-level
HELLO CAN HELP
prone to overfitting [30]. Moreover, CTC introduces limited while fully embracing the iterative optimization procedure.
contribution towards optimizing the feature extractor [31]. That module is able to produce compact representations for
For these reasons, some recent approaches have adopted an a video segment, which approximate the average duration of
iterative training optimization methodology. The latter essen- a gloss. Thereby, the LSTM captures the context information
tially comprises a two-step process. In particular, a set of between gloss segments, instead of individual frames as in
temporally-aligned pseudo-labels are initially estimated and previous works. In [2], a hybrid 2D-3D CNN architecture [37]
used to guide the training of the feature extraction module. is developed. Features are extracted in a structured manner,
In the beginning, the pseudo-labels can be either estimated by where temporal dependencies are modeled by two LSTMs,
statistical approaches [3] or extracted from a shallower model without pretraining or using an iterative procedure. This ap-
[25]. After training the model in an isolated setup, the trained proach however, yields the best results only in continuous SL
feature extractor is utilized for the continuous SLR setup. This datasets where a plethora of training data is available.
process may be performed in an iterative way, similarly to the
Expectation Maximization (EM) algorithm [32]. Finally, CTC C. 3D CNN-based CSLR approaches
imposes a conditional independence constraint, where output
One of the first works that employs 3D-CNNs in SLR
predictions are independent, given the entire input sequence.
is introduced in [38]. The authors present a multi-modal
approach for the task of isolated SLR, using spatio-temporal
B. 2D CNN-based CSLR approaches Convolutional 3D networks (C3D) [39], known from the re-
One of the firstly deployed architectures in CSLR is based search field of action recognition. Multi-modal representations
on [33], where a CNN-HMM network is proposed. Google- are lately fused and fed to a Support Vector Machine (SVM)
LeNet serves as the backbone architecture, fed with cropped [40] classifier. The C3D architecture has also been utilized in
hand regions and trained in an iterative manner. The same CSLR by [41]. The developed two-stream 3D CNN processes
network architecture is deployed in a CSLR prediction setup both full frame and cropped hand images. The full network,
[34], where the CNN is trained using glosses as targets named LS-HAN, consists of the proposed 3D CNN network,
instead of hand shapes. Later on, in [35], the same authors along with a hierarchical attention network, capable of latent
extend their previous work by incorporating a Long Short- space-based recognition modeling. In a later work [42], the
Term Memory unit (LSTM) [36] on top of the aforementioned authors propose the I3D [43] architecture in SLR. The model
network. In a more recent work [24], the authors present a is deployed on an isolated SLR setup, with pretrained weights
three-stream CNN-LSTM-HMM network, using full frame, on action recognition datasets. The signer’s body bounding box
cropped dominant hand and signer’s mouth region modalities. is served as input. For the evaluated dataset it yielded state-of-
These models, since they employ HMM for sequence learning, the-art results. In [31], the authors adopted and enhanced the
have to make strong initial assumptions in order to overcome original I3D model with a gated Recurrent Neural Network
HMM’s expressive limitations. (RNN). The whole architecture is a 3D CNN-RNN-CTC
In [26], the authors introduce an end-to-end system in architecture trained iteratively with a dynamic pseudo-label
CSLR without iterative training. Their model follows a 2D decoding method. Their aim is to accommodate features from
CNN-LSTM architecture, replacing HMM with LSTM-CTC. different time scales. In another work [44], the authors intro-
It consists of two streams, one responsible for processing the duce the 3D ResNet architecture to extract features. Further-
full frame sequences and one for processing only the signer’s more, they substituted LSTM with stacked dilated temporal
cropped dominant hand. In [27], the authors employ a 2D convolutions and CTC for sequence alignment and decoding.
CNN-LSTM architecture and in parallel with the LSTMs, With this approach, they manage to have very large receptive
a weakly supervised gloss-detection regularization network, fields while reducing time and space complexity, compared to
consisting of stacked temporal 1D convolutions. The same LSTM. Finally, in [45] Pu et al. propose a framework that also
authors in [25] extend their previous work by proposing a consists of a 3D ResNet backbone. The features are provided
module composed of a series of temporal 1D CNNs followed in both an attentional encoder-decoder network [46] and a
by max pooling, between the feature extractor and the LSTM, CTC decoder for sequence learning. Both decoded outputs
4
TABLE I
L ARGE - SCALE PUBLICLY AVAILABLE SLR DATASETS
Characteristics
Datasets Language Signers Classes Video instances Duration (hours) Resolution fps Type Modalities Year
Signum SI [48] German 25 780 19,500 55.3 776x578 30 continuous RGB 2007
Signum isol. [48] German 25 455 11,375 8.43 776x578 30 both RGB 2007
Signum subset [48] German 1 780 2,340 4.92 776x578 30 both RGB 2007
Phoenix SD [49] German 9 1,231 6,841 10.71 210x260 25 continuous RGB 2014
Phoenix SI [49] German 9 1,117 4,667 7.28 210x260 25 continuous RGB 2014
CSL SD [41] Chinese 50 178 25,000 100+ 1920x1080 30 continuous RGB+D 2016
CSL SI [41] Chinese 50 178 25,000 100+ 1920x1080 30 continuous RGB+D 2016
CSL isol. [38] Chinese 50 500 125,000 67.75 1920x1080 30 isolated RGB+D 2016
Phoenix-T [50] German 9 1,231 8,257 10.53 210x260 25 continuous RGB 2018
ASL 100 [42] English 189 100 5,736 5.55 varying varying isolated RGB 2019
ASL 1000 [42] English 222 1,000 25,513 24.65 varying varying isolated RGB 2019
GSL isol. (new) Greek 7 310 40,785 6.44 848x480 30 isolated RGB+D 2019
GSL SD (new) Greek 7 310 10,290 9.59 848x480 30 continuous RGB+D 2019
GSL SI (new) Greek 7 310 10,290 9.59 848x480 30 continuous RGB+D 2019
V. S EQUENCE LEARNING TRAINING CRITERIA FOR CSLR The error signal of Lctc with respect to gvt is:
A summary of the notations used in this paper, is provided
in this section, so as to enhance its readability and under- ∂Lctc 1 X
standing. Let us denote by U the label (i.e. gloss) vocabulary t
=− p(π|X) (7)
∂gv p(y|X)gvt
{π∈B −1 (y),πt =v}
and by blank the new blank token, representing the silence
or transition between two consecutive labels. The extended From (7) it can be observed that the error signal is pro-
vocabulary can be defined as V = U ∪ {blank} ∈ RL , where portional to the fraction of all valid paths. As soon as a
L is the total number of labels. From now on, given a sequence path dominates the rest, the error signal enforces all the
f of length F , we denote its first and last p elements by f 1:p probabilities to concentrate on a single path. Moreover, (1)
and f p:F , respectively. An input frame sequence of length N and (7) indicate that the probabilities of a gloss occurring
can be defined as X = (x1 , .., xN ). The corresponding target at following time-steps are independent, which is known as
sequence of labels (i.e. glosses) of length K is defined as the conditional independence assumption. For these reasons,
y = (y1 , .., yK ). In addition, let Gv = (g 1v , .., g Tv ) ∈ RL×T two learning criteria are introduced in CSLR: a) one that
be the predicted output sequence of a softmax classifier, where encounters the ambiguous segmentation boundaries of adjacent
T ≤ N and v ∈ V . g tv can be interpreted as the probability glosses, and b) one that is able to model the intra-gloss
of observing label v at time-step t. Hence, Gv defines a dependencies, by incorporating a learnable language model
distribution over the set V T ∈ RL×T : during training (as opposed to other approaches that use it
T
only during the CTC decoding stage).
Y
p(π|X) = gπt t , ∀π ∈ V T (1)
t=1 B. Entropy Regularization CTC
The elements of V T are referred as paths and denoted The CTC criterion can be extended [30] based on maximum
by π. In order to map y to π, one can define a mapping conditional entropy [52], by adding an entropy regularization
function B : V T 7→ U ≤T , with U ≤T being the set of possible term H:
labellings. B removes repeated labels and blanks from a given
path. Similarly, one can denote the inverse operation of B as
B −1 , that maps target labels to all the valid paths. From this
X
H(p(π|y, X)) = − p(π|X, y) log p(π|X, y)
perspective, the conditional probability of y is computed as: π∈B −1 (y)
X Q(y)
p(y|X) = p(π|X) (2) =− + log p(y|X), (8)
p(y|X)
π∈B −1 (y)
P
where Q(y) = π∈B −1 (y) p(π|X) log p(π|X).
A. Traditional CTC criterion
H aims to prevent the entropy of the non-dominant paths
Connectionist Temporal Classification (CTC) [28] is widely from decreasing rapidly. Consequently, the entropy regulariza-
utilized for labelling unsegmented sequences. The time com- tion CTC criterion (EnCTC) is formulated as:
plexity of (2) is O(LN K), which means that the amount of
valid paths grows exponentially with N . To efficiently calcu- Lenctc = Lctc − φH(p(π|y, X)), (9)
late p(y|X), a recursive formula is derived, which exploits
the existence of common sub-paths. Furthermore, to allow for where φ is a hyperparameter. The introduction of the entropy
blanks in the paths, a modified gloss sequence y 0 of length term H prevents the error signal from gathering into the
K 0 = 2K + 1 is used, by adding blanks before and after each dominant path, but rather encourages the exploration of nearby
gloss in y. Forward and backward probabilities αt (s) of y 01:s ones. By increasing the probabilities of the alternative paths,
at t and βt (s) of y 0s:K 0 at t are defined as: the peaky distribution problem is alleviated.
t
X Y 0
αt (s) , gπt t0 (3) C. Stimulated CTC
B(π1:t )=y 01:s t0 =1
Stimulated learning [53], [54], [55] augments the training
T process by regularizing the activations of the sequence learning
X Y 0
βt (s) , gπt t0 (4) RNN, ht . Stimulated CTC (StimCTC) [56] constricts the
B(πt:T )=y 0s:K 0 t0 =t independent assumption of traditional CTC. To generate the
Therefore, to calculate p(y|X) for any t, we sum over all appropriate stimuli, an auxiliary uni-directional Language
s in y 0 as: Model RNN (RNN-LM) is utilized. The RNN-LM encoded
K0 hidden states (hk ) encapsulate the sentence’s history, up to
X αt (s)βt (s)
p(y|X) = (5) gloss k. ht is stimulated by utilizing the non-blank probabil-
s=1
gyt 0s ities α0t and β 0t ∈ RK . Then, the weighting factor γ t can be
Finally, the CTC criterion is derived as: calculated as:
β 0 α0t
Lctc = − log p(y|X) (6) γ t = t0 (10)
β t · α0t
7
Lstim = Lctc + λLlm + θLstimuli , (12) The same data preprocessing methods are used for all
datasets. Each frame is normalised by the mean and standard
where λ and θ are hyper-parameters. The described criteria deviation of the ImageNet dataset. To increase the variabil-
can be combined, resulting in Entropy Stimulated CTC (En- ity of the training videos, the following data augmentation
StimCTC) criterion, as: techniques are adopted. Frames are resized to 256X256 and
cropped at a random position to 224X224. Random frame
Lenstim = Lctc − φH(p(π|y)) + λLlm + θLstimuli (13) sampling is used up to 80% of video length. Moreover, random
jittering of the brightness, contrast, saturation and hue values
VI. E XPERIMENTAL EVALUATION of each frame is applied. The models are trained with Adam
optimizer with initial learning rate λ0 = 10−4 , which is
In order to provide a fair evaluation, we re-implemented reduced to λi = 10−5 when validation loss starts to plateau.
the selected approaches and evaluated them on multiple large- For isolated SLR experiments, the batch size is set to 2. Videos
scale datasets, in both isolated and continuous SLR. Re- are rescaled to a fixed length that is equal to the average
implementations are based on the original authors’ guidelines gloss length of each dataset. For CSLR experiments, videos are
and any modifications are explicitly referenced. For the con- downsampled to maximum length of 250 frames, if necessary.
tinuous setup, the criteria CTC, EnCTC, and EnStimCTC are Batch size is set to 1, due to GPU memory constraints. The
evaluated in all architectures. For a fair comparison between experiments are conducted in a NVIDIA GeForce GTX-1080
different models, we opt to use the full frame modality, since Ti GPU with 12 GB of memory and 32 GB of RAM. All
it is the common modality between selected datasets and it is models, depending on the dataset, require 10 to 25 epochs to
more suitable for real-life applications. We omit the iterative converge.
optimization process, instead we pretrain each model on the The referenced models, depending on the dataset, have been
respective dataset’s isolated version, if present. Otherwise, modified as follows: In SubUNets, AlexNet [58] is used as
extracted pseudo-alignments from other models (i.e. Phoenix) feature extractor instead of CaffeNet [59], as they share a sim-
are used for isolated pretraining (implementations and experi- ilar architecture. Additionally, for the CSL and GSL datasets,
mental results are publicly available to enforce reproducibility we reduce the bidirectional LSTM hidden size by half, due
in SLR3 ). to computational space complexity. In the isolated setup, the
LSTM layers of SubUNets are trained along with the feature
extractor. In order to achieve the maximum performance of
A. Datasets and Evaluation metrics GoogLeNet+TConvs, a manual customization of TConvs 1D
The following datasets have been chosen for experimental CNN kernels and pooling sizes is necessary. The intuition
evaluation: ASL 100 and 1000, CSL isol., GSL isol. for the behind it, is that the receptive field should be approximately
isolated setup, and Phoenix SD and Phoenix SI, CSL SD, covering the average gloss duration. Each 1D CNN layer
CSL SI, GSL SD, GSL SI for the CSLR setup. To evaluate includes 1024 filters. In CSL, the 1D CNN are set with kernel
recognition performance in continuous datasets, the word error size 7, stride 1 and the max-pooling layers with kernel sizes
rate (WER) metric has been adopted, which quantifies the and strides equal to 3, to cover the average gloss duration of
similarity between predicted glosses and ground truth gloss 58 frames. For the GSL dataset the TConvs are tuned with
sequence. WER measures the least number of operations kernel sizes equal to 5 and pooling sizes equal to 3. In order
needed to transform the aligned predicted sequence to the to deploy 3D-ResNet and I3D in a CSLR setup, a sliding
ground truth and can be defined as: window technique is adopted in the input sequence. Window
size and stride are selected to cover the average gloss duration.
S+D+I
W ER = , (14) Then, a 2-layer bidirectional LSTM is added to model the
N long-term temporal correlations in the feature sequence. In
where S is the total number of substitutions, D is the total CSL, the window size is set to 50 and stride 36, whereas in
number of deletions, I is the total number of insertions and GSL the window size is set to 25, with stride equal to 12.
N is the total number of glosses in the ground truth. I3D and 3D-ResNet are initialized with weights pretrained
on Kinetics. Also, for the 3D-ResNet method, we omit the
3 https://zenodo.org/record/3941811#.XxrZXZZRU5k attentional decoder from the original paper, keeping the 3D-
8
TABLE III
F INE TUNING IN CSLR DATASETS . R ESULTS ARE REPORTED IN WER
Datasets
Method Phoenix SD Phoenix SI CSL SI CSL SD GSL SI GSL SD
Val. / Test Val. / Test Test Test Test Test
I3D (Kinetics) 53.81 / 51.27 65.53 / 62.38 23.19 72.39 34.52 75.42
I3D (Kinetics + ASL 1000) 40.89 / 40.49 59.60 / 58.36 16.73 64.72 27.09 71.05
TABLE V
R EPORTED RESULTS IN CONTINUOUS SD SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .
TABLE VI
R EPORTED RESULTS IN CONTINUOUS SI SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .
ground truth I(1) PAPER EXCUSE CHECK AFTER YOU PAPER PROOF I_GIVE_YOU
Fig. 4. Visual comparison of ground truth alignments with the predictions of the proposed training criteria. GoogLeNet+TConvs is used for evaluation in the
GSL SD dataset.
36
34 variations of the traditional CTC.
32
30
28
0 5 10 15 20 25 30
Epoch
B. Comparison between CTC variations
Fig. 5. Comparison of validation WER of CTC and EnStimCTC criteria with
GoogLeNet+TConvs in CSL SI and Phoenix SD datasets. The reported experimental results exhibit the negative influ-
ence of CTC’s drawbacks (overconfident paths and conditional
independence assumption) in CSLR. EnCTC’s contribution
semantic representations (more abstract output features). In to alleviate the overconfident paths, is illustrated in Fig. 4.
the opposite direction, deploying skip connections maintains The ground truth gloss “PROOF” is recognized with the
previous layers feature maps that correspond to lower semantic introduction of H, instead of “APPROVAL”. The latter has six
content that does not assist in isolated SLR. times higher occurrence frequency. After a careful examination
Modeling intermediate short temporal dependencies was of the aforementioned signs, one can notice that these signs
experimentally shown (Tables V, VI) to enhance the CSLR are close in terms of hand position and execution speed, which
performance. The implemented 3D CNN architectures directly justifies the depicted predictions. Furthermore, it is observed
capture spatio-temporal correlations as intermediate represen- that EnCTC boosts performance mostly in CSL SI and GSL
tations. The design choice of providing the input video in SI, due to the limited diversity and vocabulary. It can be high-
a sliding window restricts the network’s temporal receptive lighted that EnCTC did not boost SubUNets’ performance. The
field. Based on a sequential structure, architectures such as latter generates per frame predictions (T = N ), wherein the
GoogLeNet+TConvs achieve the same goal, by grouping con- rest approaches generate grouped predictions (T ≈ N4 ). This
secutive spatial features. Such a sequential approach can be results in a significantly larger space of possible alignments
proved beneficial in many datasets, given that spatial filters that is harder to be explored from this criterion. From Fig.
are well-trained. For this reason, such approaches require 4, it can be visually validated that EnStimCTC remedies the
heavy pretraining in the backbone network. The superiority conditional independence assumption. For instance, the gloss
in performance of the implemented sequential approach is “CHECK” was only recognised with stimulated training. By
justified in the careful manual tuning of temporal kernels and bringing closer predictions that correspond to the same target
strides. However, manual design significantly downgrades the gloss, the intra-gloss dependencies are effectively modeled.
advantages of transfer learning. The sliding window technique In parallel, the network was also able to correctly classify
can be easily adapted based on the particularities of each transitions between glosses as blank. It should be also noted
dataset, making 3D CNNs more scalable and suitable for that EnStimCTC does not increase time and space complexity
real-life applications. To summarize, both techniques aim to during inference.
11
C. Evaluation of pretraining schemes features of SL, similar to humans. Finally, it would be of great
Due to the limited contribution of CTC gradients in the fea- importance for the deaf-non deaf communication to bridge the
ture extractor, an effective pretraining is mandatory. As shown gap between SLR and SL translation. Advancements in this
in Fig. 3, pretraining significantly affects the starting WER of domain will drive research to SL translation as well as SL to
each model. Without pretraining, all models congregate around SL translation, which have not yet been thoroughly studied.
the most dominant glosses, which significantly slows down
the CSLR training process and limits the learning capacity IX. ACKNOWLEDGEMENTS
of the network. Fully supervised pretraining is interpreted
as a domain shift to the distribution of the SL dataset that This work was supported by the Greek General Secretariat
speeds up the early training stage in CSLR. Regarding the of Research and Technology under contract Τ1Ε∆Κ-02469
pretraining scheme, in datasets with limited vocabulary and EPIKOINONO.
gloss sequences (i.e. CSL), inferring initial pseudo-alignments The authors would like to express their gratitude to Vasileios
proved beneficial, as shown in Table IV. This is explained Angelidis, Chrysoula Kyrlou and Georgios Gkintikas from the
due to the fact that the data distribution of the isolated Greek sign language center4 for their valuable feedback and
subset had different particularities, such as sign execution contribution to the Greek sign language capturings.
speed. However, producing initial pseudo-alignments is time
consuming. Hence, the small deterioration in performance is R EFERENCES
an acceptable trade-off between recognition rate and time to
train. The proposed GSL dataset contains nearly double the [1] W. Sandler and D. Lillo-Martin, Sign language and linguistic universals.
Cambridge University Press, 2006.
vocabulary and roughly three times the number of unique [2] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature
gloss sentences, with less training instances. More importantly, network for continuous sign language recognition,” arXiv preprint
the isolated subset draws instances from the same distribution arXiv:1908.01341, 2019.
[3] O. Koller, J. Forster, and H. Ney, “Continuous sign language recogni-
as the continuous one. In such cases it can be stated that tion: Towards large vocabulary statistical recognition systems handling
proximal transfer learning significantly outperforms training multiple signers,” Computer Vision and Image Understanding, vol. 141,
with pseudo-alignments (56.90% relative improvement in the pp. 108–125, 2015.
[4] R. E. Mitchell, T. A. Young, B. BACHELDA, and M. A. Karchmer,
GSL dataset). “How many people use asl in the united states? why estimates need
updating,” Sign Language Studies, vol. 6, no. 3, pp. 306–335, 2006.
[5] D. Bragg, O. Koller, M. Bellard, L. Berke, P. Boudrealt, A. Braffort,
VIII. C ONCLUSIONS AND FUTURE WORK N. Caselli, M. Huenerfauth, H. Kacorri, T. Verhoef et al., “Sign
In this paper, an in-depth analysis of the most characteristic language recognition, generation, and translation: An interdisciplinary
perspective,” arXiv preprint arXiv:1908.08597, 2019.
DNN-based SLR model architectures was conducted. Through [6] G. T. Papadopoulos and P. Daras, “Human action recognition using 3d
extensive experiments in three publicly available datasets, reconstruction data,” IEEE Transactions on Circuits and Systems for
a comparative evaluation of the most representative SLR Video Technology, vol. 28, no. 8, pp. 1807–1823, 2016.
[7] H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in
architectures was presented. Alongside with this evaluation, Visual Analysis of Humans. Springer, 2011, pp. 539–562.
a new publicly available large-scale RGB+D dataset was [8] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini, and A. Rosete,
introduced for the Greek SL, suitable for SLR benchmarking. “Lsa64: an argentinian sign language dataset,” in XXII Congreso Ar-
gentino de Ciencias de la Computación (CACIC 2016)., 2016.
Two CTC variations known from other application fields, [9] M. W. Kadous et al., “Machine recognition of auslan signs using
EnCTC & StimCTC, were evaluated for CSLR and it was powergloves: Towards large-lexicon recognition of sign language,” in
noticed that their combination tackled two important issues, Proceedings of the Workshop on the Integration of Gesture in Language
and Speech, vol. 165, 1996.
the ambiguous boundaries of adjacent glosses and intra-gloss [10] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture
dependencies. Moreover, a pretraining scheme was provided, recognition with kinect depth camera,” IEEE transactions on multimedia,
in which transfer learning from a proximal isolated dataset can vol. 17, no. 1, pp. 29–39, 2014.
[11] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recog-
be a good initialization for CSLR training. The main finding nition from articulated poses,” in European Conference on Computer
of this work was that while 3D CNN-based architectures were Vision. Springer, 2014, pp. 595–607.
more effective in isolated SLR, 2D CNN-based models with [12] J. Zhang, W. Zhou, C. Xie, J. Pu, and H. Li, “Chinese sign language
an intermediate per gloss representation achieved superior recognition with adaptive hmm,” in 2016 IEEE International Conference
on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
results in the majority of the CSLR datasets. In particular, [13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Enabling
our implementation of GoogLeNet+TConvs, with the proposed robust statistical continuous sign language recognition via hybrid cnn-
pretraining scheme and EnStimCTC criterion, yielded state-of- hmms,” International Journal of Computer Vision, vol. 126, no. 12, pp.
1311–1325, 2018.
the-art results in CSL SI. [14] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell,
Concerning future work, efficient ways for integrating depth “Hidden conditional random fields for gesture recognition,” in 2006
information that will guide the feature extraction training IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1521–1527.
phase, can be devised. Moreover, another promising direc- [15] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
tion is to investigate the incorporation of more sequence gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
learning modules, like attention-based approaches, in order networks for visual recognition and description,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp.
to adequately model inter-gloss dependencies. Future SLR 2625–2634.
architectures may be enhanced by fusing highly semantic
representations that correspond to the manual and non-manual 4 https://www.keng.gr/
12
[16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream [38] J. Pu, W. Zhou, and H. Li, “Sign language recognition with multi-modal
network fusion for video action recognition,” in Proceedings of the IEEE features,” in Pacific Rim Conference on Multimedia. Springer, 2016,
conference on computer vision and pattern recognition, 2016, pp. 1933– pp. 252–261.
1941. [39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[17] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recogni- spatiotemporal features with 3d convolutional networks,” in Proceedings
tion with 3d convolutional neural networks,” in Proceedings of the IEEE of the IEEE international conference on computer vision, 2015, pp.
conference on computer vision and pattern recognition workshops, 2015, 4489–4497.
pp. 1–7. [40] J. Shawe-Taylor and N. Cristianini, “Support vector machines,” An Intro-
[18] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using con- duction to Support Vector Machines and Other Kernel-based Learning
volutional 3d neural networks for user-independent continuous gesture Methods, pp. 93–112, 2000.
recognition,” in 2016 23rd International Conference on Pattern Recog- [41] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign
nition (ICPR). IEEE, 2016, pp. 49–54. language recognition without temporal segmentation,” in Thirty-Second
[19] D. S. Alexiadis, A. Chatzitofis, N. Zioulis, O. Zoidi, G. Louizis, AAAI Conference on Artificial Intelligence, 2018.
D. Zarpalas, and P. Daras, “An integrated platform for live 3d human [42] H. R. V. Joze and O. Koller, “Ms-asl: A large-scale data set and
reconstruction and motion capturing,” IEEE Transactions on Circuits benchmark for understanding american sign language,” arXiv preprint
and Systems for Video Technology, vol. 27, no. 4, pp. 798–813, 2016. arXiv:1812.01053, 2018.
[20] D. S. Alexiadis and P. Daras, “Quaternionic signal processing techniques [43] J. Carreira and A. Zisserman, “Quo vadis, action recognition,” A new
for automatic evaluation of dance performances from mocap data,” IEEE model and the kinetics dataset. CoRR, abs/1705.07750, vol. 2, p. 3,
Transactions on Multimedia, vol. 16, no. 5, pp. 1391–1406, 2014. 2017.
[21] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden, “Sign language [44] J. Pu, W. Zhou, and H. Li, “Dilated convolutional network with iterative
recognition using sub-units,” Journal of Machine Learning Research, optimization for continuous sign language recognition.” in IJCAI, vol. 3,
vol. 13, no. Jul, pp. 2205–2231, 2012. 2018, p. 7.
[22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive [45] J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous
multi-modal gesture recognition,” IEEE Transactions on Pattern Analy- sign language recognition,” in Proceedings of the IEEE Conference on
sis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2015. Computer Vision and Pattern Recognition, 2019, pp. 4165–4174.
[23] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and [46] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
J.-M. Odobez, “Deep dynamic neural networks for multimodal gesture jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
segmentation and recognition,” IEEE transactions on pattern analysis 2014.
and machine intelligence, vol. 38, no. 8, pp. 1583–1597, 2016. [47] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for
[24] O. Koller, C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised time-series,” in Proceedings of the 34th International Conference on
learning with multi-stream cnn-lstm-hmms to discover sequential paral- Machine Learning-Volume 70. JMLR. org, 2017, pp. 894–903.
lelism in sign language videos,” IEEE transactions on pattern analysis [48] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial
and machine intelligence, 2019. features for automatic sign language recognition,” in 2008 8th IEEE
[25] R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous International Conference on Automatic Face & Gesture Recognition.
sign language recognition by iterative training,” IEEE Transactions on IEEE, 2008, pp. 1–6.
Multimedia, 2019. [49] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions
of the sign language recognition and translation corpus rwth-phoenix-
[26] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Subunets: End-
weather.” in LREC, 2014, pp. 1911–1916.
to-end hand shape and continuous sign language recognition,” in 2017
[50] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neu-
IEEE International Conference on Computer Vision (ICCV). IEEE,
ral sign language translation,” in Proceedings of the IEEE Conference
2017, pp. 3075–3084.
on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793.
[27] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks
[51] A. Baker, B. van den Bogaerde, R. Pfau, and T. Schermer, The linguistics
for continuous sign language recognition by staged optimization,” in
of sign languages: An introduction. John Benjamins Publishing
Proceedings of the IEEE Conference on Computer Vision and Pattern
Company, 2016.
Recognition, 2017, pp. 7361–7369.
[52] E. T. Jaynes, “Information theory and statistical mechanics,” Physical
[28] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection- review, vol. 106, no. 4, p. 620, 1957.
ist temporal classification: labelling unsegmented sequence data with [53] S. Tan, K. C. Sim, and M. Gales, “Improving the interpretability of deep
recurrent neural networks,” in Proceedings of the 23rd international neural networks with stimulated learning,” in 2015 IEEE Workshop on
conference on Machine learning. ACM, 2006, pp. 369–376. Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015,
[29] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization pp. 617–623.
for spoken word recognition,” IEEE transactions on acoustics, speech, [54] C. Wu, P. Karanasou, M. J. Gales, and K. C. Sim, “Stimulated deep
and signal processing, vol. 26, no. 1, pp. 43–49, 1978. neural network for speech recognition,” in Interspeech 2016, 2016, pp.
[30] H. Liu, S. Jin, and C. Zhang, “Connectionist temporal classification with 400–404.
maximum entropy regularization,” in Advances in Neural Information [55] C. Wu, M. J. Gales, A. Ragni, P. Karanasou, and K. C. Sim, “Improving
Processing Systems, 2018, pp. 831–841. interpretability and regularization in deep learning,” IEEE/ACM Trans-
[31] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for actions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp.
continuous sign language recognition,” in 2019 IEEE International 256–265, 2017.
Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1282– [56] J. Heymann, K. C. Sim, and B. Li, “Improving ctc using stimu-
1287. lated learning for sequence modeling,” in ICASSP 2019-2019 IEEE
[32] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal International Conference on Acoustics, Speech and Signal Processing
processing magazine, vol. 13, no. 6, pp. 47–60, 1996. (ICASSP). IEEE, 2019, pp. 5701–5705.
[33] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a cnn [57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
on 1 million hand images when your data is continuous and weakly the inception architecture for computer vision,” in Proceedings of the
labelled,” in Proceedings of the IEEE Conference on Computer Vision IEEE conference on computer vision and pattern recognition, 2016, pp.
and Pattern Recognition, 2016, pp. 3793–3802. 2818–2826.
[34] O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid [58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
cnn-hmm for continuous sign language recognition,” in Proceedings of with deep convolutional neural networks,” in Advances in neural infor-
the British Machine Vision Conference 2016, 2016. mation processing systems, 2012, pp. 1097–1105.
[35] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end [59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
sequence modelling with deep recurrent cnn-hmms,” in Proceedings S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
of the IEEE Conference on Computer Vision and Pattern Recognition, fast feature embedding,” in Proceedings of the 22nd ACM international
2017, pp. 4297–4305. conference on Multimedia. ACM, 2014, pp. 675–678.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.