0% found this document useful (0 votes)
1 views

A_Comprehensive_Study_on_Sign_Language_Recognition

This document presents a comprehensive study on deep learning-based methods for sign language recognition (SLR), focusing on the evaluation of various computer vision techniques and the introduction of a new RGB+D dataset for Greek sign language. It discusses the challenges of SLR, including the need for robust tools to bridge communication gaps for the deaf community, and proposes new training criteria and pretraining schemes for improved recognition accuracy. The paper emphasizes the significance of both manual and non-manual features in sign language and outlines the advancements in continuous sign language recognition using deep neural networks.

Uploaded by

amiya07chinnu03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

A_Comprehensive_Study_on_Sign_Language_Recognition

This document presents a comprehensive study on deep learning-based methods for sign language recognition (SLR), focusing on the evaluation of various computer vision techniques and the introduction of a new RGB+D dataset for Greek sign language. It discusses the challenges of SLR, including the need for robust tools to bridge communication gaps for the deaf community, and proposes new training criteria and pretraining schemes for improved recognition accuracy. The paper emphasizes the significance of both manual and non-manual features in sign language and outlines the advancements in continuous sign language recognition using deep neural networks.

Uploaded by

amiya07chinnu03
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/350575220

A Comprehensive Study on Deep Learning-Based Methods for Sign Language


Recognition

Article in IEEE Transactions on Multimedia · April 2021


DOI: 10.1109/TMM.2021.3070438

CITATIONS READS
114 1,313

10 authors, including:

Theocharis Chatzis Ilias Papastratis


The Centre for Research and Technology, Hellas The Centre for Research and Technology, Hellas
12 PUBLICATIONS 224 CITATIONS 13 PUBLICATIONS 342 CITATIONS

SEE PROFILE SEE PROFILE

Andreas Stergioulas Georgios Th. Papadopoulos


The Centre for Research and Technology, Hellas Harokopio University
13 PUBLICATIONS 186 CITATIONS 79 PUBLICATIONS 784 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Klimis Antzakas on 13 September 2021.

The user has requested enhancement of the downloaded file.


1

A Comprehensive Study on Sign Language


Recognition Methods
Nikolas Adaloglou1* , Theocharis Chatzis1* , Ilias Papastratis1* , Andreas Stergioulas1* , Georgios Th.
Papadopoulos1 , Member, IEEE, Vassia Zacharopoulou2 , George J. Xydopoulos2 , Klimnis Atzakas2 , Dimitris
Papazachariou2 , and Petros Daras1 , Senior Member, IEEE
University of Patras2
Centre for Research and Technology Hellas1
arXiv:2007.12530v1 [cs.CV] 24 Jul 2020

Abstract—In this paper, a comparative experimental assess- • Non-manual features, namely eye gaze, head-nods/
ment of computer vision-based methods for sign language recog- shakes, shoulder orientations, various kinds of facial
nition is conducted. By implementing the most recent deep neural expression as mouthing and mouth gestures.
network methods in this field, a thorough evaluation on multiple
publicly available datasets is performed. The aim of the present Combinations of the above-mentioned features represent a
study is to provide insights on sign language recognition, focusing gloss, which is the fundamental building block of a SL and
on mapping non-segmented video streams to glosses. For this
represents the closest meaning of a sign [2]. SLs, similar to the
task, two new sequence training criteria, known from the fields of
speech and scene text recognition, are introduced. Furthermore, a spoken ones, include an inventory of flexible grammatical rules
plethora of pretraining schemes is thoroughly discussed. Finally, that govern both manual and non-manual features [3]. Both
a new RGB+D dataset for the Greek sign language is created. To of them, are simultaneously (and often with loose temporal
the best of our knowledge, this is the first sign language dataset structure) used by signers, in order to construct sentences in
where sentence and gloss level annotations are provided for a
a SL. Depending on the context, a specific feature may be the
video capture.
most critical factor towards interpreting a gloss. It can modify
Index Terms—Sign Language Recognition, Greek sign lan- the meaning of a verb, provide spatial/temporal reference and
guage, Deep neural networks, stimulated CTC, conditional en- discriminate between objects and people.
tropy CTC.
Due to the intrinsic difficulty of the deaf community to
interact with the rest of the society (according to [4], around
500,000 people use the American SL to communicate in
I. I NTRODUCTION
the USA), the development of robust tools for automatic
Spoken languages make use of the “vocal - auditory” SL recognition would greatly alleviate this communication
channel, as they are articulated with the mouth and perceived gap. As stated in [5], there is an increased demand for
with the ear. All writing systems also derive from, or are interdisciplinary collaboration including the deaf community
representations of, spoken languages. Sign languages (SLs) and for the creation of representative public video datasets.
are different as they make use of the “corporal - visual” Sign Language Recognition (SLR) can be defined as the
channel, produced with the body and perceived with the eyes. task of inferring glosses performed by a signer from video
SLs are not international and they are widely used by the captures. Even though there is a significant amount of work
communities of deaf people. They are natural languages, since in the field of SLR, a lack of a complete experimental study is
they are developed spontaneously wherever deaf people have profound. Moreover, most publications do not report results in
the opportunity to congregate and communicate mutually. SLs all available datasets or share their code. Thus, experimental
are not derived from spoken languages; they have their own results in the field of SL are rarely reproducible and lacking
independent vocabularies and their own grammatical structures interpretation. Apart from the inherent difficulties related to
[1]. The signs used by deaf people, actually have internal human motion analysis (e.g. differences in the appearance of
structure in the same way as spoken words. Just as hundreds of the subjects, the human silhouette features, the execution of the
thousands of English words are produced using a small number same actions, the presence of occlusions, etc.) [6], automatic
of different sounds, the signs of SLs are produced using a SLR exhibits the following key additional challenges:
finite number of gestural features. Thus, signs are not holistic • Exact position in surrounding space and context have a
gestures but are rather analyzable, as a combination of lin- large impact on the interpretation of SL. For example,
guistically significant features. Similarly to spoken languages, personal pronouns (e.g. “he”, “she”, etc.) do not exist.
SLs are composed of the following indivisible features: Instead, the signer points directly to any involved referent
• Manual features, i.e. hand shape, position, movement, or, when reproducing the contents of a conversation,
orientation of the palm or fingers, and pronouns are modeled by twisting his/her shoulders or
gaze.
* authors contributed equally • Many glosses are only distinguishable by their constituent
non-manual features and they are typically difficult to be
2

accurately detected, since even very slight human move- insights of the conducted experiments are discussed. Finally,
ments may impose different grammatical or semantic conclusions are drawn and future research directions are
interpretations depending on the context [7]. highlighted in Section VIII.
• The execution speed of a given gloss may indicate a
different meaning or the particular signers attitude. For II. R ELATED W ORK
instance, signers would not use two glosses to express The various automatic SLR tasks, depending on the model-
“run quickly”, but they would simply speed up the ing’s level of detail and the subsequent recognition step, can
execution of the involved signs [7]. be roughly divided in (Fig. 1):
• Signers often discard a gloss sub-feature, depending on
• Isolated SLR: Methods of this category target to address
previously performed and proceeding glosses. Hence,
the task of video segment classification (where the seg-
different instances of the exact same gloss, originating
ment boundaries are provided), based on the fundamental
even from the same signer, can be observed.
assumption that a single gloss is present [9], [21], [18].
• For most SLs so far, very few formal standardization ac-
• Sign detection in continuous streams: The aim of these
tivities have been implemented, to the extent that signers
approaches is to detect a set of predefined glosses in a
of the same country exhibit distinguishable differences
continuous video stream [11], [22], [23].
during the execution of a given gloss [8].
• Continuous SLR (CSLR): These methods aim at rec-
Historically, before the advent of deep learning methods, the ognizing the sequence of glosses that are present in
focus was on identifying isolated glosses and gesture spotting. a continuous/non-segmented video sequence [3], [24],
Developed methods were often making use of hand crafted [25]. This category of approaches exhibits characteristics
techniques [9], [10]. For spatial representation of the different that are most suitable for the needs of real-life SLR
sub-gloss components, they usually used handcrafted features applications [5]; hence, it has gained increased research
and/or fusion of multiple modalities. Temporal modeling was attention and will be further discussed in the remainder
achieved by classical sequence learning models, such as of this section.
Hidden Markov Model (HMM) [11], [12], [13] and hidden
conditional random fields [14]. The rise of deep networks
A. Continuous sign language recognition
was met with a significant boost in performance for many
video-related tasks, like human action recognition [15], [16], By definition, CSLR is a task very similar to the one of
gesture recognition, [17], [18], motion capturing [19], [20], continuous human action recognition, where a sequence of
etc. SLR, is a task closely related to computer vision. This is glosses (instead of actions) needs to be identified in a contin-
the reason that most approaches tackling SLR have adjusted uous stream of video data. However, glosses typically exhibit
to this direction. a significantly shorter duration than actions (i.e. they may
In this paper, SLR using Deep Neural Network (DNN) only involve a very small number of frames), while transitions
methods is investigated. The main contributions of this work among them are often very subtle for their temporal boundaries
are summarized as follows: to be efficiently recognized. Additionally, glosses may only
involve very detailed and fine-grained human movements (e.g.
• A comprehensive, holistic and in-depth analysis of multi- finger signs or facial expressions), while human actions usually
ple literature DNN-based SLR methods is performed, in refer to more concrete and extensive human body actions. The
order to provide meaningful and detailed insights to the latter facts highlight the particular challenges that are present
task at hand. in the CSLR field [3].
• Two new sequence learning training criteria are proposed, Due to the lack of gloss-level annotations, CSLR is regularly
known from the fields of speech and scene text recogni- casted as a weakly supervised learning problem. The majority
tion. of CSLR architectures usually consists of a feature extractor,
• A new pretraining scheme is discussed, where transfer followed by a temporal modeling mechanism, [26], [27]. The
learning is compared to initial pseudo-alignments. feature extractor is used to compute feature representations
• A new publicly available large-scale RGB+D Greek Sign from individual input frames (using 2D CNNs) or sets of
Language (GSL) dataset is introduced, containing real- neighbouring frames (using 3D CNNs). On the other hand,
life conversations that may occur in different public a critical aspect of the temporal modeling scheme enables the
services. This dataset is particularly suitable for DNN- modeling of the SL unit feature representations (i.e gloss-
based approaches that typically require large quantities level, sentence-level). With respect to temporal modeling,
of expert annotated data. sequence learning can be achieved using HMMs, Connec-
The remainder of this paper is organized as follows: in tionist Temporal Classification (CTC) [28] or Dynamic Time
Section II, related work is described. In Section III, an Warping (DTW) [29] techniques. From the aforementioned
overview of the publicly available datasets in SLR is provided, categories, CTC has, in general, shown superior performance
along with the introduction of a new GSL dataset. In Section and the majority of works in CSLR has established CTC
IV, a description of the implemented architectures is given. as the main sequence training criterion (for instance, HMMs
In Section V, a description of the proposed sequence training may fail to efficiently model complex dynamic variations, due
criteria is detailed. In Section VI, the performed experimental to expressiveness limitations [25]). However, CTC has the
results are reported. Then, in Section VII, interpretations and tendency to produce overconfident peak distributions, that are
3

Input video Feature extraction phase Temporal modeling Prediction Phase

Representation Isolated SLR


HELLO
3D CNN-based
frame-level Gloss spotting

HELLO CAN HELP CAN

CSLR
Individual Gloss 2D CNN-based segment-level
HELLO CAN HELP

Entire sequence of glosses

Fig. 1. An overview of SLR categories

prone to overfitting [30]. Moreover, CTC introduces limited while fully embracing the iterative optimization procedure.
contribution towards optimizing the feature extractor [31]. That module is able to produce compact representations for
For these reasons, some recent approaches have adopted an a video segment, which approximate the average duration of
iterative training optimization methodology. The latter essen- a gloss. Thereby, the LSTM captures the context information
tially comprises a two-step process. In particular, a set of between gloss segments, instead of individual frames as in
temporally-aligned pseudo-labels are initially estimated and previous works. In [2], a hybrid 2D-3D CNN architecture [37]
used to guide the training of the feature extraction module. is developed. Features are extracted in a structured manner,
In the beginning, the pseudo-labels can be either estimated by where temporal dependencies are modeled by two LSTMs,
statistical approaches [3] or extracted from a shallower model without pretraining or using an iterative procedure. This ap-
[25]. After training the model in an isolated setup, the trained proach however, yields the best results only in continuous SL
feature extractor is utilized for the continuous SLR setup. This datasets where a plethora of training data is available.
process may be performed in an iterative way, similarly to the
Expectation Maximization (EM) algorithm [32]. Finally, CTC C. 3D CNN-based CSLR approaches
imposes a conditional independence constraint, where output
One of the first works that employs 3D-CNNs in SLR
predictions are independent, given the entire input sequence.
is introduced in [38]. The authors present a multi-modal
approach for the task of isolated SLR, using spatio-temporal
B. 2D CNN-based CSLR approaches Convolutional 3D networks (C3D) [39], known from the re-
One of the firstly deployed architectures in CSLR is based search field of action recognition. Multi-modal representations
on [33], where a CNN-HMM network is proposed. Google- are lately fused and fed to a Support Vector Machine (SVM)
LeNet serves as the backbone architecture, fed with cropped [40] classifier. The C3D architecture has also been utilized in
hand regions and trained in an iterative manner. The same CSLR by [41]. The developed two-stream 3D CNN processes
network architecture is deployed in a CSLR prediction setup both full frame and cropped hand images. The full network,
[34], where the CNN is trained using glosses as targets named LS-HAN, consists of the proposed 3D CNN network,
instead of hand shapes. Later on, in [35], the same authors along with a hierarchical attention network, capable of latent
extend their previous work by incorporating a Long Short- space-based recognition modeling. In a later work [42], the
Term Memory unit (LSTM) [36] on top of the aforementioned authors propose the I3D [43] architecture in SLR. The model
network. In a more recent work [24], the authors present a is deployed on an isolated SLR setup, with pretrained weights
three-stream CNN-LSTM-HMM network, using full frame, on action recognition datasets. The signer’s body bounding box
cropped dominant hand and signer’s mouth region modalities. is served as input. For the evaluated dataset it yielded state-of-
These models, since they employ HMM for sequence learning, the-art results. In [31], the authors adopted and enhanced the
have to make strong initial assumptions in order to overcome original I3D model with a gated Recurrent Neural Network
HMM’s expressive limitations. (RNN). The whole architecture is a 3D CNN-RNN-CTC
In [26], the authors introduce an end-to-end system in architecture trained iteratively with a dynamic pseudo-label
CSLR without iterative training. Their model follows a 2D decoding method. Their aim is to accommodate features from
CNN-LSTM architecture, replacing HMM with LSTM-CTC. different time scales. In another work [44], the authors intro-
It consists of two streams, one responsible for processing the duce the 3D ResNet architecture to extract features. Further-
full frame sequences and one for processing only the signer’s more, they substituted LSTM with stacked dilated temporal
cropped dominant hand. In [27], the authors employ a 2D convolutions and CTC for sequence alignment and decoding.
CNN-LSTM architecture and in parallel with the LSTMs, With this approach, they manage to have very large receptive
a weakly supervised gloss-detection regularization network, fields while reducing time and space complexity, compared to
consisting of stacked temporal 1D convolutions. The same LSTM. Finally, in [45] Pu et al. propose a framework that also
authors in [25] extend their previous work by proposing a consists of a 3D ResNet backbone. The features are provided
module composed of a series of temporal 1D CNNs followed in both an attentional encoder-decoder network [46] and a
by max pooling, between the feature extractor and the LSTM, CTC decoder for sequence learning. Both decoded outputs
4

are jointly trained while the soft-DTW [47] is utilized to align


them.

III. P UBLICLY AVAILABLE DATASETS


Existing SLR datasets can be characterized as isolated
or continuous, taking into account whether annotation are
provided at the gloss (fine-grained) or the sentence (coarse-
grained) levels. Additionally, they can be divided into Signer Fig. 2. Example keyframes of the introduced GSL dataset
Dependent (SD) and Signer Independent (SI) ones, based on
the defined evaluation scheme. In particular, in the SI datasets
a signer cannot be present in both the training and the test The recordings are conducted using an Intel RealSense D435
set. In Table I, the following most widely known public SLR RGB+D camera at a rate of 30 fps. Both the RGB and the
datasets, along with their main characteristics, are illustrated: depth streams are acquired in the same spatial resolution of
• The Signum SI and the Signum subset [48] include
848x480 pixels. To increase variability in videos, the camera
laboratory capturings of the German Sign Language. position and orientation are slightly altered within subsequent
They are both created under strict laboratory settings with recordings. Exemplary cropped frames of the captured videos
the most frequent everyday glosses. are depicted in Fig.2.
2) GSL evaluation sets: Regarding the evaluation settings,
• The Chinese Sign Language (CSL) SD, the CSL SI
the dataset includes the following setups: a) the continuous
and the CSL isol. datasets [41] are also recorded in
GSL SD, b) the continuous GSL SI, and c) the GSL isol. In
a predefined laboratory environment with Chinese SL
GSL SD, roughly 80% of the videos are used for training,
words that are widely used in daily conversations.
corresponding to 8,189 instances. The rest 1,063 (10%) are
• The Phoenix SD [49], the Phoenix SI [49] and the
kept for validation and 1,043 (10%) for testing. The selected
Phoenix-T [50] datasets comprise videos of German SL,
test gloss sequences are not used in the training set, while
originating from the weather forecast domain.
all the individual glosses exist in the training set. In GSL
• The American Sign Language (ASL) [42] dataset con-
SI, the recordings of one signer are left out for validation
tains videos of various real-life settings. The collected
and testing (588 and 881 instances, respectively). The rest
videos exhibit large variations in background, image
8821 instances are utilized for training. A similar strategy is
quality, lighting and positioning of the signers.
followed in GSL isol., wherein the validation set consists of
2,231 gloss instances, the test set 3,500, while the remaining
A. The GSL dataset 34,995 are used for training.
1) Dataset description: In order to boost scientific research 3) Linguistic analysis and annotation process: As already
in the deep learning era, large-scale public datasets need to be mentioned, the provided annotations are both at individual
created. In this respect and with a particular focus on the case gloss and sentence level. Native signers annotated and labelled
of the GSL recognition, a corresponding public dataset has individual glosses, as well as whole sentences. Sign linguists
been created in this work. In particular, a set of seven native and SL professional interpreters consistently validated the
GSL signers are involved in the capturings. The considered annotation of the individual glosses. A great effort was devoted
application includes cases of deaf people interacting with in determining individual glosses following the “one form
different public services, namely police departments, hospitals one meaning” principle (i.e. a distinctive set of signs), taking
and citizen service centers. For each application case, 5 into consideration the linguistic structure of the GSL and
individual and commonly met scenarios (of increasing duration not its translation to the spoken standard modern Greek. We
and vocabulary complexity) are defined. The average length addressed and provided a solution for the following issues: a)
of each scenario is twenty sentences with 4.23 glosses per compound words, b) synonyms, c) regional or stylistic variants
sentence on average. Subsequently, each signer was asked to of the same meaning, and d) agreement verbs.
perform the pre-defined dialogues in GSL five consecutive In particular, compound words are composed of smaller
times. In all cases, the simulation considers a deaf person meaningful units with distinctive form and meaning, i.e. the
communicating with a single public service employee, while equivalent of morphemes of the spoken languages, which
all interactions are performed in GSL (the involved signer can also be simple individual words, for example: SON =
performed the sequence of glosses of both agents in the discus- MAN+BIRTH. Following the “one form one meaning” prin-
sion). Overall, the resulting dataset includes 10,290 sentence ciple, we split a compound word into its indivisible parts.
instances, 40,785 gloss instances, 310 unique glosses (vocab- Based on the above design, a computer vision system does
ulary size) and 331 unique sentences. For the definition of the not confuse compound words with its constituents.
dialogues in the identified application cases, the particularities Synonyms (e.g. two different signs with similar meaning)
of the GSL and the corresponding annotation guidelines, GSL were distinguished to each other with the use of consecutively
linguistic experts are involved. The video annotation process numbered lemmas. For instance, the two different signs which
is performed both at gloss and sentence level. The provided have the meaning: “DOWN” were annotated as DOWN(1) and
annotated segments enable benchmarking in SLR (using the DOWN(2). The same strategy was opted for the annotation
glosses) and SL translation (using the standard modern Greek). of regional and stylistic variants of the same meaning. For
5

TABLE I
L ARGE - SCALE PUBLICLY AVAILABLE SLR DATASETS

Characteristics
Datasets Language Signers Classes Video instances Duration (hours) Resolution fps Type Modalities Year
Signum SI [48] German 25 780 19,500 55.3 776x578 30 continuous RGB 2007
Signum isol. [48] German 25 455 11,375 8.43 776x578 30 both RGB 2007
Signum subset [48] German 1 780 2,340 4.92 776x578 30 both RGB 2007
Phoenix SD [49] German 9 1,231 6,841 10.71 210x260 25 continuous RGB 2014
Phoenix SI [49] German 9 1,117 4,667 7.28 210x260 25 continuous RGB 2014
CSL SD [41] Chinese 50 178 25,000 100+ 1920x1080 30 continuous RGB+D 2016
CSL SI [41] Chinese 50 178 25,000 100+ 1920x1080 30 continuous RGB+D 2016
CSL isol. [38] Chinese 50 500 125,000 67.75 1920x1080 30 isolated RGB+D 2016
Phoenix-T [50] German 9 1,231 8,257 10.53 210x260 25 continuous RGB 2018
ASL 100 [42] English 189 100 5,736 5.55 varying varying isolated RGB 2019
ASL 1000 [42] English 222 1,000 25,513 24.65 varying varying isolated RGB 2019
GSL isol. (new) Greek 7 310 40,785 6.44 848x480 30 isolated RGB+D 2019
GSL SD (new) Greek 7 310 10,290 9.59 848x480 30 continuous RGB+D 2019
GSL SI (new) Greek 7 310 10,290 9.59 848x480 30 continuous RGB+D 2019

example, the two different regional variants of “DOCTOR B. GoogLeNet + TConvs


were annotated as DOCTOR(1), DOCTOR(2).
In contrast to other 2D CNN-based methods that employ
Another interesting case is the agreement verbs of sign
HMMs, Cui et. al [25] propose a model that includes an
languages, which contain the subject and/or object within the
extra temporal module (TConvs), after the feature extractor
sign of the agreement verb. Agreement verbs indicate subjects
(GoogLeNet). The TConvs module consists of two 1D CNN
and/or objects by changing the direction of the movement
layers and two max pooling layers. It is designed to capture
and/or the orientation of the hand. Therefore, it was decided
the fine-grained dependencies, which exist inside a gloss
that they cannot be distinguished as autonomous signs and
(intra-gloss dependencies) between consecutive frames, into
are annotated as a single gloss. A representative example is
compact per-window feature vectors. Finally, bidirectional
the : “I DISCUSS WITH YOU” versus “YOU DISCUSS
RNNs are applied in order to capture the long-term temporal
WITH HIM”. For the described annotation guideline, the
dependencies of the entire sentence. The total architecture is
internationally accepted notation for the sign verbs is followed
trained iteratively, in order to exploit the expressive capability
[51], [1].
of DNN models with limited data.

IV. SLR APPROACHES


C. I3D
In order to gain a better insight on the behavior of the Inflated 3D ConvNet (I3D) [43] was originally developed
various automatic SLR approaches, the best performing and for the task of human action recognition; however, its ap-
the most widely adopted methods of the literature are dis- plication has demonstrated outstanding performance on iso-
cussed in this section. The selected approaches cover all lated SLR [42]. In particular, the I3D architecture is an
different categories of methods that have been proposed so extended version of GoogLeNet, which contains several 3D
far. The quantitative comparative evaluation of the latter, using convolutional layers followed by 3D max-pooling layers. The
multiple publicly available datasets, will facilitate towards key insight of this architecture is the endowing of the 2D
providing valuable feedback regarding the pros and cons of sub-modules (filters and pooling kernels) with an additional
each automatic SLR methodology. temporal dimension. This methodology makes feasible to
learn spatio-temporal features from videos, while it leverages
efficient known architecture designs and parameters.
A. SubUNets
Camgoz et. al [26] introduce a DNN-based approach for
D. 3D ResNet+LSTM
solving the simultaneous alignment and recognition problems,
typically referred to as “sequence-to-sequence” learning. In Pu et al. [45] propose a framework comprising a 3D CNN
particular, the overall problem is decomposed of a series for feature extraction, a RNN for sequence learning and
of specialized systems, termed SubUNets. The overall goal two different decoding strategies, one performed with CTC
is to model the spatio-temporal relationships among these and the other with an attentional decoder RNN. The glosses
SubUNets to solve the task at hand. More specifically, Sub- predicted by the attentional decoder are utilised to draw a
UNets allow to inject domain-specific expert knowledge into warping path using a soft-DTW [47] alignment constraint.
the system regarding suitable intermediate representations. The warping paths display the alignments between glosses
Additionally, they also allow to implicitly perform transfer and video segments. The proposed pseudo-alignments are then
learning between different interrelated tasks. employed for iterative optimization.
6

V. S EQUENCE LEARNING TRAINING CRITERIA FOR CSLR The error signal of Lctc with respect to gvt is:
A summary of the notations used in this paper, is provided
in this section, so as to enhance its readability and under- ∂Lctc 1 X
standing. Let us denote by U the label (i.e. gloss) vocabulary t
=− p(π|X) (7)
∂gv p(y|X)gvt
{π∈B −1 (y),πt =v}
and by blank the new blank token, representing the silence
or transition between two consecutive labels. The extended From (7) it can be observed that the error signal is pro-
vocabulary can be defined as V = U ∪ {blank} ∈ RL , where portional to the fraction of all valid paths. As soon as a
L is the total number of labels. From now on, given a sequence path dominates the rest, the error signal enforces all the
f of length F , we denote its first and last p elements by f 1:p probabilities to concentrate on a single path. Moreover, (1)
and f p:F , respectively. An input frame sequence of length N and (7) indicate that the probabilities of a gloss occurring
can be defined as X = (x1 , .., xN ). The corresponding target at following time-steps are independent, which is known as
sequence of labels (i.e. glosses) of length K is defined as the conditional independence assumption. For these reasons,
y = (y1 , .., yK ). In addition, let Gv = (g 1v , .., g Tv ) ∈ RL×T two learning criteria are introduced in CSLR: a) one that
be the predicted output sequence of a softmax classifier, where encounters the ambiguous segmentation boundaries of adjacent
T ≤ N and v ∈ V . g tv can be interpreted as the probability glosses, and b) one that is able to model the intra-gloss
of observing label v at time-step t. Hence, Gv defines a dependencies, by incorporating a learnable language model
distribution over the set V T ∈ RL×T : during training (as opposed to other approaches that use it
T
only during the CTC decoding stage).
Y
p(π|X) = gπt t , ∀π ∈ V T (1)
t=1 B. Entropy Regularization CTC
The elements of V T are referred as paths and denoted The CTC criterion can be extended [30] based on maximum
by π. In order to map y to π, one can define a mapping conditional entropy [52], by adding an entropy regularization
function B : V T 7→ U ≤T , with U ≤T being the set of possible term H:
labellings. B removes repeated labels and blanks from a given
path. Similarly, one can denote the inverse operation of B as
B −1 , that maps target labels to all the valid paths. From this
X
H(p(π|y, X)) = − p(π|X, y) log p(π|X, y)
perspective, the conditional probability of y is computed as: π∈B −1 (y)
X Q(y)
p(y|X) = p(π|X) (2) =− + log p(y|X), (8)
p(y|X)
π∈B −1 (y)
P
where Q(y) = π∈B −1 (y) p(π|X) log p(π|X).
A. Traditional CTC criterion
H aims to prevent the entropy of the non-dominant paths
Connectionist Temporal Classification (CTC) [28] is widely from decreasing rapidly. Consequently, the entropy regulariza-
utilized for labelling unsegmented sequences. The time com- tion CTC criterion (EnCTC) is formulated as:
plexity of (2) is O(LN K), which means that the amount of
valid paths grows exponentially with N . To efficiently calcu- Lenctc = Lctc − φH(p(π|y, X)), (9)
late p(y|X), a recursive formula is derived, which exploits
the existence of common sub-paths. Furthermore, to allow for where φ is a hyperparameter. The introduction of the entropy
blanks in the paths, a modified gloss sequence y 0 of length term H prevents the error signal from gathering into the
K 0 = 2K + 1 is used, by adding blanks before and after each dominant path, but rather encourages the exploration of nearby
gloss in y. Forward and backward probabilities αt (s) of y 01:s ones. By increasing the probabilities of the alternative paths,
at t and βt (s) of y 0s:K 0 at t are defined as: the peaky distribution problem is alleviated.
t
X Y 0
αt (s) , gπt t0 (3) C. Stimulated CTC
B(π1:t )=y 01:s t0 =1
Stimulated learning [53], [54], [55] augments the training
T process by regularizing the activations of the sequence learning
X Y 0
βt (s) , gπt t0 (4) RNN, ht . Stimulated CTC (StimCTC) [56] constricts the
B(πt:T )=y 0s:K 0 t0 =t independent assumption of traditional CTC. To generate the
Therefore, to calculate p(y|X) for any t, we sum over all appropriate stimuli, an auxiliary uni-directional Language
s in y 0 as: Model RNN (RNN-LM) is utilized. The RNN-LM encoded
K0 hidden states (hk ) encapsulate the sentence’s history, up to
X αt (s)βt (s)
p(y|X) = (5) gloss k. ht is stimulated by utilizing the non-blank probabil-
s=1
gyt 0s ities α0t and β 0t ∈ RK . Then, the weighting factor γ t can be
Finally, the CTC criterion is derived as: calculated as:
β 0 α0t
Lctc = − log p(y|X) (6) γ t = t0 (10)
β t · α0t
7

Intuitively, γ t can be seen as the probabilities of any gloss in TABLE II


target sequence y to be mapped to time-step t. The linguistic G LOSS T EST ACCURACY IN PERCENTAGE - ISOLATED SLR
structure of SL is then incorporated as:
Datasets
K X
T Method ASL 1000 ASL 100 CSL isol. GSL isol.
1 X 2
Lstimuli = γt (k)|| ht − hk || , (11) GoogLeNet+TConvs [25] - 44.92 79.31 86.03
K ·T t
k 3D-ResNet [45] - 50.48 89.91 86.23
I3D [57] 40.99 72.07 95.68 89.74
Thereby, ht is enforced to comply with hk . The RNN-LM
model is trained using the cross-entropy criterion denoted as
Llm . Finally, the StimCTC criterion is defined as:
B. Data augmentation and implementation details

Lstim = Lctc + λLlm + θLstimuli , (12) The same data preprocessing methods are used for all
datasets. Each frame is normalised by the mean and standard
where λ and θ are hyper-parameters. The described criteria deviation of the ImageNet dataset. To increase the variabil-
can be combined, resulting in Entropy Stimulated CTC (En- ity of the training videos, the following data augmentation
StimCTC) criterion, as: techniques are adopted. Frames are resized to 256X256 and
cropped at a random position to 224X224. Random frame
Lenstim = Lctc − φH(p(π|y)) + λLlm + θLstimuli (13) sampling is used up to 80% of video length. Moreover, random
jittering of the brightness, contrast, saturation and hue values
VI. E XPERIMENTAL EVALUATION of each frame is applied. The models are trained with Adam
optimizer with initial learning rate λ0 = 10−4 , which is
In order to provide a fair evaluation, we re-implemented reduced to λi = 10−5 when validation loss starts to plateau.
the selected approaches and evaluated them on multiple large- For isolated SLR experiments, the batch size is set to 2. Videos
scale datasets, in both isolated and continuous SLR. Re- are rescaled to a fixed length that is equal to the average
implementations are based on the original authors’ guidelines gloss length of each dataset. For CSLR experiments, videos are
and any modifications are explicitly referenced. For the con- downsampled to maximum length of 250 frames, if necessary.
tinuous setup, the criteria CTC, EnCTC, and EnStimCTC are Batch size is set to 1, due to GPU memory constraints. The
evaluated in all architectures. For a fair comparison between experiments are conducted in a NVIDIA GeForce GTX-1080
different models, we opt to use the full frame modality, since Ti GPU with 12 GB of memory and 32 GB of RAM. All
it is the common modality between selected datasets and it is models, depending on the dataset, require 10 to 25 epochs to
more suitable for real-life applications. We omit the iterative converge.
optimization process, instead we pretrain each model on the The referenced models, depending on the dataset, have been
respective dataset’s isolated version, if present. Otherwise, modified as follows: In SubUNets, AlexNet [58] is used as
extracted pseudo-alignments from other models (i.e. Phoenix) feature extractor instead of CaffeNet [59], as they share a sim-
are used for isolated pretraining (implementations and experi- ilar architecture. Additionally, for the CSL and GSL datasets,
mental results are publicly available to enforce reproducibility we reduce the bidirectional LSTM hidden size by half, due
in SLR3 ). to computational space complexity. In the isolated setup, the
LSTM layers of SubUNets are trained along with the feature
extractor. In order to achieve the maximum performance of
A. Datasets and Evaluation metrics GoogLeNet+TConvs, a manual customization of TConvs 1D
The following datasets have been chosen for experimental CNN kernels and pooling sizes is necessary. The intuition
evaluation: ASL 100 and 1000, CSL isol., GSL isol. for the behind it, is that the receptive field should be approximately
isolated setup, and Phoenix SD and Phoenix SI, CSL SD, covering the average gloss duration. Each 1D CNN layer
CSL SI, GSL SD, GSL SI for the CSLR setup. To evaluate includes 1024 filters. In CSL, the 1D CNN are set with kernel
recognition performance in continuous datasets, the word error size 7, stride 1 and the max-pooling layers with kernel sizes
rate (WER) metric has been adopted, which quantifies the and strides equal to 3, to cover the average gloss duration of
similarity between predicted glosses and ground truth gloss 58 frames. For the GSL dataset the TConvs are tuned with
sequence. WER measures the least number of operations kernel sizes equal to 5 and pooling sizes equal to 3. In order
needed to transform the aligned predicted sequence to the to deploy 3D-ResNet and I3D in a CSLR setup, a sliding
ground truth and can be defined as: window technique is adopted in the input sequence. Window
size and stride are selected to cover the average gloss duration.
S+D+I
W ER = , (14) Then, a 2-layer bidirectional LSTM is added to model the
N long-term temporal correlations in the feature sequence. In
where S is the total number of substitutions, D is the total CSL, the window size is set to 50 and stride 36, whereas in
number of deletions, I is the total number of insertions and GSL the window size is set to 25, with stride equal to 12.
N is the total number of glosses in the ground truth. I3D and 3D-ResNet are initialized with weights pretrained
on Kinetics. Also, for the 3D-ResNet method, we omit the
3 https://zenodo.org/record/3941811#.XxrZXZZRU5k attentional decoder from the original paper, keeping the 3D-
8

TABLE III
F INE TUNING IN CSLR DATASETS . R ESULTS ARE REPORTED IN WER

Datasets
Method Phoenix SD Phoenix SI CSL SI CSL SD GSL SI GSL SD
Val. / Test Val. / Test Test Test Test Test
I3D (Kinetics) 53.81 / 51.27 65.53 / 62.38 23.19 72.39 34.52 75.42
I3D (Kinetics + ASL 1000) 40.89 / 40.49 59.60 / 58.36 16.73 64.72 27.09 71.05

ResNet+LSTM model. In Table III, I3D+LSTM is fine-tuned on CSLR datasets


In initial experiments it was observed that by training with with CTC in 2 configurations: a) using the pretrained weights
StimCTC, all baseline models were unable to converge. The from Kinetics, and b) pretraining in ASL 1000. Results are
main reason is that the networks produce unstable output improved by 6.79% on average for the second configuration.
probability distributions in the early stage of training. On This was expected due to the task relevance.
the contrary, introducing Lstim in the late training stage, Table IV presents an evaluation of the impact of transfer
constantly improved the networks’ performance. The overall learning versus training with initial pseudo-alignments, as a
best results were obtained with EnStimCTC. The reason is pretraining scheme. The following four cases are considered:
that, while the entropy term H introduces more variability in • directly train a shallow model (i.e SubUNets) without
the early optimisation process, convergence is hindered on the pretraining, to obtain initial pseudo-alignments,
late training stage. By removing H and introducing Lstim , • assume uniform pseudo-alignments over input video for
the possible alignments generated by EnCTC are filtered. each gloss in a sentence,
Regarding the hyper-parameters of the selected criteria, a • transfer learning from a large-scale isolated dataset
tuning was necessary. For EnCTC, the hyperparameter φ is (ASL), and
varied in the range of 0.1 and 0.2. For EnStimCTC, λ is set • proximal transfer learning from the respected datasets
to 1. Concerning θ, evaluations for θ = 0.1, 0.2, 0.5, 1 are isolated.
performed. The best results were obtained with θ = 0.5 and
Experiments are conducted on the CSL SI and the GSL SI
φ = 0.1.
evaluation sets, since they have annotated isolated subsets for
proximal transfer learning. For the particular experiment I3D
C. Experimental results is used, since it is the best performing model in isolated setup
In Table II, quantitative results are reported for the isolated (Table II). SubUNets are chosen to infer the initial pseudo-
setup. Classification accuracy is reported in percentage. It alignments, because pretraining is not required by design.
can be seen that 3D baseline methods achieve higher gloss Training was performed with the traditional CTC criterion. In
recognition rate than 2D ones. I3D clearly outperforms other CSL SI, the best strategy by a relative margin of 7.9%, seems
architectures in this setup, by a minimum margin of 2.2% to a to be pretraining on pseudo-alignments. On the contrary, in
maximum of 21.6%. I3D and 3D-ResNet were pretrained on GSL SI the best results are acquired with proximal transfer
Kinetics, which explains their superiority in performance. The learning by a relative gain of 56.9% compared to pseudo-
3D CNN models achieve satisfactory results in datasets created alignments. Producing pseudo-alignments requires more train-
under laboratory conditions, yet in challenging scenarios, I3D ing time, while training with proximal isolated subset is not
clearly outperforms 3D-ResNet. Specifically in ASL 1000, always available.
where glosses are not executed in a controlled environment, In Tables V and VI, quantitative results regarding CSLR are
only I3D is able to converge. SubUNets performed poorly or reported. The selected architectures are evaluated in CSLR
did not converge at all and its results are deliberately excluded. datasets in both SD and SI subsets, using the proposed
SubUNets’ inability to converge, may be due to their large criteria. Training with EnCTC, needs more epochs to converge,
number of parameters (roughly 125M). due to the fact that a greater number of possible paths is
explored, yet it converges to a better local optimal. Overall,
EnCTC shows an average improvement of 1.59% in WER
TABLE IV (9.73% relative). A further reduction of 1.60% in WER (5.69%
C OMPARISON OF PRETRAINING SCHEMES : RESULTS OF THE I3D
ARCHITECTURE , AS MEASURED IN TEST WER, USING MULTIPLE
relative) is observed by adding StimCTC. It can be seen that
FULLY- SUPERVISED APPROACHES BEFORE TRAINING IN CSLR the proposed EnStimCTC criterion improves recognition in all
datasets by an overall WER gain of 3.26% (14.56% relative).
Datasets In the reported average gains SubUNets are excluded due to
Method CSL SI GSL SI performance deterioration.
Test Val. / Test In Phoenix SD subset, all models benefit from training
SubUNets alignments 5.94 18.43 / 20.00 with EnStimCTC loss by 1.59% less WER on average. Fig.3,
Uniform alignments 16.98 27.30 / 29.08
Transfer learning from ASL 16.73 25.89 / 27.09 depicts the models’ WER in Phoenix SD validation set.
Proximal transfer learning 6.45 8.78 / 8.62 SubUNets have a WER of 29.51% in validation set and
29.22% in test set, which is an average reduction of 12.59%
9

TABLE V
R EPORTED RESULTS IN CONTINUOUS SD SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .

Signer Dependent Datasets


Phoenix SD CSL SD GSL SD
Method CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC
Val. / Test Val. / Test Val. / Test Test Test Test Val. / Test Val. / Test Val. / Test
SubUNets [26] 30.51/30.62 32.02/31.61 29.51/29.22 78.31 81.33 80.13 52.79/54.31 58.11/60.09 55.03/57.49
GoogLeNet+TConvs [25] 32.18/31.37 31.66/31.74 28.87/29.11 65.83 64.04 64.43 43.54/48.46 42.69/44.11 38.92/42.33
3D-ResNet+LSTM [45] 38.81/37.79 38.80/37.50 36.74/35.51 72.44 70.20 68.35 61.94/68.54 63.47/66.54 57.88/61.64
I3D+LSTM [57] 32.88/31.92 32.60/32.70 31.16/31.48 64.73 64.06 60.68 51.74/53.48 51.37/53.48 49.89/49.99

TABLE VI
R EPORTED RESULTS IN CONTINUOUS SI SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .

Signer Independent Datasets


Phoenix SI CSL SI GSL SI
Method CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC
Val. / Test Val. / Test Val. / Test Test Test Test Val. / Test Val. / Test Val. / Test
SubUNets [26] 56.56/55.06 55.59/53.42 55.01/54.11 3.29 5.13 4.14 24.64/24.03 21.73/20.58 21.65/20.62
GoogLeNet+TConvs [25] 46.70/46.67 47.14/46.70 46.42/46.41 4.06 2.46 2.41 8.08/7.95 7.63/6.91 6.99/6.75
3D-ResNet+LSTM [45] 55.88/53.77 54.69/54.57 52.88/50.98 19.09 13.36 14.31 33.61/33.07 27.80/26.75 25.58/24.01
I3D+LSTM [57] 55.24/54.43 54.42/53.92 53.70/52.71 6.49 4.26 2.72 8.78/8.62 7.69/6.55 6.63/6.10

absolute WER reduction. GoogLeNet+TConvs has the best


80 performance with 2.41% WER, which is 5.36% less WER
I3D+LSTM on average than the other models and 1.65% less compared
70 SubUnets to CTC training (Fig. 5). This method outperforms the current
GoogLeNet+TConvs state-of-the-art method on CSL SI [2] by an absolute reduction
60 3D-ResNet+LSTM of 1.39% WER (3.80 vs 2.41) and relatively by 36.58%.
In CSL SD WER results are considerably higher than CSL
WER

50 SI results, with an average WER gain of 70.00%. The best


performing model is I3D+LSTM with 60.68% WER.
40 In GSL SI, I3D+LSTM and GoogLeNet+TConvs recogni-
tion results are close, with 6.63%/6.10% and 6.99%/6.75%
30 WER in development and test set, respectively. In GSL SD
GoogLeNet+TConvs yields the lowest WER (42.33%) yet is
20 still higher than its GSL SI results by 35.58% absolute WER.
0 5 10 15 20 25 30 Regarding CSL and GSL, both datasets have a relatively low
Epoch number of unique gloss combinations and their SD test sets
contain unseen gloss sequences. For these reasons, all models
Fig. 3. Validation WER of the implemented architectures in Phoenix SD
dataset trained with EnStimCTC loss. tend to predict combination of glosses similar to the ones seen
during training. This explains the superior recognition rates in
the SI subsets of CSL and GSL.
WER, compared to the original paper’s results (42.1% vs
30.62%) [26]. Furthermore, 2D-based CNNs produce similar VII. D ISCUSSION
results with negligible difference in performance. Similarly, in A. Performance comparison of implemented architectures
Phoenix SI, GoogLeNet+TConvs trained with EnStimCTC is From the group of experiments in isolated SLR in Table II, it
the best performing setup with an average of 10.9% relative was experimentally shown that 3D methods are more suitable
less WER, compared to the others. Finally, all architectures in for isolated gloss classification compared to the 2D models.
Phoenix SI have worse recognition performances compared to This is justified by the fact that 2D CNNs do not model
their SD, due to a reduction of more than 20% in the training dependencies between neighbouring frames, where motion
data. features play a crucial role in classifying a gloss. I3D is con-
In CSL SI dataset all methods, except for 3D-ResNet, sidered as the most capable of directly modeling the intra-gloss
have comparable recognition performance. They achieve high dependencies, leading to superior generalization capabilities.
recognition accuracy due to the large size of the dataset The advantage of 3D inception layer lies in the ability to
and the small size of the vocabulary. I3D+LSTM seems to project multi-channel spatio-temporal features in dense, lower
benefit the most when trained with EnStimCTC, with 3.77% dimensional embeddings. This leads in accumulating higher
10

ground truth I(1) PAPER EXCUSE CHECK AFTER YOU PAPER PROOF I_GIVE_YOU

EnStimCTC I(1) PAPER EXCUSE CHECK AFTER PAPER PROOF I_GIVE_YOU

ΕnCTC I(1) PAPER EXCUSE AFTER PAPER PROOF I_GIVE_YOU

CTC I(1) PAPER EXCUSE AFTER PAPER APPROVAL I_GIVE_YOU

Fig. 4. Visual comparison of ground truth alignments with the predictions of the proposed training criteria. GoogLeNet+TConvs is used for evaluation in the
GSL SD dataset.

approximate the average gloss duration. This is interpreted


8 as a guidance in models, based on the statistics of the SL
7 CTC
dataset. On the other hand, utilizing only LSTMs to capture
6 EnStimCTC
WER

the temporal dependencies (i.e. SubUNets), results in an


5
4 ineffective modeling of intra-gloss correlations. LSTMs are
3 designed to model the long-term dependencies that correspond
2 to the inter-gloss dependencies. Taking a closer look at the
0 2 4 6 8 10 12
Epoch
predicted alignments of each approach, it is noticed that 3D
42
architectures do not provide as precise gloss boundaries for
40 CTC true positives as the 2D ones. We strongly believe that this is
38 EnStimCTC the reason that 3D models benefit more from the introduced
WER

36
34 variations of the traditional CTC.
32
30
28
0 5 10 15 20 25 30
Epoch
B. Comparison between CTC variations
Fig. 5. Comparison of validation WER of CTC and EnStimCTC criteria with
GoogLeNet+TConvs in CSL SI and Phoenix SD datasets. The reported experimental results exhibit the negative influ-
ence of CTC’s drawbacks (overconfident paths and conditional
independence assumption) in CSLR. EnCTC’s contribution
semantic representations (more abstract output features). In to alleviate the overconfident paths, is illustrated in Fig. 4.
the opposite direction, deploying skip connections maintains The ground truth gloss “PROOF” is recognized with the
previous layers feature maps that correspond to lower semantic introduction of H, instead of “APPROVAL”. The latter has six
content that does not assist in isolated SLR. times higher occurrence frequency. After a careful examination
Modeling intermediate short temporal dependencies was of the aforementioned signs, one can notice that these signs
experimentally shown (Tables V, VI) to enhance the CSLR are close in terms of hand position and execution speed, which
performance. The implemented 3D CNN architectures directly justifies the depicted predictions. Furthermore, it is observed
capture spatio-temporal correlations as intermediate represen- that EnCTC boosts performance mostly in CSL SI and GSL
tations. The design choice of providing the input video in SI, due to the limited diversity and vocabulary. It can be high-
a sliding window restricts the network’s temporal receptive lighted that EnCTC did not boost SubUNets’ performance. The
field. Based on a sequential structure, architectures such as latter generates per frame predictions (T = N ), wherein the
GoogLeNet+TConvs achieve the same goal, by grouping con- rest approaches generate grouped predictions (T ≈ N4 ). This
secutive spatial features. Such a sequential approach can be results in a significantly larger space of possible alignments
proved beneficial in many datasets, given that spatial filters that is harder to be explored from this criterion. From Fig.
are well-trained. For this reason, such approaches require 4, it can be visually validated that EnStimCTC remedies the
heavy pretraining in the backbone network. The superiority conditional independence assumption. For instance, the gloss
in performance of the implemented sequential approach is “CHECK” was only recognised with stimulated training. By
justified in the careful manual tuning of temporal kernels and bringing closer predictions that correspond to the same target
strides. However, manual design significantly downgrades the gloss, the intra-gloss dependencies are effectively modeled.
advantages of transfer learning. The sliding window technique In parallel, the network was also able to correctly classify
can be easily adapted based on the particularities of each transitions between glosses as blank. It should be also noted
dataset, making 3D CNNs more scalable and suitable for that EnStimCTC does not increase time and space complexity
real-life applications. To summarize, both techniques aim to during inference.
11

C. Evaluation of pretraining schemes features of SL, similar to humans. Finally, it would be of great
Due to the limited contribution of CTC gradients in the fea- importance for the deaf-non deaf communication to bridge the
ture extractor, an effective pretraining is mandatory. As shown gap between SLR and SL translation. Advancements in this
in Fig. 3, pretraining significantly affects the starting WER of domain will drive research to SL translation as well as SL to
each model. Without pretraining, all models congregate around SL translation, which have not yet been thoroughly studied.
the most dominant glosses, which significantly slows down
the CSLR training process and limits the learning capacity IX. ACKNOWLEDGEMENTS
of the network. Fully supervised pretraining is interpreted
as a domain shift to the distribution of the SL dataset that This work was supported by the Greek General Secretariat
speeds up the early training stage in CSLR. Regarding the of Research and Technology under contract Τ1Ε∆Κ-02469
pretraining scheme, in datasets with limited vocabulary and EPIKOINONO.
gloss sequences (i.e. CSL), inferring initial pseudo-alignments The authors would like to express their gratitude to Vasileios
proved beneficial, as shown in Table IV. This is explained Angelidis, Chrysoula Kyrlou and Georgios Gkintikas from the
due to the fact that the data distribution of the isolated Greek sign language center4 for their valuable feedback and
subset had different particularities, such as sign execution contribution to the Greek sign language capturings.
speed. However, producing initial pseudo-alignments is time
consuming. Hence, the small deterioration in performance is R EFERENCES
an acceptable trade-off between recognition rate and time to
train. The proposed GSL dataset contains nearly double the [1] W. Sandler and D. Lillo-Martin, Sign language and linguistic universals.
Cambridge University Press, 2006.
vocabulary and roughly three times the number of unique [2] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature
gloss sentences, with less training instances. More importantly, network for continuous sign language recognition,” arXiv preprint
the isolated subset draws instances from the same distribution arXiv:1908.01341, 2019.
[3] O. Koller, J. Forster, and H. Ney, “Continuous sign language recogni-
as the continuous one. In such cases it can be stated that tion: Towards large vocabulary statistical recognition systems handling
proximal transfer learning significantly outperforms training multiple signers,” Computer Vision and Image Understanding, vol. 141,
with pseudo-alignments (56.90% relative improvement in the pp. 108–125, 2015.
[4] R. E. Mitchell, T. A. Young, B. BACHELDA, and M. A. Karchmer,
GSL dataset). “How many people use asl in the united states? why estimates need
updating,” Sign Language Studies, vol. 6, no. 3, pp. 306–335, 2006.
[5] D. Bragg, O. Koller, M. Bellard, L. Berke, P. Boudrealt, A. Braffort,
VIII. C ONCLUSIONS AND FUTURE WORK N. Caselli, M. Huenerfauth, H. Kacorri, T. Verhoef et al., “Sign
In this paper, an in-depth analysis of the most characteristic language recognition, generation, and translation: An interdisciplinary
perspective,” arXiv preprint arXiv:1908.08597, 2019.
DNN-based SLR model architectures was conducted. Through [6] G. T. Papadopoulos and P. Daras, “Human action recognition using 3d
extensive experiments in three publicly available datasets, reconstruction data,” IEEE Transactions on Circuits and Systems for
a comparative evaluation of the most representative SLR Video Technology, vol. 28, no. 8, pp. 1807–1823, 2016.
[7] H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in
architectures was presented. Alongside with this evaluation, Visual Analysis of Humans. Springer, 2011, pp. 539–562.
a new publicly available large-scale RGB+D dataset was [8] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini, and A. Rosete,
introduced for the Greek SL, suitable for SLR benchmarking. “Lsa64: an argentinian sign language dataset,” in XXII Congreso Ar-
gentino de Ciencias de la Computación (CACIC 2016)., 2016.
Two CTC variations known from other application fields, [9] M. W. Kadous et al., “Machine recognition of auslan signs using
EnCTC & StimCTC, were evaluated for CSLR and it was powergloves: Towards large-lexicon recognition of sign language,” in
noticed that their combination tackled two important issues, Proceedings of the Workshop on the Integration of Gesture in Language
and Speech, vol. 165, 1996.
the ambiguous boundaries of adjacent glosses and intra-gloss [10] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture
dependencies. Moreover, a pretraining scheme was provided, recognition with kinect depth camera,” IEEE transactions on multimedia,
in which transfer learning from a proximal isolated dataset can vol. 17, no. 1, pp. 29–39, 2014.
[11] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recog-
be a good initialization for CSLR training. The main finding nition from articulated poses,” in European Conference on Computer
of this work was that while 3D CNN-based architectures were Vision. Springer, 2014, pp. 595–607.
more effective in isolated SLR, 2D CNN-based models with [12] J. Zhang, W. Zhou, C. Xie, J. Pu, and H. Li, “Chinese sign language
an intermediate per gloss representation achieved superior recognition with adaptive hmm,” in 2016 IEEE International Conference
on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
results in the majority of the CSLR datasets. In particular, [13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Enabling
our implementation of GoogLeNet+TConvs, with the proposed robust statistical continuous sign language recognition via hybrid cnn-
pretraining scheme and EnStimCTC criterion, yielded state-of- hmms,” International Journal of Computer Vision, vol. 126, no. 12, pp.
1311–1325, 2018.
the-art results in CSL SI. [14] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell,
Concerning future work, efficient ways for integrating depth “Hidden conditional random fields for gesture recognition,” in 2006
information that will guide the feature extraction training IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1521–1527.
phase, can be devised. Moreover, another promising direc- [15] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venu-
tion is to investigate the incorporation of more sequence gopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
learning modules, like attention-based approaches, in order networks for visual recognition and description,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp.
to adequately model inter-gloss dependencies. Future SLR 2625–2634.
architectures may be enhanced by fusing highly semantic
representations that correspond to the manual and non-manual 4 https://www.keng.gr/
12

[16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream [38] J. Pu, W. Zhou, and H. Li, “Sign language recognition with multi-modal
network fusion for video action recognition,” in Proceedings of the IEEE features,” in Pacific Rim Conference on Multimedia. Springer, 2016,
conference on computer vision and pattern recognition, 2016, pp. 1933– pp. 252–261.
1941. [39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
[17] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recogni- spatiotemporal features with 3d convolutional networks,” in Proceedings
tion with 3d convolutional neural networks,” in Proceedings of the IEEE of the IEEE international conference on computer vision, 2015, pp.
conference on computer vision and pattern recognition workshops, 2015, 4489–4497.
pp. 1–7. [40] J. Shawe-Taylor and N. Cristianini, “Support vector machines,” An Intro-
[18] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using con- duction to Support Vector Machines and Other Kernel-based Learning
volutional 3d neural networks for user-independent continuous gesture Methods, pp. 93–112, 2000.
recognition,” in 2016 23rd International Conference on Pattern Recog- [41] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign
nition (ICPR). IEEE, 2016, pp. 49–54. language recognition without temporal segmentation,” in Thirty-Second
[19] D. S. Alexiadis, A. Chatzitofis, N. Zioulis, O. Zoidi, G. Louizis, AAAI Conference on Artificial Intelligence, 2018.
D. Zarpalas, and P. Daras, “An integrated platform for live 3d human [42] H. R. V. Joze and O. Koller, “Ms-asl: A large-scale data set and
reconstruction and motion capturing,” IEEE Transactions on Circuits benchmark for understanding american sign language,” arXiv preprint
and Systems for Video Technology, vol. 27, no. 4, pp. 798–813, 2016. arXiv:1812.01053, 2018.
[20] D. S. Alexiadis and P. Daras, “Quaternionic signal processing techniques [43] J. Carreira and A. Zisserman, “Quo vadis, action recognition,” A new
for automatic evaluation of dance performances from mocap data,” IEEE model and the kinetics dataset. CoRR, abs/1705.07750, vol. 2, p. 3,
Transactions on Multimedia, vol. 16, no. 5, pp. 1391–1406, 2014. 2017.
[21] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden, “Sign language [44] J. Pu, W. Zhou, and H. Li, “Dilated convolutional network with iterative
recognition using sub-units,” Journal of Machine Learning Research, optimization for continuous sign language recognition.” in IJCAI, vol. 3,
vol. 13, no. Jul, pp. 2205–2231, 2012. 2018, p. 7.
[22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive [45] J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous
multi-modal gesture recognition,” IEEE Transactions on Pattern Analy- sign language recognition,” in Proceedings of the IEEE Conference on
sis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2015. Computer Vision and Pattern Recognition, 2019, pp. 4165–4174.
[23] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and [46] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
J.-M. Odobez, “Deep dynamic neural networks for multimodal gesture jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
segmentation and recognition,” IEEE transactions on pattern analysis 2014.
and machine intelligence, vol. 38, no. 8, pp. 1583–1597, 2016. [47] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for
[24] O. Koller, C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised time-series,” in Proceedings of the 34th International Conference on
learning with multi-stream cnn-lstm-hmms to discover sequential paral- Machine Learning-Volume 70. JMLR. org, 2017, pp. 894–903.
lelism in sign language videos,” IEEE transactions on pattern analysis [48] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial
and machine intelligence, 2019. features for automatic sign language recognition,” in 2008 8th IEEE
[25] R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous International Conference on Automatic Face & Gesture Recognition.
sign language recognition by iterative training,” IEEE Transactions on IEEE, 2008, pp. 1–6.
Multimedia, 2019. [49] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions
of the sign language recognition and translation corpus rwth-phoenix-
[26] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Subunets: End-
weather.” in LREC, 2014, pp. 1911–1916.
to-end hand shape and continuous sign language recognition,” in 2017
[50] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neu-
IEEE International Conference on Computer Vision (ICCV). IEEE,
ral sign language translation,” in Proceedings of the IEEE Conference
2017, pp. 3075–3084.
on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793.
[27] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks
[51] A. Baker, B. van den Bogaerde, R. Pfau, and T. Schermer, The linguistics
for continuous sign language recognition by staged optimization,” in
of sign languages: An introduction. John Benjamins Publishing
Proceedings of the IEEE Conference on Computer Vision and Pattern
Company, 2016.
Recognition, 2017, pp. 7361–7369.
[52] E. T. Jaynes, “Information theory and statistical mechanics,” Physical
[28] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection- review, vol. 106, no. 4, p. 620, 1957.
ist temporal classification: labelling unsegmented sequence data with [53] S. Tan, K. C. Sim, and M. Gales, “Improving the interpretability of deep
recurrent neural networks,” in Proceedings of the 23rd international neural networks with stimulated learning,” in 2015 IEEE Workshop on
conference on Machine learning. ACM, 2006, pp. 369–376. Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015,
[29] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization pp. 617–623.
for spoken word recognition,” IEEE transactions on acoustics, speech, [54] C. Wu, P. Karanasou, M. J. Gales, and K. C. Sim, “Stimulated deep
and signal processing, vol. 26, no. 1, pp. 43–49, 1978. neural network for speech recognition,” in Interspeech 2016, 2016, pp.
[30] H. Liu, S. Jin, and C. Zhang, “Connectionist temporal classification with 400–404.
maximum entropy regularization,” in Advances in Neural Information [55] C. Wu, M. J. Gales, A. Ragni, P. Karanasou, and K. C. Sim, “Improving
Processing Systems, 2018, pp. 831–841. interpretability and regularization in deep learning,” IEEE/ACM Trans-
[31] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for actions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp.
continuous sign language recognition,” in 2019 IEEE International 256–265, 2017.
Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1282– [56] J. Heymann, K. C. Sim, and B. Li, “Improving ctc using stimu-
1287. lated learning for sequence modeling,” in ICASSP 2019-2019 IEEE
[32] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal International Conference on Acoustics, Speech and Signal Processing
processing magazine, vol. 13, no. 6, pp. 47–60, 1996. (ICASSP). IEEE, 2019, pp. 5701–5705.
[33] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a cnn [57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
on 1 million hand images when your data is continuous and weakly the inception architecture for computer vision,” in Proceedings of the
labelled,” in Proceedings of the IEEE Conference on Computer Vision IEEE conference on computer vision and pattern recognition, 2016, pp.
and Pattern Recognition, 2016, pp. 3793–3802. 2818–2826.
[34] O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid [58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
cnn-hmm for continuous sign language recognition,” in Proceedings of with deep convolutional neural networks,” in Advances in neural infor-
the British Machine Vision Conference 2016, 2016. mation processing systems, 2012, pp. 1097–1105.
[35] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end [59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
sequence modelling with deep recurrent cnn-hmms,” in Proceedings S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
of the IEEE Conference on Computer Vision and Pattern Recognition, fast feature embedding,” in Proceedings of the 22nd ACM international
2017, pp. 4297–4305. conference on Multimedia. ACM, 2014, pp. 675–678.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.

View publication stats

You might also like