1
A Comprehensive Study on Sign Language
Recognition Methods
arXiv:2007.12530v1 [cs.CV] 24 Jul 2020
Nikolas Adaloglou1* , Theocharis Chatzis1* , Ilias Papastratis1* , Andreas Stergioulas1* , Georgios Th.
Papadopoulos1 , Member, IEEE, Vassia Zacharopoulou2 , George J. Xydopoulos2 , Klimnis Atzakas2 , Dimitris
Papazachariou2 , and Petros Daras1 , Senior Member, IEEE
University of Patras2
Centre for Research and Technology Hellas1
Abstract—In this paper, a comparative experimental assessment of computer vision-based methods for sign language recognition is conducted. By implementing the most recent deep neural
network methods in this field, a thorough evaluation on multiple
publicly available datasets is performed. The aim of the present
study is to provide insights on sign language recognition, focusing
on mapping non-segmented video streams to glosses. For this
task, two new sequence training criteria, known from the fields of
speech and scene text recognition, are introduced. Furthermore, a
plethora of pretraining schemes is thoroughly discussed. Finally,
a new RGB+D dataset for the Greek sign language is created. To
the best of our knowledge, this is the first sign language dataset
where sentence and gloss level annotations are provided for a
video capture.
Index Terms—Sign Language Recognition, Greek sign language, Deep neural networks, stimulated CTC, conditional entropy CTC.
I. I NTRODUCTION
Spoken languages make use of the “vocal - auditory”
channel, as they are articulated with the mouth and perceived
with the ear. All writing systems also derive from, or are
representations of, spoken languages. Sign languages (SLs)
are different as they make use of the “corporal - visual”
channel, produced with the body and perceived with the eyes.
SLs are not international and they are widely used by the
communities of deaf people. They are natural languages, since
they are developed spontaneously wherever deaf people have
the opportunity to congregate and communicate mutually. SLs
are not derived from spoken languages; they have their own
independent vocabularies and their own grammatical structures
[1]. The signs used by deaf people, actually have internal
structure in the same way as spoken words. Just as hundreds of
thousands of English words are produced using a small number
of different sounds, the signs of SLs are produced using a
finite number of gestural features. Thus, signs are not holistic
gestures but are rather analyzable, as a combination of linguistically significant features. Similarly to spoken languages,
SLs are composed of the following indivisible features:
•
•
Combinations of the above-mentioned features represent a
gloss, which is the fundamental building block of a SL and
represents the closest meaning of a sign [2]. SLs, similar to the
spoken ones, include an inventory of flexible grammatical rules
that govern both manual and non-manual features [3]. Both
of them, are simultaneously (and often with loose temporal
structure) used by signers, in order to construct sentences in
a SL. Depending on the context, a specific feature may be the
most critical factor towards interpreting a gloss. It can modify
the meaning of a verb, provide spatial/temporal reference and
discriminate between objects and people.
Due to the intrinsic difficulty of the deaf community to
interact with the rest of the society (according to [4], around
500,000 people use the American SL to communicate in
the USA), the development of robust tools for automatic
SL recognition would greatly alleviate this communication
gap. As stated in [5], there is an increased demand for
interdisciplinary collaboration including the deaf community
and for the creation of representative public video datasets.
Sign Language Recognition (SLR) can be defined as the
task of inferring glosses performed by a signer from video
captures. Even though there is a significant amount of work
in the field of SLR, a lack of a complete experimental study is
profound. Moreover, most publications do not report results in
all available datasets or share their code. Thus, experimental
results in the field of SL are rarely reproducible and lacking
interpretation. Apart from the inherent difficulties related to
human motion analysis (e.g. differences in the appearance of
the subjects, the human silhouette features, the execution of the
same actions, the presence of occlusions, etc.) [6], automatic
SLR exhibits the following key additional challenges:
•
Manual features, i.e. hand shape, position, movement,
orientation of the palm or fingers, and
* authors
contributed equally
Non-manual features, namely eye gaze, head-nods/
shakes, shoulder orientations, various kinds of facial
expression as mouthing and mouth gestures.
•
Exact position in surrounding space and context have a
large impact on the interpretation of SL. For example,
personal pronouns (e.g. “he”, “she”, etc.) do not exist.
Instead, the signer points directly to any involved referent
or, when reproducing the contents of a conversation,
pronouns are modeled by twisting his/her shoulders or
gaze.
Many glosses are only distinguishable by their constituent
non-manual features and they are typically difficult to be
2
•
•
•
accurately detected, since even very slight human movements may impose different grammatical or semantic
interpretations depending on the context [7].
The execution speed of a given gloss may indicate a
different meaning or the particular signers attitude. For
instance, signers would not use two glosses to express
“run quickly”, but they would simply speed up the
execution of the involved signs [7].
Signers often discard a gloss sub-feature, depending on
previously performed and proceeding glosses. Hence,
different instances of the exact same gloss, originating
even from the same signer, can be observed.
For most SLs so far, very few formal standardization activities have been implemented, to the extent that signers
of the same country exhibit distinguishable differences
during the execution of a given gloss [8].
Historically, before the advent of deep learning methods, the
focus was on identifying isolated glosses and gesture spotting.
Developed methods were often making use of hand crafted
techniques [9], [10]. For spatial representation of the different
sub-gloss components, they usually used handcrafted features
and/or fusion of multiple modalities. Temporal modeling was
achieved by classical sequence learning models, such as
Hidden Markov Model (HMM) [11], [12], [13] and hidden
conditional random fields [14]. The rise of deep networks
was met with a significant boost in performance for many
video-related tasks, like human action recognition [15], [16],
gesture recognition, [17], [18], motion capturing [19], [20],
etc. SLR, is a task closely related to computer vision. This is
the reason that most approaches tackling SLR have adjusted
to this direction.
In this paper, SLR using Deep Neural Network (DNN)
methods is investigated. The main contributions of this work
are summarized as follows:
•
•
•
•
A comprehensive, holistic and in-depth analysis of multiple literature DNN-based SLR methods is performed, in
order to provide meaningful and detailed insights to the
task at hand.
Two new sequence learning training criteria are proposed,
known from the fields of speech and scene text recognition.
A new pretraining scheme is discussed, where transfer
learning is compared to initial pseudo-alignments.
A new publicly available large-scale RGB+D Greek Sign
Language (GSL) dataset is introduced, containing reallife conversations that may occur in different public
services. This dataset is particularly suitable for DNNbased approaches that typically require large quantities
of expert annotated data.
The remainder of this paper is organized as follows: in
Section II, related work is described. In Section III, an
overview of the publicly available datasets in SLR is provided,
along with the introduction of a new GSL dataset. In Section
IV, a description of the implemented architectures is given.
In Section V, a description of the proposed sequence training
criteria is detailed. In Section VI, the performed experimental
results are reported. Then, in Section VII, interpretations and
insights of the conducted experiments are discussed. Finally,
conclusions are drawn and future research directions are
highlighted in Section VIII.
II. R ELATED W ORK
The various automatic SLR tasks, depending on the modeling’s level of detail and the subsequent recognition step, can
be roughly divided in (Fig. 1):
• Isolated SLR: Methods of this category target to address
the task of video segment classification (where the segment boundaries are provided), based on the fundamental
assumption that a single gloss is present [9], [21], [18].
• Sign detection in continuous streams: The aim of these
approaches is to detect a set of predefined glosses in a
continuous video stream [11], [22], [23].
• Continuous SLR (CSLR): These methods aim at recognizing the sequence of glosses that are present in
a continuous/non-segmented video sequence [3], [24],
[25]. This category of approaches exhibits characteristics
that are most suitable for the needs of real-life SLR
applications [5]; hence, it has gained increased research
attention and will be further discussed in the remainder
of this section.
A. Continuous sign language recognition
By definition, CSLR is a task very similar to the one of
continuous human action recognition, where a sequence of
glosses (instead of actions) needs to be identified in a continuous stream of video data. However, glosses typically exhibit
a significantly shorter duration than actions (i.e. they may
only involve a very small number of frames), while transitions
among them are often very subtle for their temporal boundaries
to be efficiently recognized. Additionally, glosses may only
involve very detailed and fine-grained human movements (e.g.
finger signs or facial expressions), while human actions usually
refer to more concrete and extensive human body actions. The
latter facts highlight the particular challenges that are present
in the CSLR field [3].
Due to the lack of gloss-level annotations, CSLR is regularly
casted as a weakly supervised learning problem. The majority
of CSLR architectures usually consists of a feature extractor,
followed by a temporal modeling mechanism, [26], [27]. The
feature extractor is used to compute feature representations
from individual input frames (using 2D CNNs) or sets of
neighbouring frames (using 3D CNNs). On the other hand,
a critical aspect of the temporal modeling scheme enables the
modeling of the SL unit feature representations (i.e glosslevel, sentence-level). With respect to temporal modeling,
sequence learning can be achieved using HMMs, Connectionist Temporal Classification (CTC) [28] or Dynamic Time
Warping (DTW) [29] techniques. From the aforementioned
categories, CTC has, in general, shown superior performance
and the majority of works in CSLR has established CTC
as the main sequence training criterion (for instance, HMMs
may fail to efficiently model complex dynamic variations, due
to expressiveness limitations [25]). However, CTC has the
tendency to produce overconfident peak distributions, that are
3
Input video
Feature extraction phase
3D CNN-based
HELLO
CAN
Prediction Phase
Representation
Isolated SLR
frame-level
Gloss spotting
HELLO
HELP
CAN
2D CNN-based
Individual Gloss
Temporal modeling
segment-level
CSLR
HELLO
CAN
HELP
Entire sequence of glosses
Fig. 1. An overview of SLR categories
prone to overfitting [30]. Moreover, CTC introduces limited
contribution towards optimizing the feature extractor [31].
For these reasons, some recent approaches have adopted an
iterative training optimization methodology. The latter essentially comprises a two-step process. In particular, a set of
temporally-aligned pseudo-labels are initially estimated and
used to guide the training of the feature extraction module.
In the beginning, the pseudo-labels can be either estimated by
statistical approaches [3] or extracted from a shallower model
[25]. After training the model in an isolated setup, the trained
feature extractor is utilized for the continuous SLR setup. This
process may be performed in an iterative way, similarly to the
Expectation Maximization (EM) algorithm [32]. Finally, CTC
imposes a conditional independence constraint, where output
predictions are independent, given the entire input sequence.
B. 2D CNN-based CSLR approaches
One of the firstly deployed architectures in CSLR is based
on [33], where a CNN-HMM network is proposed. GoogleLeNet serves as the backbone architecture, fed with cropped
hand regions and trained in an iterative manner. The same
network architecture is deployed in a CSLR prediction setup
[34], where the CNN is trained using glosses as targets
instead of hand shapes. Later on, in [35], the same authors
extend their previous work by incorporating a Long ShortTerm Memory unit (LSTM) [36] on top of the aforementioned
network. In a more recent work [24], the authors present a
three-stream CNN-LSTM-HMM network, using full frame,
cropped dominant hand and signer’s mouth region modalities.
These models, since they employ HMM for sequence learning,
have to make strong initial assumptions in order to overcome
HMM’s expressive limitations.
In [26], the authors introduce an end-to-end system in
CSLR without iterative training. Their model follows a 2D
CNN-LSTM architecture, replacing HMM with LSTM-CTC.
It consists of two streams, one responsible for processing the
full frame sequences and one for processing only the signer’s
cropped dominant hand. In [27], the authors employ a 2D
CNN-LSTM architecture and in parallel with the LSTMs,
a weakly supervised gloss-detection regularization network,
consisting of stacked temporal 1D convolutions. The same
authors in [25] extend their previous work by proposing a
module composed of a series of temporal 1D CNNs followed
by max pooling, between the feature extractor and the LSTM,
while fully embracing the iterative optimization procedure.
That module is able to produce compact representations for
a video segment, which approximate the average duration of
a gloss. Thereby, the LSTM captures the context information
between gloss segments, instead of individual frames as in
previous works. In [2], a hybrid 2D-3D CNN architecture [37]
is developed. Features are extracted in a structured manner,
where temporal dependencies are modeled by two LSTMs,
without pretraining or using an iterative procedure. This approach however, yields the best results only in continuous SL
datasets where a plethora of training data is available.
C. 3D CNN-based CSLR approaches
One of the first works that employs 3D-CNNs in SLR
is introduced in [38]. The authors present a multi-modal
approach for the task of isolated SLR, using spatio-temporal
Convolutional 3D networks (C3D) [39], known from the research field of action recognition. Multi-modal representations
are lately fused and fed to a Support Vector Machine (SVM)
[40] classifier. The C3D architecture has also been utilized in
CSLR by [41]. The developed two-stream 3D CNN processes
both full frame and cropped hand images. The full network,
named LS-HAN, consists of the proposed 3D CNN network,
along with a hierarchical attention network, capable of latent
space-based recognition modeling. In a later work [42], the
authors propose the I3D [43] architecture in SLR. The model
is deployed on an isolated SLR setup, with pretrained weights
on action recognition datasets. The signer’s body bounding box
is served as input. For the evaluated dataset it yielded state-ofthe-art results. In [31], the authors adopted and enhanced the
original I3D model with a gated Recurrent Neural Network
(RNN). The whole architecture is a 3D CNN-RNN-CTC
architecture trained iteratively with a dynamic pseudo-label
decoding method. Their aim is to accommodate features from
different time scales. In another work [44], the authors introduce the 3D ResNet architecture to extract features. Furthermore, they substituted LSTM with stacked dilated temporal
convolutions and CTC for sequence alignment and decoding.
With this approach, they manage to have very large receptive
fields while reducing time and space complexity, compared to
LSTM. Finally, in [45] Pu et al. propose a framework that also
consists of a 3D ResNet backbone. The features are provided
in both an attentional encoder-decoder network [46] and a
CTC decoder for sequence learning. Both decoded outputs
4
are jointly trained while the soft-DTW [47] is utilized to align
them.
III. P UBLICLY AVAILABLE DATASETS
Existing SLR datasets can be characterized as isolated
or continuous, taking into account whether annotation are
provided at the gloss (fine-grained) or the sentence (coarsegrained) levels. Additionally, they can be divided into Signer
Dependent (SD) and Signer Independent (SI) ones, based on
the defined evaluation scheme. In particular, in the SI datasets
a signer cannot be present in both the training and the test
set. In Table I, the following most widely known public SLR
datasets, along with their main characteristics, are illustrated:
• The Signum SI and the Signum subset [48] include
laboratory capturings of the German Sign Language.
They are both created under strict laboratory settings with
the most frequent everyday glosses.
• The Chinese Sign Language (CSL) SD, the CSL SI
and the CSL isol. datasets [41] are also recorded in
a predefined laboratory environment with Chinese SL
words that are widely used in daily conversations.
• The Phoenix SD [49], the Phoenix SI [49] and the
Phoenix-T [50] datasets comprise videos of German SL,
originating from the weather forecast domain.
• The American Sign Language (ASL) [42] dataset contains videos of various real-life settings. The collected
videos exhibit large variations in background, image
quality, lighting and positioning of the signers.
A. The GSL dataset
1) Dataset description: In order to boost scientific research
in the deep learning era, large-scale public datasets need to be
created. In this respect and with a particular focus on the case
of the GSL recognition, a corresponding public dataset has
been created in this work. In particular, a set of seven native
GSL signers are involved in the capturings. The considered
application includes cases of deaf people interacting with
different public services, namely police departments, hospitals
and citizen service centers. For each application case, 5
individual and commonly met scenarios (of increasing duration
and vocabulary complexity) are defined. The average length
of each scenario is twenty sentences with 4.23 glosses per
sentence on average. Subsequently, each signer was asked to
perform the pre-defined dialogues in GSL five consecutive
times. In all cases, the simulation considers a deaf person
communicating with a single public service employee, while
all interactions are performed in GSL (the involved signer
performed the sequence of glosses of both agents in the discussion). Overall, the resulting dataset includes 10,290 sentence
instances, 40,785 gloss instances, 310 unique glosses (vocabulary size) and 331 unique sentences. For the definition of the
dialogues in the identified application cases, the particularities
of the GSL and the corresponding annotation guidelines, GSL
linguistic experts are involved. The video annotation process
is performed both at gloss and sentence level. The provided
annotated segments enable benchmarking in SLR (using the
glosses) and SL translation (using the standard modern Greek).
Fig. 2. Example keyframes of the introduced GSL dataset
The recordings are conducted using an Intel RealSense D435
RGB+D camera at a rate of 30 fps. Both the RGB and the
depth streams are acquired in the same spatial resolution of
848x480 pixels. To increase variability in videos, the camera
position and orientation are slightly altered within subsequent
recordings. Exemplary cropped frames of the captured videos
are depicted in Fig.2.
2) GSL evaluation sets: Regarding the evaluation settings,
the dataset includes the following setups: a) the continuous
GSL SD, b) the continuous GSL SI, and c) the GSL isol. In
GSL SD, roughly 80% of the videos are used for training,
corresponding to 8,189 instances. The rest 1,063 (10%) are
kept for validation and 1,043 (10%) for testing. The selected
test gloss sequences are not used in the training set, while
all the individual glosses exist in the training set. In GSL
SI, the recordings of one signer are left out for validation
and testing (588 and 881 instances, respectively). The rest
8821 instances are utilized for training. A similar strategy is
followed in GSL isol., wherein the validation set consists of
2,231 gloss instances, the test set 3,500, while the remaining
34,995 are used for training.
3) Linguistic analysis and annotation process: As already
mentioned, the provided annotations are both at individual
gloss and sentence level. Native signers annotated and labelled
individual glosses, as well as whole sentences. Sign linguists
and SL professional interpreters consistently validated the
annotation of the individual glosses. A great effort was devoted
in determining individual glosses following the “one form
one meaning” principle (i.e. a distinctive set of signs), taking
into consideration the linguistic structure of the GSL and
not its translation to the spoken standard modern Greek. We
addressed and provided a solution for the following issues: a)
compound words, b) synonyms, c) regional or stylistic variants
of the same meaning, and d) agreement verbs.
In particular, compound words are composed of smaller
meaningful units with distinctive form and meaning, i.e. the
equivalent of morphemes of the spoken languages, which
can also be simple individual words, for example: SON =
MAN+BIRTH. Following the “one form one meaning” principle, we split a compound word into its indivisible parts.
Based on the above design, a computer vision system does
not confuse compound words with its constituents.
Synonyms (e.g. two different signs with similar meaning)
were distinguished to each other with the use of consecutively
numbered lemmas. For instance, the two different signs which
have the meaning: “DOWN” were annotated as DOWN(1) and
DOWN(2). The same strategy was opted for the annotation
of regional and stylistic variants of the same meaning. For
5
TABLE I
L ARGE - SCALE PUBLICLY AVAILABLE SLR DATASETS
Characteristics
Datasets
Signum SI [48]
Signum isol. [48]
Signum subset [48]
Phoenix SD [49]
Phoenix SI [49]
CSL SD [41]
CSL SI [41]
CSL isol. [38]
Phoenix-T [50]
ASL 100 [42]
ASL 1000 [42]
GSL isol. (new)
GSL SD (new)
GSL SI (new)
Language
Signers
Classes
Video instances
Duration (hours)
Resolution
fps
Type
Modalities
Year
German
German
German
German
German
Chinese
Chinese
Chinese
German
English
English
Greek
Greek
Greek
25
25
1
9
9
50
50
50
9
189
222
7
7
7
780
455
780
1,231
1,117
178
178
500
1,231
100
1,000
310
310
310
19,500
11,375
2,340
6,841
4,667
25,000
25,000
125,000
8,257
5,736
25,513
40,785
10,290
10,290
55.3
8.43
4.92
10.71
7.28
100+
100+
67.75
10.53
5.55
24.65
6.44
9.59
9.59
776x578
776x578
776x578
210x260
210x260
1920x1080
1920x1080
1920x1080
210x260
varying
varying
848x480
848x480
848x480
30
30
30
25
25
30
30
30
25
varying
varying
30
30
30
continuous
both
both
continuous
continuous
continuous
continuous
isolated
continuous
isolated
isolated
isolated
continuous
continuous
RGB
RGB
RGB
RGB
RGB
RGB+D
RGB+D
RGB+D
RGB
RGB
RGB
RGB+D
RGB+D
RGB+D
2007
2007
2007
2014
2014
2016
2016
2016
2018
2019
2019
2019
2019
2019
example, the two different regional variants of “DOCTOR
were annotated as DOCTOR(1), DOCTOR(2).
Another interesting case is the agreement verbs of sign
languages, which contain the subject and/or object within the
sign of the agreement verb. Agreement verbs indicate subjects
and/or objects by changing the direction of the movement
and/or the orientation of the hand. Therefore, it was decided
that they cannot be distinguished as autonomous signs and
are annotated as a single gloss. A representative example is
the : “I DISCUSS WITH YOU” versus “YOU DISCUSS
WITH HIM”. For the described annotation guideline, the
internationally accepted notation for the sign verbs is followed
[51], [1].
IV. SLR APPROACHES
In order to gain a better insight on the behavior of the
various automatic SLR approaches, the best performing and
the most widely adopted methods of the literature are discussed in this section. The selected approaches cover all
different categories of methods that have been proposed so
far. The quantitative comparative evaluation of the latter, using
multiple publicly available datasets, will facilitate towards
providing valuable feedback regarding the pros and cons of
each automatic SLR methodology.
B. GoogLeNet + TConvs
In contrast to other 2D CNN-based methods that employ
HMMs, Cui et. al [25] propose a model that includes an
extra temporal module (TConvs), after the feature extractor
(GoogLeNet). The TConvs module consists of two 1D CNN
layers and two max pooling layers. It is designed to capture
the fine-grained dependencies, which exist inside a gloss
(intra-gloss dependencies) between consecutive frames, into
compact per-window feature vectors. Finally, bidirectional
RNNs are applied in order to capture the long-term temporal
dependencies of the entire sentence. The total architecture is
trained iteratively, in order to exploit the expressive capability
of DNN models with limited data.
C. I3D
Inflated 3D ConvNet (I3D) [43] was originally developed
for the task of human action recognition; however, its application has demonstrated outstanding performance on isolated SLR [42]. In particular, the I3D architecture is an
extended version of GoogLeNet, which contains several 3D
convolutional layers followed by 3D max-pooling layers. The
key insight of this architecture is the endowing of the 2D
sub-modules (filters and pooling kernels) with an additional
temporal dimension. This methodology makes feasible to
learn spatio-temporal features from videos, while it leverages
efficient known architecture designs and parameters.
A. SubUNets
Camgoz et. al [26] introduce a DNN-based approach for
solving the simultaneous alignment and recognition problems,
typically referred to as “sequence-to-sequence” learning. In
particular, the overall problem is decomposed of a series
of specialized systems, termed SubUNets. The overall goal
is to model the spatio-temporal relationships among these
SubUNets to solve the task at hand. More specifically, SubUNets allow to inject domain-specific expert knowledge into
the system regarding suitable intermediate representations.
Additionally, they also allow to implicitly perform transfer
learning between different interrelated tasks.
D. 3D ResNet+LSTM
Pu et al. [45] propose a framework comprising a 3D CNN
for feature extraction, a RNN for sequence learning and
two different decoding strategies, one performed with CTC
and the other with an attentional decoder RNN. The glosses
predicted by the attentional decoder are utilised to draw a
warping path using a soft-DTW [47] alignment constraint.
The warping paths display the alignments between glosses
and video segments. The proposed pseudo-alignments are then
employed for iterative optimization.
6
V. S EQUENCE LEARNING TRAINING CRITERIA FOR CSLR
A summary of the notations used in this paper, is provided
in this section, so as to enhance its readability and understanding. Let us denote by U the label (i.e. gloss) vocabulary
and by blank the new blank token, representing the silence
or transition between two consecutive labels. The extended
vocabulary can be defined as V = U ∪ {blank} ∈ RL , where
L is the total number of labels. From now on, given a sequence
f of length F , we denote its first and last p elements by f 1:p
and f p:F , respectively. An input frame sequence of length N
can be defined as X = (x1 , .., xN ). The corresponding target
sequence of labels (i.e. glosses) of length K is defined as
y = (y1 , .., yK ). In addition, let Gv = (g 1v , .., g Tv ) ∈ RL×T
be the predicted output sequence of a softmax classifier, where
T ≤ N and v ∈ V . g tv can be interpreted as the probability
of observing label v at time-step t. Hence, Gv defines a
distribution over the set V T ∈ RL×T :
p(π|X) =
T
Y
gπt t , ∀π ∈ V T
The error signal of Lctc with respect to gvt is:
∂Lctc
1
=−
t
∂gv
p(y|X)gvt
From (7) it can be observed that the error signal is proportional to the fraction of all valid paths. As soon as a
path dominates the rest, the error signal enforces all the
probabilities to concentrate on a single path. Moreover, (1)
and (7) indicate that the probabilities of a gloss occurring
at following time-steps are independent, which is known as
the conditional independence assumption. For these reasons,
two learning criteria are introduced in CSLR: a) one that
encounters the ambiguous segmentation boundaries of adjacent
glosses, and b) one that is able to model the intra-gloss
dependencies, by incorporating a learnable language model
during training (as opposed to other approaches that use it
only during the CTC decoding stage).
The CTC criterion can be extended [30] based on maximum
conditional entropy [52], by adding an entropy regularization
term H:
H(p(π|y, X)) = −
Connectionist Temporal Classification (CTC) [28] is widely
utilized for labelling unsegmented sequences. The time complexity of (2) is O(LN K), which means that the amount of
valid paths grows exponentially with N . To efficiently calculate p(y|X), a recursive formula is derived, which exploits
the existence of common sub-paths. Furthermore, to allow for
blanks in the paths, a modified gloss sequence y ′ of length
K ′ = 2K + 1 is used, by adding blanks before and after each
gloss in y. Forward and backward probabilities αt (s) of y ′1:s
at t and βt (s) of y ′s:K ′ at t are defined as:
t
Y
′
gπt t′
(3)
B(π1:t )=y ′1:s t′ =1
βt (s) ,
T
Y
′
gπt t′
(4)
B(πt:T )=y ′s:K ′ t′ =t
Therefore, to calculate p(y|X) for any t, we sum over all
s in y ′ as:
K′
X
αt (s)βt (s)
(5)
p(y|X) =
gyt ′s
s=1
Finally, the CTC criterion is derived as:
Lctc = − log p(y|X)
(6)
p(π|X, y) log p(π|X, y)
=−
where Q(y) =
A. Traditional CTC criterion
X
π∈B −1 (y)
π∈B −1 (y)
X
(7)
B. Entropy Regularization CTC
The elements of V T are referred as paths and denoted
by π. In order to map y to π, one can define a mapping
function B : V T 7→ U ≤T , with U ≤T being the set of possible
labellings. B removes repeated labels and blanks from a given
path. Similarly, one can denote the inverse operation of B as
B −1 , that maps target labels to all the valid paths. From this
perspective, the conditional probability of y is computed as:
X
p(y|X) =
p(π|X)
(2)
X
p(π|X)
{π∈B −1 (y),πt =v}
(1)
t=1
αt (s) ,
X
P
π∈B −1 (y)
Q(y)
+ log p(y|X),
p(y|X)
(8)
p(π|X) log p(π|X).
H aims to prevent the entropy of the non-dominant paths
from decreasing rapidly. Consequently, the entropy regularization CTC criterion (EnCTC) is formulated as:
Lenctc = Lctc − φH(p(π|y, X)),
(9)
where φ is a hyperparameter. The introduction of the entropy
term H prevents the error signal from gathering into the
dominant path, but rather encourages the exploration of nearby
ones. By increasing the probabilities of the alternative paths,
the peaky distribution problem is alleviated.
C. Stimulated CTC
Stimulated learning [53], [54], [55] augments the training
process by regularizing the activations of the sequence learning
RNN, ht . Stimulated CTC (StimCTC) [56] constricts the
independent assumption of traditional CTC. To generate the
appropriate stimuli, an auxiliary uni-directional Language
Model RNN (RNN-LM) is utilized. The RNN-LM encoded
hidden states (hk ) encapsulate the sentence’s history, up to
gloss k. ht is stimulated by utilizing the non-blank probabilities α′t and β ′t ∈ RK . Then, the weighting factor γ t can be
calculated as:
β ′ ⊙ α′t
(10)
γ t = t′
β t · α′t
7
Intuitively, γ t can be seen as the probabilities of any gloss in
target sequence y to be mapped to time-step t. The linguistic
structure of SL is then incorporated as:
Lstimuli =
1
K ·T
K X
T
X
k
2
γt (k)|| ht − hk || ,
(11)
Thereby, ht is enforced to comply with hk . The RNN-LM
model is trained using the cross-entropy criterion denoted as
Llm . Finally, the StimCTC criterion is defined as:
(12)
where λ and θ are hyper-parameters. The described criteria
can be combined, resulting in Entropy Stimulated CTC (EnStimCTC) criterion, as:
Lenstim = Lctc − φH(p(π|y)) + λLlm + θLstimuli
(13)
VI. E XPERIMENTAL EVALUATION
In order to provide a fair evaluation, we re-implemented
the selected approaches and evaluated them on multiple largescale datasets, in both isolated and continuous SLR. Reimplementations are based on the original authors’ guidelines
and any modifications are explicitly referenced. For the continuous setup, the criteria CTC, EnCTC, and EnStimCTC are
evaluated in all architectures. For a fair comparison between
different models, we opt to use the full frame modality, since
it is the common modality between selected datasets and it is
more suitable for real-life applications. We omit the iterative
optimization process, instead we pretrain each model on the
respective dataset’s isolated version, if present. Otherwise,
extracted pseudo-alignments from other models (i.e. Phoenix)
are used for isolated pretraining (implementations and experimental results are publicly available to enforce reproducibility
in SLR3 ).
A. Datasets and Evaluation metrics
The following datasets have been chosen for experimental
evaluation: ASL 100 and 1000, CSL isol., GSL isol. for the
isolated setup, and Phoenix SD and Phoenix SI, CSL SD,
CSL SI, GSL SD, GSL SI for the CSLR setup. To evaluate
recognition performance in continuous datasets, the word error
rate (WER) metric has been adopted, which quantifies the
similarity between predicted glosses and ground truth gloss
sequence. WER measures the least number of operations
needed to transform the aligned predicted sequence to the
ground truth and can be defined as:
W ER =
S+D+I
,
N
(14)
where S is the total number of substitutions, D is the total
number of deletions, I is the total number of insertions and
N is the total number of glosses in the ground truth.
3 https://zenodo.org/record/3941811#.XxrZXZZRU5k
Datasets
Method
t
Lstim = Lctc + λLlm + θLstimuli ,
TABLE II
G LOSS T EST ACCURACY IN PERCENTAGE - ISOLATED SLR
GoogLeNet+TConvs [25]
3D-ResNet [45]
I3D [57]
ASL 1000
ASL 100
CSL isol.
GSL isol.
40.99
44.92
50.48
72.07
79.31
89.91
95.68
86.03
86.23
89.74
B. Data augmentation and implementation details
The same data preprocessing methods are used for all
datasets. Each frame is normalised by the mean and standard
deviation of the ImageNet dataset. To increase the variability of the training videos, the following data augmentation
techniques are adopted. Frames are resized to 256X256 and
cropped at a random position to 224X224. Random frame
sampling is used up to 80% of video length. Moreover, random
jittering of the brightness, contrast, saturation and hue values
of each frame is applied. The models are trained with Adam
optimizer with initial learning rate λ0 = 10−4 , which is
reduced to λi = 10−5 when validation loss starts to plateau.
For isolated SLR experiments, the batch size is set to 2. Videos
are rescaled to a fixed length that is equal to the average
gloss length of each dataset. For CSLR experiments, videos are
downsampled to maximum length of 250 frames, if necessary.
Batch size is set to 1, due to GPU memory constraints. The
experiments are conducted in a NVIDIA GeForce GTX-1080
Ti GPU with 12 GB of memory and 32 GB of RAM. All
models, depending on the dataset, require 10 to 25 epochs to
converge.
The referenced models, depending on the dataset, have been
modified as follows: In SubUNets, AlexNet [58] is used as
feature extractor instead of CaffeNet [59], as they share a similar architecture. Additionally, for the CSL and GSL datasets,
we reduce the bidirectional LSTM hidden size by half, due
to computational space complexity. In the isolated setup, the
LSTM layers of SubUNets are trained along with the feature
extractor. In order to achieve the maximum performance of
GoogLeNet+TConvs, a manual customization of TConvs 1D
CNN kernels and pooling sizes is necessary. The intuition
behind it, is that the receptive field should be approximately
covering the average gloss duration. Each 1D CNN layer
includes 1024 filters. In CSL, the 1D CNN are set with kernel
size 7, stride 1 and the max-pooling layers with kernel sizes
and strides equal to 3, to cover the average gloss duration of
58 frames. For the GSL dataset the TConvs are tuned with
kernel sizes equal to 5 and pooling sizes equal to 3. In order
to deploy 3D-ResNet and I3D in a CSLR setup, a sliding
window technique is adopted in the input sequence. Window
size and stride are selected to cover the average gloss duration.
Then, a 2-layer bidirectional LSTM is added to model the
long-term temporal correlations in the feature sequence. In
CSL, the window size is set to 50 and stride 36, whereas in
GSL the window size is set to 25, with stride equal to 12.
I3D and 3D-ResNet are initialized with weights pretrained
on Kinetics. Also, for the 3D-ResNet method, we omit the
attentional decoder from the original paper, keeping the 3D-
8
TABLE III
F INE TUNING IN CSLR DATASETS . R ESULTS ARE REPORTED IN WER
Datasets
Method
Phoenix SD
Phoenix SI
CSL SI
CSL SD
GSL SI
GSL SD
I3D (Kinetics)
I3D (Kinetics + ASL 1000)
Val. / Test
53.81 / 51.27
40.89 / 40.49
Val. / Test
65.53 / 62.38
59.60 / 58.36
Test
23.19
16.73
Test
72.39
64.72
Test
34.52
27.09
Test
75.42
71.05
ResNet+LSTM model.
In initial experiments it was observed that by training with
StimCTC, all baseline models were unable to converge. The
main reason is that the networks produce unstable output
probability distributions in the early stage of training. On
the contrary, introducing Lstim in the late training stage,
constantly improved the networks’ performance. The overall
best results were obtained with EnStimCTC. The reason is
that, while the entropy term H introduces more variability in
the early optimisation process, convergence is hindered on the
late training stage. By removing H and introducing Lstim ,
the possible alignments generated by EnCTC are filtered.
Regarding the hyper-parameters of the selected criteria, a
tuning was necessary. For EnCTC, the hyperparameter φ is
varied in the range of 0.1 and 0.2. For EnStimCTC, λ is set
to 1. Concerning θ, evaluations for θ = 0.1, 0.2, 0.5, 1 are
performed. The best results were obtained with θ = 0.5 and
φ = 0.1.
C. Experimental results
In Table II, quantitative results are reported for the isolated
setup. Classification accuracy is reported in percentage. It
can be seen that 3D baseline methods achieve higher gloss
recognition rate than 2D ones. I3D clearly outperforms other
architectures in this setup, by a minimum margin of 2.2% to a
maximum of 21.6%. I3D and 3D-ResNet were pretrained on
Kinetics, which explains their superiority in performance. The
3D CNN models achieve satisfactory results in datasets created
under laboratory conditions, yet in challenging scenarios, I3D
clearly outperforms 3D-ResNet. Specifically in ASL 1000,
where glosses are not executed in a controlled environment,
only I3D is able to converge. SubUNets performed poorly or
did not converge at all and its results are deliberately excluded.
SubUNets’ inability to converge, may be due to their large
number of parameters (roughly 125M).
TABLE IV
C OMPARISON OF PRETRAINING SCHEMES : RESULTS OF THE I3D
ARCHITECTURE , AS MEASURED IN TEST WER, USING MULTIPLE
FULLY- SUPERVISED APPROACHES BEFORE TRAINING IN CSLR
Datasets
Method
SubUNets alignments
Uniform alignments
Transfer learning from ASL
Proximal transfer learning
CSL SI
Test
5.94
16.98
16.73
6.45
GSL SI
Val.
18.43
27.30
25.89
8.78
/
/
/
/
/
Test
20.00
29.08
27.09
8.62
In Table III, I3D+LSTM is fine-tuned on CSLR datasets
with CTC in 2 configurations: a) using the pretrained weights
from Kinetics, and b) pretraining in ASL 1000. Results are
improved by 6.79% on average for the second configuration.
This was expected due to the task relevance.
Table IV presents an evaluation of the impact of transfer
learning versus training with initial pseudo-alignments, as a
pretraining scheme. The following four cases are considered:
•
•
•
•
directly train a shallow model (i.e SubUNets) without
pretraining, to obtain initial pseudo-alignments,
assume uniform pseudo-alignments over input video for
each gloss in a sentence,
transfer learning from a large-scale isolated dataset
(ASL), and
proximal transfer learning from the respected datasets
isolated.
Experiments are conducted on the CSL SI and the GSL SI
evaluation sets, since they have annotated isolated subsets for
proximal transfer learning. For the particular experiment I3D
is used, since it is the best performing model in isolated setup
(Table II). SubUNets are chosen to infer the initial pseudoalignments, because pretraining is not required by design.
Training was performed with the traditional CTC criterion. In
CSL SI, the best strategy by a relative margin of 7.9%, seems
to be pretraining on pseudo-alignments. On the contrary, in
GSL SI the best results are acquired with proximal transfer
learning by a relative gain of 56.9% compared to pseudoalignments. Producing pseudo-alignments requires more training time, while training with proximal isolated subset is not
always available.
In Tables V and VI, quantitative results regarding CSLR are
reported. The selected architectures are evaluated in CSLR
datasets in both SD and SI subsets, using the proposed
criteria. Training with EnCTC, needs more epochs to converge,
due to the fact that a greater number of possible paths is
explored, yet it converges to a better local optimal. Overall,
EnCTC shows an average improvement of 1.59% in WER
(9.73% relative). A further reduction of 1.60% in WER (5.69%
relative) is observed by adding StimCTC. It can be seen that
the proposed EnStimCTC criterion improves recognition in all
datasets by an overall WER gain of 3.26% (14.56% relative).
In the reported average gains SubUNets are excluded due to
performance deterioration.
In Phoenix SD subset, all models benefit from training
with EnStimCTC loss by 1.59% less WER on average. Fig.3,
depicts the models’ WER in Phoenix SD validation set.
SubUNets have a WER of 29.51% in validation set and
29.22% in test set, which is an average reduction of 12.59%
9
TABLE V
R EPORTED RESULTS IN CONTINUOUS SD SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .
Signer Dependent Datasets
Phoenix SD
Method
SubUNets [26]
GoogLeNet+TConvs [25]
3D-ResNet+LSTM [45]
I3D+LSTM [57]
CSL SD
GSL SD
CTC
EnCTC
EnStimCTC
CTC
EnCTC
EnStimCTC
CTC
EnCTC
EnStimCTC
Val. / Test
30.51/30.62
32.18/31.37
38.81/37.79
32.88/31.92
Val. / Test
32.02/31.61
31.66/31.74
38.80/37.50
32.60/32.70
Val. / Test
29.51/29.22
28.87/29.11
36.74/35.51
31.16/31.48
Test
78.31
65.83
72.44
64.73
Test
81.33
64.04
70.20
64.06
Test
80.13
64.43
68.35
60.68
Val. / Test
52.79/54.31
43.54/48.46
61.94/68.54
51.74/53.48
Val. / Test
58.11/60.09
42.69/44.11
63.47/66.54
51.37/53.48
Val. / Test
55.03/57.49
38.92/42.33
57.88/61.64
49.89/49.99
TABLE VI
R EPORTED RESULTS IN CONTINUOUS SI SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED .
Signer Independent Datasets
Phoenix SI
Method
SubUNets [26]
GoogLeNet+TConvs [25]
3D-ResNet+LSTM [45]
I3D+LSTM [57]
EnStimCTC
CTC
EnCTC
EnStimCTC
CTC
EnCTC
EnStimCTC
Val. / Test
56.56/55.06
46.70/46.67
55.88/53.77
55.24/54.43
Val. / Test
55.59/53.42
47.14/46.70
54.69/54.57
54.42/53.92
Val. / Test
55.01/54.11
46.42/46.41
52.88/50.98
53.70/52.71
Test
3.29
4.06
19.09
6.49
Test
5.13
2.46
13.36
4.26
Test
4.14
2.41
14.31
2.72
Val. / Test
24.64/24.03
8.08/7.95
33.61/33.07
8.78/8.62
Val. / Test
21.73/20.58
7.63/6.91
27.80/26.75
7.69/6.55
Val. / Test
21.65/20.62
6.99/6.75
25.58/24.01
6.63/6.10
I3D+LSTM
SubUnets
GoogLeNet+TConvs
3D-ResNet+LSTM
70
60
50
40
30
20
0
5
GSL SI
EnCTC
80
WER
CSL SI
CTC
10
15
20
Epoch
25
30
Fig. 3. Validation WER of the implemented architectures in Phoenix SD
dataset trained with EnStimCTC loss.
WER, compared to the original paper’s results (42.1% vs
30.62%) [26]. Furthermore, 2D-based CNNs produce similar
results with negligible difference in performance. Similarly, in
Phoenix SI, GoogLeNet+TConvs trained with EnStimCTC is
the best performing setup with an average of 10.9% relative
less WER, compared to the others. Finally, all architectures in
Phoenix SI have worse recognition performances compared to
their SD, due to a reduction of more than 20% in the training
data.
In CSL SI dataset all methods, except for 3D-ResNet,
have comparable recognition performance. They achieve high
recognition accuracy due to the large size of the dataset
and the small size of the vocabulary. I3D+LSTM seems to
benefit the most when trained with EnStimCTC, with 3.77%
absolute WER reduction. GoogLeNet+TConvs has the best
performance with 2.41% WER, which is 5.36% less WER
on average than the other models and 1.65% less compared
to CTC training (Fig. 5). This method outperforms the current
state-of-the-art method on CSL SI [2] by an absolute reduction
of 1.39% WER (3.80 vs 2.41) and relatively by 36.58%.
In CSL SD WER results are considerably higher than CSL
SI results, with an average WER gain of 70.00%. The best
performing model is I3D+LSTM with 60.68% WER.
In GSL SI, I3D+LSTM and GoogLeNet+TConvs recognition results are close, with 6.63%/6.10% and 6.99%/6.75%
WER in development and test set, respectively. In GSL SD
GoogLeNet+TConvs yields the lowest WER (42.33%) yet is
still higher than its GSL SI results by 35.58% absolute WER.
Regarding CSL and GSL, both datasets have a relatively low
number of unique gloss combinations and their SD test sets
contain unseen gloss sequences. For these reasons, all models
tend to predict combination of glosses similar to the ones seen
during training. This explains the superior recognition rates in
the SI subsets of CSL and GSL.
VII. D ISCUSSION
A. Performance comparison of implemented architectures
From the group of experiments in isolated SLR in Table II, it
was experimentally shown that 3D methods are more suitable
for isolated gloss classification compared to the 2D models.
This is justified by the fact that 2D CNNs do not model
dependencies between neighbouring frames, where motion
features play a crucial role in classifying a gloss. I3D is considered as the most capable of directly modeling the intra-gloss
dependencies, leading to superior generalization capabilities.
The advantage of 3D inception layer lies in the ability to
project multi-channel spatio-temporal features in dense, lower
dimensional embeddings. This leads in accumulating higher
10
ground truth
I(1)
PAPER
EnStimCTC
I(1)
PAPER
ΕnCTC
I(1)
PAPER
CTC
I(1)
PAPER
EXCUSE
CHECK
EXCUSE
CHECK
EXCUSE
EXCUSE
AFTER
AFTER
YOU
PAPER
PAPER
PROOF
I_GIVE_YOU
PROOF
I_GIVE_YOU
AFTER
PAPER
PROOF
I_GIVE_YOU
AFTER
PAPER
APPROVAL
I_GIVE_YOU
WER
WER
Fig. 4. Visual comparison of ground truth alignments with the predictions of the proposed training criteria. GoogLeNet+TConvs is used for evaluation in the
GSL SD dataset.
8
7
6
5
4
3
2
0
42
40
38
36
34
32
30
28
0
CTC
EnStimCTC
2
4
6
8
Epoch
10
12
CTC
EnStimCTC
5
10
15
20
Epoch
25
approximate the average gloss duration. This is interpreted
as a guidance in models, based on the statistics of the SL
dataset. On the other hand, utilizing only LSTMs to capture
the temporal dependencies (i.e. SubUNets), results in an
ineffective modeling of intra-gloss correlations. LSTMs are
designed to model the long-term dependencies that correspond
to the inter-gloss dependencies. Taking a closer look at the
predicted alignments of each approach, it is noticed that 3D
architectures do not provide as precise gloss boundaries for
true positives as the 2D ones. We strongly believe that this is
the reason that 3D models benefit more from the introduced
variations of the traditional CTC.
30
Fig. 5. Comparison of validation WER of CTC and EnStimCTC criteria with
GoogLeNet+TConvs in CSL SI and Phoenix SD datasets.
semantic representations (more abstract output features). In
the opposite direction, deploying skip connections maintains
previous layers feature maps that correspond to lower semantic
content that does not assist in isolated SLR.
Modeling intermediate short temporal dependencies was
experimentally shown (Tables V, VI) to enhance the CSLR
performance. The implemented 3D CNN architectures directly
capture spatio-temporal correlations as intermediate representations. The design choice of providing the input video in
a sliding window restricts the network’s temporal receptive
field. Based on a sequential structure, architectures such as
GoogLeNet+TConvs achieve the same goal, by grouping consecutive spatial features. Such a sequential approach can be
proved beneficial in many datasets, given that spatial filters
are well-trained. For this reason, such approaches require
heavy pretraining in the backbone network. The superiority
in performance of the implemented sequential approach is
justified in the careful manual tuning of temporal kernels and
strides. However, manual design significantly downgrades the
advantages of transfer learning. The sliding window technique
can be easily adapted based on the particularities of each
dataset, making 3D CNNs more scalable and suitable for
real-life applications. To summarize, both techniques aim to
B. Comparison between CTC variations
The reported experimental results exhibit the negative influence of CTC’s drawbacks (overconfident paths and conditional
independence assumption) in CSLR. EnCTC’s contribution
to alleviate the overconfident paths, is illustrated in Fig. 4.
The ground truth gloss “PROOF” is recognized with the
introduction of H, instead of “APPROVAL”. The latter has six
times higher occurrence frequency. After a careful examination
of the aforementioned signs, one can notice that these signs
are close in terms of hand position and execution speed, which
justifies the depicted predictions. Furthermore, it is observed
that EnCTC boosts performance mostly in CSL SI and GSL
SI, due to the limited diversity and vocabulary. It can be highlighted that EnCTC did not boost SubUNets’ performance. The
latter generates per frame predictions (T = N ), wherein the
rest approaches generate grouped predictions (T ≈ N4 ). This
results in a significantly larger space of possible alignments
that is harder to be explored from this criterion. From Fig.
4, it can be visually validated that EnStimCTC remedies the
conditional independence assumption. For instance, the gloss
“CHECK” was only recognised with stimulated training. By
bringing closer predictions that correspond to the same target
gloss, the intra-gloss dependencies are effectively modeled.
In parallel, the network was also able to correctly classify
transitions between glosses as blank. It should be also noted
that EnStimCTC does not increase time and space complexity
during inference.
11
C. Evaluation of pretraining schemes
Due to the limited contribution of CTC gradients in the feature extractor, an effective pretraining is mandatory. As shown
in Fig. 3, pretraining significantly affects the starting WER of
each model. Without pretraining, all models congregate around
the most dominant glosses, which significantly slows down
the CSLR training process and limits the learning capacity
of the network. Fully supervised pretraining is interpreted
as a domain shift to the distribution of the SL dataset that
speeds up the early training stage in CSLR. Regarding the
pretraining scheme, in datasets with limited vocabulary and
gloss sequences (i.e. CSL), inferring initial pseudo-alignments
proved beneficial, as shown in Table IV. This is explained
due to the fact that the data distribution of the isolated
subset had different particularities, such as sign execution
speed. However, producing initial pseudo-alignments is time
consuming. Hence, the small deterioration in performance is
an acceptable trade-off between recognition rate and time to
train. The proposed GSL dataset contains nearly double the
vocabulary and roughly three times the number of unique
gloss sentences, with less training instances. More importantly,
the isolated subset draws instances from the same distribution
as the continuous one. In such cases it can be stated that
proximal transfer learning significantly outperforms training
with pseudo-alignments (56.90% relative improvement in the
GSL dataset).
VIII. C ONCLUSIONS AND FUTURE WORK
In this paper, an in-depth analysis of the most characteristic
DNN-based SLR model architectures was conducted. Through
extensive experiments in three publicly available datasets,
a comparative evaluation of the most representative SLR
architectures was presented. Alongside with this evaluation,
a new publicly available large-scale RGB+D dataset was
introduced for the Greek SL, suitable for SLR benchmarking.
Two CTC variations known from other application fields,
EnCTC & StimCTC, were evaluated for CSLR and it was
noticed that their combination tackled two important issues,
the ambiguous boundaries of adjacent glosses and intra-gloss
dependencies. Moreover, a pretraining scheme was provided,
in which transfer learning from a proximal isolated dataset can
be a good initialization for CSLR training. The main finding
of this work was that while 3D CNN-based architectures were
more effective in isolated SLR, 2D CNN-based models with
an intermediate per gloss representation achieved superior
results in the majority of the CSLR datasets. In particular,
our implementation of GoogLeNet+TConvs, with the proposed
pretraining scheme and EnStimCTC criterion, yielded state-ofthe-art results in CSL SI.
Concerning future work, efficient ways for integrating depth
information that will guide the feature extraction training
phase, can be devised. Moreover, another promising direction is to investigate the incorporation of more sequence
learning modules, like attention-based approaches, in order
to adequately model inter-gloss dependencies. Future SLR
architectures may be enhanced by fusing highly semantic
representations that correspond to the manual and non-manual
features of SL, similar to humans. Finally, it would be of great
importance for the deaf-non deaf communication to bridge the
gap between SLR and SL translation. Advancements in this
domain will drive research to SL translation as well as SL to
SL translation, which have not yet been thoroughly studied.
IX. ACKNOWLEDGEMENTS
This work was supported by the Greek General Secretariat
of Research and Technology under contract Τ1Ε∆Κ-02469
EPIKOINONO.
The authors would like to express their gratitude to Vasileios
Angelidis, Chrysoula Kyrlou and Georgios Gkintikas from the
Greek sign language center4 for their valuable feedback and
contribution to the Greek sign language capturings.
R EFERENCES
[1] W. Sandler and D. Lillo-Martin, Sign language and linguistic universals.
Cambridge University Press, 2006.
[2] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature
network for continuous sign language recognition,” arXiv preprint
arXiv:1908.01341, 2019.
[3] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling
multiple signers,” Computer Vision and Image Understanding, vol. 141,
pp. 108–125, 2015.
[4] R. E. Mitchell, T. A. Young, B. BACHELDA, and M. A. Karchmer,
“How many people use asl in the united states? why estimates need
updating,” Sign Language Studies, vol. 6, no. 3, pp. 306–335, 2006.
[5] D. Bragg, O. Koller, M. Bellard, L. Berke, P. Boudrealt, A. Braffort,
N. Caselli, M. Huenerfauth, H. Kacorri, T. Verhoef et al., “Sign
language recognition, generation, and translation: An interdisciplinary
perspective,” arXiv preprint arXiv:1908.08597, 2019.
[6] G. T. Papadopoulos and P. Daras, “Human action recognition using 3d
reconstruction data,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 28, no. 8, pp. 1807–1823, 2016.
[7] H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in
Visual Analysis of Humans. Springer, 2011, pp. 539–562.
[8] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini, and A. Rosete,
“Lsa64: an argentinian sign language dataset,” in XXII Congreso Argentino de Ciencias de la Computación (CACIC 2016)., 2016.
[9] M. W. Kadous et al., “Machine recognition of auslan signs using
powergloves: Towards large-lexicon recognition of sign language,” in
Proceedings of the Workshop on the Integration of Gesture in Language
and Speech, vol. 165, 1996.
[10] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture
recognition with kinect depth camera,” IEEE transactions on multimedia,
vol. 17, no. 1, pp. 29–39, 2014.
[11] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recognition from articulated poses,” in European Conference on Computer
Vision. Springer, 2014, pp. 595–607.
[12] J. Zhang, W. Zhou, C. Xie, J. Pu, and H. Li, “Chinese sign language
recognition with adaptive hmm,” in 2016 IEEE International Conference
on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
[13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Enabling
robust statistical continuous sign language recognition via hybrid cnnhmms,” International Journal of Computer Vision, vol. 126, no. 12, pp.
1311–1325, 2018.
[14] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell,
“Hidden conditional random fields for gesture recognition,” in 2006
IEEE Computer Society Conference on Computer Vision and Pattern
Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1521–1527.
[15] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional
networks for visual recognition and description,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp.
2625–2634.
4 https://www.keng.gr/
12
[16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream
network fusion for video action recognition,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 1933–
1941.
[17] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3d convolutional neural networks,” in Proceedings of the IEEE
conference on computer vision and pattern recognition workshops, 2015,
pp. 1–7.
[18] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using convolutional 3d neural networks for user-independent continuous gesture
recognition,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 49–54.
[19] D. S. Alexiadis, A. Chatzitofis, N. Zioulis, O. Zoidi, G. Louizis,
D. Zarpalas, and P. Daras, “An integrated platform for live 3d human
reconstruction and motion capturing,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 27, no. 4, pp. 798–813, 2016.
[20] D. S. Alexiadis and P. Daras, “Quaternionic signal processing techniques
for automatic evaluation of dance performances from mocap data,” IEEE
Transactions on Multimedia, vol. 16, no. 5, pp. 1391–1406, 2014.
[21] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden, “Sign language
recognition using sub-units,” Journal of Machine Learning Research,
vol. 13, no. Jul, pp. 2205–2231, 2012.
[22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive
multi-modal gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2015.
[23] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and
J.-M. Odobez, “Deep dynamic neural networks for multimodal gesture
segmentation and recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 38, no. 8, pp. 1583–1597, 2016.
[24] O. Koller, C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised
learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos,” IEEE transactions on pattern analysis
and machine intelligence, 2019.
[25] R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous
sign language recognition by iterative training,” IEEE Transactions on
Multimedia, 2019.
[26] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Subunets: Endto-end hand shape and continuous sign language recognition,” in 2017
IEEE International Conference on Computer Vision (ICCV). IEEE,
2017, pp. 3075–3084.
[27] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks
for continuous sign language recognition by staged optimization,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017, pp. 7361–7369.
[28] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with
recurrent neural networks,” in Proceedings of the 23rd international
conference on Machine learning. ACM, 2006, pp. 369–376.
[29] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization
for spoken word recognition,” IEEE transactions on acoustics, speech,
and signal processing, vol. 26, no. 1, pp. 43–49, 1978.
[30] H. Liu, S. Jin, and C. Zhang, “Connectionist temporal classification with
maximum entropy regularization,” in Advances in Neural Information
Processing Systems, 2018, pp. 831–841.
[31] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for
continuous sign language recognition,” in 2019 IEEE International
Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1282–
1287.
[32] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal
processing magazine, vol. 13, no. 6, pp. 47–60, 1996.
[33] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a cnn
on 1 million hand images when your data is continuous and weakly
labelled,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 3793–3802.
[34] O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid
cnn-hmm for continuous sign language recognition,” in Proceedings of
the British Machine Vision Conference 2016, 2016.
[35] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end
sequence modelling with deep recurrent cnn-hmms,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2017, pp. 4297–4305.
[36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[38] J. Pu, W. Zhou, and H. Li, “Sign language recognition with multi-modal
features,” in Pacific Rim Conference on Multimedia. Springer, 2016,
pp. 252–261.
[39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
of the IEEE international conference on computer vision, 2015, pp.
4489–4497.
[40] J. Shawe-Taylor and N. Cristianini, “Support vector machines,” An Introduction to Support Vector Machines and Other Kernel-based Learning
Methods, pp. 93–112, 2000.
[41] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign
language recognition without temporal segmentation,” in Thirty-Second
AAAI Conference on Artificial Intelligence, 2018.
[42] H. R. V. Joze and O. Koller, “Ms-asl: A large-scale data set and
benchmark for understanding american sign language,” arXiv preprint
arXiv:1812.01053, 2018.
[43] J. Carreira and A. Zisserman, “Quo vadis, action recognition,” A new
model and the kinetics dataset. CoRR, abs/1705.07750, vol. 2, p. 3,
2017.
[44] J. Pu, W. Zhou, and H. Li, “Dilated convolutional network with iterative
optimization for continuous sign language recognition.” in IJCAI, vol. 3,
2018, p. 7.
[45] J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous
sign language recognition,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 4165–4174.
[46] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
2014.
[47] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for
time-series,” in Proceedings of the 34th International Conference on
Machine Learning-Volume 70. JMLR. org, 2017, pp. 894–903.
[48] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial
features for automatic sign language recognition,” in 2008 8th IEEE
International Conference on Automatic Face & Gesture Recognition.
IEEE, 2008, pp. 1–6.
[49] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions
of the sign language recognition and translation corpus rwth-phoenixweather.” in LREC, 2014, pp. 1911–1916.
[50] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793.
[51] A. Baker, B. van den Bogaerde, R. Pfau, and T. Schermer, The linguistics
of sign languages: An introduction.
John Benjamins Publishing
Company, 2016.
[52] E. T. Jaynes, “Information theory and statistical mechanics,” Physical
review, vol. 106, no. 4, p. 620, 1957.
[53] S. Tan, K. C. Sim, and M. Gales, “Improving the interpretability of deep
neural networks with stimulated learning,” in 2015 IEEE Workshop on
Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015,
pp. 617–623.
[54] C. Wu, P. Karanasou, M. J. Gales, and K. C. Sim, “Stimulated deep
neural network for speech recognition,” in Interspeech 2016, 2016, pp.
400–404.
[55] C. Wu, M. J. Gales, A. Ragni, P. Karanasou, and K. C. Sim, “Improving
interpretability and regularization in deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp.
256–265, 2017.
[56] J. Heymann, K. C. Sim, and B. Li, “Improving ctc using stimulated learning for sequence modeling,” in ICASSP 2019-2019 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 5701–5705.
[57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016, pp.
2818–2826.
[58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
fast feature embedding,” in Proceedings of the 22nd ACM international
conference on Multimedia. ACM, 2014, pp. 675–678.