Academia.eduAcademia.edu

A Comprehensive Study on Sign Language Recognition Methods

2020, ArXiv

In this paper, a comparative experimental assessment of computer vision-based methods for sign language recognition is conducted. By implementing the most recent deep neural network methods in this field, a thorough evaluation on multiple publicly available datasets is performed. The aim of the present study is to provide insights on sign language recognition, focusing on mapping non-segmented video streams to glosses. For this task, two new sequence training criteria, known from the fields of speech and scene text recognition, are introduced. Furthermore, a plethora of pretraining schemes is thoroughly discussed. Finally, a new RGB+D dataset for the Greek sign language is created. To the best of our knowledge, this is the first sign language dataset where sentence and gloss level annotations are provided for a video capture.

1 A Comprehensive Study on Sign Language Recognition Methods arXiv:2007.12530v1 [cs.CV] 24 Jul 2020 Nikolas Adaloglou1* , Theocharis Chatzis1* , Ilias Papastratis1* , Andreas Stergioulas1* , Georgios Th. Papadopoulos1 , Member, IEEE, Vassia Zacharopoulou2 , George J. Xydopoulos2 , Klimnis Atzakas2 , Dimitris Papazachariou2 , and Petros Daras1 , Senior Member, IEEE University of Patras2 Centre for Research and Technology Hellas1 Abstract—In this paper, a comparative experimental assessment of computer vision-based methods for sign language recognition is conducted. By implementing the most recent deep neural network methods in this field, a thorough evaluation on multiple publicly available datasets is performed. The aim of the present study is to provide insights on sign language recognition, focusing on mapping non-segmented video streams to glosses. For this task, two new sequence training criteria, known from the fields of speech and scene text recognition, are introduced. Furthermore, a plethora of pretraining schemes is thoroughly discussed. Finally, a new RGB+D dataset for the Greek sign language is created. To the best of our knowledge, this is the first sign language dataset where sentence and gloss level annotations are provided for a video capture. Index Terms—Sign Language Recognition, Greek sign language, Deep neural networks, stimulated CTC, conditional entropy CTC. I. I NTRODUCTION Spoken languages make use of the “vocal - auditory” channel, as they are articulated with the mouth and perceived with the ear. All writing systems also derive from, or are representations of, spoken languages. Sign languages (SLs) are different as they make use of the “corporal - visual” channel, produced with the body and perceived with the eyes. SLs are not international and they are widely used by the communities of deaf people. They are natural languages, since they are developed spontaneously wherever deaf people have the opportunity to congregate and communicate mutually. SLs are not derived from spoken languages; they have their own independent vocabularies and their own grammatical structures [1]. The signs used by deaf people, actually have internal structure in the same way as spoken words. Just as hundreds of thousands of English words are produced using a small number of different sounds, the signs of SLs are produced using a finite number of gestural features. Thus, signs are not holistic gestures but are rather analyzable, as a combination of linguistically significant features. Similarly to spoken languages, SLs are composed of the following indivisible features: • • Combinations of the above-mentioned features represent a gloss, which is the fundamental building block of a SL and represents the closest meaning of a sign [2]. SLs, similar to the spoken ones, include an inventory of flexible grammatical rules that govern both manual and non-manual features [3]. Both of them, are simultaneously (and often with loose temporal structure) used by signers, in order to construct sentences in a SL. Depending on the context, a specific feature may be the most critical factor towards interpreting a gloss. It can modify the meaning of a verb, provide spatial/temporal reference and discriminate between objects and people. Due to the intrinsic difficulty of the deaf community to interact with the rest of the society (according to [4], around 500,000 people use the American SL to communicate in the USA), the development of robust tools for automatic SL recognition would greatly alleviate this communication gap. As stated in [5], there is an increased demand for interdisciplinary collaboration including the deaf community and for the creation of representative public video datasets. Sign Language Recognition (SLR) can be defined as the task of inferring glosses performed by a signer from video captures. Even though there is a significant amount of work in the field of SLR, a lack of a complete experimental study is profound. Moreover, most publications do not report results in all available datasets or share their code. Thus, experimental results in the field of SL are rarely reproducible and lacking interpretation. Apart from the inherent difficulties related to human motion analysis (e.g. differences in the appearance of the subjects, the human silhouette features, the execution of the same actions, the presence of occlusions, etc.) [6], automatic SLR exhibits the following key additional challenges: • Manual features, i.e. hand shape, position, movement, orientation of the palm or fingers, and * authors contributed equally Non-manual features, namely eye gaze, head-nods/ shakes, shoulder orientations, various kinds of facial expression as mouthing and mouth gestures. • Exact position in surrounding space and context have a large impact on the interpretation of SL. For example, personal pronouns (e.g. “he”, “she”, etc.) do not exist. Instead, the signer points directly to any involved referent or, when reproducing the contents of a conversation, pronouns are modeled by twisting his/her shoulders or gaze. Many glosses are only distinguishable by their constituent non-manual features and they are typically difficult to be 2 • • • accurately detected, since even very slight human movements may impose different grammatical or semantic interpretations depending on the context [7]. The execution speed of a given gloss may indicate a different meaning or the particular signers attitude. For instance, signers would not use two glosses to express “run quickly”, but they would simply speed up the execution of the involved signs [7]. Signers often discard a gloss sub-feature, depending on previously performed and proceeding glosses. Hence, different instances of the exact same gloss, originating even from the same signer, can be observed. For most SLs so far, very few formal standardization activities have been implemented, to the extent that signers of the same country exhibit distinguishable differences during the execution of a given gloss [8]. Historically, before the advent of deep learning methods, the focus was on identifying isolated glosses and gesture spotting. Developed methods were often making use of hand crafted techniques [9], [10]. For spatial representation of the different sub-gloss components, they usually used handcrafted features and/or fusion of multiple modalities. Temporal modeling was achieved by classical sequence learning models, such as Hidden Markov Model (HMM) [11], [12], [13] and hidden conditional random fields [14]. The rise of deep networks was met with a significant boost in performance for many video-related tasks, like human action recognition [15], [16], gesture recognition, [17], [18], motion capturing [19], [20], etc. SLR, is a task closely related to computer vision. This is the reason that most approaches tackling SLR have adjusted to this direction. In this paper, SLR using Deep Neural Network (DNN) methods is investigated. The main contributions of this work are summarized as follows: • • • • A comprehensive, holistic and in-depth analysis of multiple literature DNN-based SLR methods is performed, in order to provide meaningful and detailed insights to the task at hand. Two new sequence learning training criteria are proposed, known from the fields of speech and scene text recognition. A new pretraining scheme is discussed, where transfer learning is compared to initial pseudo-alignments. A new publicly available large-scale RGB+D Greek Sign Language (GSL) dataset is introduced, containing reallife conversations that may occur in different public services. This dataset is particularly suitable for DNNbased approaches that typically require large quantities of expert annotated data. The remainder of this paper is organized as follows: in Section II, related work is described. In Section III, an overview of the publicly available datasets in SLR is provided, along with the introduction of a new GSL dataset. In Section IV, a description of the implemented architectures is given. In Section V, a description of the proposed sequence training criteria is detailed. In Section VI, the performed experimental results are reported. Then, in Section VII, interpretations and insights of the conducted experiments are discussed. Finally, conclusions are drawn and future research directions are highlighted in Section VIII. II. R ELATED W ORK The various automatic SLR tasks, depending on the modeling’s level of detail and the subsequent recognition step, can be roughly divided in (Fig. 1): • Isolated SLR: Methods of this category target to address the task of video segment classification (where the segment boundaries are provided), based on the fundamental assumption that a single gloss is present [9], [21], [18]. • Sign detection in continuous streams: The aim of these approaches is to detect a set of predefined glosses in a continuous video stream [11], [22], [23]. • Continuous SLR (CSLR): These methods aim at recognizing the sequence of glosses that are present in a continuous/non-segmented video sequence [3], [24], [25]. This category of approaches exhibits characteristics that are most suitable for the needs of real-life SLR applications [5]; hence, it has gained increased research attention and will be further discussed in the remainder of this section. A. Continuous sign language recognition By definition, CSLR is a task very similar to the one of continuous human action recognition, where a sequence of glosses (instead of actions) needs to be identified in a continuous stream of video data. However, glosses typically exhibit a significantly shorter duration than actions (i.e. they may only involve a very small number of frames), while transitions among them are often very subtle for their temporal boundaries to be efficiently recognized. Additionally, glosses may only involve very detailed and fine-grained human movements (e.g. finger signs or facial expressions), while human actions usually refer to more concrete and extensive human body actions. The latter facts highlight the particular challenges that are present in the CSLR field [3]. Due to the lack of gloss-level annotations, CSLR is regularly casted as a weakly supervised learning problem. The majority of CSLR architectures usually consists of a feature extractor, followed by a temporal modeling mechanism, [26], [27]. The feature extractor is used to compute feature representations from individual input frames (using 2D CNNs) or sets of neighbouring frames (using 3D CNNs). On the other hand, a critical aspect of the temporal modeling scheme enables the modeling of the SL unit feature representations (i.e glosslevel, sentence-level). With respect to temporal modeling, sequence learning can be achieved using HMMs, Connectionist Temporal Classification (CTC) [28] or Dynamic Time Warping (DTW) [29] techniques. From the aforementioned categories, CTC has, in general, shown superior performance and the majority of works in CSLR has established CTC as the main sequence training criterion (for instance, HMMs may fail to efficiently model complex dynamic variations, due to expressiveness limitations [25]). However, CTC has the tendency to produce overconfident peak distributions, that are 3 Input video Feature extraction phase 3D CNN-based HELLO CAN Prediction Phase Representation Isolated SLR frame-level Gloss spotting HELLO HELP CAN 2D CNN-based Individual Gloss Temporal modeling segment-level CSLR HELLO CAN HELP Entire sequence of glosses Fig. 1. An overview of SLR categories prone to overfitting [30]. Moreover, CTC introduces limited contribution towards optimizing the feature extractor [31]. For these reasons, some recent approaches have adopted an iterative training optimization methodology. The latter essentially comprises a two-step process. In particular, a set of temporally-aligned pseudo-labels are initially estimated and used to guide the training of the feature extraction module. In the beginning, the pseudo-labels can be either estimated by statistical approaches [3] or extracted from a shallower model [25]. After training the model in an isolated setup, the trained feature extractor is utilized for the continuous SLR setup. This process may be performed in an iterative way, similarly to the Expectation Maximization (EM) algorithm [32]. Finally, CTC imposes a conditional independence constraint, where output predictions are independent, given the entire input sequence. B. 2D CNN-based CSLR approaches One of the firstly deployed architectures in CSLR is based on [33], where a CNN-HMM network is proposed. GoogleLeNet serves as the backbone architecture, fed with cropped hand regions and trained in an iterative manner. The same network architecture is deployed in a CSLR prediction setup [34], where the CNN is trained using glosses as targets instead of hand shapes. Later on, in [35], the same authors extend their previous work by incorporating a Long ShortTerm Memory unit (LSTM) [36] on top of the aforementioned network. In a more recent work [24], the authors present a three-stream CNN-LSTM-HMM network, using full frame, cropped dominant hand and signer’s mouth region modalities. These models, since they employ HMM for sequence learning, have to make strong initial assumptions in order to overcome HMM’s expressive limitations. In [26], the authors introduce an end-to-end system in CSLR without iterative training. Their model follows a 2D CNN-LSTM architecture, replacing HMM with LSTM-CTC. It consists of two streams, one responsible for processing the full frame sequences and one for processing only the signer’s cropped dominant hand. In [27], the authors employ a 2D CNN-LSTM architecture and in parallel with the LSTMs, a weakly supervised gloss-detection regularization network, consisting of stacked temporal 1D convolutions. The same authors in [25] extend their previous work by proposing a module composed of a series of temporal 1D CNNs followed by max pooling, between the feature extractor and the LSTM, while fully embracing the iterative optimization procedure. That module is able to produce compact representations for a video segment, which approximate the average duration of a gloss. Thereby, the LSTM captures the context information between gloss segments, instead of individual frames as in previous works. In [2], a hybrid 2D-3D CNN architecture [37] is developed. Features are extracted in a structured manner, where temporal dependencies are modeled by two LSTMs, without pretraining or using an iterative procedure. This approach however, yields the best results only in continuous SL datasets where a plethora of training data is available. C. 3D CNN-based CSLR approaches One of the first works that employs 3D-CNNs in SLR is introduced in [38]. The authors present a multi-modal approach for the task of isolated SLR, using spatio-temporal Convolutional 3D networks (C3D) [39], known from the research field of action recognition. Multi-modal representations are lately fused and fed to a Support Vector Machine (SVM) [40] classifier. The C3D architecture has also been utilized in CSLR by [41]. The developed two-stream 3D CNN processes both full frame and cropped hand images. The full network, named LS-HAN, consists of the proposed 3D CNN network, along with a hierarchical attention network, capable of latent space-based recognition modeling. In a later work [42], the authors propose the I3D [43] architecture in SLR. The model is deployed on an isolated SLR setup, with pretrained weights on action recognition datasets. The signer’s body bounding box is served as input. For the evaluated dataset it yielded state-ofthe-art results. In [31], the authors adopted and enhanced the original I3D model with a gated Recurrent Neural Network (RNN). The whole architecture is a 3D CNN-RNN-CTC architecture trained iteratively with a dynamic pseudo-label decoding method. Their aim is to accommodate features from different time scales. In another work [44], the authors introduce the 3D ResNet architecture to extract features. Furthermore, they substituted LSTM with stacked dilated temporal convolutions and CTC for sequence alignment and decoding. With this approach, they manage to have very large receptive fields while reducing time and space complexity, compared to LSTM. Finally, in [45] Pu et al. propose a framework that also consists of a 3D ResNet backbone. The features are provided in both an attentional encoder-decoder network [46] and a CTC decoder for sequence learning. Both decoded outputs 4 are jointly trained while the soft-DTW [47] is utilized to align them. III. P UBLICLY AVAILABLE DATASETS Existing SLR datasets can be characterized as isolated or continuous, taking into account whether annotation are provided at the gloss (fine-grained) or the sentence (coarsegrained) levels. Additionally, they can be divided into Signer Dependent (SD) and Signer Independent (SI) ones, based on the defined evaluation scheme. In particular, in the SI datasets a signer cannot be present in both the training and the test set. In Table I, the following most widely known public SLR datasets, along with their main characteristics, are illustrated: • The Signum SI and the Signum subset [48] include laboratory capturings of the German Sign Language. They are both created under strict laboratory settings with the most frequent everyday glosses. • The Chinese Sign Language (CSL) SD, the CSL SI and the CSL isol. datasets [41] are also recorded in a predefined laboratory environment with Chinese SL words that are widely used in daily conversations. • The Phoenix SD [49], the Phoenix SI [49] and the Phoenix-T [50] datasets comprise videos of German SL, originating from the weather forecast domain. • The American Sign Language (ASL) [42] dataset contains videos of various real-life settings. The collected videos exhibit large variations in background, image quality, lighting and positioning of the signers. A. The GSL dataset 1) Dataset description: In order to boost scientific research in the deep learning era, large-scale public datasets need to be created. In this respect and with a particular focus on the case of the GSL recognition, a corresponding public dataset has been created in this work. In particular, a set of seven native GSL signers are involved in the capturings. The considered application includes cases of deaf people interacting with different public services, namely police departments, hospitals and citizen service centers. For each application case, 5 individual and commonly met scenarios (of increasing duration and vocabulary complexity) are defined. The average length of each scenario is twenty sentences with 4.23 glosses per sentence on average. Subsequently, each signer was asked to perform the pre-defined dialogues in GSL five consecutive times. In all cases, the simulation considers a deaf person communicating with a single public service employee, while all interactions are performed in GSL (the involved signer performed the sequence of glosses of both agents in the discussion). Overall, the resulting dataset includes 10,290 sentence instances, 40,785 gloss instances, 310 unique glosses (vocabulary size) and 331 unique sentences. For the definition of the dialogues in the identified application cases, the particularities of the GSL and the corresponding annotation guidelines, GSL linguistic experts are involved. The video annotation process is performed both at gloss and sentence level. The provided annotated segments enable benchmarking in SLR (using the glosses) and SL translation (using the standard modern Greek). Fig. 2. Example keyframes of the introduced GSL dataset The recordings are conducted using an Intel RealSense D435 RGB+D camera at a rate of 30 fps. Both the RGB and the depth streams are acquired in the same spatial resolution of 848x480 pixels. To increase variability in videos, the camera position and orientation are slightly altered within subsequent recordings. Exemplary cropped frames of the captured videos are depicted in Fig.2. 2) GSL evaluation sets: Regarding the evaluation settings, the dataset includes the following setups: a) the continuous GSL SD, b) the continuous GSL SI, and c) the GSL isol. In GSL SD, roughly 80% of the videos are used for training, corresponding to 8,189 instances. The rest 1,063 (10%) are kept for validation and 1,043 (10%) for testing. The selected test gloss sequences are not used in the training set, while all the individual glosses exist in the training set. In GSL SI, the recordings of one signer are left out for validation and testing (588 and 881 instances, respectively). The rest 8821 instances are utilized for training. A similar strategy is followed in GSL isol., wherein the validation set consists of 2,231 gloss instances, the test set 3,500, while the remaining 34,995 are used for training. 3) Linguistic analysis and annotation process: As already mentioned, the provided annotations are both at individual gloss and sentence level. Native signers annotated and labelled individual glosses, as well as whole sentences. Sign linguists and SL professional interpreters consistently validated the annotation of the individual glosses. A great effort was devoted in determining individual glosses following the “one form one meaning” principle (i.e. a distinctive set of signs), taking into consideration the linguistic structure of the GSL and not its translation to the spoken standard modern Greek. We addressed and provided a solution for the following issues: a) compound words, b) synonyms, c) regional or stylistic variants of the same meaning, and d) agreement verbs. In particular, compound words are composed of smaller meaningful units with distinctive form and meaning, i.e. the equivalent of morphemes of the spoken languages, which can also be simple individual words, for example: SON = MAN+BIRTH. Following the “one form one meaning” principle, we split a compound word into its indivisible parts. Based on the above design, a computer vision system does not confuse compound words with its constituents. Synonyms (e.g. two different signs with similar meaning) were distinguished to each other with the use of consecutively numbered lemmas. For instance, the two different signs which have the meaning: “DOWN” were annotated as DOWN(1) and DOWN(2). The same strategy was opted for the annotation of regional and stylistic variants of the same meaning. For 5 TABLE I L ARGE - SCALE PUBLICLY AVAILABLE SLR DATASETS Characteristics Datasets Signum SI [48] Signum isol. [48] Signum subset [48] Phoenix SD [49] Phoenix SI [49] CSL SD [41] CSL SI [41] CSL isol. [38] Phoenix-T [50] ASL 100 [42] ASL 1000 [42] GSL isol. (new) GSL SD (new) GSL SI (new) Language Signers Classes Video instances Duration (hours) Resolution fps Type Modalities Year German German German German German Chinese Chinese Chinese German English English Greek Greek Greek 25 25 1 9 9 50 50 50 9 189 222 7 7 7 780 455 780 1,231 1,117 178 178 500 1,231 100 1,000 310 310 310 19,500 11,375 2,340 6,841 4,667 25,000 25,000 125,000 8,257 5,736 25,513 40,785 10,290 10,290 55.3 8.43 4.92 10.71 7.28 100+ 100+ 67.75 10.53 5.55 24.65 6.44 9.59 9.59 776x578 776x578 776x578 210x260 210x260 1920x1080 1920x1080 1920x1080 210x260 varying varying 848x480 848x480 848x480 30 30 30 25 25 30 30 30 25 varying varying 30 30 30 continuous both both continuous continuous continuous continuous isolated continuous isolated isolated isolated continuous continuous RGB RGB RGB RGB RGB RGB+D RGB+D RGB+D RGB RGB RGB RGB+D RGB+D RGB+D 2007 2007 2007 2014 2014 2016 2016 2016 2018 2019 2019 2019 2019 2019 example, the two different regional variants of “DOCTOR were annotated as DOCTOR(1), DOCTOR(2). Another interesting case is the agreement verbs of sign languages, which contain the subject and/or object within the sign of the agreement verb. Agreement verbs indicate subjects and/or objects by changing the direction of the movement and/or the orientation of the hand. Therefore, it was decided that they cannot be distinguished as autonomous signs and are annotated as a single gloss. A representative example is the : “I DISCUSS WITH YOU” versus “YOU DISCUSS WITH HIM”. For the described annotation guideline, the internationally accepted notation for the sign verbs is followed [51], [1]. IV. SLR APPROACHES In order to gain a better insight on the behavior of the various automatic SLR approaches, the best performing and the most widely adopted methods of the literature are discussed in this section. The selected approaches cover all different categories of methods that have been proposed so far. The quantitative comparative evaluation of the latter, using multiple publicly available datasets, will facilitate towards providing valuable feedback regarding the pros and cons of each automatic SLR methodology. B. GoogLeNet + TConvs In contrast to other 2D CNN-based methods that employ HMMs, Cui et. al [25] propose a model that includes an extra temporal module (TConvs), after the feature extractor (GoogLeNet). The TConvs module consists of two 1D CNN layers and two max pooling layers. It is designed to capture the fine-grained dependencies, which exist inside a gloss (intra-gloss dependencies) between consecutive frames, into compact per-window feature vectors. Finally, bidirectional RNNs are applied in order to capture the long-term temporal dependencies of the entire sentence. The total architecture is trained iteratively, in order to exploit the expressive capability of DNN models with limited data. C. I3D Inflated 3D ConvNet (I3D) [43] was originally developed for the task of human action recognition; however, its application has demonstrated outstanding performance on isolated SLR [42]. In particular, the I3D architecture is an extended version of GoogLeNet, which contains several 3D convolutional layers followed by 3D max-pooling layers. The key insight of this architecture is the endowing of the 2D sub-modules (filters and pooling kernels) with an additional temporal dimension. This methodology makes feasible to learn spatio-temporal features from videos, while it leverages efficient known architecture designs and parameters. A. SubUNets Camgoz et. al [26] introduce a DNN-based approach for solving the simultaneous alignment and recognition problems, typically referred to as “sequence-to-sequence” learning. In particular, the overall problem is decomposed of a series of specialized systems, termed SubUNets. The overall goal is to model the spatio-temporal relationships among these SubUNets to solve the task at hand. More specifically, SubUNets allow to inject domain-specific expert knowledge into the system regarding suitable intermediate representations. Additionally, they also allow to implicitly perform transfer learning between different interrelated tasks. D. 3D ResNet+LSTM Pu et al. [45] propose a framework comprising a 3D CNN for feature extraction, a RNN for sequence learning and two different decoding strategies, one performed with CTC and the other with an attentional decoder RNN. The glosses predicted by the attentional decoder are utilised to draw a warping path using a soft-DTW [47] alignment constraint. The warping paths display the alignments between glosses and video segments. The proposed pseudo-alignments are then employed for iterative optimization. 6 V. S EQUENCE LEARNING TRAINING CRITERIA FOR CSLR A summary of the notations used in this paper, is provided in this section, so as to enhance its readability and understanding. Let us denote by U the label (i.e. gloss) vocabulary and by blank the new blank token, representing the silence or transition between two consecutive labels. The extended vocabulary can be defined as V = U ∪ {blank} ∈ RL , where L is the total number of labels. From now on, given a sequence f of length F , we denote its first and last p elements by f 1:p and f p:F , respectively. An input frame sequence of length N can be defined as X = (x1 , .., xN ). The corresponding target sequence of labels (i.e. glosses) of length K is defined as y = (y1 , .., yK ). In addition, let Gv = (g 1v , .., g Tv ) ∈ RL×T be the predicted output sequence of a softmax classifier, where T ≤ N and v ∈ V . g tv can be interpreted as the probability of observing label v at time-step t. Hence, Gv defines a distribution over the set V T ∈ RL×T : p(π|X) = T Y gπt t , ∀π ∈ V T The error signal of Lctc with respect to gvt is: ∂Lctc 1 =− t ∂gv p(y|X)gvt From (7) it can be observed that the error signal is proportional to the fraction of all valid paths. As soon as a path dominates the rest, the error signal enforces all the probabilities to concentrate on a single path. Moreover, (1) and (7) indicate that the probabilities of a gloss occurring at following time-steps are independent, which is known as the conditional independence assumption. For these reasons, two learning criteria are introduced in CSLR: a) one that encounters the ambiguous segmentation boundaries of adjacent glosses, and b) one that is able to model the intra-gloss dependencies, by incorporating a learnable language model during training (as opposed to other approaches that use it only during the CTC decoding stage). The CTC criterion can be extended [30] based on maximum conditional entropy [52], by adding an entropy regularization term H: H(p(π|y, X)) = − Connectionist Temporal Classification (CTC) [28] is widely utilized for labelling unsegmented sequences. The time complexity of (2) is O(LN K), which means that the amount of valid paths grows exponentially with N . To efficiently calculate p(y|X), a recursive formula is derived, which exploits the existence of common sub-paths. Furthermore, to allow for blanks in the paths, a modified gloss sequence y ′ of length K ′ = 2K + 1 is used, by adding blanks before and after each gloss in y. Forward and backward probabilities αt (s) of y ′1:s at t and βt (s) of y ′s:K ′ at t are defined as: t Y ′ gπt t′ (3) B(π1:t )=y ′1:s t′ =1 βt (s) , T Y ′ gπt t′ (4) B(πt:T )=y ′s:K ′ t′ =t Therefore, to calculate p(y|X) for any t, we sum over all s in y ′ as: K′ X αt (s)βt (s) (5) p(y|X) = gyt ′s s=1 Finally, the CTC criterion is derived as: Lctc = − log p(y|X) (6) p(π|X, y) log p(π|X, y) =− where Q(y) = A. Traditional CTC criterion X π∈B −1 (y) π∈B −1 (y) X (7) B. Entropy Regularization CTC The elements of V T are referred as paths and denoted by π. In order to map y to π, one can define a mapping function B : V T 7→ U ≤T , with U ≤T being the set of possible labellings. B removes repeated labels and blanks from a given path. Similarly, one can denote the inverse operation of B as B −1 , that maps target labels to all the valid paths. From this perspective, the conditional probability of y is computed as: X p(y|X) = p(π|X) (2) X p(π|X) {π∈B −1 (y),πt =v} (1) t=1 αt (s) , X P π∈B −1 (y) Q(y) + log p(y|X), p(y|X) (8) p(π|X) log p(π|X). H aims to prevent the entropy of the non-dominant paths from decreasing rapidly. Consequently, the entropy regularization CTC criterion (EnCTC) is formulated as: Lenctc = Lctc − φH(p(π|y, X)), (9) where φ is a hyperparameter. The introduction of the entropy term H prevents the error signal from gathering into the dominant path, but rather encourages the exploration of nearby ones. By increasing the probabilities of the alternative paths, the peaky distribution problem is alleviated. C. Stimulated CTC Stimulated learning [53], [54], [55] augments the training process by regularizing the activations of the sequence learning RNN, ht . Stimulated CTC (StimCTC) [56] constricts the independent assumption of traditional CTC. To generate the appropriate stimuli, an auxiliary uni-directional Language Model RNN (RNN-LM) is utilized. The RNN-LM encoded hidden states (hk ) encapsulate the sentence’s history, up to gloss k. ht is stimulated by utilizing the non-blank probabilities α′t and β ′t ∈ RK . Then, the weighting factor γ t can be calculated as: β ′ ⊙ α′t (10) γ t = t′ β t · α′t 7 Intuitively, γ t can be seen as the probabilities of any gloss in target sequence y to be mapped to time-step t. The linguistic structure of SL is then incorporated as: Lstimuli = 1 K ·T K X T X k 2 γt (k)|| ht − hk || , (11) Thereby, ht is enforced to comply with hk . The RNN-LM model is trained using the cross-entropy criterion denoted as Llm . Finally, the StimCTC criterion is defined as: (12) where λ and θ are hyper-parameters. The described criteria can be combined, resulting in Entropy Stimulated CTC (EnStimCTC) criterion, as: Lenstim = Lctc − φH(p(π|y)) + λLlm + θLstimuli (13) VI. E XPERIMENTAL EVALUATION In order to provide a fair evaluation, we re-implemented the selected approaches and evaluated them on multiple largescale datasets, in both isolated and continuous SLR. Reimplementations are based on the original authors’ guidelines and any modifications are explicitly referenced. For the continuous setup, the criteria CTC, EnCTC, and EnStimCTC are evaluated in all architectures. For a fair comparison between different models, we opt to use the full frame modality, since it is the common modality between selected datasets and it is more suitable for real-life applications. We omit the iterative optimization process, instead we pretrain each model on the respective dataset’s isolated version, if present. Otherwise, extracted pseudo-alignments from other models (i.e. Phoenix) are used for isolated pretraining (implementations and experimental results are publicly available to enforce reproducibility in SLR3 ). A. Datasets and Evaluation metrics The following datasets have been chosen for experimental evaluation: ASL 100 and 1000, CSL isol., GSL isol. for the isolated setup, and Phoenix SD and Phoenix SI, CSL SD, CSL SI, GSL SD, GSL SI for the CSLR setup. To evaluate recognition performance in continuous datasets, the word error rate (WER) metric has been adopted, which quantifies the similarity between predicted glosses and ground truth gloss sequence. WER measures the least number of operations needed to transform the aligned predicted sequence to the ground truth and can be defined as: W ER = S+D+I , N (14) where S is the total number of substitutions, D is the total number of deletions, I is the total number of insertions and N is the total number of glosses in the ground truth. 3 https://zenodo.org/record/3941811#.XxrZXZZRU5k Datasets Method t Lstim = Lctc + λLlm + θLstimuli , TABLE II G LOSS T EST ACCURACY IN PERCENTAGE - ISOLATED SLR GoogLeNet+TConvs [25] 3D-ResNet [45] I3D [57] ASL 1000 ASL 100 CSL isol. GSL isol. 40.99 44.92 50.48 72.07 79.31 89.91 95.68 86.03 86.23 89.74 B. Data augmentation and implementation details The same data preprocessing methods are used for all datasets. Each frame is normalised by the mean and standard deviation of the ImageNet dataset. To increase the variability of the training videos, the following data augmentation techniques are adopted. Frames are resized to 256X256 and cropped at a random position to 224X224. Random frame sampling is used up to 80% of video length. Moreover, random jittering of the brightness, contrast, saturation and hue values of each frame is applied. The models are trained with Adam optimizer with initial learning rate λ0 = 10−4 , which is reduced to λi = 10−5 when validation loss starts to plateau. For isolated SLR experiments, the batch size is set to 2. Videos are rescaled to a fixed length that is equal to the average gloss length of each dataset. For CSLR experiments, videos are downsampled to maximum length of 250 frames, if necessary. Batch size is set to 1, due to GPU memory constraints. The experiments are conducted in a NVIDIA GeForce GTX-1080 Ti GPU with 12 GB of memory and 32 GB of RAM. All models, depending on the dataset, require 10 to 25 epochs to converge. The referenced models, depending on the dataset, have been modified as follows: In SubUNets, AlexNet [58] is used as feature extractor instead of CaffeNet [59], as they share a similar architecture. Additionally, for the CSL and GSL datasets, we reduce the bidirectional LSTM hidden size by half, due to computational space complexity. In the isolated setup, the LSTM layers of SubUNets are trained along with the feature extractor. In order to achieve the maximum performance of GoogLeNet+TConvs, a manual customization of TConvs 1D CNN kernels and pooling sizes is necessary. The intuition behind it, is that the receptive field should be approximately covering the average gloss duration. Each 1D CNN layer includes 1024 filters. In CSL, the 1D CNN are set with kernel size 7, stride 1 and the max-pooling layers with kernel sizes and strides equal to 3, to cover the average gloss duration of 58 frames. For the GSL dataset the TConvs are tuned with kernel sizes equal to 5 and pooling sizes equal to 3. In order to deploy 3D-ResNet and I3D in a CSLR setup, a sliding window technique is adopted in the input sequence. Window size and stride are selected to cover the average gloss duration. Then, a 2-layer bidirectional LSTM is added to model the long-term temporal correlations in the feature sequence. In CSL, the window size is set to 50 and stride 36, whereas in GSL the window size is set to 25, with stride equal to 12. I3D and 3D-ResNet are initialized with weights pretrained on Kinetics. Also, for the 3D-ResNet method, we omit the attentional decoder from the original paper, keeping the 3D- 8 TABLE III F INE TUNING IN CSLR DATASETS . R ESULTS ARE REPORTED IN WER Datasets Method Phoenix SD Phoenix SI CSL SI CSL SD GSL SI GSL SD I3D (Kinetics) I3D (Kinetics + ASL 1000) Val. / Test 53.81 / 51.27 40.89 / 40.49 Val. / Test 65.53 / 62.38 59.60 / 58.36 Test 23.19 16.73 Test 72.39 64.72 Test 34.52 27.09 Test 75.42 71.05 ResNet+LSTM model. In initial experiments it was observed that by training with StimCTC, all baseline models were unable to converge. The main reason is that the networks produce unstable output probability distributions in the early stage of training. On the contrary, introducing Lstim in the late training stage, constantly improved the networks’ performance. The overall best results were obtained with EnStimCTC. The reason is that, while the entropy term H introduces more variability in the early optimisation process, convergence is hindered on the late training stage. By removing H and introducing Lstim , the possible alignments generated by EnCTC are filtered. Regarding the hyper-parameters of the selected criteria, a tuning was necessary. For EnCTC, the hyperparameter φ is varied in the range of 0.1 and 0.2. For EnStimCTC, λ is set to 1. Concerning θ, evaluations for θ = 0.1, 0.2, 0.5, 1 are performed. The best results were obtained with θ = 0.5 and φ = 0.1. C. Experimental results In Table II, quantitative results are reported for the isolated setup. Classification accuracy is reported in percentage. It can be seen that 3D baseline methods achieve higher gloss recognition rate than 2D ones. I3D clearly outperforms other architectures in this setup, by a minimum margin of 2.2% to a maximum of 21.6%. I3D and 3D-ResNet were pretrained on Kinetics, which explains their superiority in performance. The 3D CNN models achieve satisfactory results in datasets created under laboratory conditions, yet in challenging scenarios, I3D clearly outperforms 3D-ResNet. Specifically in ASL 1000, where glosses are not executed in a controlled environment, only I3D is able to converge. SubUNets performed poorly or did not converge at all and its results are deliberately excluded. SubUNets’ inability to converge, may be due to their large number of parameters (roughly 125M). TABLE IV C OMPARISON OF PRETRAINING SCHEMES : RESULTS OF THE I3D ARCHITECTURE , AS MEASURED IN TEST WER, USING MULTIPLE FULLY- SUPERVISED APPROACHES BEFORE TRAINING IN CSLR Datasets Method SubUNets alignments Uniform alignments Transfer learning from ASL Proximal transfer learning CSL SI Test 5.94 16.98 16.73 6.45 GSL SI Val. 18.43 27.30 25.89 8.78 / / / / / Test 20.00 29.08 27.09 8.62 In Table III, I3D+LSTM is fine-tuned on CSLR datasets with CTC in 2 configurations: a) using the pretrained weights from Kinetics, and b) pretraining in ASL 1000. Results are improved by 6.79% on average for the second configuration. This was expected due to the task relevance. Table IV presents an evaluation of the impact of transfer learning versus training with initial pseudo-alignments, as a pretraining scheme. The following four cases are considered: • • • • directly train a shallow model (i.e SubUNets) without pretraining, to obtain initial pseudo-alignments, assume uniform pseudo-alignments over input video for each gloss in a sentence, transfer learning from a large-scale isolated dataset (ASL), and proximal transfer learning from the respected datasets isolated. Experiments are conducted on the CSL SI and the GSL SI evaluation sets, since they have annotated isolated subsets for proximal transfer learning. For the particular experiment I3D is used, since it is the best performing model in isolated setup (Table II). SubUNets are chosen to infer the initial pseudoalignments, because pretraining is not required by design. Training was performed with the traditional CTC criterion. In CSL SI, the best strategy by a relative margin of 7.9%, seems to be pretraining on pseudo-alignments. On the contrary, in GSL SI the best results are acquired with proximal transfer learning by a relative gain of 56.9% compared to pseudoalignments. Producing pseudo-alignments requires more training time, while training with proximal isolated subset is not always available. In Tables V and VI, quantitative results regarding CSLR are reported. The selected architectures are evaluated in CSLR datasets in both SD and SI subsets, using the proposed criteria. Training with EnCTC, needs more epochs to converge, due to the fact that a greater number of possible paths is explored, yet it converges to a better local optimal. Overall, EnCTC shows an average improvement of 1.59% in WER (9.73% relative). A further reduction of 1.60% in WER (5.69% relative) is observed by adding StimCTC. It can be seen that the proposed EnStimCTC criterion improves recognition in all datasets by an overall WER gain of 3.26% (14.56% relative). In the reported average gains SubUNets are excluded due to performance deterioration. In Phoenix SD subset, all models benefit from training with EnStimCTC loss by 1.59% less WER on average. Fig.3, depicts the models’ WER in Phoenix SD validation set. SubUNets have a WER of 29.51% in validation set and 29.22% in test set, which is an average reduction of 12.59% 9 TABLE V R EPORTED RESULTS IN CONTINUOUS SD SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED . Signer Dependent Datasets Phoenix SD Method SubUNets [26] GoogLeNet+TConvs [25] 3D-ResNet+LSTM [45] I3D+LSTM [57] CSL SD GSL SD CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC Val. / Test 30.51/30.62 32.18/31.37 38.81/37.79 32.88/31.92 Val. / Test 32.02/31.61 31.66/31.74 38.80/37.50 32.60/32.70 Val. / Test 29.51/29.22 28.87/29.11 36.74/35.51 31.16/31.48 Test 78.31 65.83 72.44 64.73 Test 81.33 64.04 70.20 64.06 Test 80.13 64.43 68.35 60.68 Val. / Test 52.79/54.31 43.54/48.46 61.94/68.54 51.74/53.48 Val. / Test 58.11/60.09 42.69/44.11 63.47/66.54 51.37/53.48 Val. / Test 55.03/57.49 38.92/42.33 57.88/61.64 49.89/49.99 TABLE VI R EPORTED RESULTS IN CONTINUOUS SI SLR DATASETS , AS MEASURED IN WER. P RETRAINING IS PERFORMED IN THE RESPECTIVE ISOLATED . Signer Independent Datasets Phoenix SI Method SubUNets [26] GoogLeNet+TConvs [25] 3D-ResNet+LSTM [45] I3D+LSTM [57] EnStimCTC CTC EnCTC EnStimCTC CTC EnCTC EnStimCTC Val. / Test 56.56/55.06 46.70/46.67 55.88/53.77 55.24/54.43 Val. / Test 55.59/53.42 47.14/46.70 54.69/54.57 54.42/53.92 Val. / Test 55.01/54.11 46.42/46.41 52.88/50.98 53.70/52.71 Test 3.29 4.06 19.09 6.49 Test 5.13 2.46 13.36 4.26 Test 4.14 2.41 14.31 2.72 Val. / Test 24.64/24.03 8.08/7.95 33.61/33.07 8.78/8.62 Val. / Test 21.73/20.58 7.63/6.91 27.80/26.75 7.69/6.55 Val. / Test 21.65/20.62 6.99/6.75 25.58/24.01 6.63/6.10 I3D+LSTM SubUnets GoogLeNet+TConvs 3D-ResNet+LSTM 70 60 50 40 30 20 0 5 GSL SI EnCTC 80 WER CSL SI CTC 10 15 20 Epoch 25 30 Fig. 3. Validation WER of the implemented architectures in Phoenix SD dataset trained with EnStimCTC loss. WER, compared to the original paper’s results (42.1% vs 30.62%) [26]. Furthermore, 2D-based CNNs produce similar results with negligible difference in performance. Similarly, in Phoenix SI, GoogLeNet+TConvs trained with EnStimCTC is the best performing setup with an average of 10.9% relative less WER, compared to the others. Finally, all architectures in Phoenix SI have worse recognition performances compared to their SD, due to a reduction of more than 20% in the training data. In CSL SI dataset all methods, except for 3D-ResNet, have comparable recognition performance. They achieve high recognition accuracy due to the large size of the dataset and the small size of the vocabulary. I3D+LSTM seems to benefit the most when trained with EnStimCTC, with 3.77% absolute WER reduction. GoogLeNet+TConvs has the best performance with 2.41% WER, which is 5.36% less WER on average than the other models and 1.65% less compared to CTC training (Fig. 5). This method outperforms the current state-of-the-art method on CSL SI [2] by an absolute reduction of 1.39% WER (3.80 vs 2.41) and relatively by 36.58%. In CSL SD WER results are considerably higher than CSL SI results, with an average WER gain of 70.00%. The best performing model is I3D+LSTM with 60.68% WER. In GSL SI, I3D+LSTM and GoogLeNet+TConvs recognition results are close, with 6.63%/6.10% and 6.99%/6.75% WER in development and test set, respectively. In GSL SD GoogLeNet+TConvs yields the lowest WER (42.33%) yet is still higher than its GSL SI results by 35.58% absolute WER. Regarding CSL and GSL, both datasets have a relatively low number of unique gloss combinations and their SD test sets contain unseen gloss sequences. For these reasons, all models tend to predict combination of glosses similar to the ones seen during training. This explains the superior recognition rates in the SI subsets of CSL and GSL. VII. D ISCUSSION A. Performance comparison of implemented architectures From the group of experiments in isolated SLR in Table II, it was experimentally shown that 3D methods are more suitable for isolated gloss classification compared to the 2D models. This is justified by the fact that 2D CNNs do not model dependencies between neighbouring frames, where motion features play a crucial role in classifying a gloss. I3D is considered as the most capable of directly modeling the intra-gloss dependencies, leading to superior generalization capabilities. The advantage of 3D inception layer lies in the ability to project multi-channel spatio-temporal features in dense, lower dimensional embeddings. This leads in accumulating higher 10 ground truth I(1) PAPER EnStimCTC I(1) PAPER ΕnCTC I(1) PAPER CTC I(1) PAPER EXCUSE CHECK EXCUSE CHECK EXCUSE EXCUSE AFTER AFTER YOU PAPER PAPER PROOF I_GIVE_YOU PROOF I_GIVE_YOU AFTER PAPER PROOF I_GIVE_YOU AFTER PAPER APPROVAL I_GIVE_YOU WER WER Fig. 4. Visual comparison of ground truth alignments with the predictions of the proposed training criteria. GoogLeNet+TConvs is used for evaluation in the GSL SD dataset. 8 7 6 5 4 3 2 0 42 40 38 36 34 32 30 28 0 CTC EnStimCTC 2 4 6 8 Epoch 10 12 CTC EnStimCTC 5 10 15 20 Epoch 25 approximate the average gloss duration. This is interpreted as a guidance in models, based on the statistics of the SL dataset. On the other hand, utilizing only LSTMs to capture the temporal dependencies (i.e. SubUNets), results in an ineffective modeling of intra-gloss correlations. LSTMs are designed to model the long-term dependencies that correspond to the inter-gloss dependencies. Taking a closer look at the predicted alignments of each approach, it is noticed that 3D architectures do not provide as precise gloss boundaries for true positives as the 2D ones. We strongly believe that this is the reason that 3D models benefit more from the introduced variations of the traditional CTC. 30 Fig. 5. Comparison of validation WER of CTC and EnStimCTC criteria with GoogLeNet+TConvs in CSL SI and Phoenix SD datasets. semantic representations (more abstract output features). In the opposite direction, deploying skip connections maintains previous layers feature maps that correspond to lower semantic content that does not assist in isolated SLR. Modeling intermediate short temporal dependencies was experimentally shown (Tables V, VI) to enhance the CSLR performance. The implemented 3D CNN architectures directly capture spatio-temporal correlations as intermediate representations. The design choice of providing the input video in a sliding window restricts the network’s temporal receptive field. Based on a sequential structure, architectures such as GoogLeNet+TConvs achieve the same goal, by grouping consecutive spatial features. Such a sequential approach can be proved beneficial in many datasets, given that spatial filters are well-trained. For this reason, such approaches require heavy pretraining in the backbone network. The superiority in performance of the implemented sequential approach is justified in the careful manual tuning of temporal kernels and strides. However, manual design significantly downgrades the advantages of transfer learning. The sliding window technique can be easily adapted based on the particularities of each dataset, making 3D CNNs more scalable and suitable for real-life applications. To summarize, both techniques aim to B. Comparison between CTC variations The reported experimental results exhibit the negative influence of CTC’s drawbacks (overconfident paths and conditional independence assumption) in CSLR. EnCTC’s contribution to alleviate the overconfident paths, is illustrated in Fig. 4. The ground truth gloss “PROOF” is recognized with the introduction of H, instead of “APPROVAL”. The latter has six times higher occurrence frequency. After a careful examination of the aforementioned signs, one can notice that these signs are close in terms of hand position and execution speed, which justifies the depicted predictions. Furthermore, it is observed that EnCTC boosts performance mostly in CSL SI and GSL SI, due to the limited diversity and vocabulary. It can be highlighted that EnCTC did not boost SubUNets’ performance. The latter generates per frame predictions (T = N ), wherein the rest approaches generate grouped predictions (T ≈ N4 ). This results in a significantly larger space of possible alignments that is harder to be explored from this criterion. From Fig. 4, it can be visually validated that EnStimCTC remedies the conditional independence assumption. For instance, the gloss “CHECK” was only recognised with stimulated training. By bringing closer predictions that correspond to the same target gloss, the intra-gloss dependencies are effectively modeled. In parallel, the network was also able to correctly classify transitions between glosses as blank. It should be also noted that EnStimCTC does not increase time and space complexity during inference. 11 C. Evaluation of pretraining schemes Due to the limited contribution of CTC gradients in the feature extractor, an effective pretraining is mandatory. As shown in Fig. 3, pretraining significantly affects the starting WER of each model. Without pretraining, all models congregate around the most dominant glosses, which significantly slows down the CSLR training process and limits the learning capacity of the network. Fully supervised pretraining is interpreted as a domain shift to the distribution of the SL dataset that speeds up the early training stage in CSLR. Regarding the pretraining scheme, in datasets with limited vocabulary and gloss sequences (i.e. CSL), inferring initial pseudo-alignments proved beneficial, as shown in Table IV. This is explained due to the fact that the data distribution of the isolated subset had different particularities, such as sign execution speed. However, producing initial pseudo-alignments is time consuming. Hence, the small deterioration in performance is an acceptable trade-off between recognition rate and time to train. The proposed GSL dataset contains nearly double the vocabulary and roughly three times the number of unique gloss sentences, with less training instances. More importantly, the isolated subset draws instances from the same distribution as the continuous one. In such cases it can be stated that proximal transfer learning significantly outperforms training with pseudo-alignments (56.90% relative improvement in the GSL dataset). VIII. C ONCLUSIONS AND FUTURE WORK In this paper, an in-depth analysis of the most characteristic DNN-based SLR model architectures was conducted. Through extensive experiments in three publicly available datasets, a comparative evaluation of the most representative SLR architectures was presented. Alongside with this evaluation, a new publicly available large-scale RGB+D dataset was introduced for the Greek SL, suitable for SLR benchmarking. Two CTC variations known from other application fields, EnCTC & StimCTC, were evaluated for CSLR and it was noticed that their combination tackled two important issues, the ambiguous boundaries of adjacent glosses and intra-gloss dependencies. Moreover, a pretraining scheme was provided, in which transfer learning from a proximal isolated dataset can be a good initialization for CSLR training. The main finding of this work was that while 3D CNN-based architectures were more effective in isolated SLR, 2D CNN-based models with an intermediate per gloss representation achieved superior results in the majority of the CSLR datasets. In particular, our implementation of GoogLeNet+TConvs, with the proposed pretraining scheme and EnStimCTC criterion, yielded state-ofthe-art results in CSL SI. Concerning future work, efficient ways for integrating depth information that will guide the feature extraction training phase, can be devised. Moreover, another promising direction is to investigate the incorporation of more sequence learning modules, like attention-based approaches, in order to adequately model inter-gloss dependencies. Future SLR architectures may be enhanced by fusing highly semantic representations that correspond to the manual and non-manual features of SL, similar to humans. Finally, it would be of great importance for the deaf-non deaf communication to bridge the gap between SLR and SL translation. Advancements in this domain will drive research to SL translation as well as SL to SL translation, which have not yet been thoroughly studied. IX. ACKNOWLEDGEMENTS This work was supported by the Greek General Secretariat of Research and Technology under contract Τ1Ε∆Κ-02469 EPIKOINONO. The authors would like to express their gratitude to Vasileios Angelidis, Chrysoula Kyrlou and Georgios Gkintikas from the Greek sign language center4 for their valuable feedback and contribution to the Greek sign language capturings. R EFERENCES [1] W. Sandler and D. Lillo-Martin, Sign language and linguistic universals. Cambridge University Press, 2006. [2] Z. Yang, Z. Shi, X. Shen, and Y.-W. Tai, “Sf-net: Structured feature network for continuous sign language recognition,” arXiv preprint arXiv:1908.01341, 2019. [3] O. Koller, J. Forster, and H. Ney, “Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers,” Computer Vision and Image Understanding, vol. 141, pp. 108–125, 2015. [4] R. E. Mitchell, T. A. Young, B. BACHELDA, and M. A. Karchmer, “How many people use asl in the united states? why estimates need updating,” Sign Language Studies, vol. 6, no. 3, pp. 306–335, 2006. [5] D. Bragg, O. Koller, M. Bellard, L. Berke, P. Boudrealt, A. Braffort, N. Caselli, M. Huenerfauth, H. Kacorri, T. Verhoef et al., “Sign language recognition, generation, and translation: An interdisciplinary perspective,” arXiv preprint arXiv:1908.08597, 2019. [6] G. T. Papadopoulos and P. Daras, “Human action recognition using 3d reconstruction data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 8, pp. 1807–1823, 2016. [7] H. Cooper, B. Holt, and R. Bowden, “Sign language recognition,” in Visual Analysis of Humans. Springer, 2011, pp. 539–562. [8] F. Ronchetti, F. Quiroga, C. A. Estrebou, L. C. Lanzarini, and A. Rosete, “Lsa64: an argentinian sign language dataset,” in XXII Congreso Argentino de Ciencias de la Computación (CACIC 2016)., 2016. [9] M. W. Kadous et al., “Machine recognition of auslan signs using powergloves: Towards large-lexicon recognition of sign language,” in Proceedings of the Workshop on the Integration of Gesture in Language and Speech, vol. 165, 1996. [10] C. Wang, Z. Liu, and S.-C. Chan, “Superpixel-based hand gesture recognition with kinect depth camera,” IEEE transactions on multimedia, vol. 17, no. 1, pp. 29–39, 2014. [11] G. D. Evangelidis, G. Singh, and R. Horaud, “Continuous gesture recognition from articulated poses,” in European Conference on Computer Vision. Springer, 2014, pp. 595–607. [12] J. Zhang, W. Zhou, C. Xie, J. Pu, and H. Li, “Chinese sign language recognition with adaptive hmm,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6. [13] O. Koller, S. Zargaran, H. Ney, and R. Bowden, “Deep sign: Enabling robust statistical continuous sign language recognition via hybrid cnnhmms,” International Journal of Computer Vision, vol. 126, no. 12, pp. 1311–1325, 2018. [14] S. B. Wang, A. Quattoni, L.-P. Morency, D. Demirdjian, and T. Darrell, “Hidden conditional random fields for gesture recognition,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1521–1527. [15] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634. 4 https://www.keng.gr/ 12 [16] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1933– 1941. [17] P. Molchanov, S. Gupta, K. Kim, and J. Kautz, “Hand gesture recognition with 3d convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2015, pp. 1–7. [18] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Using convolutional 3d neural networks for user-independent continuous gesture recognition,” in 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, pp. 49–54. [19] D. S. Alexiadis, A. Chatzitofis, N. Zioulis, O. Zoidi, G. Louizis, D. Zarpalas, and P. Daras, “An integrated platform for live 3d human reconstruction and motion capturing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 4, pp. 798–813, 2016. [20] D. S. Alexiadis and P. Daras, “Quaternionic signal processing techniques for automatic evaluation of dance performances from mocap data,” IEEE Transactions on Multimedia, vol. 16, no. 5, pp. 1391–1406, 2014. [21] H. Cooper, E.-J. Ong, N. Pugeault, and R. Bowden, “Sign language recognition using sub-units,” Journal of Machine Learning Research, vol. 13, no. Jul, pp. 2205–2231, 2012. [22] N. Neverova, C. Wolf, G. Taylor, and F. Nebout, “Moddrop: adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 8, pp. 1692–1706, 2015. [23] D. Wu, L. Pigou, P.-J. Kindermans, N. D.-H. Le, L. Shao, J. Dambre, and J.-M. Odobez, “Deep dynamic neural networks for multimodal gesture segmentation and recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 8, pp. 1583–1597, 2016. [24] O. Koller, C. Camgoz, H. Ney, and R. Bowden, “Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos,” IEEE transactions on pattern analysis and machine intelligence, 2019. [25] R. Cui, H. Liu, and C. Zhang, “A deep neural framework for continuous sign language recognition by iterative training,” IEEE Transactions on Multimedia, 2019. [26] N. C. Camgoz, S. Hadfield, O. Koller, and R. Bowden, “Subunets: Endto-end hand shape and continuous sign language recognition,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 3075–3084. [27] R. Cui, H. Liu, and C. Zhang, “Recurrent convolutional neural networks for continuous sign language recognition by staged optimization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7361–7369. [28] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 369–376. [29] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE transactions on acoustics, speech, and signal processing, vol. 26, no. 1, pp. 43–49, 1978. [30] H. Liu, S. Jin, and C. Zhang, “Connectionist temporal classification with maximum entropy regularization,” in Advances in Neural Information Processing Systems, 2018, pp. 831–841. [31] H. Zhou, W. Zhou, and H. Li, “Dynamic pseudo label decoding for continuous sign language recognition,” in 2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019, pp. 1282– 1287. [32] T. K. Moon, “The expectation-maximization algorithm,” IEEE Signal processing magazine, vol. 13, no. 6, pp. 47–60, 1996. [33] O. Koller, H. Ney, and R. Bowden, “Deep hand: How to train a cnn on 1 million hand images when your data is continuous and weakly labelled,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3793–3802. [34] O. Koller, O. Zargaran, H. Ney, and R. Bowden, “Deep sign: Hybrid cnn-hmm for continuous sign language recognition,” in Proceedings of the British Machine Vision Conference 2016, 2016. [35] O. Koller, S. Zargaran, and H. Ney, “Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent cnn-hmms,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4297–4305. [36] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. [38] J. Pu, W. Zhou, and H. Li, “Sign language recognition with multi-modal features,” in Pacific Rim Conference on Multimedia. Springer, 2016, pp. 252–261. [39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497. [40] J. Shawe-Taylor and N. Cristianini, “Support vector machines,” An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, pp. 93–112, 2000. [41] J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, “Video-based sign language recognition without temporal segmentation,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [42] H. R. V. Joze and O. Koller, “Ms-asl: A large-scale data set and benchmark for understanding american sign language,” arXiv preprint arXiv:1812.01053, 2018. [43] J. Carreira and A. Zisserman, “Quo vadis, action recognition,” A new model and the kinetics dataset. CoRR, abs/1705.07750, vol. 2, p. 3, 2017. [44] J. Pu, W. Zhou, and H. Li, “Dilated convolutional network with iterative optimization for continuous sign language recognition.” in IJCAI, vol. 3, 2018, p. 7. [45] J. Pu, W. Zhou, and H. Li, “Iterative alignment network for continuous sign language recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4165–4174. [46] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [47] M. Cuturi and M. Blondel, “Soft-dtw: a differentiable loss function for time-series,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 894–903. [48] U. Von Agris, M. Knorr, and K.-F. Kraiss, “The significance of facial features for automatic sign language recognition,” in 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition. IEEE, 2008, pp. 1–6. [49] J. Forster, C. Schmidt, O. Koller, M. Bellgardt, and H. Ney, “Extensions of the sign language recognition and translation corpus rwth-phoenixweather.” in LREC, 2014, pp. 1911–1916. [50] N. Cihan Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, “Neural sign language translation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7784–7793. [51] A. Baker, B. van den Bogaerde, R. Pfau, and T. Schermer, The linguistics of sign languages: An introduction. John Benjamins Publishing Company, 2016. [52] E. T. Jaynes, “Information theory and statistical mechanics,” Physical review, vol. 106, no. 4, p. 620, 1957. [53] S. Tan, K. C. Sim, and M. Gales, “Improving the interpretability of deep neural networks with stimulated learning,” in 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, 2015, pp. 617–623. [54] C. Wu, P. Karanasou, M. J. Gales, and K. C. Sim, “Stimulated deep neural network for speech recognition,” in Interspeech 2016, 2016, pp. 400–404. [55] C. Wu, M. J. Gales, A. Ragni, P. Karanasou, and K. C. Sim, “Improving interpretability and regularization in deep learning,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 2, pp. 256–265, 2017. [56] J. Heymann, K. C. Sim, and B. Li, “Improving ctc using stimulated learning for sequence modeling,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 5701–5705. [57] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826. [58] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [59] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp. 675–678.