2-Way Arabic Sign Language Translator Using CNNLSTM Architecture and NLP
2-Way Arabic Sign Language Translator Using CNNLSTM Architecture and NLP
2-Way Arabic Sign Language Translator Using CNNLSTM Architecture and NLP
96
Sociolinguistic factors affecting change in sign language along There is also some variation according to gender. Some men and
with the assistance a 2-way translator provides to keep up with the women use different signs for ‘hello’, ‘terrible’ and ‘welcome’
change is discussed in Section 2. The application of NLP for text (Figure 1).
to sign conversion is discussed in section 3.1, and methodology is
discussed in section 3. The results are discussed in section 4.
Conclusion along with the impact of a 2-way communicator on
the society is presented in section 5.
2. SOCIOLINGUISTIC FACTORS
AFFECTING SIGN LANGUAGE
2.1 Political Correctness
Different social components have impacted modern days’ gesture-
based communication. One of the central points of impact is
political accuracy. An overview of how British communication
via gestures (BSL) is utilized by hard of hearing individuals of
various ages, based in the UK, uncovered that a significant shift
has occurred in the signs utilized by various ages. In this present Figure 1. Variation in ‘welcome’ sign for different genders
day and age, to sign an inclined eye to allude to Chinese is (Male – top, Female – bottom)
profoundly examined by the general public. Henceforth, for hard
of hearing individuals matured in the vicinity of 16 and 30, the
culturally sensitive approach to show China is to draw the correct A wide variety of social factors influence the language of the
hand from the signers’ heart evenly over their chest, and then silent. The rate of changing sign language is tied to the rapidly
down towards the hip, demonstrating the shape of a Mao coat [11]. changing rate of sociocultural evolution of the society. To keep up
with these changes mobile phone-based two-way translator is
It is also considered offensive to mime a hook nose when
ideal. Primarily, a mobile-based two-way translator connected to a
referring to Jewish people. Similarly, their sign for a Jewish man
cloud database, with functionality to upload gestures, will ensure
or woman is a hand resting against the chin and making a short
that the database keeps with the sociolinguistic changes.
movement down, in the shape of a beard. A finger pointing to an
Moreover, two-way communication from sign to text and vice-
imaginary spot in the middle of a forehead is no longer
versa warrants enhanced independence and self-confidence of a
appropriate as the sign for India. Modern sign translation of India
deaf person.
is a mime of the triangular shape of the subcontinent [11].
97
3.2.1 Preprocessing the data
3.2.1.1 Parsing
We used the Stanford Arabic Parser which performs the
segmentation of a sentence and outputs the set of segmented
words with their grammatical category.[15] Stanford Parser is a
probabilistic parser that returns the most probable analysis of new
sentences based on the knowledge acquired from hand parsed
sentences. The parser assumes precisely the tokenization of
Arabic used in the Penn Arabic Treebank (ATB) and uses a
whitespace tokenizer.
3.2.1.2 Parts of Speech Tagging
We applied the Stanford Arabic Parts of Speech (POS) Tagger
which is software that reads text in the Arabic language and
correspondingly associates each token to a part of speech such as
nouns, verbs, adjectives, etc. To recognize each token, the
Stanford Arabic Parts of Speech Tagger is already trained on Penn
Arabic Treebank (ATB).
3.2.1.3 Cleaning the Text
Certain rules are applied to the input text. The rules applied are
mentioned as follows:
I. Delete Special Characters: Some special characters (For
e.g. &, *,”, %, #) in Arabic have no significance to
the meaning of the sentence. Hence, such
characters are removed.
II. Spell Checking: We also check for spelling errors to
ensure that the word entered is correct and it also
makes the system more error-tolerant and robust.
III. Delete Stop Words: Stop words hold syntactic Figure 2. Preprocessing the data
significance but they carry very little meaning and 3.2.2 Tokenization and Translation
hence are almost unrelated to the subject matter. The last step before obtaining the signs’ videos is the tokenization
The Arabic stop words were defined using a of the preprocessed text. We first do a sliding window search to
common library available online [16]. The input identify patterns in the preprocessed text for compound sentences.
text was then scanned, and all the detected stop If the match for a pattern is found the pattern is added to the queue
words were filtered out. of tokens as a single token. Also, the entities recognized by the
IV. Retain Exception Words: In Arabic Sign Language, NER are broken down into characters and inserted as separate
generally, the signs for organizations, people, and tokens. All the remaining words are inserted into the queue in the
locations do not have a specific translation. For appropriate order. The videos corresponding to the tokens are
recognizing such named entities, we use the retrieved from the sign language database. The video queue
Named Entity Recognition (NER) module which obtained is as per the Arabic Syntax is rearranged to the Arabic
identifies the entities and categorizes them into Sign Language Syntax based on grammar preservation rules in
three classes either as a name, person or location. Table 1. The videos are then integrated to form a single video and
For such cases, these words are translated as the final video is streamed back to the mobile application.
separated letters [17].
V. Morphological Analysis: We have used SARF Table 1. Sample of rules for Arabic Syntax to Arabic Sign
Morphological Analyzer for analyzing the text. It Language Syntax conversion
extracts the root, pattern, stem, part of speech,
prefixes, and suffixes of the text. No Arabic Syntax ArSL syntax
1 S+V S+V
2 V+S S+V
3 S+P S+P
4 S+V+O S+O+V
5 S+V+O (Adj, Adv) S+O+V (Adj, Adv)
6 S+P+(Adj, Adv) S+P+(Adj, Adv)
7 S+V+Pr S+V
8 V+O O+V
98
3.3 Sign to Text Translation The differential image is then given as input to the first
convolutional layer, at each time step. After processing the
3.3.1 Preprocessing the Gesture Data differential image, the first convolutional produces a set of feature
To build the dataset for sign to text translation, we picked out 200 maps which is consequently processed by the second
signs and had 4 subjects enact each gesture 10 times. A python convolutional layer. The output from the second layer is then
script was written to store and process the live input video frame flattened, then passed as input into the hidden layer, and finally
by frame at a frame rate of 30 FPS. This produces 30 frames for into the LSTM blocks of the recurrent layer. The recurrent layer,
every second of the video. As the average duration of each gesture responsible for the mapping of temporal context, ultimately
in the amassed gesture video dictionary is 3 seconds, this provided assigns a gesture label to the differential image.
us with 3600 frames for each gesture. In total, the dataset is
comprised of 720,000 frames. The dataset was split in 70% for 4. RESULTS
training and 30% for testing. Each frame in the training dataset 4.1 Text to Sign
was labeled with their correct gesture class label. As for the
testing data, each gesture sequence had been annotated To check the accuracy of the complete system an interpreter tested
exclusively. it in the deaf domain. The interpreter tested a sample of 58
sentences and ranked the system based on the following factors
3.3.2 Dynamic Gesture Recognition using CNNLSTM such as grammar translation, appropriate sign representation, and
semantic transfer. The performance evaluation of the system was
based on parameters like accuracy, recall, and precision as shown
in Table 2. This indicates that most of the sentences are correctly
translated; a minority of the sentences which were incorrectly
translated contain words not found in our library.
Figure 3. CNNLSTM architecture For instance, Figure 5 shows an example of an Arabic sentence
For dynamic gesture recognition, we use the CNNLSTM [18] being processed correctly to produce its sign language equivalent.
deep learning architecture. Convolutional Neural Network Long
Short-Term Memory Network often abbreviated as CNNLSTM is
an LSTM architecture designed for temporal prediction problems
with spatial inputs, like images or videos. It involves
Convolutional Neural Network (CNN) layers for feature
extraction on input data integrated with LSTMs to temporal
prediction. More precisely, the architecture consists of two
convolutional layers, a flattening layer, a Long Short-Term
Memory recurrent layer followed by a SoftMax output layer as
depicted in Figure 3. Each convolutional layer is comprised of
convolution and max-pooling operations in addition to the
SoftMax function. Moreover, CNNLSTM is a type of an Elman
recurrent neural network and consequently, can be trained with
Backpropagation Through Time (BPTT).
99
Figure 7. Motion of ArSL gesture for the phrase ‘How are
you?’(above) and ‘meeting’ (below).
5. CONCLUSION
To bridge the communication gap between the deaf and the
hearing community, we have successfully built a 2-way sign
language translator which could be implemented on a smartphone
device. The CNNLSTM architecture used for sign to text
translation is especially ideal for this task as it works with an RGB
input from a regular smartphone camera. Although, for sign
language to text translation, this proposed work is limited to
translating solo dynamic words and phrases, and could be
Figure 6. Videos corresponding to the tokens are retrieved improved upon to translate complete sentences. Additionally, the
and displayed in order. model could be extended to accommodate 2-way translation over
multiple languages. Moreover, connecting the model to a cloud
database which holds a crowdsourced gesture library, would
ensure that the model is robust to the sociolinguistic changes
4.2 Sign to Text affecting sign language. Overall, a two-way translator will have a
Table 3. Metrics for Sign to Text translation using CNNLSTM profound impression on the educational sector and society as a
model whole.
Accuracy Precision Recall F1-measure Almost 15% of school-age children (ages 6-19) in the United
88.67% 87.52% 85.75% 86.62% States alone have some degree of hearing loss [20]. The
technological aspect of the 2-way translator could be modified to
be fit into current schools and colleges. This would greatly help
The CNNLSTM model was evaluated using hold out cross-
the hearing-impaired students integrate into normal schools, rather
validation. For each gesture sequence in the testing dataset, the than send them away to specially-abled schools. Use of the two-
differential images corresponding to each of the frames in the way translator in educational institutions would vastly help reduce
sequence were processed by the CNNLSTM model, and the label the cost of paying for expensive specially-abled institutions and
assigned most recurrently to the differential images was
concomitant translators. In addition to this, it will enable the deaf
considered representative of the gesture. Similar to the text to sign students to freely and fully voice their thoughts and queries
evaluation, the CNNLSTM model was also evaluated based on
regarding discussions in the institution. Furthermore, it will give
accuracy, precision, recall and F1-measure as shown in Table 3.
the hearing-impaired community, equitable educational and
Furthermore, Figure 7, illustrates the motion-tracked frame by
employment opportunities, which are often lost are due to
frame to classify some of the signs accurately. communication gaps or errors in translation.
The translator will endow the deaf with a choice between the
‘Deaf Culture’ and ‘Normal’ culture [21]. Communication via the
mobile device would allow the deaf to explore and interact with
more places and people, thus allowing them to have more social
experiences. A precise two-way communicator would also vastly
100
improve the emotional health of a deaf person. 40 percent of [10] El-Gayyar, Mahmoud M., et al. 2016. “Translation from
displaced refugees from war-torn areas suffer from a hearing loss Arabic Speech to Arabic Sign Language Based on Cloud
due to high-pressure waves from bombing. In addition to the Computing.” Egyptian Informatics Journal, vol. 17, no. 3, pp.
sudden pressure of adjusting to a new language and culture, loss 295–303., doi:10.1016/j.eij.2016.04.001.
of hearing is an added disadvantage. Overcoming the barrier of [11] Stamp R, Schembri A, Fenlon J, Rentelis R, Woll B, Cormier
communication is of need and importance to the deaf refugees and K. 2014. Lexical Variation and Change in British Sign
utilizing a two-way communicator would immensely help them Language. PLoS ONE 9(4): e94053.
secure a sustainable life. https://doi.org/10.1371/journal.pone.0094053
6. REFERENCES [12] Stamp, Rose, et al. 2014. “Lexical Variation and Change in
[1] United Nations. “Sign Language, Deaf, Advocacy, Human British Sign Language.” PLoS ONE, vol. 9, no. 4,
Rights, Disability.” United Nations, doi:10.1371/journal.pone.0094053.
https://www.un.org/en/events/signlanguagesday. [13] Minisi MN. 2015. Arabic sign language dictionary;
[2] Sarji, David K. 2008. “HandTalk: Assistive Technology for http://www.menasy.com/
the Deaf.” Computer, vol. 41, no. 7, pp. 84–86., [14] @esl_zayed. 2016. “Emirati Sign Language (ESL)”,
doi:10.1109/mc.2008.226. Instagram Photos and Videos.” Instagram,
[3] P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. https://www.instagram.com/esl_zayed/?hl=en
Kautz. 2016. Online detection and classification of dynamic [15] “The Stanford NLP Group.” The Stanford Natural Language
hand gestures with recurrent 3D convolutional neural Processing Group,
network. In Proc. CVPR. https://nlp.stanford.edu/projects/arabic.shtml.
[4] N. Neverova, C. Wolf, G. Taylor, and F. Nebout. 2014. [16] Mohataher. 2017. “Mohataher/Arabic-Stop-Words.” GitHub,
Multi-scale deep learning for gesture detection and 22 Jan. https://github.com/mohataher/arabic-stop-words.
localization. In ECCV Workshops
[17] Oudalab. “Oudalab/Arabic-NER.” GitHub,
[5] D. Wu, L. Pigou, P.-J. Kindermans, N. Le, L. Shao, J. https://github.com/oudalab/Arabic-NER.
Dambre, and J.-M. Odobez. 2016. Deep dynamic neural
networks for multimodal gesture segmentation and [18] Tsironi, Eleni, et al. 2017. “An Analysis of Convolutional
recognition. IEEE Transactions on Pattern Analysis and Long Short-Term Memory Recurrent Neural Networks for
Machine Intelligence, 38(8):1583–1597 Gesture Recognition.” Neurocomputing, vol. 268, pp. 76–86.,
doi: 10.1016/j.neucom.2016.12.088.
[6] Camgoz, Necati Cihan, et al. 2017. “SubUNets: End-to-End
Hand Shape and Continuous Sign Language Recognition.” [19] Lin, Tsung-Yi & Maire, Michael & Belongie, Serge & Hays,
2017 IEEE International Conference on Computer Vision James & Perona, Pietro & Ramanan, Deva & Dollár, Piotr &
(ICCV), doi:10.1109/iccv.2017.332. Lawrence Zitnick, 2014. C. Microsoft COCO: Common
Objects in Context. 8693. 10.1007/978-3-319-10602-1_48.
[7] A.E.E.El Alfi, M.M.R.El Basuony, S.M.El Atawy. 2014.
Intelligent Arabic text to Arabic sign language translation for [20] Lisa Yuan. “Hearing Loss Facts and Demographics.” HLAA,
easy deaf communication Int J Comput Appl, 92, pp. 22-29 http://hlaa-la.org/better-hearing/hearing-loss-statistics-and-
demographics/.
[8] S.M. Halawani, A.B. Zaitun. 2012. An avatar based
translation system from Arabic speech to Arabic sign [21] Joanne Cripps. 2017. “What Is Deaf Culture?” DEAF
language for deaf people Int J Inf Sci Educ ISSN, 2, pp. 13- CULTURE CENTRE, 27 Dec., https:// deafculturecentre.ca
20 ISSN 2231-1262 /what-is-deaf-culture/.
[9] H. Al-Khalifa. 2010. Introducing Arabic sign language for
mobile phones Comput Help People Spec Needs, 6180, pp.
213-220, Springer Berlin Heidelberg
101