The myth of signing avatars
Rosalee Wolfe
[email protected]
Institute for Language and Speech Processing, Athena RC, Greece
John C. McDonald
[email protected]
School of Computing, DePaul University, Chicago, USA
Eleni Efthimiou
[email protected]
Institute for Language and Speech Processing, Athena RC, Greece
Evita Fontinea
[email protected]
Institute for Language and Speech Processing, Athena RC, Greece
Frankie Picron
[email protected]
European Union of the Deaf, Brussels, Belgium
Davy Van Landuyt
[email protected]
European Union of the Deaf, Brussels, Belgium
Tina Sioen
[email protected]
European Union of the Deaf, Brussels, Belgium
Annelies Braffort
[email protected]
Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
Michael Filhol
[email protected]
Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay, France
Sarah Ebling
[email protected]
Department of Computational Linguistics, University of Zurich, Switzerland
Thomas Hanke
[email protected]
Institut für Deutsche Gebärdensprache, Universität Hamburg, Germany
Verena Krausneker
[email protected]
Institut für Sprachwissenschaft, Universität Wien, Vienna, Austria
Abstract
Development of automatic translation between signed and spoken languages has lagged behind the development of automatic translation between spoken languages, but it is a common
misperception that extending machine translation techniques to include signed languages
should be a straightforward process. A contributing factor is the lack of an acceptable method
for displaying sign language apart from interpreters on video. This position paper examines
the challenges of displaying a signed language as a target in automatic translation, analyses
the underlying causes and suggests strategies to develop display technologies that are acceptable to sign language communities.
1. Introduction
Deaf sign language users around the world face continual challenges in daily interaction with
hearing, non-signing populations. The gold standard for translating between signed and spoken
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 33
languages1 are certified sign language interpreters who are essential to facilitating communication for education, healthcare, and legal consultation among other situations. However, many
transactions in daily living consist of short conversations over a hotel desk, at a store counter
or in an office foyer. These interactions are so limited in scope and duration that hiring a qualified interpreter would be prohibitively expensive or quite unnecessary, or even impossible because in most countries there is a shortage of qualified interpreters. In such situations, an automatic translation system between spoken and signed language would ease communication barriers and improve inclusivity. For technology of this sort to be useful, it must display sign language in a way that is acceptable to members of the sign language community.
To be effective, an automated translation system or machine translation system must be
able to produce legible, grammatically, and phonologically and phonetically correct, acceptable
utterances in a desired target language with minimal or no human involvement. Researchers
have made significant progress in translating between high-resource languages that have a written form and some have suggested that automatic translation has achieved human parity in some
domains (Hassan, et al., 2018).
Progress in translating between signed and spoken languages has lagged significantly in
comparison. Traditionally, this task has been conceived as one of text-to-text translation, involving written representations of sign languages. Since sign languages have no widely accepted written form, an additional required step in going from a spoken language to a sign
language is that of displaying signed languages in their natural moving form, in the visual modality (Ebling, 2016). This position paper examines the challenges of displaying signed language as a target in automatic translation, analyses the underlying impediments and suggests
strategies to develop display technologies that are acceptable to deaf sign language users.
2. Background
Sign languages are distinct from their surrounding spoken languages. For example, in France,
many deaf persons have Langue des Signes Française (LSF), not French, as their preferred
language. Since French is a second language to them, even its written form poses a barrier.
Many researchers have noted that written language poses barriers to members of the Deaf communities (Traxler, 2000; Gutjahr, 2006; Hennies, 2010; Konrad, 2011).
Deaf sign language users consider themselves members of a minority group, with a distinct language, culture, and shared experiences, rather than as simply persons with a disability
(De Meulder, Krausneker, Turner, & Conama, 2019). They continually struggle with the reality
that policy makers in governmental departments, educational institutions and health care agencies consist primarily of hearing people who are not familiar with the values, goals and concerns
of sign language communities (Branson & Miller, 1998). As a result, there is a history of disenfranchisement which adds a barrier of distrust to the barrier of language that exists between
deaf and hearing communities. At present, current technology claiming to translate between
spoken and signed languages are not viewed favourably by sign language communities. Rather,
the technology is often perceived as a ploy to replace human interpreters (World Federation of
the Deaf, 2018; European Union of the Deaf, 2018), or even as cultural appropriation by predominantly hearing researchers, who do not always have linguistic knowledge of these languages, and often have little connection with sign language communities (Erard, 2017).
1
The term spoken language refers to any language that is not signed, whether represented as speech or
as text.
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 34
Linguists have noted that as long as avatars are only capable of artificial and flawed language,
they are very likely to be counterproductive. (Austrian Association of Applied Linguistics,
2019).
This scepticism and often downright hostility towards automatic translation systems is
exacerbated by the generally poor quality of their sign language (Sayers, et al., 2021). To date
these have exhibited robotic movement and are mostly unable to reproduce all of the multimodal articulation mechanisms necessary to be legible. They are comparable to early speech
synthesis systems which featured robotic-sounding voices that chained words together with little regard to coarticulation and no attention to prosody.
3. Quality of the target language
Just as with text-to-text translation applications, users will judge the quality of the application
by the quality of its output to the target language. The same is true when the target language is
signed. Poor-quality signing is difficult to understand, just as poor-quality speech synthesis or
egregious misspellings are difficult to understand. It undermines the viewer’s confidence in the
quality of the translation. Worse, poor quality signing alienates the sign language community.
Being forced to struggle with the poor signing is no better than being forced to lip read or use
captions in the second language.
This is simply more evidence that reconfirms a continuing disenfranchisement. For these
reasons, quality of the ultimate signed language display must be given highest priority in a
spoken to signed translation system. The motion should be indistinguishable from that of a
human signing the same utterance. This visual Turing Test should be the ultimate goal of any
sign language display.
4. Sign language in automatic translation services
Among the challenges to acceptable sign language display as part of an automatic translation
system, three issues stand out. These are 1) the difference of modalities between signed and
spoken languages 2) the representation used to characterize sign languages and 3) the development of the technology required to display sign languages.
4.1.
Modality
The modality of sign languages differs markedly from that of spoken languages, which utilize
the vocal apparatus for production, and hearing for reception. Spoken languages use visible
communicative behaviours like gestures as well, but listeners can comprehend audio-only
sources. In contrast, signed languages use only visible actions for production, and vision for
reception. Whereas speech utilizes a single vibrating column of air for producing utterances,
signed languages use the configuration and movement of multiple body parts concurrently, including hands with all the fingers, head, face, eyes, and torso.
All sign languages have linguistic processes that are not linearly ordered. For example, in
American Sign Language (ASL) the appearance of pursed lips in conjunction with the sign
SMOOTH intensifies the degree of smoothness. In signed language, layers of processes ranging
from the phonological to the prosodic can co-occur (Crasborn, 2006). Co-occurrence is a more
general term than synchronized or simultaneous, as co-occurring events do not necessarily start
or end at the same time, but they overlap in their duration.
Although there are many discrete lexical items in signed languages, much information is
conveyed through forms with infinite variability and depiction, unlike fixed dictionary signs.
A case in point are classifiers, which represent general categories or “classes” of objects. They
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 35
can be used to describe the size and shape of an object, and they can also represent how an
object moves or is utilized. Through the use of classifiers, a signer can describe a scenario with
few discrete lexical items. The signer creates an image in space. This is not simply an informal
gesture as there are well-documented linguistic rules governing classifier usage (Lepic &
Occhino, 2018). These are evocative, not necessarily iconic, and are extremely powerful. In a
story about a motorcycle ride (Dudis, 2004), a signer can use an instrument classifier to indicate
that the rider is revving the engine and a vehicle classifier to show the rider driving away on a
hilly highway (Figure 1).
The motorcyclist
driving up a hill
Figure 1. Classifier usage (Dudis, 2004).
The presence of multiple articulators that can co-occur and classifier usage are examples
of the stark difference between signed and spoken languages. For these reasons, it is essential
to avoid the trap of casting the problem of signed/spoken translation as a case of simply retrieving lexical items or phrasal units from a dictionary and concatenating them.
4.2.
Representation
The second of the three challenges is the question of representation. Languages commonly processed by automatic translation systems have a written form. Signed languages do not. They
are languages and cultures that have been preserved and transmitted from generation to generation by “hand to eye to hand”. Determining a standard transcription/annotation system that can
capture all of the linguistic information contained in a signed message is still an open question.
A linear stream of glosses, even with accompanying superscript strings to indicate prosody and
syntax (Adamo-Villani & Wilbur, 2015), does not contain the entire semantic content of a
signed utterance, in particular the depicting and spatialized linguistic structures.
This is not analogous to the difference between reading printed text on a page and witnessing an actor perform the text. Less information is captured in a gloss stream than is conveyed
in written text. A hearing person may argue that not all features of articulation are captured in
a printed sentence of a spoken language, such as speed of delivery, but in languages where
adverbs are not necessarily expressed as separate lexical items, the lack of a speed indication is
losing semantic information, not just performance information.
4.3.
Sign language display
The third challenge is the display of a sign language when it is the target. The most commonly
used strategy for this purpose is avatar technology. Three-dimensional avatars have the
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 36
advantages of consistency and flexibility. When recording a human signer with traditional
video, special care must be taken to ensure consistency of the studio set up and the appearance
of the signer between recording sessions. This requires additional time and money. When using
an avatar, the lighting and camera set up can be fixed; the clothing can be chosen by the viewer
as can the hair and makeup. No additional resources are required to ensure consistency.
In addition, avatars have the advantage of flexibility through the use of animation techniques. They can display co-occurring linguistic processes. Proper application of coarticulation
can provide smooth transitions and can inflect signs according to syntactic rules. These properties are necessary for a translation system to produce novel utterances.
Avatars also have flexibility in appearance. They can be easily adapted to look like the
shape of the original speaker/source, like a presenter or a cartoon character or a movie character.
This flexibility in appearance can also anonymize a signer, so that the signer’s identity will
remain hidden.
Another advantage of this type of anonymization of content is that it covers one of the key
properties of written language, which is inherently more anonymous than a live performance
that is spoken or signed. With an anonymously presented avatar, content can be communicated
without knowing the person who expressed it.
5. The promise and mythology of avatars
Given that there is a century’s worth of development in animation, and nearly half that supporting video game technology, it would be tempting to dismiss the question of using avatars to
display sign languages as a solved problem. However, a closer analysis shows that there are
still significant challenges yet to be fully addressed.
Animation, the precursor to avatar technology, is powerfully communicative. Animation
artists abstract and emphasize the salient features of a character for greater audience appeal and
engagement. Simplification of a character’s appearance is vital to maximizing emotional impact. This is the reason that the eyes of Disney cartoon characters are twice the size of those of
a human and spaced more widely apart.
However, the requirements for sign language display are different from those for portraying cartoon characters. Beyond communicative power, display of sign language requires precision. It must adhere more closely to physical reality. For example, the hands of animation characters such as Mickey Mouse or Homer Simpson have only three fingers. For a hearing audience, this is perfectly acceptable, but three fingers aren’t enough to distinguish between the
fingerspelled letter W and the number 4 (Figure 2). Another consideration is that while character
animation effectively uses the face and body to express emotion, the facial animation is typically at a lower quality than what would be required to portray a sign language legibly.
Figure 2. The difference between the letter W and the digit 4 would disappear in a three-fingered
character.
Several ground-breaking animations have received attention and praise from sign language
communities (Stewart, 2008; Fumdación Fesord CV, 2007). These were manually created by
artists with the assistance of motion capture. The artists create underlying natural processes of
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 37
coordinated muscle action, coarticulation at a biomechanical level and ambient movement.
While creating the animation the artists are continually checking whether the animation draft
effectively communicates the intended message and editing the draft when there are flaws.
However, animations are intended for playback only and are not extensible without manual
intervention. Once completed, they are archived, and without additional manual editing cannot
be utilized for generating new utterances. In short, animations are not created in real time and
are not interactive.
In contrast, video game characters move in response to player input in real time and are
highly interactive. Thus, using video game technology might seem like an expedient approach
to sign language display for a translation system. However, many game players continue to
comment on the poor quality of the game characters. This is due to the effect of the uncanny
valley (Tinwell, 2014). If a character appears more human-like, viewers expect the character to
behave in a more human-like manner. But because the character’s motion cannot be refined and
edited by human animators before it is displayed, the results are unsatisfying. As explained by
a professional animator (Trentskiroonie, 2015),
For something like film or television, I could create a kickass animation of a monster jumping
off a building and landing on the street below, but to do the same thing in a game, the movement
has to be broken up into separate parts. This is because he probably won't do the exact same
action every time. There may be buildings of different heights in the game, so I can't hard-code
the height of the jump into the animation. I have to create an initial jump animation, then an idle
hang-time animation to play while he's in the air, and then a landing animation. The programmer
then strings the jump, hang-time, and landing together and decides the timing and trajectory of
the hang-time part procedurally. ... That takes artistic control away from the animator and can
result in some fugly animation.
Unfortunately, a “fugly” motion on a sign language avatar can destroy the legibility and even
the meaning of the message, thus making the avatar bothersome or even useless for a deaf sign
language user. Finally, the representation of signed languages through avatars will have an
effect on the hearing perception of these minority languages. Hearing viewers should not be
confronted with "fugly" signed texts and be misled into thinking that it is real sign language in
all its beauty and richness.
The analysis of the requirements for a sign language avatar shows that it must have the
expressivity of manual animation but the flexibility of a video game character. These two requirements are in conflict. It is still an open question as to how to reconcile these goals.
6. Moving forward
The establishment of a set of best practices would be a substantive step toward the development
of better sign language displays in automatic translation systems, but it cannot happen without
a mutual collaboration with sign language stakeholders (Tupi, 2019). Deaf leadership is vital
for the establishment of a validated methodology for user evaluation of avatar technology. Once
created and reviewed, the methodology should be made publicly available to all researchers
working in this area. Currently in Austria, there is a small research project aiming to create a
Best Practice Protocol for the use of signing avatars (Krausneker, 2021).
This is consistent with the World Federation of the Deaf’s position paper on Sign Language
Work (World Federation of the Deaf, 2014).
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 38
The WFD considers exclusion of Deaf Community and their national organizations from sign
language work ... a violation of the linguistic human rights of deaf people. Decisions regarding
sign languages should always remain within the linguistic community, in this case deaf people.
Best practice for reviewing research papers would include an awareness of the multidisciplinary
qualification required. It is not enough to know about machine translation. Reviewers must also
be aware of sign language linguistics, the deaf experience and previous work in sign language
machine translation.
When reporting on an advance in sign language avatar technology, researchers should include a sample of the sign language produced by the technique outlined in a paper. Since the
sample would necessarily contain motion, it could either take the form of a media file in a
commonly available format such as MPEG-4, or a web application available online. Conference organizers and journal editors need to collaborate with academic and professional organizations to archive media accompanying research papers.
7. Conclusion
“Together, we are strong.” -- Lutz König, Hamburg, 14 November 2017
Together, machine translation (MT) researchers, sign language linguists and the deaf sign language community have the potential to form powerful partnerships to educate policy makers
(Bragg, et al., 2019). Ideally, Deaf professionals should be educated, supported, and actively
sought to include in sign language relevant research projects.
To hearing researchers: Get to know members of sign language communities and learn
about deaf culture.
•
•
•
•
Take a class in the national sign language of your country. You already know several
spoken languages -- why not discover an entirely new world? Or if you don’t feel
you have time,
Go to a deaf event -- see a play in sign language, go to a deaf trade show.
When writing grant proposals that include work relevant for sign languages, include
the local and/or national deaf community. Most countries in the world have a National
Association of the Deaf. Include budget for interpreters.
Listen. Just because an idea or a result is incredibly appealing to an MT researcher
does not mean that it will be useful or welcomed within the sign language community.
Take feedback seriously and act on it.
Through exchange of ideas and concerns, the sign language community can inform MT researchers about their priorities, and MT researchers can clarify the capabilities and limitations
of today’s technologies. A clear understanding of priorities, expectations, potentials, and limitations will move the state of the art closer to realization of better inclusivity.
Acknowledgments
This work is supported in part by the EASIER (Intelligent Automatic Sign Language Translation) Project. EASIER has received funding from the European Union’s Horizon 2020 research
and innovation programme, grant agreement n° 101016982.
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 39
Bibliography
Adamo-Villani, N., & Wilbur, R. B. (2015). ASL-Pro: American sign language animation
with prosodic elements. International Conference on Universal Access in HumanComputer Interaction, (pp. 307–318).
Austrian Association of Applied Linguistics. (2019, August). Position Paper on Automated
Translations and Signing Avatars. Récupéré sur verbal; Verband für Angewandte
Linguistik Österreich: https://www.verbal.at/stellungnahmen/Position_PaperAvatars_verbal_2019.pdf
Bragg, D., Koller, O., Bellard, M., Berke, L., Boudreault, P., Braffort, A., . . . others. (2019).
Sign language recognition, generation, and translation: An interdisciplinary
perspective. The 21st International ACM SIGACCESS Conference on Computers
and Accessibility, (pp. 16–31).
Branson, J., & Miller, D. (1998). Nationalism and the linguistic rights of Deaf communities:
Linguistic imperialism and the recognition and development of sign languages.
Journal of Sociolinguistics, 2, 3–34.
Crasborn, O. A. (2006). Nonmanual structures in sign language. Dans K. Brown (Éd.),
Encyclopedia of Language and Linguistics (éd. 2nd, pp. 668-672). Oxford: Elsevier.
De Meulder, M., Krausneker, V., Turner, G., & Conama, J. B. (2019). Sign language
communities. Dans G. Hogan-Burn, & B. O'Rourke (Éd.), The Palgrave Handbook
of Minority Languages and Communities (pp. 207-232). London: Palgrave
Macmillan.
Dudis, P. G. (2004). Body partitioning and real-space blends. Cognitive Linguistics, 15(2),
223-238.
Ebling, S. (2016). Automatic Translation from German to Synthesized Swiss German Sign
Language. Ph.D. dissertation, University of Zurich.
Erard, M. (2017, November 9). Why sign-language gloves don’t help deaf people. Récupéré
sur The Atlantic: https://www.theatlantic.com/technology/archive/2017/11/why-signlanguage-gloves-dont-help-deaf-people/545441/
European Union of the Deaf. (2018, October 26). Accessibility of information and
communication. Récupéré sur European Union of the Deaf :
https://www.eud.eu/about-us/eud-position-paper/accessibility-information-andcommunication/
Fundación Fesord CV. (2007, Jan 26). World Federation of the Deaf 2007. Récupéré sur
youtube: https://www.youtube.com/watch?v=wW2KBXrPEdM
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 40
Gutjahr, A. E. (2006). Lesekompetenz Gehörloser: Ein Forschungsüberblick. Ph.D.
dissertation.
Hassan, H., Aue, A., Chen, C., Chowdhary, V., Clark, J., Federmann, C., . . . Zhou, M. (2018,
Mar 15). Achieving Human Parity on Automatic Chinese to English News
Translation. Récupéré sur Microsoft.com: https://www.microsoft.com/enus/research/uploads/prod/2018/03/final-achieving-human.pdf
Hennies, J. (2010). Lesekompetenz gehörloser und schwerhöriger SchülerInnen Ein Beitrag
zur empirischen Bildungsforschung in der Hörgeschädigtenpädagogik.
Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., . . . others. (2017).
Google’s multilingual neural machine translation system: Enabling zero-shot
translation. Transactions of the Association for Computational Linguistics, 5, 339–
351.
Konrad, R. (2011). Die lexikalische Struktur der Deutschen Gebärdensprache im Spiegel
empirischer Fachgebärdenlexikographie. Gunter Narr Verlag.
Krausneker, V. (2021). Avatars and sign languages: Developing a best practice protocol on
quality in accessibility. Récupéré sur University of Vienna: https://avatarbestpractice.univie.ac.at/
Lepic, R., & Occhino, C. (2018). A construction morphology approach to sign language
analysis. Dans The construction of words (pp. 141–172). Springer.
Sayers, D., Sousa-Silva, R., Höhn, S., Ahmedi, L., Allkivi-Metsoja, K., Anastasiou, D., . . .
others. (2021). The dawn of the human-machine era: A forecast of new and emerging
language technologies. Récupéré sur LITHME: https://lithme.eu/wpcontent/uploads/2021/05/The-dawn-of-the-human-machine-era-a-forecast-report2021-final.pdf
Stewart, J. (2008, July 21). The Forest - A story in ASL. Récupéré sur youtube:
https://www.youtube.com/watch?v=oUclQ10BsH8
Tinwell, A. (2014). The uncanny valley in games and animation. CRC Press.
Traxler, C. B. (2000). The Stanford achievement test: National norming and performance
standards for deaf and hard-of-hearing students. Journal of deaf studies and deaf
education, 5, 337–348.
Trentskiroonie. (2015). Let's talk about Animation Quality! Récupéré sur reddit.com:
https://www.reddit.com/r/truegaming/comments/2x4fqy/lets_talk_about_animation_
quality/
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 41
Tupi, E. (2019). Sign language rights in the framework of the Council of Europe and its
member states. Sign language rights in the framework of the Council of Europe and
its member states. Helsinki: Ministry for Foreign Affairs of Finland.
World Federation of the Deaf. (2014, February 19). WFD statement of sign language work.
Récupéré sur World Federation of the Deaf: http://wfdeaf.org/wpcontent/uploads/2016/11/WFD-statement-sign-language-work.pdf
World Federation of the Deaf. (2018, March 14). WFD and WASLI statement of use of signing
avatars. Récupéré sur World Federation of the Deaf:
https://wfdeaf.org/news/resources/wfd-wasli-statement-use-signing-avatars/
Proceedings of the 18th Biennial Machine Translation Summit, Virtual USA, August 16 - 20, 2021
1st International Workshop on Automatic Translation for Signed and Spoken Languages
Page 42