Zurich Open Repository and
Archive
University of Zurich
Main Library
Strickhofstrasse 39
CH-8057 Zurich
www.zora.uzh.ch
Year: 2012
Laugh machine
Urbain, Jerôme ; Niewiadomski, Radoslaw ; Hofmann, Jennifer ; Bantegnie, Emeline ; Baur, Tobias ;
Berthouze, Nadia ; Cakmak, Hüseyin ; Cruz, Richard Thomas ; Dupont, Stephane ; Geist, Matthieu ;
Griffin, Harry ; Lingenfelser, Florian ; Mancini, Maurizio ; McKeown, Gary ; Miranda, Miguel ; Pammi,
Sathish ; Pietquin, Olivier ; Piot, Bilal ; Platt, Tracey ; Ruch, Willibald ; Sharma, Abhishek ; Volpe,
Gualtiero ; Wagner, Johannes
Abstract: The Laugh Machine project aims at endowing virtual agents with the capability to laugh
naturally, at the right moment and with the correct intensity, when interacting with human participants.
In this report we present the technical development and evaluation of such an agent in one specific
scenario: watching TV along with a participant. The agent must be able to react to both, the video and
the participant’s behaviour. A full processing chain has been implemented, inte- grating components to
sense the human behaviours, decide when and how to laugh and, finally, synthesize audiovisual laughter
animations. The system was evaluated in its capability to enhance the affective experience of naive
participants, with the help of pre and post-experiment questionnaires. Three interaction conditions have
been compared: laughter-enabled or not, reacting to the participant’s behaviour or not. Preliminary
results (the number of experiments is currently to small to obtain statistically significant differences)
show that the interactive, laughter-enabled agent is positively perceived and is increasing the emotional
dimension of the experiment.
Posted at the Zurich Open Repository and Archive, University of Zurich
ZORA URL: https://doi.org/10.5167/uzh-76341
Book Section
Published Version
Originally published at:
Urbain, Jerôme; Niewiadomski, Radoslaw; Hofmann, Jennifer; Bantegnie, Emeline; Baur, Tobias; Berthouze,
Nadia; Cakmak, Hüseyin; Cruz, Richard Thomas; Dupont, Stephane; Geist, Matthieu; Griffin, Harry;
Lingenfelser, Florian; Mancini, Maurizio; McKeown, Gary; Miranda, Miguel; Pammi, Sathish; Pietquin,
Olivier; Piot, Bilal; Platt, Tracey; Ruch, Willibald; Sharma, Abhishek; Volpe, Gualtiero; Wagner, Johannes (2012). Laugh machine. In: Pietquin, Oliver. Proceedings eNTERFACE’12. The 8th International Summer Workshop on Multimodal Interfaces, 2nd – 27th of July 2012. Metz, France: Supelec,
13-34.
Forewords
The eNTERFACE’12 workshop was organized by the Metz’ Campus of Supélec and co-sponsored
by the ILHAIRE and Allegro European projects.
The previous workshops in Mons (Belgium), Dubrovnik (Croatia), Istanbul (Turkey), Paris
(France), Genoa (Italy), Amsterdam (The Netherlands) and Plzen (Czech Republic) had an impressive success record and had proven the viability and usefulness of this original workshop. eNTERFACE’12 hosted by Supélec in Metz (France) took this line of fruitful collaboration one step
further. Previous editions of eNTERFACE have already inspired competitive projects in the area
of multimodal interfaces, has secured the contributions of leading professionals and has encouraged
participation of a large number of graduate and undergraduate students.
We received high quality project proposals among which the 8 following projects were selected.
1. Speech, gaze and gesturing - multimodal conversational interaction with Nao robot
2. Laugh Machine
3. Human motion recognition based on videos
4. M2M -Socially Aware Many-to-Machine Communication
5. Is this guitar talking or what!?
6. CITYGATE, The multimodal cooperative intercity Window
7. Active Speech Modifications
8. ArmBand : Inverse Reinforcement Learning for a BCI driven robotic arm control
All the projects resulted in promising results and demonstrations which are reported in the
rest of this document. The workshop gathered more than 70 attendees coming from 16 countries
all around Europe and even further. We received 4 invited speakers (Laurent Bougrain, Thierry
Dutoit, Kristiina Jokinen and Anton Batliner) whose talks were greatly appreciated. The workshop was held in a brand new 800 m2 building in which robotics materials as well as many sensors
were available to the attendees. This is why we proposed a special focus of this edition on topics
related to human-robot and human-environment interaction. This event was a unique opportunity
for students and experts to meet and work together, and to foster the development of tomorrow’s
multimodal research community.
All this has been made possible thanks to the the good will of many of my colleagues who
volunteered before and during the workshop. Especially, I want to address many thanks to Jérémy
2
who did a tremendous job for making this event as enjoyable and fruitful as possible. Thanks a lot
to Matthieu, Thérèse, Danièle, Jean-Baptiste, Senthil, Lucie, Edouard, Bilal, Claudine, Patrick,
Michel, Dorothée, Serge, Calogero, Yves, Eric, Véronique, Christian, Nathalie and Elisabeth. Organizing this workshop was a real pleasure for all of us and we hope we could make it a memorable
moment of work and fun.
Olivier Pietquin
Chairman of eNTERFACE’12
3
The eNTERFACE’12 Sponsors
We want to express our gratitude to all the organizations which made this event possible.
The eNTERFACE’12 Scientific Committee
Niels Ole Bernsen, University of Southern Denmark - Odense, Denmark
Thierry Dutoit, Faculté Polytechnique de Mons, Belgium
Christine Guillemot, IRISA, Rennes, France
Richard Kitney, University College of London, United Kingdom
Benoı̂t Macq, Université Catholique de Louvain, Louvain-la-Neuve, Belgium
Cornelius Malerczyk, Zentrum für Graphische Datenverarbeitung e.V, Germany
Ferran Marques, Univertat Politécnica de Catalunya PC, Spain
Laurence Nigay, Université Joseph Fourier, Grenoble, France
Olivier Pietquin, Supélec, Metz, France
Dimitrios Tzovaras, Informatics and Telematics Intsitute, Greece
Jean-Philippe Thiran, Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
Jean Vanderdonckt, Université Catholique de Louvain, Louvain-la-Neuve, Belgium
The eNTERFACE’12 Local Organization Committee
General chair
Co-chair
Web management
Technical support
Social activities
Administration
Olivier Pietquin
Jeremy Fix
Claudine Mercier
Jean-Baptiste Tavernier
Matthieu Geist
Danielle Cebe
Thérèse Pirrone
4
eNTERFACE 2012 - Project reports
Project
Title
Pages
P1
Speech, gaze and gesturing - multimodal conversational interaction with Nao robot
7-12
P2
Laugh Machine
13-34
P3
Human motion recognition based on videos
35-38
P5
M2M - Socially Aware Many-to-Machine Communication
39-46
P6
Is this guitar talking or what!?
47-56
P7
CITYGATE, The multimodal cooperative intercity Window
57-60
P8
Active Speech Modifications
61-82
P10
ArmBand : Inverse Reinforcement Learning for a BCI driven
robotic arm control
83-88.
5
6
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
13
Laugh Machine
Jérôme Urbain1 , Radoslaw Niewiadomski2 , Jennifer Hofmann3 , Emeline Bantegnie4 , Tobias Baur5 ,
Nadia Berthouze6 , Hüseyin Çakmak1 , Richard Thomas Cruz7 , Stéphane Dupont1 , Matthieu Geist10 ,
Harry Griffin6 , Florian Lingenfelser5 Maurizio Mancini8 , Miguel Miranda7 , Gary McKeown9 , Sathish Pammi2 ,
Olivier Pietquin10 , Bilal Piot10 , Tracey Platt3 , Willibald Ruch3 , Abhishek Sharma2 , Gualtiero Volpe8
and Johannes Wagner5
1 TCTS
Lab, Faculté Polytechnique, Université de Mons, Place du Parc 20, 7000 Mons, Belgium
- LTCI UMR 5141 - Telecom ParisTech, Rue Dareau, 37-39, 75014 Paris, France
3 Universität Zürich, Binzmuhlestrasse, 14/7, 8050 Zurich, Switzerland
4 LA CANTOCHE PRODUCTION, Hauteville, 68, 75010 Paris, France
5 Institut für Informatik, Universität Augsburg, Universitätsstr. 6a, 86159 Augsburg, Germany
6 UCL Interaction Centre, University College London, Gower Street, London, WC1E 6BT, United Kingdom
7 Center for Empathic Human-Computer Interactions, De la Salle University, Manila, Philippines
8 Universita Degli Studi di Genova, Viale Francesco Causa, 13, 16145 Genova, Italy
9 The Queen’s University of Belfast, University Road, Lanyon Building, BT7 1NN Belfast, United Kingdom
10 École Supérieure d’Électricité, Rue Edouard Belin, 2, 57340 Metz, France
2 CNRS
Abstract—The Laugh Machine project aims at endowing
virtual agents with the capability to laugh naturally, at the
right moment and with the correct intensity, when interacting
with human participants. In this report we present the technical
development and evaluation of such an agent in one specific
scenario: watching TV along with a participant. The agent
must be able to react to both, the video and the participant’s
behaviour. A full processing chain has been implemented, integrating components to sense the human behaviours, decide when
and how to laugh and, finally, synthesize audiovisual laughter
animations. The system was evaluated in its capability to enhance
the affective experience of naive participants, with the help
of pre and post-experiment questionnaires. Three interaction
conditions have been compared: laughter-enabled or not, reacting
to the participant’s behaviour or not. Preliminary results (the
number of experiments is currently to small to obtain statistically
significant differences) show that the interactive, laughter-enabled
agent is positively perceived and is increasing the emotional
dimension of the experiment.
Index Terms—Laughter, virtual agent.
I. I NTRODUCTION
L
AUGHTER is a significant feature of human communication, and machines acting in roles like companions
or tutors should not be blind to it. So far, limited progress
has been made towards allowing computer-based applications
to deal with laughter. In consequence, only few interactive
multimodal systems exist that use laughter in the interactions.
Within the long term aim of building a truly interactive
machine able to laugh and respond to human laughter, during
the eNTERFACE Summer Workshop 2012 we have developed
the Laugh Machine project.
This project had three main objectives. First of all we aimed
to build an interactive system that is able to detect the human
laughs and to laugh back appropriately (i.e., right timing, right
type of laughter) to the human and the context. Secondly, we
wanted to use the laughing agent to support psychological
studies investigating benefits of laughter in human-machine
interaction and consequently improve the system towards more
naturalness and believeability. The third aim was the collection
of multimodal data on human interactions with the agent-based
system.
To achieve these aims, we tuned and integrated several
existing analysis components that can detect laughter events as
well as interpreters that controlled how the virtual agent should
react to them. In addition, we also provided output components
that are able to synthesize audio-visual laughs. All these
components were integrated to work in real-time. Secondly,
we focused on building an interactive scenario where our
laughing agent can be used. In our scenario, the participant
watches a funny stimulus (i.e., film clip, cartoon) together
with the virtual agent. The agent is able to laugh, reacting
to both, the stimulus and the user’s behavior. We evaluated
the impact of the agent through user evaluation questionnaires
(e.g., assessing the mood pre and post experiments, funniness
and aversiveness ratings to both stimuli and agent behavior,
etc.). At the same time we were able to collect multimodal data
(audio, facial expressions, shoulder movements, and Kinect
depth maps) of people interacting with the system.
This report is organized as follows. First, related work is
presented in Section II. Then, the experimental scenarios are
outlined in Section III, so that the framework for developing
the technical setup is known. The data used for training the
components is presented in section IV. Section V shows the
global architecture of the Laugh Machine system. The next
sections focus on the components of this system: details about
the input components are given in Section VI, Section VII is
related to the dialog manager and the output components are
described in Section VIII. Then, the conducted experiments to
evaluate the system are explained in Section IX. The results of
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
these experiments are discussed in Section X. Section XI refers
to the data that has been collected during the experiments.
Finally, Section XII presents the conclusions of the project.
II. R ELATED WORK
Building an interactive laughing agent requires tools from
several fields: at least audiovisual laughter synthesis for the
output, and components able to detect particular events like
participant’s laughs and decide when and how to laugh. In
the following paragraphs we will present the main works in
audiovisual laughter recognition, acoustic laughter synthesis
and visual laughter synthesis, then the interactive systems
involving laughter that have already been built. Regarding a
decision component dealing with laughter as input and output,
to the best of our knowledge there is no existing work.
A. Audiovisual laughter recognition
In the last decade, several systems have been built to
distinguish laughter from other sounds like speech. It started
with audio-only detection. The global approach followed up
to now for discriminating speech and laughter is to compute
standard acoustic features (MFCCs, pitch, energy, ...) and
feed them into typical classifiers: Gaussian Mixture Models
(GMMs), Support Vector Machines (SVMs) or Multi-Layer
Perceptrons (MLPs). Kennedy and Ellis [1] obtained 87% of
classification accuracy with SVMs fed with 6 MFCCs; Truong
and van Leeuwen [2] reached slightly better results (equal
error rate of 11%) with MLPs fed with Perceptual Linear
Prediction features; Knox and Mirghafori [3] obtained better
performance (around 5% of error) by using temporal feature
windows (feeding the MLPs with the features belonging to the
past, current and future frames).
In 2008, Petridis and Pantic started to enrich the so far
mainly audio-based work in laughter detection by consulting
audio-visual cues for decision level fusion approaches [4]–
[6]. They combined spectral and prosodic features from the
audio modality with head movement and facial expressions
from the video channel. Results suggest that integrated information from audio and video leads to improved classification
reliability compared to a single modality - even with fairly
simple fusion methods. They reported a classification accuracy of 74.7% to distinguish three classes, namely unvoiced
laughter, voiced laughter and speech. In [7] they present a
new classification approach for discriminating laughter from
speech by modelling the relationship between acoustic and
visual features with Neural Networks.
B. Acoustic laughter synthesis
Acoustic laughter synthesis is an almost unexplored domain.
Only 2 attempts have been reported in literature. Sundaram
and Narayanan [8] modeled the laughter intensity rhythmic
envelope with the equations governing an oscillating massspring and synthesized laughter vowels by Linear Prediction.
This approach to laughter synthesis was interesting, but the
produced laughs were judged as non-natural by listeners.
Lasarcyk and Trouvain [9] compared laughs synthesized by
14
an articulatory system (a 3D modeling of the vocal tract) and
diphone concatenation. The articulatory system gave better
results, but they were still evaluated as significantly less
natural than human laughs. In 2010, Cox conducted an online
evaluation study to measure to what extent (copy-)synthesized
laughs were perceived as generated by a human or a computer
[10]. Laughs synthesized by the 2 aforementioned groups were
included in the study, as well as a burst-concatenation copysynthesized laughter proposed by UMONS, which obtained the
best results with almost 60% of the 6000 participants thinking
it could be a human laugh. Nevertheless, this number is far
from the 80% achieved by a true human laugh.
C. Visual laughter synthesis
The audio-synchronous visual synthesis of laughter requires
the development of innovative hybrid approaches that combine
several existing animation techniques such as data-driven
animation, procedural animation and machine learning based
animation. Some preliminary audio-driven models of laughter
have been proposed. In particular Di Lorenzo et al. [11] proposed an anatomic model of torso respiration during laughter,
while Cosker and Edge [12] worked on facial animation during
laughter. The first model does not work in real-time while the
second is limited to only facial animation.
D. Laughing virtual agents
Urbain et al. [13] have proposed the AVLaughterCycle
machine, a system able to detect and respond to human laughs
in real time. With the aim of creating an engaging interaction
loop between a human and the agent they built a system
capable of recording the user’s laugh and responding to it with
a similar laugh. The virtual agent response is automatically
chosen from an audio-visual laughter database by analyzing
acoustic similarities with the input laughter. This database
is composed of audio samples accompanied by the motion
capture data of facial expressions. While the audio content is
directly replayed, the corresponding motion capture data are
retargeted to the virtual model.
Shahid et al. [14] proposed Adaptive Affective Mirror, a tool
that is able to detect user’s laughs and to present audio-visual
affective feedback, which may elicit more positive emotions in
the user. In more details, Adaptive Affective Mirror produces
a distortion of the audio-visual input using real-time graphical
filters such as bump distortion. These distortions are driven by
the amount and type of user’s laughter that has been detected.
Fukushima et al. [15] built a system able to increase users’
laughter reactions. It is composed of a set of toy robots that
shake heads and play preregistered laughter sounds when the
system detects the initial user laughter. The evaluation study
showed that the system enhances the users’ laughing activity
(i.e., generates the effect of contagion).
Finally, Becker-Asano et al. [16] studied the impact of
auditory and behavioral signals of laughter in different social
robots. They discovered that the social effect of laughter
depends on the situational context including the type of task
executed by the robot, verbal and nonverbal behaviors (other
than laughing) that accompany the laughing act [17]. They also
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
claim that inter-cultural differences exist in the perception of
naturalness of laughing humanoids [16].
III. S CENARIOS AND S TIMULUS F ILM
In our evaluation scenario the virtual agent and its laughter
behavior were investigated. The experimental setup involved a
participant watching a funny video with a virtual agent visually
present on a separate screen. The expressive behavior of the
virtual agent was varied among three conditions, systematically altering the degree of expressed appreciation of the clip
(amusement) in verbal and non-verbal behavior, as well as
different degrees of interaction with the participant’s behavior.
The three conditions are:
• “fixed speech”: the agent is verbally expressing amusement at pre-defined times of the video
• “fixed laughter”: the agent is expressing amusement
through laughs at pre-defined times of the video
• “interactive laughter”: the agent is expressing amusement
through laughter, in reaction to both the stimulus video
and the participant’s behavior
Furthermore, participant related variables were assessed
with self-report instruments and allowed for the investigation
of the influence of mood and personality on the perception and
evaluation of the virtual agent. This allowed for the control of
systematic biases on the evaluation of the virtual agent, which
are independent of its believability (e.g., individuals with a
fear of being laughed at perceive all laughter negatively). The
impact of the agent was assessed by investigating the influence
of the session on participant’s mood, as well as by self-report
questionnaires assessing the perception of the virtual agent and
the participant’s cognitions, beliefs and emotions.
The stimulus film consisted of five candid camera pranks
with a total length of 8 minutes. The clips were chosen
by one expert rater who screened a large amount of video
clips (approximately 4 hours) and chose five representative,
culturally unbiased pranks sections of approximately 1 to 2
minutes length. All pranks were soundless and consisted of
incongruity-resolution humor.
IV. DATA USED FOR TRAINING
Several pieces of data have been used in the project, two
existing databases and two datasets specifically recorded to
develop Laugh Machine. These databases are briefly presented
in this section.
A. The SEMAINE database
The SEMAINE database [18] was collected for the
SEMAINE-project by Queen’s University Belfast with technical support of the HCI2 group of Imperial College London.
The corpus includes recordings from users while holding
conversations with an operator who adopts in sequence four
roles designed to evoke emotional reactions. One of the roles
(Poppy) being happy and outgoing often invokes natural and
spontaneous laughter by the user. The corpus is freely available
for research purpose and offers high-audio quality, as well as,
frontal and profile video recordings. The latter is important as
15
it allows incorporation of visual features, which is part of the
future work of Laugh Machine.
Within Laugh Machine, the SEMAINE database has been
used to design a framework for laughter recognition (see
Section VI-B) and select the most relevant audio features for
this task.
Even though laughter is included as a class in the transcriptions of the SEMAINE database, provided laughter annotation
tracks turned out to be too coarse to be used in the Laugh
Machine training process. Hence, 19 sessions (each about 47 minutes long), which were found to include a sufficient
number of laughs, were selected and manually corrected.
B. The AVLaughterCycle database
Secondly, we used the AudioVisualLaughterCycle (AVLC)
corpus [19] that contains about 1000 spontaneous audio-visual
laughter episodes with no overlapping speech. The episodes
were recorded with the participation of 24 subjects. Each
subject was recorded watching a 10-minutes comedy video.
Thus it is expected that the corpus contains mainly amusement
laughter. Each episode was captured with one motion capture
system (either Optitrack or Zigntrack) and synchronized with
the corresponding audiovisual sample. The material was manually segmented into episodes containing just one laugh. The
number of laughter episodes for a subject ranges from 4 to
82. The annotations also include phonetic transcriptions of the
laughter episodes [20].
Within Laugh Machine, the AVLaughterCycle database has
been used to design the output components (audiovisual laughter synthesis, see Section VIII).
C. Belfast interacting dyads
The first corpus recorded especially for Laugh Machine contains human-human interactions when watching the stimulus
film (see Secton III). Two dyads (one female-female, one malemale) were asked to watch the film. The two participants
were placed in two rooms; they watched the same film
simultaneously on two separate LCD displays. They could
also see the other participant’s reaction as a small window
with the other person view was placed on the top of the
displayed content. The data contains the closeup view of each
participant’s face, 90 degree views (all at 50FPS) of the half of
the body as well as audio tracks obtained from close-talk and
far-field microphones for each participant, sampled at 48kHz
and stored in PCM 24bits. Laughs have been segmented from
the recorded signals.
This interaction data has been used to train the dialog
manager component (see Section VII).
D. Augsburg scenario recordings
In order to tune the laughter detection (initially developed
on the SEMAINE database) to the sensors actually used in
Laugh Machine, a dedicated dataset has been recorded.
Since laughter includes respiratory, vocal, and facial and
skeletomuscular elements [21], we can expect to capture signs
of laughter if we install sensory to capture the user’s voice,
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
facial expressions, and movements of the upper body. To have
a minimum of sensors we decided to work with only two
devices: the Microsoft Kinect and the Chest Band developed
at the University College London (see Section VI-D). The
latest version of the Microsoft Kinect SDK1 not only offers
full 3D body tracking, but also a real-time 3D mesh of facial
features—tracking the head position, location of eyebrows,
shape of the mouth, etc.
TABLE I
R ECORDED SIGNALS .
Recording device
Captured signal
Description
Microsoft Kinect
Video
Face points
Facial action units
Head pose
Skeleton joints
Audio
RGB, 30fps, 640x480
16 kHz, 16 bit, mono
Thoracic circumference
120Hz, 8 bit
Respiration Sensor
The recorded signals are summarized in Table I. Recordings
took place at the University of Augsburg, using the Social
Signal Interpretation (SSI, see Section VI-A) tool. During the
sessions 10 German and 10 Arabic students were recorded
while watching the stimulus film. By including participants
with different cultural background it is our hope to improve
the robustness of the final system. The recordings were then
manually annotated at three levels: 1) beginning and ending
of laughter in the audio track, 2) any non-laughter event in the
audio track, such as speech and other noises, and 3) beginning
and ending of smiles in the video track.
V. S YSTEM ARCHITECTURE
The general system architecture is displayed in Figure 1.
We can distinguish 3 types of components: input components,
decision components and output components. They are respectively explained in Sections VI, VII and VIII.
The input components are responsible for multimodal data
acquisition and real-time laughter-related analysis. They include laughter detection from audiovisual features, body movements analysis (with laughter likelihood), respiration signal
acquisition (also with laughter likelihood) and input laughter
intensity estimation.
The decision components receive the information from the
input components (i.e., laughter likelihoods and intensity from
multimodal features) as well as contextual information (i.e.,
the funniness of the stimulus, see Section IX-C2, in green
on Figure 1) and determines how the virtual agent should
react. There are actually two decision components: the dialog
manager, which decides if and how the agent should laugh
at each time frame (typically 200ms), is followed by a block
called “Laughter Planner”, which decides whether or not the
instruction to laugh should be forwarded to the synthesis
components. In some cases, for example when there is an
ongoing animation, it is indeed preferable not to transmit new
synthesis instructions.
16
The output components are responsible for the audiovisual laughter synthesis that is displayed when the decision
components instruct to do so. In the current state of these
components, it is not possible to interrupt a laughter animation
(e.g., to decide abruptly to stop laughing or on the other hand
to laugh more intensely before the current output laughter
is finished). This is the reason why the “Laughter Planner”
module has been added. The Laugh Machine project includes
one component for audio synthesis and 2 different animation
Realizers, Greta and LivingActor (see Section VIII).
All the components have to work in real-time. Thus, the
organization of the communication between different components is crucial in such project. For this purpose we use the
SEMAINE2 architecture which was originally aimed to build a
Sensitive Listener Agent (SAL). The SEMAINE API is a distributed multi-platform component integration framework for
real-time interactive systems. The architecture of SEMAINE
API uses a message-oriented middleware (MoM) in order to
integrate several components – where actual processing of the
system is defined. Such components communicate via a set
of topics. Here, a topic is a virtual channel where each and
every published message, addressed to that topic, is delivered
to its subscribed consumers. The communication passes via the
message-oriented middleware ActiveMQTM [22], which supports multiple operating systems and programming languages.
For component integration, the SEMAINE API encapsulates
the communication layer in terms of components that receive
and send messages, and a system manager that verifies the
overall system state, provides a centralized clock independent
of the individual system clocks.
To integrate Laugh Machine components we used the same
exchange messages server (i.e., ActiveMQ) and the SEMAINE
API. Each Laugh Machine component can read and write
to some specific ActiveMQ topics. For this purpose we defined a hierarchy of message topics and for each topic the
appropriate message format. Simple data (such as input data
or clock signals) were coded in simple text messages in
string/value tuples, so called MapMessages, e.g. the message
AU DIO LAU GHT ER DET ECT ION 1 is sent wherever
laughter was detected from the audio channel. On the other
hand more complex information such as the description of
the behavior to be displayed was coded in standard XML
languages such as the Behavior Markup Language3 (BML).
It should be noted that, since the available data to train
the decision components ((i.e., the Belfast dyads data) did not
contain Kinect nor respiration signals, the decision modules
currently use only the acoustic laughter detection and acoustic
laughter intensity. The other input components (in yellow on
Figure 1) are nevertheless integrated in the system architecture
and their data is recorded in order to train the decision modules
with these additional signals in the future.
VI. I NPUT C OMPONENTS
To work properly, our system must be able to capture
sufficient information about the user, coming from different
2 http://www.semaine-project.eu/
1 http://www.microsoft.com/en-us/kinectforwindows/
3 http://www.mindmakers.org/projects/bml-1-0/wiki/Wiki?version=10
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
Fig. 1.
17
Overall architecture of Laugh Machine
modalities such as sound, visual tracking, and chest movement. To facilitate the multimodal data processing and the
synchronisation between the different signals, we have used
the Social Signal Interpretation (SSI) [23] software developed
at the University of Augsburg. This software will be presented
first in this section, then we will present the different analysis
components that have been developed: audiovisual laughter
detection, laughter intensity estimation, respiration signal acquisition and body movement analysis. All these components
have been plugged directly in SSI, except the body motion
analysis, due to a problem of sharing the Kinect data in realtime.
other hand, if we detect that the user is talking right now
we would decrease the confidence for a detected smile. The
different tasks the recognition component is involved with are
visualized in Figure 2.
A. SSI
The desired recognition component has to be equipped with
certain sensory to capture multimodal signals. First, the raw
sensor data is collected, synchronized and buffered for further
processing. Then the individual streams are filtered, e.g. to
remove noise, and transformed into a compact representation
by extracting a set of feature values from the time- and
frequency space. The in this way parameterized signal can be
classified by either comparing it to some threshold or applying
a more sophisticated classification scheme. The latter usually
requires a training phase where the classifier is tuned using
pre-annotated sample data. The collection of training data is
thus another task of the recognition component. Often, an
activity detection is required in the first place in order to
identify interesting segments, which are subject to a deeper
analysis. Finally, a meaningful interpretation of the detected
events is only possible at the background of past events
and events from other modalities. For instance, detecting
several laughter events within a short time frame increases
the probability that the user is in fact laughing. On the
Fig. 2. Scheme of the laughter recognition component implemented with
the Social Signal Interpretation (SSI) framework. Its central part consists of
a recognition pipeline that processes the raw sensory input in real-time. If an
interesting event is detected it is classified and fused with previous events and
those of other modalities. The final decision can be shared through the network
with external components. In order to train the recognition components a
logging mechanism is incorporated in order to capture processed signals and
add manual annotation. In an offline learning step the recognition components
can now be tuned to improve accuracy.
The Social Signal Interpretation (SSI) software [23] developed at Augsburg University suits all mentioned tasks and
was therefore used as a general framework to implement the
recognition component. SSI provides wrappers for a large
range of commercial sensors, such as web/dv cameras and
multi-channel ASIO audio devices, as well as the Nintendo Wii
remote control, Microsoft Kinect and various physiological
sensors like NeXus, ProComp, AliveHeartMonitor, IOM or
Emotiv. A patch-based architecture allows a developer to
quickly construct pipelines to simultaneously manipulate the
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
raw signals captured by multiple devices, where the length
of the processing window can be adjusted for each modality
individually. Many common filter algorithms, such as moving
and sliding average, Butterworth, Chebyshev, Elliptic, etc. as
well as, derivative and integral filters are part of the core
system and can be easily combined with a range of lowlevel features such as Fourier coefficients, intensity, cepstra,
spectrogram, or pitch, as well as, more than 100 functionals,
such as crossings, extremes, moments, regression, percentiles,
etc. However, a plug-in system encourages developers to
extend the core functions with whatever algorithm is required.
A peak detection component is included, too, which can be
applied to any continuous signal in order to detect segments
above a certain activity level. If an event is detected it can
be classified using one of various classification models such
as K-Nearest Neighbor (KNN), Linear Discriminant Analysis
(LDA), Support Vector Machines (SVM) or Hidden Markov
Models (HMM). Tools for training and evaluation are available
and can be combined with several feature selection algorithms
(e.g., SFS) and over-sampling techniques (e.g., SMOTE [24])
for boosting under represented classes are available, too.
Finally, classified events can be fused over time using vectorbased event fusion. SSI offers a XML interface to put the
different components to a single pipeline and keep control of
important parameters.
In the Laugh Machine project, SSI was used for body and
face tracking as well as audio and respiration recording. To
have access to the new features provided in the latest Microsoft
Kinect SDK, the Kinect wrapper in SSI was revised and
updated accordingly. To access to the stretch values measured
by the respiration sensor a new sensor wrapper was written
using a serial connection.
After finishing the integration of the sensor devices, a
recording pipeline was set up to record a training corpus for
tuning the final recognition system (the Augsburg scenario
recordings presented in Section IV-D). The pipeline also
includes a playback component that allows replay of a video
file to the user in order to induce laughter. This feature was
used to drive the stimulus video directly from SSI. Since the
video playback is then synchronized with the recorded signals,
it is possible to relate captured laughter bouts to a certain
stimuli in the video. The same pipeline was later used in
our experiments. It is illustrated in Figure 3, which presents
the Laugh Machine architecture from the point of view of
SSI. The following sections present components that have
been integrated into SSI: laughter detection, laughter intensity
estimation and respiration signal acquisition.
B. Laughter detection
Starting from the literature one can find several studies
dealing with the detection of laughter from speech (e.g.,
[1]–[3], see Section II-A). Most of them are pure offline
studies and in part the explored feature types and classification
methods vary largely. This circumstance makes it difficult to
decide from scratch, which feature set and classifier would be
the best choice for an online laughter detector. Hence, it was
decided to run a fair comparison of the suggested methods in
18
Fig. 3. SSI roles in the Laugh Machine system. While the user is watching
funny video clips his or her non-verbal behavior is analyzed by a recognition
component developed with SSI. If a laughter event is detected this information
is shared to the behavior model, which controls the avatar engine. According
to the input the avatar is now able to respond in an appropriate way, e.g., join
the user’s laughter bout.
a large scale study. To this end, the SEMAINE database has
been used and annotations of 19 files containing laughter were
manually edited (as explained in Section IV-A).
Based on the edited annotations, for each second (a number
commonly found in literature) it was decided whether the
segment includes only silence (1906 samples), pure speech
(5328), pure laughter (370), or both, speech and laughter (261).
Samples were then equally distributed in a training and test
set, while it was ensured that samples of the same user would
not occur in both sets. To have an equal number of samples
for each class, underrepresented classes were oversampled
in the training set using SMOTE. After some preliminary
tests it was decided to leave out silence, as it can be easily
differed from speech and laughter using activity detection. It
was also decided to leave out samples including both speech
and laughter, as the goal of the experiment was to find features
that best discriminate the two classes.
After setting up the database, large parts of the openSMILE
(Speech & Music Interpretation by Large-space Extraction)
feature extraction toolkit developed at the Technical University
Munich (TUM) [25] were integrated into SSI. OpenSMILE
is an open source state-of-the-art implementation of common
audio features for speech and music processing. An important
feature is its capability of on-line incremental processing,
which makes it possible to run even complex and timeconsuming algorithms, such as pitch extraction, in real-time.
Based on the findings of earlier studies, the following speechrelated low-level features were selected as most promising candidates: Intensity, MFCCs, Pitch, PLPs. On these the following
11 groups of functionals were tested: Zero-Crossings, DCT
(Direct Cosine Transform) Coefficients, Segments, Times,
Extremes, Means, Onsets, Peaks, Percentiles, Linear and
Quadratic Regression, and Moments. Regarding classification,
four well known methods were chosen: Naive Bayes (NB),
Gaussian Mixture Models (GMM), Hidden Markov Models
(HMM) and Support Vector Machines (SVM). Finally, the
frame size at which low-level features are extracted was also
altered.
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
A large scale experiment was then conducted. First, each
of the 11 groups of functionals was tested independently with
each of the four low-level feature types. In case of MFFCs
also the number of coefficients was altered and higher-order
derivatives (up to 4) were added. Results suggest that most
reliable results are achieved using Intensity and MFCCs, while
adding Pitch and PLP features did not improve results on the
studied corpus. Among the functionals, Regression, Moments,
Peaks, Crossings, Means and Segments are considered to carry
most distinctive information. Regarding classification, SVM
with a linear kernel clearly outperformed all other tested
recognition methods. In terms of operation size accuracy was
highest at a frame rate of 10ms with 2/3 of overlap. In the best
case an overall accuracy of 88.2% at an unweighted average
recall of 91.2% was obtained.
The developed laughter detection framework was then tuned
to the specific Laugh Machine scenario and input components
(i.e., the audio is recorded by the Kinect), thanks to the
Augsburg scenario recordings (see Section IV-D). The annotations of the audio tracks were used to re-train the laughter
detector described above, with the features extracted in the
Laugh Machine scenario conditions. The obtained laughter
model was finally combined with a silent detection to filter out
silent frames in the first place and classifying all remaining
frames into laughter or noise. The frame size was set to 1
second with an overlap of 0.8 second, i.e. a new classification
is received every 0.2 second. The annotations of the video
tracks are meant for training an additional smile detector in
the future. Same counts for the respiration signal (see Section
VI-D), which in future will serve as a third input channel to
the laughter detector.
C. Laughter intensity
Knowing the intensity of incoming laughs is important information to determine the appropriate behavior of the virtual
agent.
In [26], naive participants have been asked to rate the
intensity of laughs from the AVLaughterCycle database [19]
on a scale from 1 (very low intensity) to 5 (very high intensity).
One global intensity value had to be assigned to each laugh.
Audiovisual features that correlate with these perceived global
intensity have then be investigated.
Here, we wanted not only to estimate the global laughter
intensity, after the laugh has finished, but to measure in realtime the instantaneous intensity. As a first step, only the audio
modality was included. 49 acoustic laughs, produced by 3
subjects of the AVLaughterCycle database and distributed over
the ranges of annotated global intensity values, have been
continuously annotated in intensity by one labeler. Acoustic
features have been extracted with the objective to predict the
continuous intensity curves.
Figure 4 displays the manual intensity curve for one laugh,
together with the automatic intensity prediction obtained from
two acoustic features: loudness and pitch. The intensity curve
is obtained by a linear combination between the maximum
pitch and the maximum loudness values over a sliding 200ms
window, followed by median filtering to smooth the curve. The
19
overall trend is followed, even though there are differences,
mostly at the edge of the manually spotted bursts, and the
manual curve is smoother than the automatic one. Furthermore, the overall laughter intensity can be extracted from the
continuous annotation curve: correlation coefficients between
the median intensity scored by users and the intensity predicted
from acoustic features are over 0.7 for 21 out of 23 subjects4 .
Fig. 4. Example of laughter continuous intensity curve. Top: waveform;
Bottom: manual and automatic intensity curves.)
During the eNTERFACE workshop, work has been done
to improve the computation of the continuous intensity curve.
Indeed, the linear combination is able to capture trends for
one subject (which laugh or laugh segment is more intense
than another one), but the outputted values fall in different
ranges from one subject to another. Classification with Weka
[27] has been investigated to overcome this problem. First,
neural networks have been trained in Weka to predict the
continuous intensity curve from acoustic features (MFCCs and
spectral flatness). The correlation with the manually annotated
curves was over 0.8, using a “leave-one-subject-out” scheme
for testing. Second, other neural networks have been used
to compute the global laughter intensity from the predicted
continuous intensity. To keep the number of features constants,
5 functionals (max, std, range, mean, sum) of the continuous
intensity have been used as inputs. The results again show a
good correlation between the predicted global intensity and
the one rated by naive participants, in this case with similar
values for all the subjects of the AVLaughterCycle database.
However, the speaker-independent intensity detection with
Weka could not be integrated in the full LaughMachine
system yet. Only the linear combination has been used in
our experiments. Further work to improve laughter intensity
prediction include the extension of the feature set to visual
features, the integration of the Weka classification within the
Laugh Machine framework and possibly the adaptation of the
functions to the user.
D. Respiration
The production of audible laughter is, in essence, a respiratory act since it requires the exhalation of air to produce
distinctive laughter sounds (“Ha”) or less obvious sigh- or
hiss-like verbalizations. The respiratory patterns of laughter
have been extensively researched as Ruch & Ekman [21]
summarize. A distinctive respiration pattern has emerged of
4 The 24th subject of the AVLC corpus only laughed 4 times and all these
laugsh were rated witht he same global intensity, which prevents us from
computing correlations for this subject
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
a rapid exhalation followed by a period of smaller exhalations
at close-to-minimum lung volume. This pattern is reflected by
changes in the volume of the thoracic and abdominal cavities,
which rapidly decrease to reach a minimum value within
approximately 1 s [28]. These volumetric changes can be
seen through the simpler measure of thoracic circumference,
noted almost a century ago by Feleky [29]. In order to
capture these changes, we constructed a respiration sensor
based on the design of commercially available sensors: the
active component is a length of extensible conductive fabric
within an otherwise inextensible band that is fitted around
the upper thorax. Expansions and contraction of the thorax
change the length of the conductive fabric causing changes in
its resistance. These changes in resistance are used to modulate
an output voltage that is monitored by the Arduino prototyping
platform5 . Custom-written code on the Arduino converts the
voltage to a 1-byte serial signal, linear with respect to actual
circumference, which is passed to a PC over a USB connection
at a rate of approximately 120Hz.
Automatic detection of laughter from respiratory actions has
previously been investigated using electromyography (EMG).
Fukushima et al. analyzed the frequency characteristics of
diaphragmatic muscle activity to distinguish laughter, which
contained a large high-frequency component, from rest, which
contained mostly low-frequency components [15]. We exploited the predictable respiration pattern of laughter to use
simpler techniques that do not rely on computationally demanding frequency decomposition. We identified laughter onset through the appearance of 3 respiration events (see Figure
5):
1) A sharp change in current respiration state (inhalation,
pause, standard exhalation) to rapid exhalation.
2) A period of rapid exhalation resulting in rapid decrease
in lung volume.
3) A period of very low lung volume
Fig. 5. Example of thoracic circumference, with laughter episode marked
in red, and notable features of laughter initiation. Feature 1 - a sharp change
in current respiration state to rapid exhalation; feature 2 - a period of rapid
exhalation; feature 3 - a period of very low lung volume.
These appear as distinctive events in the thoracic circumference measure and its derivatives:
1) A negative spike in the second derivative of thoracic
circumference.
5 http://www.arduino.cc/
20
2) A negative period in the first derivative of thoracic
circumference.
3) A period of very low thoracic circumference.
These were identified by calculating a running mean (λf )
and standard deviation (σf ) for each measure. A running
threshold (Tf ) for each measure was calculated as:
Tf = λf − α f σf
(1)
where αf is a coefficient for that measure, empirically
determined to optimise the sensitivity/specificity trade-off.
Each feature was determined to be present if the value of
the measure fell below the threshold at that sample. Laughter
onset was identified by the presence of all three features in
the relevant order (1 before 2 before 3) in a sliding window
of approximately 1 s. This approach restricts the number of
parameters to 3 (α1—3 ) but does introduce lag necessary for
calculating valid derivatives from potentially noisy data. It
also requires a period for the running means and standard
deviations, and so the running thresholds, to stabilise. This
process would be jeopardised by the presence of large, rapid
respiratory event such as coughs and sneezes. We were unable
to integrate these rules into the LaughMachine system due to
technical errors. Future recordings on the LaughMachine platform, incorporating the respiration data, will allow optimisation of these rules and the fusion of respiration data with other
modalities for real-time laughter/non-laughter discrimination.
E. Body analysis
The EyesWeb XMI platform is a modular system that
allows both expert (e.g., researchers in computer engineering) and non-expert users (e.g., artists) to create multimodal
installations in a visual way [30]. The platform provides
modules, called blocks, that can be assembled intuitively (i.e.,
by operating only with mouse) to create programs, called
patches, that exploit system’s resources such as multimodal
files, webcams, sound cards, multiple displays and so on. The
body analysis input component consists of an EyesWeb XMI
patch performing analysis of the user’s body movements in
realtime. The computation performed by the patch can be split
into a sequence of distinct steps, described in the following
subsections.
1) Shoulder tracking: The task of the body analysis module
is to track the user’s shoulders and perform some computation
on the variation of their position in realtime. In order to do
that we could provide the Kinect shoulders’ data extracted
by SSI (see Section VI-B) as input to our component. However, we observed that the shoulders’ position extracted by
Kinect do not consistently follow the user’s real shoulder
movement: in the Kinect skeleton, shoulders’ position is
determined by performing some statistical algorithm on the
user’s silhouette and depth map and usually this computation
can not track subtle shoulders’ movement, for example, small
upward/downward movements. To overcome this limitation
we fixed two markers on the user’s body: two small and
lightweight green polystyrene spheres have been fixed on the
user’s clothes just over the user’s shoulders. The EyesWeb
patch separates the green channel of the input video signal
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
to isolate the position on the video frame of the two spheres.
Then a tracking algorithm is performed to follow the motion
of the sphere frame by frame, as shown in Figure 6.
Fig. 6. Two green spheres placed on the user’s shoulders are tracked in
realtime (red and blue trajectories)
The position of each user’s shoulder is associated to the
barycenter of each sphere, which can be computed in two
ways. The first consists in the computation of the graphical
barycenter of each sphere, that is, the mean of the pixels
of each sphere’s silhouette is computed. The second option
includes some additional steps: after computing the barycenter
like in the first case, we consider a square region around it and
we apply a Lukas-Kanade [31] algorithm to this area. The
result is a set of 3 points on which we compute the mean: the
resulting point is taken as the position of the shoulder.
2) Correlation: Correlation ρ is computed as the Pearson
correlation coefficient between the vertical position of the
user’s left shoulder and the vertical position of the user’s right
shoulder. Vertical positions are approximated by the y coordinate of each shoulder’s barycenter extracted as mentioned
above.
3) Kinetic energy: It is computed from the speed of user’s
shoulders and their percentage mass as referred by [32] :
E = 12 (m1 v1 + m2 v2 )
4) Periodicity: Kinetic energy is serialized in a sliding
window time-series having a fixed length. Periodicity is then
computed on such time-series, using Periodicity Transforms
[33]. The input data is decomposed into a sum of its periodic components by projecting data onto periodic subspaces.
Periodicity Transforms also provide a measure of the relative
contribution of each periodic signal to the original one. Among
many algorithms for computing Periodicity Transforms, we
chose mbest. It determines the m periodic components that,
subtracted from the original signal, minimize residual energy.
With respect to the other algorithms, it also provides a better
accuracy and does not need the definition of a threshold.
5) Body Laughter Index: Body Laughter Index (BLI) stems
from the combination of the averages of shoulders’ correlation
and kinetic energy, integrated with the Periodicity Index. Such
averages are computed over a fixed range of frames. However
such a range could be automatically determined by applying
a motion segmentation algorithm on the video source. A
weighted sum of the mean correlation of shoulders’ movement
and of the mean kinetic energy is carried out as follows:
BLI = αρ̄ + β Ē
As reported in [21], rhythmical patterns produced during
laughter usually have frequencies around 5 Hz. In order to take
into account such rhythmical patterns, the Periodicity Index is
21
used. In particular, the computed BLI value is acknowledged
only if the mean Periodicity Index belongs to the arbitrary
range [ f 8ps , f 2ps ], where fps is the input video frame rate
(number of frames per second).
6) ActiveMQ: The EyesWeb XMI platform can be expanded to implement new functionalities that could be included into new sets of programming modules (blocks). To
allow the communication between the body analysis patch and
the other components (e.g., the SSI audio and face analysis
component) we implemented two new blocks: the ActiveMQ
receiver and the ActiveMQ sender. Body analysis component
sends two types of data using the ActiveMQ message format
described in Section V: data messages and clock messages.
Data messages contain tuples representing the values of the
user’s shoulders movement features presented above. Clock
messages contain the system clock of the machine on which
the EyesWeb XMI platform is running. They are sent to
the ActiveMQ server on which all the other components are
registered. So, the local clock of all the components (audio
and face analysis, dialogue generation and so on) is constantly
updated with the same value and synchronization between
the different component can be assured. In the future we
aim to exploit the synchronization features embedded in the
SEMAINE platform, that is implemented as a layer of the
ActiveMQ communication protocol.
VII. D IALOG MANAGER
The laughter-enabled dialogue management module aims at
deciding, given the information from the input components
(i.e., laughter likelihoods and intensity from multimodal features) as well as contextual information (i.e., the funniness
of the stimulus), when and how to laugh so as to generate
a natural interaction with human users. In this purpose, the
dialogue management task is seen as a sequential decision
making process meaning that the behavior is not only influenced by the current context but also by the history of the
dialogue. This is a main difference comparing with the other
interactive systems such as SEMAINE. The optimal sequence
of decisions is learned from actual human-human or humancomputer interaction data and is not rule-based or handcrafted
which is another difference with the SEMAINE system.
The decision of whether and how to laugh must be taken at
each time frame. For the eNTERFACE workshop a time frame
lasts ∆t = 200ms. The input I received by the Dialog manager
at each time frame is a vector (I ∈ [0, 1]k : each feature has
been normalized) where k is the number of chosen multimodal
features. The output O produced at each time frame is a vector
(O ∈ [0, 1] × [0, timemax ]) where the first dimension codes the
laughter intensity and the second dimension codes the duration
of the laugh.
The method used to build the decision rule during the
eNTERFACE workshop is a supervised learning method. A
supervised learning method is able via a training data set
D = {xi , yi }1≤i≤J ({xi }1≤i≤J are the inputs which belong to
the set X, {yi }1≤i≤J are the labels which belong to the set Y
and J ∈ N∗ ) to build a decision rule π. The decision rule π is
a function from X to Y that generalizes the relation between
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
22
the inputs xi and the labels yi of the training data set. There
are two different types of supervised methods: Classification
when the number of outputs is finite and Regression when the
number of outputs is infinite.
A. Training of the Dialog manager
To apply a supervised learning method to our dialog manager, we need a training data set specific to our scenario
(see Section III). The Belfast interaction dyads (see Section
IV-C) was recorded to this purpose. Let us name the two
interacting participants P 1 and P 2, respectively recorded on
tracks T 1 and T 2. We recall that the participants watch
simultaneously the same stiumulus video, and can also see
(and hear) each other on the display screen: P 2 is viewable
by P 1 and is considered as playing the role of the virtual
agent. The length of a recording is H = K∆t. Thus, on T 1
we have the inputs (i.e., laughter likelihoods and intensity from
multimodal features of P 1) {Ii }1≤i≤K of the Dialog manager
and on T 2 the corresponding outputs {Oi }1≤i≤K which are
the intensities and durations of the laughs of P 2.
The aim of the supervised method is to find a decision rule
such that the virtual agent will be able to imitate P 2. Before
applying a supervised method, we decided to cluster the inputs
with N clusters (via a k-means method) and to cluster the
outputs with M clusters (via a Gaussian Mixture Model, or
GMM, method). k-means clustering is a method of cluster
analysis which aims to partition n ∈ N∗ observations into
0 ≤ k ≤ n clusters in which each observation belongs to the
cluster with the nearest mean. This results in a partitioning of
the data space into Voronoi cells. GMM clustering is a method
of cluster analysis where each cluster can be parameterized by
a Gaussian distribution. The choice of the GMM method for
the output clustering is explained in Section VII-B.
Thanks to clustering, the input data becomes the input
clustered data {IiC }1≤i≤K with IiC ∈ {1, . . . , N } and the
output data becomes the output clustered data {OiC }1≤i≤K
with OiC ∈ {1, . . . , M }. Clustering the inputs allows to have
a finite decision rule which means that the decision rule can
be represented by a finite vector. Clustering the outputs allows
using a classification method such as the k-nearest neighbors
(k-nn) instead of a regression method which is more difficult
to implement.
Finally the supervised method used on the clustered data
{IiC , OiC }1≤i≤K is a k-nearest-neighbor method which gives
us the decision rule π which is a function from {1, . . . , N } to
{1, . . . , M }. k-nn is a method for classifying objects based
on closest training examples: the object is assigned to the
most common label amongst its k nearest neighbors. Figure 7
represents the training phase of the Dialog manager needed to
obtain the decision rule π.
B. Using the Dialog manager
The decision rule π obtained by the classification method
on {IiC , OiC }1≤i≤K is a function from {1, . . . , N } to
{1, . . . , M }: it takes an input cluster and it gives an output
cluster. However, our dialog manager must be able to take an
input I ∈ [0, 1]k and give an output O ∈ [0, 1] × [0, timemax ].
Fig. 7.
Dialog Manager training
So, first we need to assign the input I ∈ [0, 1]k to the
corresponding input cluster I C ∈ {1, . . . , N }. To do that,
we choose the cluster for which the mean is the closest to
I ∈ [0, 1]k :
IC = argmin �I − µIi �22 ,
(2)
1≤i≤N
where µIi is the mean of the input cluster i ∈ {1, . . . , N }
and ��2 is the euclidean norm. This operation is called the
input cluster choice. Second, to be able to generate O from
the selected output cluster l ∈ {1, . . . , M } the question is:
which element of the output cluster l must we choose in order
to correspond to the data {Oi }1≤i≤K ? This is why we use a
GMM method for clustering the outputs: each cluster l can
be seen, in the 2-dimensional intensity-duration plane, as a
O
O
Gaussian of law N (µO
l , Σl ), where µl is the mean of the
O
output cluster l and Σl is the covariance matrix of the output
cluster l. Therefore, to obtain an output, it is sufficient to
O
sample an element O of law N (µO
l , Σl ). This operation is
called the output generation.
Let us summarize the functioning of the Dialog manager
(see also Figure 8): we receive the input I, we associate this
input to its corresponding input cluster I C ∈ {1, . . . , N },
then the decision rule π gives the output cluster π(I C ) ∈
{1, . . . , M }, finally the output O is chosen in the output cluster
π(I C ) ∈ {1, . . . , M } via the output generation.
Fig. 8.
Dialog Manager functioning
C. Laughter Planner
In the Laugh Machine architecture, the dialog manager
is followed by the Laughter Planner, which is adapting the
outputs of the dialog manager to the constraints (instruction
format, avoid conflicting information, etc.) of the synthesis
modules. While it technically is a decision component, the
explanations about the Laughter Planner are included in the
visual synthesis section (Section VIII-B).
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
VIII. AUDIOVISUAL L AUGHTER SYNTHESIS
A. Acoustic laughter synthesis
Given 1) the lack of naturalness resulting from previous
attempts to laughter acoustic synthesis, 2) the need for high
level control of the laugh synthesizer and 3) the good performance achieved with Hidden Markov Model (HMM) based
speech synthesis [34], we decided to investigate the potential
of this technique for acoustic laughter synthesis. We opted for
the HMM-based Speech Synthesis System (HTS) [35], as it is
free and widely used in speech synthesis and research.
Explaining the details of speech synthesis with HMMs
or HTS going beyond the scope of this project report, we
will here only describe the major modifications that have
been brought to adapt our laughter data to HTS and viceversa, adapting functions or parameters of the HTS demo
(provided with the HTS toolbox) to improve the quality of
laughter synthesis. Readers who would like to know more
about HTS are encouraged to consult the publications listed
on the HTS webpage (http://hts.sp.nitech.ac.jp/?Publications),
and in particular [34] for an overview or Yoshimura’s Phd
Thesis [36] for more detailed explanations.
The following paragraphs respectively focus on the selection
of the training data, the modifications implemented in the HTS
demo and, finally, the resulting process for acoustic laughter
synthesis.
1) Selection and adaptation of acoustic data:
HMM-based acoustic synthesis requires a large quantity of
data: statistical models for each unit (in speech: phonemes) can
only be accurately estimated if there are numerous training
examples. Furthermore, the data should be labeled (i.e., a
phonetic transcription must be provided) and come from a
single person, whose voice is going to be modeled. HMMbased speech synthesis is usually trained with hours of speech.
It is difficult to obtain such large quantities of spontaneous
laughter data. The only laughter database including phonetic
transcriptions is the AVLaughterCycle database [19], [20],
which contains in total 1 hour of laughter from 24 subjects.
We decided to use that database for our acoustic laughter
synthesis.
To fully exploit the potential of HTS, the phonetic annotations of the AVLaughterCycle database have been extended
to syllables. Indeed, HTS is able to distinguish contexts that
lead to different acoustic realizations of a single phoneme
(and on the other hand, HTS groups the contexts that yield
acoustically similar realizations of a phoneme). In speech, the
context of a phoneme is defined not only with the surrounding
phonemes, but also with prosodic information such as the
position of the phoneme within the syllable, the number of
phonemes in the previous, current and following syllables;
the number of syllables in the previous, current and following
words; the number of words in the phrase; etc. Except from
the surrounding phones6 , such contextual information was not
available in the AVLC annotations, as there was no annotation
of the laughs in terms of syllables or words. It was decided
6 Since the phonological notion of “phoneme” is not clearly defined for
laughter; we prefer to use the word “phone” for the acoustic units found in
our laughter database.
23
to add a syllabic annotation of the data to provide the biggest
possible contextual information. There is no clear definition
of laughter syllables, and the practical definition that has been
used for the syllabic annotation was to consider one syllable
as a set of phones that was acoustically perceived as forming
one block (or burst), usually containing one single vowel
(but not always, as laughter can take different structures from
speech). Since the syllabic annotation is time-consuming, it
was decided to do it only for the subjects who laughed the most
in the AVLaughterCycle database: subjects 5, 6, 14, 18 and 20.
These subjects laugh around 5 minutes each, which is already
far from the hours of training data used in speech synthesis,
and it seemed they represent the best hopes for good quality
laughter synthesis. The HTS contextual information was then
formed by assimilating a full laughter episode to a speech
sentence and laughter exhalation and inhalation segments to
words.
In addition, due to the limited available data, the phonetic
labels have been grouped in 8 broad phonetic classes—
namely: fricatives, plosives, vowels, hum-like (including nasal
consonants), glottal stops, nareal fricatives (noisy respiration
airflow going through the nasal cavities), cackles (very short
vowel similar to hiccup sound) and silence—instead of the
200 phones annotated in the AVLaughterCycle database [20].
Indeed, most of these phones had very few examples for each
speaker, and hence could not be accurately modeled. Grouping
acoustically similar phones enables to obtain better models, at
the cost of reduced acoustic variability (e.g., all the vowels
are grouped in an average model that is close to ‘a’, and we
loose the possibility to generate the few ‘o’s in the database).
An example of the resulting phonetic transcription is presented in Figure 9.
Finally, the laughs from the AVLaughterCycle database
have been processed to reduce background noise and remove
saturations.
2) Modifications of the HTS demo process:
Several minor modifications have been applied to HTS.
Some of them are simple parameter variations compared to the
standard values used in speech (and in the HTS demo). For
example, the boundaries for fundamental frequency estimation
have been extended (the values have been manually determined for each subject), the threshold for pruning decision
trees has been increased, etc. In addition the list of questions
available to decision trees has been extended, considering the
new contextual information available for laughter.
More important, two standard HTS algorithms have been
replaced by more efficient methods. First, the standard Dirac
pulse train for voiced excitation has been replaced by the
DSM model [37], which better fits the human vocal excitation
shapes and reduces the buzziness of the synthesized voice.
Second, the standard vocal tract and fundamental frequency
estimation algorithms provided by HTS have been replaced
by the STRAIGHT method [38], which is known in speech
processing to provide better estimations.
3) Synthesis process:
With the explained modifications to the AVLaughterCycle
database and the HTS demo, we were able to train laughter
synthesis models, with which we can produce acoustic laughs
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
24
Fig. 9. Laughter phonetic and syllabic annotation: from top to bottom: a) waveform b) spectrogram c) phonetic annotation (using the 8 broad classes) d)
syllable annotation e) respiration phases (inhalation or exhalation)
when giving an acoustic laughter transcription as input. It is
worth noting that there is currently no module to generate such
laughter phonetic transcriptions from high-level instructions
(e.g., a type of laughter, its duration and its intensity). We
are thus constrained to play existing laughter transcriptions.
Additionally, we noticed that the synthesis quality drops if
we want to synthesize a phonetic transcription from speaker
A with the models trained on voice B. In consequence, we
currently stick to re-synthesizing laughs from one speaker,
using both the phonetic transcription and the models trained
from the same subject.
A perceptive evaluation study still has to be carried out.
Nevertheless, the first qualitative tests are promising. The
modifications explained in the previous paragraphs largely
improved the quality of the laughter synthesis. There remain
some laughs or sounds that are not properly synthesized,
possibly due to the limited training data. Future works will
investigate this issue as well as the possibility to generate
new laughter phonetic transcriptions (or modify existing ones)
that can be synthesized properly. Nevertheless, at the end
of this project, we are able to synthesize a decent number
of good quality laughs for the best voices coming from the
AVLaughterCycle database.
B. Visual laughter Synthesis
Two different virtual agents and four different approaches
were used for the visual synthesis. The visual synthesis
component is composed of a Laughter Planner and 2 Realizers
and Players (see Figure 10).
The Laughter Planner receives from the dialog manager the
information about the appropriate laugh reaction through the
ActiveMQ/SEMAINE architecture (see Section VII). Next it
chooses one laugh episode from the library of predefined laugh
samples and generates the appropriate BML command that is
sent through ActiveMQ/SEMAINE to one out of two realizers
available in the project: Living Actor or Greta Realizer.
Fig. 10.
Visual Synthesis Component Pipeline
On Figure 10 we present the detailed processing pipeline
of our visual synthesis component. The Laughter Planner is
connected to the Greta Behavior Realizer and the Cantoche
Sender. The latter is responsible for the communication with
the Living Actor component (see Section VIII-B4). Both
Behavior Realizer and Cantoche Sender receive the same BML
message. As these realizers use completely different methods
for controlling the animation (Greta can be controlled by high-
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
level facial behavior description in FACS and low-level facial animation parameterization (MPEG-4/FAPs) while Living
Actor plays predefined animations) we use realizer-specific
extensions of BML to assure that the animations played with
different agents are similar. If necessary, the Laughter Planner
can also send commands in a high-level language called
FML (FMLSemaineSender box) or control facial animations
at very low level by specifying the values of facial animation
parameters (FAPs) (FAPSender box). Independently of which
of these pipelines is used the final animation is described
using low level facial animation parameters (FAPs) and is
sent through ActiveMQ/SEMAINE to the visualization module
(FAPsender box). At the moment we use the Player from the
SEMAINE Project. Four characters are included in this Player
(2 male, 2 females) but for the purpose of the evaluation we
used only one of them.
The Laughter Planner module can work in three different
conditions, related to the three experimental scenarios: fixed
speech condition (FSC), fixed laughter condition (FLC) and
interactive laughter condition (ILC). In the first two conditions
(FSC and FLC), the Laughter Planner receives the information
about the context (time of funny event, see Section IX-C2) and
it sends the agent verbal (FSC) or nonverbal (FLC) reaction
pre-scripted in BML to be displayed to the user. The list of
these behaviors was chosen manually.
In ILC condition the behavior of the agent is flexible as
it is adapted to the participant and the context. The Laughter
Planner receives the information on duration and intensity of
laughter responses and using these values it chooses one laugh
episode from the library that matches the best both values.
At the moment, the synthesis components do not allow for
interruptions of the animation. Once it is chosen, the laugh
episode has to be played until the end. During this period
the Laughter Planner does not take into the account any new
information coming from dialog manager. All the episodes
start and end with a neutral expression. Thus they cannot be
concatenated without passing through neutral face. Additionally the presynthesized audio wave file was synchronized with
the animation.
Four different approaches were used in the project to prepare
the lexicon of laughs: animation from the manual annotation
of action units; animation from automatic facial movements
detection; motion capture data driven; and manual animation.
They are explained in the next subsections.
1) Animation from manual Action Units:
The Facial Action Coding System (FACS; [39]) is a comprehensive anatomically based system for measuring all visually
discernible facial movement. It describes all distinguishable
facial activity on the basis of 44 unique Action Units (AUs), as
well as several categories for head and eye position movements
and miscellaneous actions. Facial expressions of emotions
are emotion events that comprise of a set of different AUs
expressed simultaneously. Using FACS and viewing digitalrecorded facial behavior at frame rate and in slow motion,
certified FACS coders are able to distinguish and code all
possible facial expressions. Utilizing this technique, a selection
of twenty pre-recorded, laboratory stimulated, laughter events
were coded. These codes were then used to model the facial
25
behavior on the agent.
Four subjects interacting in same sex dyads watching the
stimulus videos (see Section IV-C) were annotated by one
certified FACS coder. Inter rater reliability was obtained by
the additional coding of 50% of the videos by a second
certified coder. The inter-rater reliability was sufficient (r =
.80) and consent was obtained on events with disagreement.
Furthermore, a selection of 20 laughter events from the AVLC
laughter database [19] (subject 5) were coded by one certified
coder.
The Greta agent is able to display any configuration of
action units. For 3 characters (two females—Poppy and
Prudence— and one male—Obadiah) single action units were
defined and validated by certified FACS coders. A BML language implemented in Greta permits to control independently
each action unit of the agent (its duration and intensity).
Furthermore, as a quality control, the animated AUs of
the virtual agent was scrutinized by the FACS coders for a)
anatomical appearance change accuracy, b) subtle differences
and dominance rules relating to changes in the face when
different intensity of facial expressions are produced.
During eNTERFACE we also developed a tool that automatically converts manual FACS annotation files to BML.
Consequently any file containing manual annotation of action
units can be easily displayed with the Greta agent.
2) Animation from Automatic Facial Movements detection:
Greta uses Facial Animation Parameters (FAPs) to realize low
level facial behavior. FAPs in Greta framework are represented
as movements of MPEG-4 facial points compared to ‘neutral’
face. In order to estimate FAPs of natural facial expressions, we make use of an open-source face tracking tool—
FaceTracker [40]—to track facial landmark localizations. It
uses a Constrained Local Model (CLM) fitting approach that
includes Regularized Landmark Mean-Shift (RLMS) optimization strategy. It can detect 66 facial landmark coordinates
within real-time latency depending on system’s configuration.
Figure 11 shows an example of 2D and 3D landmark coordinates predicted by FaceTracker.
Fig. 11.
Landmarks estimated by FaceTracker
Facial geometry is different for one and another. Therefore,
it is difficult to estimate FAPs without neutral face calibration.
To compute FAPs from facial landmarks, a neutral face model
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
is created with the help of 50 neutral faces of different persons.
With the help of this model, FAPs are estimated as the
distance between facial landmarks and neutral face landmarks.
In case of user-specific FAP estimation in real-time scenario,
the neutral face is estimated from a few seconds of video
by explicitly requesting the user to be neutral. However, the
better estimation of FAPs requires manual intervention for
tweaking weights to map landmarks and FAPs, which is a
down-side of this methodology. Figure 12 shows comparison
of the locations between MPEG-4 FAP standard and the
FaceTracker’s landmark localizations.
Fig. 12.
(a) MPEG-4 FAP standard [left];(b) FaceTracker’s landmark
locations [right].
The landmark coordinates produced by the FaceTracker are
observed as noisy due to the discontinuities and outliers in
each facial point localization. Especially, the realized behavior
is unnatural when we re-target the observed behavior onto
Greta. In order to smooth the face tracking parameters, a temporal regression strategy is applied on individual landmarks by
fitting 3rd order polynomial coefficients on a sliding window,
where the sliding window size is 0.67 seconds (i.e., 16 frames)
and sliding rate is 0.17 seconds (i.e., 4 frames). An example
of the final animation can be seen on Figure 13.
3) Animation from Motion Capture Data:
The AVLC corpus (see Section IV-B) contains motion
capture data of laugh episodes that has to be retargeted to the
virtual model. The main problem in this kind of approaches
consists in finding appropriate mappings for each participant’s
face geometry and different virtual models. Existing solutions
are typically linear (e.g., methods based on blendshape mapping) and do not take into account dynamical aspects of the
facial motion itself. Recently Zeiler et al. [41] proposed to
apply variants of Temporal Restricted Boltzmann Machines7
(TRBM) to facial retargeting problem. TRBM are a family of
models that permits tractable inference but allows complicated
structures to be extracted from time series data. These models
can encode a complex nonlinear mapping from the motion of
one individual to another which captures facial geometry and
dynamics of both source and target. In the original application
[41] these models were trained on a dataset of facial motion
capture data of two subjects, asked to perform a set of isolated
facial movements based on FACS. The first subject had 313
7 The source code for these models is publicly available at http://www.
matthewzeiler.com/software/RetargetToolbox/Documentation/index.html
26
markers (939 dimensions per frame) and the second subject
had 332 markers (996 dimensions per frame). Interestingly
there was no correspondence between marker sets.
We decided to use TRBM models for our project which
involves retargeting from an individual to a virtual character.
In our case, we take as input the AVLC mocap data and output
the corresponding facial animation parameters (FAP) values.
This task has two interesting aspects. First, the performance of
these models was previously evaluated only on retargeting an
isolated slow expression whereas our case involves transitions
from laughing to some other expression (smile or idleness)
as well as very fast movements. Second, we use less markers
comparing to the original application. Our mocap data had
only 27 markers on the face which is very sparse.
So far we used the AVLC data on one participant (number 5)
as a source mocap data. We used two sequences, one of 250
frames and another one of 150 frames, to train this model.
Target data (i.e., facial animation parameters) for this training
set was generated using the manual retargeting procedure
explained in [13]. Both the input and output data vectors
were reduced to 32 dimensions by retaining only their first 32
principal components. Since this model typically learns much
better on scaled data (around [-1,1]), the data was then normalized to have zero mean and scaled by the average standard
deviation of all the elements in the training set. Having trained
the model, we used it to generate facial animation parameters
values for 2 minutes long mocap data (2500 frames coming
from the same participant). The first results are promising but
more variability in the training set is needed to retarget more
precisely different type of movements. It is important to notice
that this procedure needs to be repeated for each virtual model
(e.g., Poppy, Prudence, Obadiah).
4) Manual Animation:
The Laugh Machine Living Actor module is composed of
a real-time 3D rendering component using Living Actor technology and a communication component that constitutes the
interface between the Living Actor agent and the ActiveMQ
messaging system. Two virtual characters have been chosen for
the first prototype: a girl and a boy, both with cartoonish style.
Two types of laughter animations were created for each one
by 3D computer graphics artists by visually matching the real
person movies from the video database of interacting dyads
(see Section IV-C).
Laughter capability has been added to the Living Actor
character production tools and rendering component: specific
facial morphing data are exported from 3D character animation
tools and later rendered in real time. Laughter audio can be
played from an audio file, which can either be the recording
of a human laughter or a synthetic laughter synchronized with
the real laughs. A user interface has been added to test various
avatar behaviors and play sounds.
An Application Programming Interface has been added to
the Laugh Machine Living Actor module to remotely control
the avatar using BML scripts. A separate component was
created in Java to make the interface between the Laugh
Machine messaging system using ActiveMQ and TCP/IP
messages of Living Actor API. At this stage, the supported
BML instructions are restricted to a few commands, triggering
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
Fig. 13.
27
Animation from Automatic Facial Movements detection
predefined laughs. But the foundation of more complex scripts
is ready.
When there are no instructions sent, the real-time 3D
rendering component automatically triggers “Idle” animations
during which the virtual agent is breathing, making it more
realistic and assuring animations continuity.
C. Audiovisual laughter synthesis
In the present work, no new laughter is generated. Instead,
existing laughs are re-synthesized. All the animations can thus
be prepared. For all the laughter animations, we synthesized
separately the acoustic and the visual modalities, using the
original audiovisual signals (with synchronized audio and
video flows). In consequence the synthesized audio and video
modalities are also synchronized. Each acoustic laugh was
synthesized and the produced WAVE file was made available
to the virtual agent. When the agent receives the instruction
to laugh, it loads simultaneously the acoustic (WAVE) file and
the BML animation file, and plays them synchronously.
IX. E XPERIMENTS
A. Participants
Twenty-one participants (13 males; ages ranging from 25 to
56 years, M = 33.16, SD = 8.11) volunteered to participate.
Four participants were assigned to the fixed speech condition,
5 to the fixed laughter condition and 11 to the interactive
condition.
B. State and Trait influences on the perception of the virtual
agent and its evaluation
Three kinds of subjective ratings were utilized to assess
a) habitual and b) actual factors affecting the perception of
the virtual agent and c) the evaluation of the interaction. For
the habitual factors, two concepts were used: the dispositions towards ridicule and laughter, and the temperamental
basis of the sense of humor, with one questionnaire each
(PhoPhiKat< 45 >; [42]; State-Trait Cheerfulness Inventory,
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
STCI; [43]). Actual factors were assessed by measuring participant’s mood before and after the experiment (state version of
the STCI; [44]). The evaluation of the interaction was assessed
with the Avatar Interaction Evaluation Form (AIEF; [45]).
1) Habitual Factors:
The assessment of personality variables allowed for a
control of habitual factors influencing the perception of the
virtual agent, independent of its believability. For example,
gelotophobes, individuals with a fear of being laughed at
(see [46]), do not perceive any laughter as joyful or relaxing
and they fear being laughed at even in ambiguous situations.
Therefore, the laughing virtual agent might be interpreted as
a threat and the evaluation would be biased by the individuals
fear. By assessing the gelotophobic trait, individuals with at
least a slight fear of being laughed at can either be excluded
from further analysis, or the influence of gelotophobia can
be investigated for the dependent variables. Further, the joy
of being laughed at (gelotophilia) and the joy of laughing at
others (katagelasticism) might alter the experience with the
agent, as katagelsticists might enjoy laughing at the agent,
while gelotophiles may feel laughed at by the agent and
derive pleasures from this. Both dispositions may increase
the positive experience of interacting with an agent. The
PhoPhiKat-45 is a 45-item measure of gelotophobia (“When
they laugh in my presence I get suspicious”), gelotophilia
(“When I am with other people, I enjoy making jokes at my
own expense to make the others laugh”), and katagelasticism
(“I enjoy exposing others and I am happy when they get
laughed at”). Answers are given on a 4-point Likert scale
(1 = strongly disagree to 4 = strongly agree). Ruch and Proyer
[42] found high internal consistencies (all alphas ≥ .84) and
high retest-reliabilities ≥ .77 and ≥ .73 (three to six months).
In the present sample, reliabilities were satisfactory to high
and ranged between α = .81 to .83.
Also, it was shown that the traits and states representing
the temperamental basis of the sense of humor influence an
individual’s threshold for smiling and laughter, being amused,
appreciating humor or humorous interactions (for an overview
see [47]). It was assumed that trait cheerful individuals would
enjoy the interaction more than low trait cheerful individuals,
as they have a lower threshold for smiling and laughter, those
behaviors are more contagious and there are generally more
elicitors of amusement to individuals with high scores. For
trait bad mood, it was expected that individuals with high
scores would experience less positive affect when interacting
with the agent, compared to individuals with low scores, as
individuals with high scores have an increased threshold for
being exhilarated, and they do not easily engage in humorous
interactions.
The STCI assesses the temperamental basis of the sense of
humor in the three constructs of cheerfulness (CH), seriousness
(SE), and bad mood (BM) as both states (STCI-S) and traits
(STCI-T). Participants completed the STCI-T before the experiment to be able to investigate the influence of cheerfulness,
seriousness and bad mood on the interaction. The standard
state form (STCI-S< 30 >; [44]) assesses the respective states
of cheerfulness, seriousness and bad mood with ten items each
(also on a four-point answering scale). Ruch and Köhler [48]
28
report high internal consistencies for the traits (CH: .93, SE:
.88, and BM: .94). The one month test-retest stability was
high for the traits (between .77 and .86), but low for the states
(between .33 and .36), conforming the nature of enduring traits
and transient states.
2) Actual Factors:
Different experiments and studies on the state-trait model
of cheerfulness, seriousness, and bad mood showed that participant’s mood alters the experience of experimental interventions and natural interactions (for an overview, see [47]). Also,
individual’s mood changes due to interactions and interventions, for example state seriousness and bad mood decrease
when participating carnival celebrations, while cheerfulness
increases. Therefore, state cheerfulness, seriousness and bad
mood were assessed before and after the experiment to investigate mood influence on the interaction with the agent (with
the above mentioned STCI-S).
3) Evaluation:
To evaluate the quality of the interaction with the virtual
agent, the naturalness of the virtual agent and cognitions and
beliefs toward it, a questionnaire was designed for the purposes
of the experiment. The aim of the Avatar Interaction Evaluation Form (AIEF) is to assess the perception of the agent, the
emotions experienced in the interaction, as well as opinions
and cognitions towards it on broad dimensions. The instrument
consists of 32 items and 3 open questions, which were
developed following a rational construction approach. The first
seven statements refer to general opinions/beliefs and feelings
on virtual agents (e.g., “generally I enjoy interacting with
virtual agents”). Then, 25 statements are listed to evaluate the
experimental session. The following components are included:
positive emotional experience (8 items; e.g., “the virtual agent
increased my enjoyment”), social (and motivational) aspects
(7 items; e.g., “being with the virtual agent just felt like being
with another person”), judgment of technical features of the
virtual agent/believeability (5 items; e.g., “the laughter of the
virtual agent was very natural”), cognitive aspects assigned
to the current virtual agent (5 items; e.g., “the virtual agent
seemed to have a personality”). All statements are judged on a
seven point Likert-scale (1 = strongly disagree to 7 = strongly
agree). In the three open questions, participants can express
any other thoughts, feelings or opinions they would like to
mention, as well as describing what they liked best/least.
4) Further Evaluation Questions and Consent Form:
To end the experimental session, the participants were
asked for general questions to assess their liking of candid
camera humor in general (“Do you like candid camera-clips
in general?” “How funny were the clips?” “How aversive were
the clips?” “Would you like to see more clips of this kind?”).
All questions were answered on a seven point Likert-scale.
Then, participants were asked to give written consent to the use
of the collected data for research and demonstration purposes
(eNTERFACE workshop and ILHAIRE8 project).
C. Conditions
1) Overview:
8 http://www.ilhaire.eu
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
To create an interaction setting, the participants were asked
to watch a film together with the virtual agent. Three conditions were designed (fixed speech, fixed laughter, interactive),
systematically altering the degree of expressed appreciation
of the clip (amusement) in verbal and non-verbal behavior, as
well as different degrees of interaction with the participant’s
behavior. In the fixed speech and fixed laughter conditions,
the agent would be acting independent of the participant, but
still be signaling appreciation. In the interactive condition, the
agent was responding to the participant’s behavior. In other
words, only the contextual information was used in the fixed
speech and fixed laughter conditions, while the input and
decision components (see Sections VI and VII) were active
in the interactive condition.
2) Selection of pre-defined time points for the fixed laughter
and fixed speech condition:
The pre-defined times were chosen from the stimulus video.
Firstly, 14 subjects (three females) watched the video material
and annotated the funniness to it on a continuous funniness
rating scale (ranging from “not funny at all” to “slightly
funny”, to “funny”, to “really funny” to “irresistibly funny”).
Averaged and normalized funniness scores were computed
over all subjects, leading to sections with steep increases in
funniness (apexes; see Figure 14) over the video. Secondly, the
trained raters assigned “punch lines” to the stimulus material,
basing on assumptions of incongruity-resolution humor theory.
Whenever the incongruous situation/prank was resolved for the
subject involved, and amusement in the observer would occur
from observing the resolution moment, a peak punch line was
assigned. Punch lines were assigned for the first punch line
occurring and the last punch line occurring in a given clip.
When matching the continuous ratings with the punch lines,
it was shown that the funniness apexes did cluster within the
first and last punch lines for all subjects and all pranks, apart
from one outlier. Table II shows the overall and apex durations
of each clip, as well as the number and intensity of the peaks
that have been fixed. For the three long apex sections, two
responses were fixed, were the averaged funniness ratings
peaked. Those peaks were rated on an intensity scale from 1
to 4. Pre-defined time points were controlled for a 1.5s delay
in the rating/recording, due to reaction latency of the subjects
and motor response delay.
TABLE II
D URATION , APEX AND NUMBER FO FIXED RESPONSES FOR EACH OF THE
STIMULUS CLIPS
Clip
Duration (s)
1
2
3
4
5
95
131
72
72
78
Apex
duration (s)
69
56
26
16
50
Fixed
responses
2
2
1
1
2
Intensity
4
2
4
3
1
Notes: 1. Duration of apex (1st to last punch line). 2. Intensity (1 = strong; 4 = weak)
3) Fixed Speech:
In the fixed speech condition, the agent expressed verbal
appreciation in 8 short phrases (e.g., “oh, that is funny”, “I
liked that one”, “ups”, “this is great”, “how amusing”, “phew”,
nodding, “I wonder what is next”) at pre-defined times. The
29
Fig. 14. Continuous funniness ratings (means in blue and standard deviations
in red) over the stimulus video for 14 subjects and expert assigned punch lines
(first and last, in blue) to each clip. Red arrows indicate time points for fixed
responses.
verbal responses were rated for intensity on a four point scale
and matched to the intensity scores of the pre-defined time
points.
4) Fixed Laughter:
In the fixed laughter condition, the agent laughed at predefined times during the video. The times were the same as the
time points in the fixed speech condition. The agent displayed
8 laughs which varied in intensity and duration, according to
the intensity ratings of the pre-defined time points. A laughter
bout may be segmented into an onset (i.e., the pre-vocal facial
part), an apex (i.e., the period where vocalization or forced
exhalation occurs), and an offset (i.e., a post-vocalization part;
often a long-lasting smile fading out smoothly; see [21]).
Therefore, the onset was emulated by an independent smiling
action just before the laughter (apex) would occur at the fixed
time. The offset of the laughter was already integrated in the
8 laughter chosen.
5) Interactive Condition:
In the interactive condition, which follows the architecture
presented in Section V and Figure 1, the agent was using
two sources of information to respond to the participant: the
continuous funniness ratings to the clip (context, shown in Figure 14) and the participant’s acoustic laughter vocalizations.
The dialog manager was receiving these two information flows
and continuously taking decisions about whether and how the
virtual agent had to laugh, providing intensity and duration
values of the laugh to display. These instructions were then
transmitted to the audiovisual synthesis modules. Due to the
limited number of laughs available for synthesis (14 at the
time of the experiments), it was decided to cluster them into
4 groups based on their intensities and durations. The output
of the dialog manager is then pointing to one of the clusters,
inside which the laugh to synthesize is randomly picked.
D. Problems encountered
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
Several problems appeared during the experiments.
First of all, the computers we used were not powerful
enough to run all the components on a single computer. We
had to connect four computers together: one master computer
running the stimulus video and the Kinect recording and
analysis (+ the context), one computer running a webcam
with shoulder movement tracking driven by Eyesweb, another
one running the dialog manager and finally one computer for
displaying the virtual agent. Still, the master computer could
not record the video stream from the Kinect. We decided to run
the experiments without recording that video as we still have
the webcam recording, but this issue should be investigated
in the future. Furthermore, during some experiments, data
transmission from one computer to the other was suffering
from important delays (5-10s), which obviously affect the
quality of the interaction. Reducing these delays will be one
of the most important future developments.
Second, the audio detection module had been trained with
data containing mostly laughs, and relatively few other noises.
Hence, there was confusion between laughter and other loud
noises. In addition, the detection was audio-only, which does
not enable to take smiles or very subtle laughs (with low
audio) into account. We are already working on improving
the laughter detection and including other modalities (video,
respiration) to increase its robustness.
Third, from the training data, it appeared that the context
was by far the best factor to explain participants’ laughs:
in consequence, the dialog manager did not pay attention to
what the participant was doing, but only triggered laughs from
the contextual input. Since this is undesirable behavior in the
interactive condition (which is in that case actually similar to
the fixed laughter condition, as every reaction is only contextdependent), we decided to omit the context in the interactive
condition: the virtual agent was then only reacting to what the
participant was doing. Better models should be built in the
future to allow both context and participant’s reactions to be
considered simultaneously.
Fourth, the pool of available laughs for synthesis is currently
limited. There are not a lot of laughs from one single voice
for which we have good quality synthesis for both the audio
and the visual modalities. This limits the range of actions
the virtual agent is able to perform and some participants
with whom the agent laughed a lot might have noticed
some repetitions. This will be improved in the future with 2
solutions: 1) a larger pool of available laughs 2) the possibility
to generate new laughter transcription and/or modify existing
ones in real-time.
Finally, a connection problem with the respiration sensor
prevented us from recording respiration data.
E. Procedure of the evaluation study
Participants were recruited through e-mail announcement
of an “evaluation study of the Laugh Machine project” at the
eNTERFACE workshop. As an incentive, participants were
offered a feedback on the questionnaire measures on request.
It was announced that the study consisted of the filling in of
questionnaires (approximately 30-45 minutes) and a session of
30
30 minutes on two given days. No further information on the
aims of the study was given. Participants chose a date for the
experimental session via the Internet and received confirmation
by email.
At the experimental session, participants were welcomed
by one of the two female experimenters and asked to fill
in the STCI and the PhoPhiKat. Then, participants were
asked to fill in the STCI-S to assess their current mood.
Meanwhile, the participants were assigned to one of the
three conditions. Afterwards, the second female experimenter
accompanied the participant to the experimenting room, where
the participant was asked to sit in front of a television screen.
A camera allowed for the frontal filming of the head and
shoulder, as well as upper body of the participant. Two
male experimenters concerned with the technical components
were present. Participants were asked for consent to have
their shoulder and body movements recorded. They were also
given headphones to hear the virtual agent. The experimenter
explained that the participant was asked to watch a film
together with Poppy and that the experimenters would leave
the room when the experiment started. Once the experimenters
left the room, the agent did greet the participant (“Hi, I’m
Greta. I’m looking forward to watch this video with you.
Let’s start”) and subsequently, the video started. After the
film, the experimenters entered the room again and the female
experimenter accompanied the participant back to the location
where the post measure of the STCI-S, as well as the AIEF
and five further evaluation questions were filled in. After all
questionnaires were completed, the first female experimenter
debriefed the participant and asked for written permission to
use the obtained data.
The following setup was used in this experimental session
(see Figure 15). Two LCD displays were used: the bigger
one (46”) was used to display the stimuli (the funny film,
see Section III). The smaller (19”) LCD display placed on
the right side of the big one was used to display the agent
(a close-up view of the agent with only the face visible was
used). Four computers were used to collect the user data,
run the Dialog Module and to control the agent audio-visual
synthesis. Participant’s behaviors were collected through a
Kinect (sound, depth map, and camera) and a second webcam
synchronized with the EyesWeb software (see Section VI).
Because of technical issues we were not able to use the
respiration sensor in this experimental session. Participants
were asked to sit on a cushion about 1m from the screen.
They were asked to wear headphones.
In the evaluation we have used 14 laugh episodes from the
AVLC dataset (subject 5). For consistency reasons we have
used only one female agent (i.e., Poppy) and the animation
created with only one method i.e. automatic facial movements
detection (see Section VIII-B3).
X. R ESULTS
A. Preliminary Analysis
Scale means for cheerfulness, seriousness, bad mood, gelotophobia, gelotophilia and katagelasticism were investigated.
The sample characteristics of the PhoPhiKat and the STCI-T
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
31
correlations to the AIEF dimensions were found for trait bad
mood. Unlike a priori assumptions, trait bad mood correlated
positively to the AIEF dimensions and correlations to trait
cheerfulness were generally very low. Trait seriousness was
correlated negatively to the AIEF scales.
C. States
Fig. 15.
Setup of the experiment.
resembled norm scores for adult populations. In this sample,
the internal consistencies were satisfactory for all trait scales,
ranging from α = .74 for trait seriousness, to α = .91
for trait cheerfulness. In respect to trait variables biasing the
evaluation, three subjects were identified for exceeding the cutoff point for gelotophobia. Means for the state cheerfulness,
seriousness and bad mood scores showed higher state bad
mood scores before the experiment, compared to previous
participants of studies on personality and humor. In respect
to the AIEF, the internal consistencies (Cronbach’s alpha) of
the scales were satisfactory, ranging from α = .78 (cognitive
aspects) to α = .90 (positive emotional experience).
B. Traits
In line with previous findings, trait cheerfulness was correlated negatively to trait bad mood (r = −.61, p < .01), as
well as trait seriousness (r = −.16, n.s.), but less strongly
to the latter one. Trait seriousness and bad mood were correlated positively (r = .22, n.s.). Gelotophobia was correlated
negatively to gelotophilia (r = −.50, p < .05), as well as
(but less so) to katagelasticism. The latter negative (but not
significant; r = −.35, p = .117) correlation was unusual, as
gelotophobia usually shows zero correlations to katagelasticism. Katagelasticism was positively related to gelotophilia
(r = .26, n.s.). Generally, correlations of the AIEF to the
trait scale did not reach statistical significance. Correlating the
dimensions and items of the AIEF to gelotophobia (bivariate
Pearson correlations) showed, that three of the four AIEF
scales were negatively correlated with gelotophobia, indicating
that higher scores in gelotophobia went along with less positive
emotions, less assignment of cognition and less believability
of the virtual agent to participants with higher scores. Feeling
social presence by the agent was positively correlated to gelotophobia. Gelotophilia correlated positively with all dimensions of the AIEF. Further, higher scores in katagelasticism
went along with more positive emotions, higher perceived
believability and higher perceived social presence. With regard
to the temperamental basis of the sense of humor, the highest
Correlating the states to their respective traits showed that
trait cheerfulness was positively correlated to state cheerfulness both pre and post the experiment (but all n.s.). Trait
seriousness was positively correlated to seriousness after the
experiment, whereas trait bad mood was negatively correlated
to both, bad mood pre and post the experiment (both p < .01).
In this sample, a few individuals with low scores in trait bad
mood came to the experiment with high values in state bad
mood, whereas a few individuals with high scores on bad
mood came to the experiment with comparably low scores
in state bad mood. Descriptive analysis of the mood before
and after the experiment, it was found that the interaction
with the virtual agent led to a decrease in seriousness over
all conditions, whereas state cheerfulness stayed stable. State
bad mood before the experiment predicted lower scores on the
AIEF dimensions, suggesting that individuals that feel more
grumpy or sad generally experience less positive emotions
with the virtual agent, assign the agent less cognitive capability, experience less social interaction and judge it as less
believable (all r < −.489, p < .05).
D. AIEF Scales/Dimensions
Due to the low cell sizes, no test of significance could
be performed to testing the influence of the condition on
the AIEF dimensions. Nevertheless descriptive inspection of
the group means showed that the conditions differed in their
elicitation of positive outcomes on all dimensions of the
AIEF. The interactive condition yielded highest means on
all four dimensions, implying that the participants felt more
positive emotions, felt more social interaction with the agent,
considered it more natural and assigned it more cognitive
capabilities than in the fixed conditions (see Figure 16). The
means of the interactive condition were followed by the means
of the fixed laughter condition.
Interestingly, the fixed speech condition yielded similarly
high scores on the beliefs on cognition as the interactive
condition, whereas the other means were numerically lowest
for the fixed speech condition.
The results stayed stable when excluding the three individuals exceeding the cut-off point for gelotophobia.
E. Open Answers
Out of the 21 participants, 14 gave answers to the question
of what they liked least about the session. Half of the participants mentioned that the video was not very funny or would
have been funnier with sound. Two participants mentioned that
they could not concentrate on both, the virtual agent and the
film. Two of the three gelotophobes gave feedback (subject 2:
“Poppy’s expression while laughing was more a smirk than a
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
Fig. 16. Profiles of the means in the AIEF scales for the three experimental
conditions separately
laugh”; subject 21: “it’s hard to act naturally when watching
a film when you feel like you should laugh”). Seventeen participants responded to what was liked best about the session.
Best liked was the laughter of the virtual agent through the
headphones (it was considered amusing and contagious; three
nominations), the video (five nominations), the set up (four
nominations) and one participant stated: “It was interesting to
see in what situations and in what manner the virtual agent
responded to my laughter and to funny situations respectively”
(subject 12).
XI. C OLLECTED DATA
The multimodal laughter corpora of human to human interactions are rare. Even more seldom are corpora of humanmachine interaction that contain any episodes of laughter.
The evaluation of our interactive laughter system gave us the
unique opportunity to gather new data about the human behavior in such human-machine interactive scenario. Consequently,
we have collected multimodal data from participants to our
experiments. In more details our corpus contains:
• audio data, recorded by the Kinect at 16kHz and stored
in mono WAVE files, PCM 16bits
• Kinect depth maps
• two web cameras
• data on the shoulders movement extrapolated from the
video stream (for this purpose two small markers were
placed on the shoulders of each participant)
All these data can be synchronized with the context (see
Section IX-C2) and the agent reactions. The collected corpus
is an important result of the Laugh Machine Project. It will be
widely used in the ILHAIRE project and will become freely
available for the research purposes.
XII. C ONCLUSIONS
The first results of the evaluation experiment are highly
promising: it was shown that the three conditions elicited
32
different degrees of positive emotions in the participants, the
amount of social interaction induced, as well as the cognitions
and capability assigned to the agent. Also, the believability
differed for all three conditions. It was shown that the interactive condition yielded the most positive outcomes on all
dimensions, implying that the feedback given to the participant
by mimicking his or her laughter is best capable of creating a
“mutual film watching experience” that is pleasurable.
In sum, expressing laughter increases the positive experience in the interaction with an agent, when watching an
amusing video (both laughter conditions elicited more positive emotions), compared to the fixed speech condition. The
fixed speech condition yielded numerically lowest means on
the AIEF dimensions, apart from the dimension “beliefs on
cognition”, where the means where as high as in the interactive
condition, implying that speech leads to the assignment of cognitive ability equally as much as responding to the participant’s
behavior. Naturally, the fixed speech conditions should yield
the lowest scores, as there was no laughter expressed in this
conditions and some items targeted the contagiousness and
appropriateness of the laughter displayed by the agent.
Obviously, in the interactive condition, the amount of laughter displayed by the agent varied for each participant, depending on how many times the participants actually laughed.
Therefore, the agent behaved similar to the participant, which
seems to be natural and comfortable for the participant.
Nevertheless, the current state of data analysis does not allow
to differentiating between individuals who displayed a lot of
laughter—and consequently had a lot of laughter feedback by
the agent—and individuals who showed only little laughter—
and received little laughter feedback by the agent. An in depth
analysis of the video material obtained during the eNTERFACE evaluation experiment will allow for an investigation
of how many times the participants actually laughed and how
this influenced the perception of the setting. This will be done
by applying the FACS [39]. Further, an analysis of the eye
movements (gaze behavior) will allow for an estimation of
the attention paid to the agent.
The results of the trait and state cheerfulness, seriousness,
and bad mood variables clearly show the importance of including personality variables into such evaluation experiments.
Especially state bad mood influenced the interaction and latter
perception of the virtual agent, leading to a mood dependent
bias. Individuals with high scores in state bad mood before
the experiment evaluated the virtual agent less favorably. This
is likely due to their enhanced threshold for engaging in
cheerful/humorous situations/interactions and—in the case of
grumpiness—their unwillingness to be exhilarated and—in the
case of depressed/melancholic mood—the incapability to be
exhilarated. Therefore, personality should always be controlled
for in future studies. Generally, there was sufficient variance
in the gelotophobia scores, even in the little sample obtained
in the evaluation. Gelotophobia showed some systematic relations to the dimensions of the AIEF. For future studies,
the assessment and control of gelotophobia is essential to get
unbiased evaluations of an agent. Furthermore, those results
might help the understanding of the fear of being laughed at
and how it influences the thoughts, feelings and behavior of
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
individuals with gelotophobia.
Nevertheless, more participants are needed to test the
hypothesis on the influence of the condition on the AIEF
dimensions in order for any statistically significant differences
between the conditions to be found. To improve the experimental set up, challenges from eNTERFACE, as well as the
participant’s feedback will be integrated to optimize the design
and procedure. For example, the stimulus video consisted of
only one type of humorous material. It is well established in
psychological research that inter-individual differences exist
in the appreciation of types of humor. Therefore, a lack of
experienced amusement on the side of the participant might
also be due to the disliking of candid camera clips, as one
specific type of humor. Any manipulation by the experimental
conditions should not be overshadowed by the quality/type of
stimulus video. Therefore, a more representative set of clips
with sound is needed (presented in counter-balanced order,
also extending the overall interaction time with the virtual
agent).
Furthermore, it needs to be clear to participants beforehand,
what the virtual agent is capable of doing. In the beginning
of the experiment, the virtual agent should display some
laughter, so the participant knows, that the virtual agent would
be capable of showing this behavior. This ensures, that the
participant will not be solely surprised and amused by the fact,
the virtual agent can laugh, when it eventually does during
the course of the film. If this information is not available
to participants, it might be that the amusement is only due
to the excitement/pleasure of the technical development of
making a virtual agent laugh. Ruch and Ekman’s [21] overview
on the knowledge about laughter (respiration, vocalization,
facial action, body movement) illustrated the mechanisms
of laughter, and defined its elements. While acknowledging
that more variants of this vocal expressive-communicative
signal might exist, they focused on the common denominators
of all forms but proposed distinguishing between laughing
spontaneously (emotional laughter) and laughing voluntarily
(contrived or faked laughter). In this experiment, only displays
of amusement laughter (differing in intensity and duration)
were utilized. Further studies may also include different variants of types of laughter.
On the technical side, the biggest outcome of the project is
a full processing chain with components that can communicate
together to perform multimodal data acquisition, real-time
laughter-related analysis, output laughter decision and audiovisual laughter synthesis. Progresses have been accomplished
on all these aspects during the Laugh Machine project. We
can cite the development of the respiration sensor and the
integration of all input devices in a synchronized framework,
which will enable multimodal laughter detection; the construction of a real-time, speaker independent, laughter intensity
estimator; the design of the first dialog manager dealing with
laughter; the first advances in acoustic laughter synthesis with
the introduction of HMM-based processes; the four different
animation techniques that have been implemented; or the
unique database of humans interacting with a laughing virtual
agent that has been collected.
Each of these components can be improved and several
33
issues arose during the experiments. Without going into
details for each of the involved components, future works
will include: improving the laughter detection and intensity
computation with the help of visual and respiration signals;
reducing the communication delays between the computers
hosting the different modules; better balancing the influence
of the context in the dialog manager; extending the range
of output laughs by allowing laughs to be generated or
modified on the fly; ensuring that all experimental data can be
recorded flawlessly; or adapting the virtual agent’s behavior to
the participant’s personality (e.g., gelotophobe) and mood to
maximize the participant’s perception of the interaction. Also,
future agents may not only include facial expressions and vocal
utterances, as laughter also entails lacrimation, respiration,
body movements (e.g., [49]), body posture and vocalization.
However, despite all the identified issues, the first evaluation
results are positive. This is very encouraging and indicates
that the full LaughMachine system, while imperfect, is already
working and providing us with both a nice benchmark and a
reusable framework to evaluate future developments.
ACKNOWLEDGMENT
This work was supported by the European FP7-ICT
projects ILHAIRE (FET, grant n◦ 270780), CEEDS (FET,
grant n◦ 258749) and TARDIS (STREP, grant n◦ 288578).
The authors would also like to thank the organizers of the
eNTERFACE’12 Workshop for making the project possible
and making available high quality material and recording
rooms to us. Finally, all the participants of our experiments
are gratefully acknowledged.
R EFERENCES
[1] L. Kennedy and D. Ellis, “Laughter detection in meetings,” in NIST
ICASSP 2004 Meeting Recognition Workshop, Montreal, May 2004, pp.
118–121.
[2] K. P. Truong and D. A. van Leeuwen, “Automatic discrimination
between laughter and speech,” Speech Communication, vol. 49, pp. 144–
158, 2007.
[3] M. T. Knox and N. Mirghafori, “Automatic laughter detection using neural networks,” in Proceedings of Interspeech 2007, Antwerp, Belgium,
August 2007, pp. 2973–2976.
[4] S. Petridis and M. Pantic, “Fusion of audio and visual cues for laughter
detection,” in Proceedings of the 2008 international conference on
Content-based image and video retrieval. ACM, 2008, pp. 329–338.
[5] ——, “Audiovisual discrimination between laughter and speech,” in
Proceedings of the IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Las Vegas, Nevada, 2008, pp. 5117–
5120.
[6] ——, “Audiovisual laughter detection based on temporal features,”
in Proceedings of the 10th international conference on Multimodal
interfaces. ACM, 2008, pp. 37–44.
[7] S. Petridis, A. Asghar, and M. Pantic, “Classifying laughter and speech
using audio-visual feature prediction,” in Proceedings of the 2010 IEEE
International Conference on Acoustics Speech and Signal Processing
(ICASSP). Dallas, Texas: IEEE, 2010, pp. 5254–5257.
[8] S. Sundaram and S. Narayanan, “Automatic acoustic synthesis of humanlike laughter,” in Journal of the Acoustical Society of America, vol. 121,
no. 1, January 2007, pp. 527–535.
[9] E. Lasarcyk and J. Trouvain, “Imitating conversational laughter with an
articulatory speech synthesis,” in Proceedings of the Interdisciplinary
Workshop on The Phonetics of Laughter, August 2007, pp. 43–48.
[10] T. Cox, “Laughter’s secrets: faking it – the results,” New Scientist,
27 July 2010. [Online]. Available: http://www.newscientist.com/article/
dn19227-laughters-secrets-faking-it--the-results.html
ENTERFACE’12 SUMMER WORKSHOP - FINAL REPORT; PROJECT P2 : LAUGH MACHINE
[11] P. DiLorenzo, V. Zordan, and B. Sanders, “Laughing out loud: control
for modeling anatomically inspired laughter using audio,” in ACM
Transactions on Graphics (TOG), vol. 27, no. 5. ACM, 2008, p. 125.
[12] D. Cosker and J. Edge, “Laughing, crying, sneezing and yawning:
Automatic voice driven animation of non-speech articulations,” in Proc.
of Computer Animation and Social Agents (CASA09). Citeseer, 2009,
pp. 21–24.
[13] J. Urbain, R. Niewiadomski, E. Bevacqua, T. Dutoit, A. Moinet,
C. Pelachaud, B. Picart, J. Tilmanne, and J. Wagner, “AVLaughterCycle:
Enabling a virtual agent to join in laughing with a conversational partner
using a similarity-driven audiovisual laughter animation,” Journal on
Multimodal User Interfaces, vol. 4, no. 1, pp. 47–58, 2010.
[14] S. Shahid, E. Krahmer, M. Swerts, W. Melder, and M. Neerincx,
“Exploring social and temporal dimensions of emotion induction using
an adaptive affective mirror,” in 27th international conference extended
abstracts on Human factors in computing systems. ACM, 2009, pp.
3727–3732.
[15] S. Fukushima, Y. Hashimoto, T. Nozawa, and H. Kajimoto, “Laugh
enhancer using laugh track synchronized with the user’s laugh motion,”
in Proceedings of the 28th of the international conference on Human
factors in computing systems (CHI’10), 2010, pp. 3613–3618.
[16] C. Becker-Asano, T. Kanda, C. Ishi, and H. Ishiguro, “How about
laughter? Perceived naturalness of two laughing humanoid robots,” in
Affective Computing and Intelligent Interaction, 2009, pp. 49–54.
[17] C. Becker-Asano and H. Ishiguro, “Laughter in social robotics - no
laughing matter,” in Intl. Workshop on Social Intelligence Design
(SID2009), 2009, pp. 287–300.
[18] G. Mckeown, M. F. Valstar, R. Cowie, and M. Pantic, “The semaine
comary:2007kkrpus of emotionally coloured character interactions,” in
Proceedings of IEEE Int’l Conf. Multimedia, Expo (ICME’10), Singapore, July 2010, pp. 1079–1084.
[19] J. Urbain, E. Bevacqua, T. Dutoit, A. Moinet, R. Niewiadomski,
C. Pelachaud, B. Picart, J. Tilmanne, and J. Wagner, “The AVLaughterCycle database,” in Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta,
May 2010.
[20] J. Urbain and T. Dutoit, “A phonetic analysis of natural laughter, for
use in automatic laughter processing systems,” in International Conference on Affective Computing and Intelligent Interaction (ACII2011),
Memphis, Tennesse, October 2011, pp. 397–406.
[21] W. Ruch and P. Ekman, “The expressive pattern of laughter,” in Emotion,
qualia and consciousness, A. Kaszniak, Ed. Tokyo: World Scientific
Publishers, 2001, pp. 426–443.
[22] The Apache Software Foundation, “Apache ActiveMQTM [computer
program webpage],” http://activemq.apache.org/, consulted on August
24, 2012.
[23] J. Wagner, F. Lingenfelser, and E. André, “The social signal interpretation framework (SSI) for real time signal processing and recognition,”
in Proceedings of Interspeech 2011, 2011.
[24] N. Chawla, K. Bowyer, L. Hall, and W. Kegelmeyer, “Smote: synthetic
minority over-sampling technique,” Journal of Artificial Intelligence
Research, vol. 16, pp. 321–357, 2002.
[25] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich
versatile and fast open-source audio feature extractor,” in Proceedings of
the international conference on Multimedia, ser. MM ’10. New York,
NY, USA: ACM, 2010, pp. 1459–1462.
[26] R. Niewiadomski, J. Urbain, C. Pelachaud, and T. Dutoit, “Finding out
the audio and visual features that influence the perception of laughter
intensity and differ in inhalation and exhalation phases,” in Proceedings
of the ES 2012 4th International Workshop on Corpora for Research on
EMOTION SENTIMENT & SOCIAL SIGNALS, Satellite of LREC 2012,
Istanbul, Turkey, May 2012.
[27] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and
I. Witten, “The weka data mining software: an update,” ACM SIGKDD
Explorations Newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[28] M. Filippelli, R. Pellegrino, I. Iandelli, G. Misuri, J. Rodarte, R. Duranti,
V. Brusasco, and G. Scano, “Respiratory dynamics during laughter,”
Journal of Applied Physiology, vol. 90, no. 4, p. 1441, 2001.
[29] A. Feleky, “The influence of the emotions on respiration.” Journal of
Experimental Psychology, vol. 1, no. 3, pp. 218–241, 1916.
[30] A. Camurri, P. Coletta, G. Varni, and S. Ghisio, “Developing multimodal
interactive systems with eyesweb xmi,” in Proceedings of the 2007
Conference on New Interfaces for Musical Expression (NIME07), 2007,
p. 302305.
[31] B. Lukas and T. Kanade, “An iterative image registration technique with
an application to stereo vision,” in Proceedings of the 7th international
joint conference on Artificial intelligence, 1981.
34
[32] D. Winter, “Biomechanics and motor control of human movement,”
1990.
[33] W. Sethares and T. Staley, “Periodicity transforms,” Signal Processing,
IEEE Transactions, vol. 47, no. 11, pp. 2953–2964, 1999.
[34] K. Tokuda, H. Zen, and A. Black, “An HMM-based speech synthesis
system applied to English,” in IEEE Speech Synthesis Workshop, Santa
Monica, California, September 2002, pp. 227–230.
[35] K. Oura, “HMM-based speech synthesis system (HTS),” http://hts.sp.
nitech.ac.jp/, consulted on June 22, 2011.
[36] T. Yoshimura, “Simultaneous modeling of phonetic and prosodic parameters, and characteristic conversion for hmm-based text-to-speech systems,” Ph.D. dissertation, Ph. D. thesis, Nagoya Institute of Technology,
2002.
[37] T. Drugman and T. Dutoit, “The deterministic plus stochastic model of
the residual signal and its applications,” Audio, Speech, and Language
Processing, IEEE Transactions on, vol. 20, pp. 968–981, 2012.
[38] H. Kawahara, “Straight, exploitation of the other aspect of vocoder:
Perceptually isomorphic decomposition of speech sounds,” Acoustical
science and technology, vol. 27, no. 6, pp. 349–353, 2006.
[39] P. Ekman, W. Friesen, and J. Hager, “Facial action coding system: A
technique for the measurement of facial movement,” 2002.
[40] J. Saragih, S. Lucey, and J. Cohn, “Deformable model fitting by
regularized landmark mean-shift,” International Journal of Computer
Vision, vol. 91, no. 2, pp. 200–215, 2011.
[41] M. Zeiler, G. Taylor, L. Sigal, I. Matthews, and R. Fergus, “Facial
expression transfer with input-output temporal restricted boltzmann
machines,” in Neural Information Processing Systems Conference NIPS
2011, 2011, pp. 1629–1637.
[42] W. Ruch and R. Proyer, “Extending the study of gelotophobia: On gelotophiles and katagelasticists,” Humor: International Journal of Humor
Research, vol. 22, no. 1-2, pp. 183–212, 2009.
[43] W. Ruch, G. Köhler, and C. Van Thriel, “Assessing the “humorous
temperament”: Construction of the facet and standard trait forms of the
state-trait-cherrfulness-inventory–STCI.” Humor: International Journal
of Humor Research; Humor: International Journal of Humor Research,
vol. 9, pp. 303–339, 1996.
[44] W. Ruch, G. Köhler, and C. Van Thriel, “To be in good or bad humour:
Construction of the state form of the state-trait-cheerfulness-inventory–
STCI,” Personality and Individual Differences, vol. 22, no. 4, pp. 477–
491, 1997.
[45] J. Hofmann, T. Platt, and W. Ruch, “Avatar interaction evaluation form
(AIEF),” 2012, unpublished research instrument.
[46] W. Ruch and R. Proyer, “The fear of being laughed at: Individual and
group differences in gelotophobia.” Humor: International Journal of
Humor Research, vol. 21, pp. 47–67, 2008.
[47] W. Ruch and J. Hofmann, “A temperament approach to humor,” in
Humor and health promotion, P. Gremigni, Ed.
New York: Nova
Science Publishers, 2012.
[48] W. Ruch and G. Köhler, “A temperament approach to humor,” in The
sense of humor: Explorations of a personality characteristic, W. Ruch,
Ed. Berlin: Mouton de Gruyter, 2007, pp. 203–230.
[49] G. Hall and A. Alliń, “The psychology of tickling, laughing, and the
comic,” The American Journal of Psychology, vol. 9, no. 1, pp. 1–41,
1897.