Comfortability Recognition from Visual Non-verbal Cues
Maria Elena Lechuga Redondo
Alessandra Sciutti
Italian Institute of Technology
Genova, Italy
[email protected]
Italian Institute of Technology
Genova, Italy
[email protected]
Francesco Rea
Radoslaw Niewiadomski
Italian Institute of Technology
Genova, Italy
[email protected]
University of Trento
Rovereto, Italy
[email protected]
ABSTRACT
As social agents, we experience situations in which sometimes we
enjoy being involved and others where we desire to withdraw from.
Being aware of others’ łcomfort towards the interactionž help us enhance our communications, thus this becomes a fundamental skill
for any interactive agent (either a robot or an Embodied Conversational Agent (ECA)). For this reason, the current paper considers
Comfortability, the internal state that focuses on the person’s desire
to maintain or withdraw from an interaction, exploring whether it
is possible to recognize it from human non-verbal behaviour. To
this aim, videos collected during real Human-Robot Interactions
(HRI) were segmented, manually annotated and used to train four
standard classifers. Concretely, diferent combinations of various
facial and upper-body movements (i.e., Action Units, Head Pose,
Upper-body Pose and Gaze) were fed to the following feature-based
Machine Learning (ML) algorithms: Naive Bayes, Neural Networks,
Random Forest and Support Vector Machines. The results indicate
that the best model, obtaining a 75% recognition accuracy, is trained
with all the aforementioned cues together and based on Random
Forest. These fndings indicate, for the frst time, that Comfortability can be automatically recognized, paving the way to its future
integration into interactive agents.
CCS CONCEPTS
• Human-centered computing → HCI theory, concepts and models.
KEYWORDS
Comfortability, Human-Agent Interaction, Afective Computing,
Multimodal Emotion Recognition
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for proft or commercial advantage and that copies bear this notice and the full citation
on the frst page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specifc permission
and/or a fee. Request permissions from
[email protected].
ICMI ’22, November 7–11, 2022, Bengaluru, India
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-9390-4/22/11. . . $15.00
https://doi.org/10.1145/3536221.3556631
207
ACM Reference Format:
Maria Elena Lechuga Redondo, Alessandra Sciutti, Francesco Rea, and Radoslaw Niewiadomski. 2022. Comfortability Recognition from Visual Nonverbal Cues. In INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION (ICMI ’22), November 7–11, 2022, Bengaluru, India. ACM, New York,
NY, USA, 10 pages. https://doi.org/10.1145/3536221.3556631
1
INTRODUCTION
Interactions entail a tangled mix of emotional, afective and internal
states that emerge between the people who are communicating
[11]. As a consequence, identifying others’ internal states plays a
very relevant role within social contexts [5, 22]. For this reason,
any interactive agent would greatly beneft from being socially
intelligent [42, 49]. Given that developing fully socially intelligent
agents is a challenge beyond our reach, providing them with foundational skills could already have a positive impact on human-agent
exchanges. Hence, this paper tackles one of these basic skills: Comfortability detection.
Comfortability was introduced in [45] as ł(disapproving of or
approving of) the situation that arises as a result of an interaction,
which infuences one’s own desire of maintaining or withdrawing
from itž. The strong point about Comfortability is that it focuses on
how a person feels respect to other agents’ actions without deepening on the specifc emotional or afective states that might arise
in parallel. This way, a system capable of identifying someone’s
Comfortability would be able to understand whether it has acted
appropriately and accordingly to its user’s expectations and could
assess whether it needs to adapt its behaviour. Albert Mehrabian
established in 1967 the 7%ś38%ś55% rule, declaring that the 7% of
the communication is verbal, 38% of the communication is vocal
and 55% of the communication is visual [38]. This statement justifes the importance of non-verbal communication highlighting at
the same time how relevant is to be capable of understanding and
recognizing others’ nonverbal cues. Additionally, Maréchal et al.
[35] wrote łA challenge in multi-modal emotion analysis is to efciently explore emotion, not only on one but on highly expressive
nature modalities.ž Therefore, this paper presents for the frst time
a model capable of classifying whether someone is uncomfortable
or not, by paying attention to several non-verbal features. Given
that the face is one of the most expressive modalities [5], diferent
cues associated to it (e.g., Action Units (AUs) and Gaze), in addition
to Upper Body and Head Pose cues, have been approached. Concisely, diferent Feature-based Machine Learning (ML) algorithms
(i.e., Naive Bayes (NB), Neural Networks (NN), Random Forest (RF)
ICMI ’22, November 7–11, 2022, Bengaluru, India
M.E.L. Redondo, et al.
and Support Vector Machines (SVM)) have been fed with such features. Furthermore, the features under study were automatically
extracted from spontaneous reactions recorded during several real
Human-Robot Interaction (HRI) interviews.
This paper provides promising results together with ideas that
might enhance subsequent Comfortability classifers that at the
same time, might help non-human agents to better understand
their human partners.
2 LITERATURE REVIEW
2.1 Expressing and Perceiving Internal States
through Non-verbal cues
One of the main channels to express and perceive emotional and
afective states is the face. Almost all interactive agents have a face,
and probably for this reason, humans are capable of identifying
faces within the frst few days after birth [48]. Barrett et al. [5] have
deeply explored this area and explained that the concept łemotionž
refers to a category of instances that vary from one another in
their physical (e.g., facial and body movements) and mental (e.g.,
pleasantness, arousal, etc.) features. This way, they implied that
an emotion (e.g., anger) won’t own characteristics that are identical across situations, people and cultures. Conversely, Paul Ekman
[18] defends that łthere is a core facial confguration that can be
used to diagnose a person’s emotional state in the same way that
a fngerprint can be used to uniquely recognize a personž[5]. He
defnes emotions as ła process, a particular kind of automatic appraisal infuenced by our evolutionary and personal past, in which
we sense that something important to our welfare is occurring,
and a set of physiological changes and emotional behaviours begin
to deal with the situationž [18]. In his book [18], he added that
words are one way to deal with emotions, and that even though we
use words when we are emotional, we cannot reduce emotions to
words. Together with Friesen [20], both studied how people from
an isolated tribe in New Guinea who had not interacted with anyone from outside their tribe, expressed and perceived each one of
the called six-basic-emotions (i.e., anger, surprise, fear, happiness,
disgust and sadness). To analyze their expressiveness, Ekman and
Friesen defned several stories (associated to each one of the basic
emotions) and asked the New Guineans to imagine themselves in
such situations. Their faces were recorded and given to American
collaborators, who had never traveled to New Guinea or been in
contact with people from this tribe, to classify them into one of
the six-basic-emotions. The results showed that the American collaborators were able to correctly classify all the videos except the
ones associated with fear and surprise, which were interchangeably
classifed. To understand the New Guinean’s perceptive abilities,
Ekman prepared another experiment. This time, the New Guineans
had to associate a story (which Ekman read to them) to a picture
of a Caucasian face posing one of the six-basic-emotions. This experiment was performed with more than 300 people. The results
showed that the subjects were very good in identifying happiness,
anger, disgust and sadness. However, similarly to the Americans,
they were unable to distinguish fear from surprise. They argued
that this phenomenon might be due to not-well-formulated stories,
but also to the fact that fear and surprise may be often intermingled
in these people’s lives, and thus not distinguished. Both studies
208
provided evidence in favour of assuming that there are innate facial
movements associated to standards and reproducible emotional
states among situations and cultures.
Even though Barrett’s and Ekman’s positions ofer evidence
against each other, both provide rich information about how emotions are expressed and perceived by people. Barret stated that
when we see someone performing a facial movement (e.g., smiling)
and subsequently infer that that person is in an specifc emotional
state (e.g., happy) we are assuming that the smile reveals something about the person’s internal state which cannot be accessed
directly. This skill requires calculating a conditional probability of
that person being in a particular internal state given the observable
set of features (in this case facial features). This approach is not
diferent from how machine learning systems operate to recognize
emotions, even though as humans we do it without realizing it
and constantly. Ekman and Friesen avoided the issue of associating
specifc facial confgurations to specifc internal states by creating/expanding the Facial Action Coding System (FACS) in 1978
[19], originally introduced by Hjortsjö in 1970 [25]. This system describes all the visual discernible facial movements, breaking down
facial confgurations into individual components of muscle movements, called Action Units (AUs). Hjortsjö explored 23 AUs and
afterwards, Ekman and Friesen expanded it to 64. To the date, there
are 46 AUs which consider facial movements, 8 which consider
head movements and 4 focused only on eye movements. Additionally, Baltrušaitis et al. developed Open Face [3], a software that
automatically detects some AUs. Concretely, OpenFace is capable
of detecting: AU1, AU2 and AU4 (which represent the muscle movements around the eyebrows), AU5, AU6, and AU7 (which represent
the muscle movements around the eyes) , AU9 (which represents a
nose wrinkle), AU10, AU12, AU14, AU15, AU17, AU20, AU23, AU25
and AU26 (which represent the muscle movements around the
mouth) and AU45 (which represents the action of blinking).
In spite the face has been deeply explored over the years, there
are other channels that can reveal plenty of information about our
internal states as well. One clear example is the body, which transmits a huge amount of information and has also been explored
by Ekman [16] and other researchers who are trying to exploit its
full-potential. For example, Hidalgo et al. [24] developed OpenPose,
an automatic recognition software that automatically detects corporal poses distributed along the whole body, providing precise
information about the face, hands, and feet. Another good example
is the voice. Human auditory information (i.e., pitch, timbre, loudness, and vocal tone) has been proven to express emotions during
speech generation [12]. To this point, all the introduced emotional
channels are perceivable by the human senses up to a certain extent. Cacioppo et al [7] afrmed that human’s afective response
is a psycho-physiological process triggered by stimuli, which is often manifested through observable behaviour channels. Although
not all physiological signals can be efciently perceived through
our senses, their changes can be measured with technological devices. This means that, even though we might not be able to use
this information on daily interactions, we might want to evaluate
someone’s internal states by considering also these features. For
example, Lobbestael et al. [30] conducted an study focused on anger,
where they exposed sixty-four participants to one specifc stimulus
(either a movie, a stressful interview, punishment or harassment).
Comfortability Recognition from Visual Non-verbal Cues
ICMI ’22, November 7–11, 2022, Bengaluru, India
To measure their anger, they considered self-reports and a list of
physiological signals: blood pressure, heart rate, skin conductance
level, and skin conductance response. They found that all the stimuli
produced similar self-reports, but that their cardiovascular efects,
and electrodermal activity increased more during the harassment
and stressful interview. This might suggest again that people might
łcontrolž what they verbally say, or how to voluntarily behave,
however they cannot control how their hearth will beat or their
body will sweat. Hence, physiological signals might bring extremely
useful information to establish ground-truths concerning internal
states.
2.2
What should afective computing compute?
Most research in afective computing tackles the well-known łsixbasic-emotionsž (i.e., happiness, sadness, anger, fear, disgust and
surprise). In fact, the most popular databases are based on them
(e.g., the JAFFE[33, 34], KDEF[32], BU-3DFE [54], CK+ [31], MMI
[41], SPEF [13], EMOTIONNET [21], AfectNet [40], and RAF-AU
[53] databases) and plenty of researchers have developed algorithms
capable of classifying them obtaining promising results [43, 51].
Nonetheless, łnon-basic-emotionsž (e.g., engagement, boredom, confusion, frustrations and so on) were found to be fve times more
frequent in real-live situations [14]. It makes sense that six basic
emotions might be insufcient to cover all the complex feelings
and feedback felt and expressed during social situations. At the
same time, almost all the expressions contained in these databases
are acted, which means that they might not be a real refection
of the expressions that arise in real life. To date, there are some
popular databases which include spontaneous not-acted data (e.g.,
the DISFA [37] and BP4D[57] databases), however they still present
a lack of more complex internal states.
In general, it can be noticed that there is a need of databases
that contain not only spontaneous reactions, but internal states
that emerge during daily situations. This way, future afective computing would address better human-machine interactions. As a
contribution to this idea, this paper provides a start by developing a Comfortability recognition system based on genuine human
behaviour.
2.3
Machine Learning Algorithms
To automatically recognize any aspect of communication (e.g., an
emotion/internal state), a Machine Learning (ML) algorithm is usually designed and trained. On the basis of the input information,
ML algorithms can be divided into two branches: Deep Learning
(when the algorithm is capable of processing information without
any previous computation) and Feature-based Learning (when the
algorithm receives a set of pre-processed features). At the same
time, ML algorithms can be characterized as supervised or unsupervised learning depending on its classifcation strategy. Supervised
ML algorithms require labeled data and the latter do not, as they
autonomously identify clustering principles.
Recent studies focused on afective computing have considered
diferent approaches and modalities. For example, Rajan et al. [43]
created a model based on Convolutional Neural Network (CNN) and
Long-Short Term Memory (LSTM) that would take into account
dynamic temporal information for facial expression recognition.
209
They tested the model with well-known databases (CK+, MMI, and
SPEW), obtaining an accuracy of 99%, 80% and 56% respectively. For
some reason, the classes anger, fear and sadness were worse classifed than the others. In the same fashion, Bartlett et al. [6] developed
a conceptor based low/high engagement classifer based on Recursive Neural Networks (RNN). To feed the classifer, they extracted
skeletal and facial landmarks using the OpenPose [8] software from
the videos contained in the PInSoRo [29] data-set (children performing tasks) taking into account the movement. They obtained
a recognition accuracy of 60% for the clips annotated as High Engagement and a recognition accuracy of 75% for the clips annotated
as Low Engagement. Castellano et al. [9] also studied the role of
movement when inferring emotions. To do so, they used videos
collected during the third summer school of the Human-Machine
Interaction Network on Emotion (HUMAINE) EU-IST project held
in Genova in 2006. In particular, they used 240 dynamic gestures
of 8 diferent emotions (anger, despair, interest, pleasure, sadness,
irritability, joy and pride) acted by 10 diferent actors. They represented each movement by computing its Quantity of Motion (QoM),
Contraction Index, Velocity, Acceleration and hand’s fuidity barycenter. Then, they applied a Dynamic Time Wrapping (DTW) [28]
algorithm to measure similarities between movements. After comparing the fve corporal features, they learnt that QoM was the one
with lower classifcation error when distinguishing between anger,
joy, pleasure and sadness. The remaining emotions were found to be
unsuccessfully classifed by any of the proposed features. One last
example relevant to this paper is Matsufuji et al. [36] who developed a model to detect awkward situations. They considered voice
intonation (i.e., maximum pitch and speech length) and corporal
information extracted with the Kinect sensor (i.e., head pitch, yaw,
neck, shoulder and elbow velocity vectors; and head x and z axes)
of 5 subjects. They used these features with the Weka [15] software
and several ML algorithms. They obtained a recognition accuracy
of 83% for Bayesian Networks, 72 % Random Forest, 72% Support
Vector Machines, and 70 % for Naive Bayes.
As literature shows, there are plenty modalities and algorithms
that can be considered. We agree with literature that the more
modalities present (e.g., physiological, auditory, visual, etc.), the
more likely the model’s performance will improve. Nonetheless,
as this paper presents an initial approach to build a Comfortability
classifer, it was decided to tackle one aspect at a time. On the one
hand, given the long-term aim of this project is to build a system
capable of working in ecological scenarios (i.e., where no external
devices are placed on the subject), physiological data were not considered. On the other hand, we noticed (analyzing the recordings)
that other modalities (i.e., body movements, audio (e.g., pitch and
tone), context and verbal content (i.e., the use of verbs)) seemed
to be relevant to represent Comfortability. However, we observed
that some of them might be quite challenging to interpret as people
might desire to hide their Uncomfortability with verbal statements
and/or feel diferent under similar circumstances. Hence, we decided to focus on the facial and upper body information, leaving
for further studies the other features. Regarding the ML algorithm,
we decided to explore several Feature-based Learning ones passing
them positions, velocities and AUs (as they do not rely on faces’
contours, colors, gadgets and hairstyles) as a frst attempt. More
ICMI ’22, November 7–11, 2022, Bengaluru, India
M.E.L. Redondo, et al.
complex features (e.g., QoM) as well as DL approaches are tentative
candidates for future studies.
3
METHODS
To capture spontaneous and legit reactions within a Human-Machine
Interaction scope, the iCub robot [39] interviewed several researchers
for a real and novel column of our institutional online magazine 1
[44]. During the interviews, the participants were exposed to an
stressful interaction where the robot complimented them at the
beginning and interrupted, ignored and misunderstood them at the
end. Even though plenty of data (auditory, visual and physiological)
were collected, only the visual information is explored in this paper.
A total of 29 videos (one per interview) of 17 : 54(±5 : 17
SD) minutes on average were recorded. From those videos, only 26
were used for this study, because 3 participants were interviewed
from a diferent perspective (instead of a total frontal view, they
were recorded slightly turned, like a classical TV interview). Our
data-set is peculiar, not only because the reactions are provoked
by a non-human agent, but because our participants are from very
diferent cultures and ethnicities; which to date is rare to fnd [5].
To analyze the visual information, the audio was excluded from the
videos with the intention of not allowing the annotator discover the
context and hence, be biased. Afterwards, the videos were trimmed
into smaller segments and subsequently labelled.
3.1
Preparing our Data-set - Trimming and
Labelling
Reis et al. [46] wrote that łthe most fundamental property of a
coding scheme for observing social interactions is the technique
adopted for sampling behaviour, known as unitizingž. Unitizing
means dividing an observable sample into discrete smaller samples.
According to cognitive sciences [55], unitizing is an automatic component of the human perceptual processing of the ongoing situation.
That is to say, we as humans make sense of reality by breaking it
into smaller units. Ceccaldi et al. [10] added that artifcial agents
should master unitizing skills to reach a comprehensive understanding of the interaction itself. With that goal in mind, they explored
the drawbacks and benefts of the two main unitizing techniques
(Interval and Continuous coding). On the one hand, Interval Coding
consists of identifying a fxed-length time interval in which the
sample will be segmented into. It is expected that the raters should
be able to fnd occurrences of the targeted behaviour in those pieces.
Established research [2] proved that thin slices (i.e., from 2 seconds
to 5 minutes) is a well-known approach for personality, afect and
interpersonal relationship samples. Even though this technique
might cut actions in between and thus, relevant information can
be lost, it is fast, easy to automatize, objective and there is no need
of a prior knowledge of the context. On the other hand, Continuous
Coding stands for identifying specifc behaviours that are likely to
last diferent amounts of time, where each segment will have its
own size. While this technique comprehends exactly the desired
information, it is much more time consuming and often requires
trained annotators. Moreover, it is likely that establishing a continuous segmentation will require a coding scheme itself (there
are some predefned like ACT4Teams [27]). Regarding our samples,
1 https://opentalk.iit.it/i-got-interviewed-by-a-robot/
210
we initially thought of using a Continuous Coding, given that the
observed behaviors seemed to not be constant and hence, vary in
time. We started by looking at each clip trying to isolate each facial
confguration (which by itself could represent a particular Comfortability level) from others. Nevertheless, after several attempts
performing a customized isolation of these movements, we realized
that this was not efective. Identifying the beginning and ending
point of a unitary facial movement was not trivial and required to
consider both facial movements and other complex features (e.g.,
facial skin color). Furthermore, the specifc moment a facial movement started and ended could be perceived diferently by diferent
people and even by the same person at diferent times. In fact,
Afzal et al. reached the same conclusions [1]. After annotating a
data-set which contained spontaneous, unpredictable and natural
reactions, they concluded “Even for a human expert, it is difcult to
defne what constitutes an emotion. Segmenting the original videos
into emotionally salient clips was the most labour-intensive and timeconsuming process. Demarcating the beginning and end of emotional
expressions is incredibly hard as they often overlap, co-occur or blend
subtly into a background expression”. This aspect made us believe
that segmenting our data-set following a continuous segmentation would require an experimental set-up on its own; which is
not the focus of our research. Therefore, we decided to go with
an Interval Coding even though some expressions might be cut in
between. To avoid cutting movements related to diferent events,
a two layer segmentation process was performed. The frst step
was to segment each video into 24 segments; one per each relevant
interview part associated to a Comfortability level (also in line with
the self-report’s structure). The second step was to segment each
one of those segments into smaller pieces. To decide the length of
each piece, the time a macro-expression (from 1/2 to 4 seconds) and
micro-expression (from 1/2 to 1 second) tend to last [17] were taken
into account. Therefore, each one of those 24 segments (counting all
the participants, a total of 696 segments) was segmented again into
3-seconds segments. The fnal amount of prepared segments was of
10.468 units. The segments which were shorter than 3-seconds, as
a result of the trimming, were discarded leaving a fnal amount of
8.467 units.
Afterwards, each one of the three-second sample was labelled
following a 7-point Likert scale from 1 (being Extremely Uncomfortable) to 7 (being extremely Comfortable). Annotate a sample,
judge each participant’s response is a very challenging and demanding task. The mood and fatigue of the annotator, as well as
the previously annotated sample can bias the evaluation criteria
inducing to error and subjectivity [47]. Also, it is known that facial
movements can be consciously shaped. For example, the “Duchenne
Smile” might be interpreted at frst sight as a łhappinessž indicator. However, it was found that it can be intentionally produced
to signal submission or afliation [23]. In addition to this fnding,
Hoque et al. [26] performed an experiment to study friendly vs.
polite smiles. The experiment consisted of people interested in
banking services meeting with a professional banker face-to-face.
They discovered that amused smiles present themselves longer and
more symmetrical than those enacted out of politeness. In addition,
as it can happen during unitizing, an annotator can become an
expert by performing several rounds of annotations learning after
each repetition a particular aspect of the emotional response or
Comfortability Recognition from Visual Non-verbal Cues
ICMI ’22, November 7–11, 2022, Bengaluru, India
cue under study. Hence, trying to minimize the weak-spots, we
became experts by running three annotation rounds and annotated
the data-set. During the annotation, the three-second videos were
presented one at a time in the screen with a 1920x1080 resolution
using the MUltiple VIdeos LABelling (MuViLab) annotation
tool 2 software. Once a video appeared, it was played in loop, allowing the annotator to introduce a Comfortability level from 1 to 7
by pressing a number in the keyboard, until the annotator decided
to pass to the next one. The clips were presented in a random order, which prevented the annotator familiarizing with one specifc
person and understanding the specifc context of the expression
under analysis.
3.2
Non-Verbal Features
A set of 118 features were extracted from the three-second videos
and considered for the Comfortability classifer. In particular, four
diferent algorithms: Naive Bayes (NB), Neural Networks (NN), Random Forest (RF) and Support Vector Machines (SVM) were trained
and tested with the following features:
3.2.1 AUs. The person’s Action units (AUs) were extracted using
the OpenFace software [3]. Specifcally, the Action Unit AU01, AU02,
AU04, AU05, AU06, AU07, AU09, AU10, AU12, AU14, AU15, AU17,
AU20, AU23, AU25, AU26 and AU45; which are the ones the software
recognizes. The mean and standard deviation of each intensity of
the sixteen AUs were included for each three-second video clip.
Thus, a total of 34 features associated to the person’s Action
Units were included.
3.2.2 BodyPose. The person’s corporal information was extracted
using the OpenPose software [8, 50]. Specifcally, information about
the person’s upper body positions (i.e., x and y of 3 key-points per
arm, 1 in between the shoulders, and 5 in the head). The mean and
standard deviation of each one of these key-points coordinates was
computed per each three-second clip. Thus, a total of 48 features
associated to the person’s upper body were included.
3.2.3 Gaze. The person’s gaze was extracted using the OpenFace
software [52]. Specifcally, the eye gaze direction vector in world
coordinates for each eye (i.e., the x, y and z coordinates for the left
eye and the x, y and z coordinates for the right eye), the eye gaze
angle direction averaged for both eyes. The mean and standard deviation of all these features were considered per each three-second
video clip. Thus, a total of 16 features associated to the person’s
gaze were included.
3.2.4 HeadPose. The person’s head position and rotation extracted
using the OpenFace software [4, 56]. Specifcally, the location of
the head with respect to the camera in millimeters (i.e., the x, y and
z coordinates; where a positive Z is being further away from the
camera) and the rotation of the head in world coordinates with the
camera being the origin (i.e., the x, y and z coordinates representing
the pitch, yaw and roll respectively). The mean, standard deviation,
velocity and acceleration of the head location and rotation were
considered when creating the classifer during each three-second
video clip. Thus, a total of 20 features associated to the person’s
head location and movement were included.
2 github.com/ ale152/ muvilab
211
4
RESULTS
In order to build a Comfortability model capable of classifying
whether someone is uncomfortable, several ML algorithms were
developed, where each algorithm’s variable was tuned to its optimal
performance for each feature received as input. More details are
provided in the subsequent sections. Also, even though some of the
features associated to a specifc three-second clip took into account
temporal dynamics, the algorithms did not consider a sequence
between clips. Thus, the data did not follow the interview sequence.
In addition, the data were divided into 70% per training (with a
30% used for cross-validation) and 30% per testing (see Table 1).
The clips were not discriminated among subjects, which means
that a clip reserved for testing was not seen during the whole
training procedure, but the subject was. The algorithm with the
best accuracy was also tested with a leave-one subject out approach.
Figure 1 shows the percentage of clips annotated with each one of
the 7 Comfortability levels. It can be seen that the Comfortability
extremes are poorly represented, being those barely 4% of the dataset. Nevertheless, the data were appreciably balanced while splitting
the samples into being Not-Uncomfortable (i.e., being comfortable
or neutral) 51% and being Uncomfortable 49%. Table 1 includes
the specifc number of clips used for training and testing for each
subsequent Comfortability label.
Figure 1: Percentages of video clips annotated for each Comfortability level
Comfortability Label
Not-Uncomfortable
Uncomfortable
# Clips Training
3020
2907
# Clips Testing
1309
1230
Table 1: Number of video clips used to train and evaluate the
interviewees’ Comfortability
4.1
Naive Bayes
Table 2 shows the accuracy, precision and recall for diferent combinations of features used to train and evaluate the Naive Bayes
classifer. From these results, it can be seen that the AUs together
the Gaze are the ones that the algorithm learns better during training obtaining a 69% accuracy. When evaluating the model with
unseen data, AUs together BodyPose are the features that work
better, obtaining a Comfortability accuracy of almost 65%.
ICMI ’22, November 7–11, 2022, Bengaluru, India
M.E.L. Redondo, et al.
Input
Training-set Accuracy Testing-set Accuracy Precision
Recall
AUs
.603
.591
.595
.587
BodyPose
.582
.580
.607
.587
Gaze
.565
.570
.585
.562
HeadPose
.510
.515
.757
.500
AUs + BodyPose
.649
.649
.649
.648
AUs + Gaze
.691
.594
.600
.589
AUs + HeadPose
.510
.515
.757
.500
BodyPose + Gaze
607
.626
.631
.628
BodyPose + HeadPose
.516
.522
.634
.507
Gaze + HeadPose
.510
.515
.757
.500
AUs + BodyPose + Gaze
.644
.647
.649
.645
AUs + BodyPose + HeadPose
.516
.522
.634
.507
AUs + Gaze + HeadPose
.510
.515
.757
.500
BodyPose + Gaze + HeadPose
.516
.522
.634
.507
AUs + BodyPose + Gaze + HeadPose
.516
.522
.634
.507
Table 2: Naive Bayes Comfortability classifcation considering the features extracted from ecological three-second clips
4.2
Neural Networks
4.5
Table 3 shows the accuracy, precision and recall for diferent combinations of features used to train and evaluate the MLPClassifer
Neural Networks classifer of sklearn. To obtain the best accuracy,
the classifer was tuned for each specifc input, varying its activation function (identity, logistic, tanh or relu), solver (lbfgs, sgd and
adam) and hidden layers’ size (from 1 to 35 layers) until obtaining
its maximum accuracy. As a result, the model trained with AUs and
HeadPose features obtained the highest training-set performance
with more that 78% accuracy. On top of that, a combination of
AUs , BodyPose and HeadPose features, and a logistic, adam and
30-hidden-layers confguration led to the highest performance with
unseen data, obtaining a 72% Comfortability recognition accuracy.
4.3
Best Algorithms and Features
The combination of features which led to the best classifcation
accuracy for each one of the tested ML algorithms is shown in Table
6. Looking at the training set, the RF and SVM algorithms are the
ones that perform better, reaching a perfect recognition response.
Considering the test set NB, NN and SVM do not perform very
diferently, while RF remains the one with the highest results. To
explore deeply its performance, Figure 2 reports the classifcation
performed on the training data-set and Figure 3 reports the classifcation performed on the testing data-set. As it can be noticed,
the algorithm recognizes łNot-Uncomfortablež levels slightly better
(77% of the time) than it recognizes łUncomfortablež levels (73% of
the time).
Random Forest
Table 4 shows the accuracy, precision and recall for diferent combinations of features used to train and evaluate the Random Forest
classifer. From these results, it can be seen that all the features and
combination of features performed perfectly with the training-set.
Additionally, it was found that a combination of all the features is
the best bet for this algorithm with unknown data. Merging the
AUs, BodyPose, Gaze and HeapPose features enhanced the model
to a 75% Comfortability recognition accuracy, precision and recall.
4.4
Support Vector Machines
Table 5 shows the accuracy, precision and recall for diferent combinations of features used to train and evaluate the Support Vector
Machines classifer. For each input feature/s, all possible combinations of kernels (linear, polynomial, rbf and sigmoid), C (from
.001 to 100) and gamma (from .001 to 100) values were run, choosing the one with the best accuracy. As a result, most of the inputs
perform ideally with data already seen. On the other hand, AUs
together with Gaze are the features that better represent someone’s
Comfortability level in the testing-set, reaching a 71% recognition
accuracy. This model was trained with a Radial Basis Function (rbf)
as kernel with ����� = .1 and � = 20.1 values.
212
Figure 2: Training-set Comfortability classifcation accuracy
The algorithm with the best accuracy was also tested with a
leave-one-subject-out procedure. Therefore, the model was trained
26 diferent times, each one leaving one subject out of the training
set to test the fnal system accuracy with it. This way, the system is
Comfortability Recognition from Visual Non-verbal Cues
ICMI ’22, November 7–11, 2022, Bengaluru, India
Input
Training-set Accuracy Testing-set Accuracy Precision
Recall
AUs
.747
.682
.683
.680
BodyPose
.712
.684
.684
.683
Gaze
.694
.673
.673
.673
HeadPose
.641
.648
.650
.645
AUs + BodyPose
.702
.687
.688
.686
AUs + Gaze
.745
.713
.713
.713
AUs + HeadPose
.782
.703
.703
.703
BodyPose + Gaze
.530
.523
.594
.535
BodyPose + HeadPose
.705
.679
.679
.678
Gaze + HeadPose
.641
.648
.651
.645
AUs + BodyPose + Gaze
.722
.697
.697
.697
AUs + BodyPose + HeadPose
.740
.720
.720
.720
AUs + Gaze + HeadPose
.760
.720
.721
.721
BodyPose + Gaze + HeadPose
.716
.685
.685
.683
AUs + BodyPose + Gaze + HeadPose
.745
.706
.706
.704
Table 3: Neural Networks Comfortability classifcation considering the features extracted from ecological three-second clips
Input
Training-set Accuracy Testing-set Accuracy Precision
Recall
AUs
1
.700
.700
.700
BodyPose
1
.739
.739
.739
Gaze
1
.667
.667
.666
HeadPose
1
.724
.724
.724
AUs + BodyPose
1
.749
.749
.748
AUs + Gaze
1
.715
.715
.716
AUs + HeadPose
1
.737
.737
.737
BodyPose + Gaze
1
.744
.744
.743
BodyPose + HeadPose
1
.738
.738
.738
Gaze + HeadPose
1
.739
.738
.738
AUs + BodyPose + Gaze
1
.745
.745
.745
AUs + BodyPose + HeadPose
1
.747
.747
.746
AUs + Gaze + HeadPose
1
.740
.740
.740
BodyPose + Gaze + HeadPose
1
.747
.747
.746
AUs + BodyPose + Gaze + HeadPose
1
.752
.751
.751
Table 4: Random Forest Comfortability classifcation considering the features extracted from ecological three-second clips
tested on subjects it was not trained on. As a result, the Random
Forest classifer was trained with a combination of the subject’s
AUs, BodyPose, Gaze and HeadPose features obtaining a classifcation
accuracy average of 56.6%(±14.2% SD). Paying attention to the
individual subjects, it is observed that not everyone was classifed
with the same accuracy. While some obtained very poor results
(from 27% to 47%) others achieved quite nice performances (from
53% to 81%). The diference between the classifcation accuracy
of the testing-set procedure (75%) and the leaving-one-subject-out
procedure (57%) might be due to the subjects’ sample size. That is
to say, being highly likely that people express Comfortability in
their own manner, a system not familiar with a particular person
might have it extremely complicated to understand what being
Uncomfortable or Not-uncomfortable means; i.e., how people behave
when being Uncomfortable or Not-uncomfortable. Both systems
were only trained with 26 subjects (counting the one/s used for
testing). Instead, if the models were to be fed with many more
subjects, the more likely they would encounter people that express
common Comfortability patterns and thus the better the model
would classify data from unknown subjects. For this reason, it has
been particularly challenging for the leaving-one-subject-out model
to generalize and predict how an unknown person would express
their own Comfortability. In spite of that, the model is capable of
classifying data it has never been exposed to better than chance.
5
DISCUSSION AND FUTURE WORK
This paper has presented several ML models capable of recognizing
Comfortability by taking into account diferent non-verbal cues
that arose during a real interaction between a person a humanoid
robot. Specifcally, the features under study comprehended information about the participants’ facial and upper body movements
(i.e., Action Units (AUs), Head Position, Gaze and Upper-body Position). The best algorithm was trained with a combination of all
the proposed features obtaining a 75% accuracy. At the same time,
213
ICMI ’22, November 7–11, 2022, Bengaluru, India
M.E.L. Redondo, et al.
Input
Training-set Accuracy Testing-set Accuracy Precision
Recall
AUs
.717
.686
.687
.684
BodyPose
1
.529
.641
.514
Gaze
.614
.620
.622
.616
HeadPose
.999
.586
.619
.577
AUs + BodyPose
1
.528
.638
.514
AUs + Gaze
.748
.709
.709
.708
AUs + HeadPose
1
.584
.618
.575
BodyPose + Gaze
1
.529
.641
.515
BodyPose + HeadPose
1
.517
.604
.502
Gaze + HeadPose
.998
.613
.624
.608
AUs + BodyPose + Gaze
1
.528
.638
.514
AUs + BodyPose + HeadPose
1
.516
.576
.501
AUs + Gaze + HeadPose
.998
.623
.637
.617
BodyPose + Gaze + HeadPose
1
.519
.591
.504
AUs + BodyPose + Gaze + HeadPose
1
.518
.586
.503
Table 5: SVM Comfortability classifcation considering the features extracted from ecological three-second clips
Algorithm
Naive Bayes
Neural Networks
Random Forest
Support Vector Machines
Input
AUs + Gaze
AUs + HeapPose
AUs + BodyPose + Gaze + HeadPose
AUs + HeadPose
Accuracy Train/Test
69% / 60%
78% / 70%
100% / 75%
100% / 59%
Table 6: Best algorithms performance considering the whole data-set and 2 Comfortability labels: being Uncomfortable vs being
Not-uncomfortable
Figure 3: Testing-set Comfortability classifcation accuracy
the same architecture was evaluated leaving one subject out during
training to test with it. This decreased the accuracy obtained before,
but still maintained a recognition accuracy better than chance (i.e.,
58%). This means that the model is capable of recognizing whether
someone is uncomfortable or not, not only of people that has already interacted with, but of total unknown faces that has not even
seen. It has been proven that, even though it is not the case for all
the four ML algorithms explored, the more features are combined
214
the more accurate predictions would be produced by the model.
Bearing in mind this thought, it is likely that the classifer presented
in this paper could be enhanced if more modalities would be taken
into account. As mentioned before, synchronized audio, video and
physiological signals have been recorded, together with the context (the type of question being asked at the precise time). Future
steps could focus on exploring deeply and individually each one of
these features, and then merging them together, to discover how
to best combine them to build an efective Comfortability Artifcial
Intelligence. Additionally, the features used to feed the system can
be polished and selected. At the moment, averages and standard
deviations have been used to represent the temporal dynamics and
static positions of the aforementioned features. However, some of
these features might be poorly or redundantly represented. For
these reasons, more complex features (like the contraction index
of the expression, quantity of motion (QoM) of the subject and so
on) and dimensionality reduction techniques like PCA should be
computed and applied to improve the model. In the same fashion,
more complex models (possible Deep-Learning based) could be
considered.
Another very important aspect that might improve considerably
the model’s performance regarding unknown faces, is to expand
the data-set by collecting more videos from a bigger sample of
subjects. Given that people express internal states diferently, a
much more varied data-set could increase the chance of recognizing
Comfortability levels expressed in several unexpected ways.
Comfortability Recognition from Visual Non-verbal Cues
ICMI ’22, November 7–11, 2022, Bengaluru, India
Overall, this study has presented an accurate Comfortability
recognition system, and highlighted relevant factors that might
improve the Comfortability classifer considerably.
ACKNOWLEDGMENTS
Alessandra Sciutti is supported by a Starting Grant from the European Research Council (ERC) under the European Union’s Horizon
2020 research and innovation program. G.A. No. 804388, wHiSPER.
REFERENCES
[1] Shazia Afzal and Peter Robinson. 2011. Natural afect data: Collection and
annotation. In New perspectives on afect and learning technologies. Springer,
55ś70.
[2] Nalini Ambady and Robert Rosenthal. 1992. Thin slices of expressive behavior as
predictors of interpersonal consequences: A meta-analysis. Psychological bulletin
111, 2 (1992), 256.
[3] Tadas Baltrušaitis, Marwa Mahmoud, and Peter Robinson. 2015. Cross-dataset
learning and person-specifc normalisation for automatic action unit detection.
In 2015 11th IEEE International Conference and Workshops on Automatic Face and
Gesture Recognition (FG), Vol. 6. IEEE, 1ś6.
[4] Tadas Baltrusaitis, Peter Robinson, and Louis-Philippe Morency. 2013. Constrained local neural felds for robust facial landmark detection in the wild. In
Proceedings of the IEEE international conference on computer vision workshops.
354ś361.
[5] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and
Seth D Pollak. 2019. Emotional expressions reconsidered: Challenges to inferring
emotion from human facial movements. Psychological science in the public interest
20, 1 (2019), 1ś68.
[6] Madeleine Bartlett, Daniel Hernandez Garcia, Serge Thill, and Tony Belpaeme.
2019. Recognizing human internal states: a conceptor-based approach. arXiv
preprint arXiv:1909.04747 (2019).
[7] John T Cacioppo, Louis G Tassinary, and Gary Berntson. 2007. Handbook of
psychophysiology. Cambridge university press.
[8] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose:
Realtime Multi-Person 2D Pose Estimation using Part Afnity Fields. IEEE
Transactions on Pattern Analysis and Machine Intelligence (2019).
[9] Ginevra Castellano, Santiago D Villalba, and Antonio Camurri. 2007. Recognising
human emotions from body movement and gesture dynamics. In International
Conference on Afective Computing and Intelligent Interaction. Springer, 71ś82.
[10] Eleonora Ceccaldi, Nale Lehmann-Willenbrock, Erica Volta, Mohamed Chetouani,
Gualtiero Volpe, and Giovanna Varni. 2019. How unitizing afects annotation
of cohesion. In 2019 8th International Conference on Afective Computing and
Intelligent Interaction (ACII). IEEE, 1ś7.
[11] C. Clare. 2012. Communicate : how to say what needs to be said when it needs to
be said in the way it needs to be said. National Library of Australia Cataloguingin-Publication entry. 187 pages.
[12] Poorna Banerjee Dasgupta. 2017. Detection and analysis of human emotions
through voice and speech pattern processing. arXiv preprint arXiv:1710.10198
(2017).
[13] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2011. Static
facial expression analysis in tough conditions: Data, evaluation protocol and
benchmark. In 2011 IEEE International Conference on Computer Vision Workshops
(ICCV Workshops). IEEE, 2106ś2112.
[14] Sidney D’Mello and Rafael A Calvo. 2013. Beyond the basic emotions: what
should afective computing compute? In CHI’13 extended abstracts on human
factors in computing systems. 2287ś2294.
[15] Frank Eibe, Mark A Hall, and Ian H Witten. 2016. The WEKA workbench. Online
appendix for data mining: practical machine learning tools and techniques. In
Morgan Kaufmann. Elsevier Amsterdam, The Netherlands.
[16] Paul Ekman. 1965. Diferential communication of afect by head and body cues.
Journal of personality and social psychology 2, 5 (1965), 726.
[17] Paul Ekman. 2003. Darwin, deception, and facial expression. Annals of the new
York Academy of sciences 1000, 1 (2003), 205ś221.
[18] P. Ekman. 2004. Emotions revealed. Bmj 328, Suppl S5 (2004).
[19] Paul Ekman and Wallace V Friesen. 1978. Facial action coding system. Environmental Psychology & Nonverbal Behavior (1978).
[20] Paul Ekman, E Richard Sorenson, and Wallace V Friesen. 1969. Pan-cultural
elements in facial displays of emotion. Science 164, 3875 (1969), 86ś88.
[21] C Fabian Benitez-Quiroz, Ramprakash Srinivasan, and Aleix M Martinez. 2016.
Emotionet: An accurate, real-time algorithm for the automatic annotation of a
million facial expressions in the wild. In Proceedings of the IEEE conference on
computer vision and pattern recognition. 5562ś5570.
[22] D. Golleman. 2006. Social Intelligence: The Revolutionary New Science of Human
Relationships. Editorial Kairos. 544 pages.
215
[23] Sarah D Gunnery and Judith A Hall. 2014. The Duchenne smile and persuasion.
Journal of Nonverbal Behavior 38, 2 (2014), 181ś194.
[24] Gines Hidalgo, Yaadhav Raaj, Haroon Idrees, Donglai Xiang, Hanbyul Joo, Tomas
Simon, and Yaser Sheikh. 2019. Single-network whole-body pose estimation. In
Proceedings of the IEEE/CVF International Conference on Computer Vision. 6982ś
6991.
[25] Carl-Herman Hjortsjö. 1969. Man’s face and mimic language. Studentlitteratur.
[26] Mohammed Hoque, Louis-Philippe Morency, and Rosalind W Picard. 2011. Are
you friendly or just polite?śanalysis of smiles in spontaneous face-to-face interactions. In International Conference on Afective Computing and Intelligent
Interaction. Springer, 135ś144.
[27] Simone Kaufeld, Nale Lehmann-Willenbrock, and Annika L Meinecke. 2018.
21 The Advanced Interaction Analysis for Teams (act4teams) Coding Scheme.
(2018).
[28] Eamonn Keogh and Chotirat Ann Ratanamahatana. 2005. Exact indexing of
dynamic time warping. Knowledge and information systems 7, 3 (2005), 358ś386.
[29] Séverin Lemaignan, Charlotte ER Edmunds, Emmanuel Senft, and Tony Belpaeme.
2018. The PInSoRo dataset: Supporting the data-driven study of child-child and
child-robot social dynamics. PloS one 13, 10 (2018), e0205999.
[30] Jill Lobbestael, Arnoud Arntz, and Reinout W Wiers. 2008. How to push someone’s buttons: A comparison of four anger-induction methods. Cognition &
Emotion 22, 2 (2008), 353ś373.
[31] Patrick Lucey, Jefrey F Cohn, Takeo Kanade, Jason Saragih, Zara Ambadar, and
Iain Matthews. 2010. The extended cohn-kanade dataset (ck+): A complete dataset
for action unit and emotion-specifed expression. In 2010 ieee computer society
conference on computer vision and pattern recognition-workshops. IEEE, 94ś101.
[32] Daniel Lundqvist, Anders Flykt, and Arne Öhman. 1998. Karolinska directed
emotional faces. Cognition and Emotion (1998).
[33] Michael Lyons, Shigeru Akamatsu, Miyuki Kamachi, and Jiro Gyoba. 1998. Coding
facial expressions with gabor wavelets. In Proceedings Third IEEE international
conference on automatic face and gesture recognition. IEEE, 200ś205.
[34] Michael J Lyons. 2021. " Excavating AI" Re-excavated: Debunking a Fallacious
Account of the JAFFE Dataset. arXiv preprint arXiv:2107.13998 (2021).
[35] Catherine Marechal, Dariusz Mikolajewski, Krzysztof Tyburek, Piotr Prokopowicz, Lamine Bougueroua, Corinne Ancourt, and Katarzyna Wegrzyn-Wolska.
2019. Survey on AI-Based Multimodal Methods for Emotion Detection. Highperformance modelling and simulation for big data applications 11400 (2019),
307ś324.
[36] Akihiro Matsufuji, Tatsuya Shiozawa, Wei Fen Hsieh, Eri Sato-Shimokawara,
Toru Yamaguchi, and Lieu-Hen Chen. 2017. The analysis of nonverbal behavior for detecting awkward situation in communication. In 2017 Conference on
Technologies and Applications of Artifcial Intelligence (TAAI). IEEE, 118ś123.
[37] S Mohammad Mavadati, Mohammad H Mahoor, Kevin Bartlett, Philip Trinh, and
Jefrey F Cohn. 2013. Disfa: A spontaneous facial action intensity database. IEEE
Transactions on Afective Computing 4, 2 (2013), 151ś160.
[38] Albert Mehrabian and Susan R Ferris. 1967. Inference of attitudes from nonverbal
communication in two channels. Journal of consulting psychology 31, 3 (1967),
248.
[39] G. Metta, G. Sandini, D. Vernon, L. Natale, and F. Nori. 2008. The iCub humanoid
robot: an open platform for research in embodied cognition. In Proceedings of the
8th workshop on performance metrics for intelligent systems. 50ś56.
[40] Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. 2017. Afectnet: A
database for facial expression, valence, and arousal computing in the wild. IEEE
Transactions on Afective Computing 10, 1 (2017), 18ś31.
[41] Maja Pantic, Michel Valstar, Ron Rademaker, and Ludo Maat. 2005. Web-based
database for facial expression analysis. In 2005 IEEE international conference on
multimedia and Expo. IEEE, 5śpp.
[42] R.W. Picard. 2003. Afective computing: challenges. International Journal of
Human-Computer Studies 59, 1-2 (jul 2003), 55ś64. https://doi.org/10.1016/S10715819
[43] Saranya Rajan, Poongodi Chenniappan, Somasundaram Devaraj, and Nirmala
Madian. 2020. Novel deep learning model for facial expression recognition
based on maximum boosted CNN and LSTM. IET Image Processing 14, 7 (2020),
1373ś1381.
[44] M. E. L. Redondo, A. Sciutti, S. Incao, F. Rea, and R. Niewiadomski. 2021. Can
Robots Impact Human Comfortability During a Live Interview?. In Companion
of the 2021 ACM/IEEE International Conference on Human-Robot Interaction. 186ś
189.
[45] M. E. L. Redondo, A. Vignolo, R. Niewiadomski, F. Rea, and A. Sciutti. 2020.
Can Robots Elicit Diferent Comfortability Levels?. In Wagner A.R. et al. (eds)
Social Robotics. ICSR 2020. Lecture Notes in Computer Science, Vol. 12483. Springer,
664ś675. https://doi.org/10.1007/978-3-030-62056-1_55
[46] Harry T Reis, Harry T Reis, Charles M Judd, et al. 2000. Handbook of research
methods in social and personality psychology. Cambridge University Press.
[47] Judy Hanwen Shen, Agata Lapedriza, and Rosalind W Picard. 2019. Unintentional
afective priming during labeling may bias labels. In 2019 8th International Conference on Afective Computing and Intelligent Interaction (ACII). IEEE, 587ś593.
ICMI ’22, November 7–11, 2022, Bengaluru, India
M.E.L. Redondo, et al.
[48] Alan Slater and Rachel Kirby. 1998. Innate and learned perceptual abilities in the
newborn infant. Experimental Brain Research 123, 1 (1998), 90ś94.
[49] E. Thorndike. 1992. Intelligence and Its Use. Harper’s Magazine. 227ś235 pages.
[50] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In CVPR.
[51] Guihua Wen, Tianyuan Chang, Huihui Li, and Lijun Jiang. 2020. Dynamic objectives learning for facial expression recognition. IEEE Transactions on Multimedia
22, 11 (2020), 2914ś2925.
[52] Erroll Wood, Tadas Baltrusaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson,
and Andreas Bulling. 2015. Rendering of eyes for eye-shape registration and
gaze estimation. In Proceedings of the IEEE International Conference on Computer
Vision. 3756ś3764.
[53] Wen-Jing Yan, Shan Li, Chengtao Que, Jiquan Pei, and Weihong Deng. 2020. RAFAU database: in-the-wild facial expressions with subjective emotion judgement
and objective au annotations. In Proceedings of the Asian Conference on Computer
216
Vision.
[54] Lijun Yin, Xiaozhou Wei, Yi Sun, Jun Wang, and Matthew J Rosato. 2006. A
3D facial expression database for facial behavior research. In 7th international
conference on automatic face and gesture recognition (FGR06). IEEE, 211ś216.
[55] Jefrey M Zacks and Khena M Swallow. 2007. Event segmentation. Current
directions in psychological science 16, 2 (2007), 80ś84.
[56] Amir Zadeh, Yao Chong Lim, Tadas Baltrusaitis, and Louis-Philippe Morency.
2017. Convolutional experts constrained local model for 3d facial landmark
detection. In Proceedings of the IEEE International Conference on Computer Vision
Workshops. 2519ś2528.
[57] Xing Zhang, Lijun Yin, Jefrey F Cohn, Shaun Canavan, Michael Reale, Andy
Horowitz, and Peng Liu. 2013. A high-resolution spontaneous 3d dynamic facial
expression database. In 2013 10th IEEE international conference and workshops on
automatic face and gesture recognition (FG). IEEE, 1ś6.